Skip to content

feat(components): add missing languages to Code Text Splitter#6290

Open
deepak0x wants to merge 2 commits intoFlowiseAI:mainfrom
deepak0x:bugfix/add-missing-languages-code-text-splitter
Open

feat(components): add missing languages to Code Text Splitter#6290
deepak0x wants to merge 2 commits intoFlowiseAI:mainfrom
deepak0x:bugfix/add-missing-languages-code-text-splitter

Conversation

@deepak0x
Copy link
Copy Markdown

Proposed changes

The Code Text Splitter node only listed the 16 languages natively supported by the JS @langchain/textsplitters package. The Python LangChain library supports many more languages that users have been requesting (C#, COBOL, Kotlin, TypeScript, etc). This PR adds 9 additional languages with custom separators ported from the Python LangChain source, while keeping existing languages on the native fromLanguage() path so there is zero regression risk.

New languages: c, csharp, cobol, elixir, haskell, kotlin, lua, powershell, ts

The dropdown is now sorted alphabetically for easier discovery.

Issue(s)

Closes #3752

How to test or reproduce

  1. pnpm build && pnpm start
  2. Create a new chatflow with a Code Text Splitter node
  3. Open the Language dropdown and confirm the new languages appear (c, csharp, cobol, elixir, haskell, kotlin, lua, powershell, ts)
  4. Select "csharp" and connect it to a document loader with C# source code
  5. Verify the splitter chunks the code correctly along class/method/control-flow boundaries

Types of changes

  • Bugfix (non-breaking change which fixes an issue)

Checklist

  • I have read the CONTRIBUTING doc
  • Lint and unit tests pass locally with my changes

The Code Text Splitter only exposed 16 languages natively supported
by @langchain/textsplitters. The Python LangChain library supports
many more. This adds 9 additional languages (c, csharp, cobol,
elixir, haskell, kotlin, lua, powershell, ts) with custom separators
ported from Python LangChain, while keeping existing languages on
the native fromLanguage() path.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request expands the CodeTextSplitter to support a variety of new programming languages by introducing custom separators and updating the language selection options. The review feedback highlights several language-specific inaccuracies in the separator lists, such as the use of 'class' in C and 'implements' in C#, and recommends refining the fallback logic in the initialization method to ensure better error handling.

Comment thread packages/components/nodes/textsplitters/CodeTextSplitter/CodeTextSplitter.ts Outdated
Comment thread packages/components/nodes/textsplitters/CodeTextSplitter/CodeTextSplitter.ts Outdated
Comment thread packages/components/nodes/textsplitters/CodeTextSplitter/CodeTextSplitter.ts Outdated
Comment thread packages/components/nodes/textsplitters/CodeTextSplitter/CodeTextSplitter.ts Outdated
Comment thread packages/components/nodes/textsplitters/CodeTextSplitter/CodeTextSplitter.ts Outdated
C: replace class with struct/union/enum (C has no class keyword).
C#: remove implements (C# uses :), add namespace and struct.
Elixir: remove while (not a keyword in Elixir).
Kotlin: remove case (Kotlin uses when).
Fallback: return default splitter instead of calling fromLanguage
with an unsupported language.
@deepak0x
Copy link
Copy Markdown
Author

@HenryHengZJ — the Code Text Splitter was missing a bunch of languages that Python LangChain already supports. The JS @langchain/textsplitters only has 16, so I ported the separator definitions from the Python side and pass them as custom separators to RecursiveCharacterTextSplitter. The original 16 languages still go through fromLanguage() as before.

New languages: C, C#, COBOL, Elixir, Haskell, Kotlin, Lua, PowerShell, TypeScript.

The Gemini bot caught a few wrong keywords in the separator lists I ported — C doesn't have class (fixed to struct/union/enum), C# doesn't have implements, Elixir has no while, Kotlin has no case. Also changed the fallback to return a default splitter instead of throwing. All addressed in the latest push.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] ccharp language is not showing on code text spliter

1 participant