If you are building retrieval for a coding agent, the first thing you will reach for is the chunker you already trust for documents. It will work, in the sense that it produces chunks and never crashes. It will also quietly ruin your retrieval, because the assumptions that make a chunker good at prose make it actively hostile to code.

Why naive splitting fails on code

A RecursiveCharacterTextSplitter set to, say, 1000 characters does exactly what it promises: it walks the text and cuts at the nearest separator once it hits the limit. On an essay, the seams land between paragraphs and nobody notices. On a source file, the limit lands wherever the limit lands — three lines into a function body, halfway through a multi-line string, between a decorator and the def it decorates.

The damage is twofold. First, the embedding gets worse: a chunk holding the back half of parse_config and the front half of validate_config has no coherent meaning, so its vector matches neither query well. Second, the retrieved text is useless even when it comes back. An agent handed a fragment that opens mid-for-loop with an unclosed brace cannot reason about it; it has the symptoms of code without the syntax.

Fixed-size chunking treats a function the way a paper shredder treats a contract: the pieces are all there, and not one of them is usable.

Code has structure that prose does not — nesting, scope, explicit boundaries — and that structure is exactly what a character-counting splitter is blind to. The fix is to stop counting characters and start respecting syntax.

Language-aware recursive splitting: the cheap upgrade

The smallest possible improvement costs almost nothing. LangChain's RecursiveCharacterTextSplitter.from_language takes a Language enum value and swaps the generic separators for language-specific ones, so a Python splitter prefers to break on \nclass , \ndef , and \n\tdef before it falls back to blank lines or raw characters.

from langchain_text_splitters import RecursiveCharacterTextSplitter, Language

splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=800, chunk_overlap=0
)

This is genuinely better than the prose default and takes one line to adopt. But notice what it still is: a heuristic over separators. It does not parse anything. It does not know that a def inside a class is a method, or that a brace it just split on was inside a string. It guesses at structure from punctuation. For a fast first pass that guess is fine. For production retrieval you want the real tree.

AST splitting with tree-sitter

The serious tools all converge on the same engine: tree-sitter, an incremental parsing system that builds a concrete syntax tree for a source file and updates it efficiently as the file changes. It ships grammars for dozens of languages and powers in-editor navigation in Neovim, Helix, and Zed, which is to say it is battle-tested far beyond RAG.

An incremental parsing system that builds a concrete syntax tree for source code and updates it as the file changes — the parsing backbone under nearly every serious code chunker.

Once you have a tree, chunking becomes a tree-walk instead of a character count. You split on node boundaries — a function definition, a class body, a top-level statement — so every chunk is a complete syntactic unit. LlamaIndex's CodeSplitter does precisely this: it parses with tree-sitter and splits by AST nodes, bounded by chunk_lines and max_chars so a single enormous class still gets divided, but on method boundaries rather than mid-line.

The LlamaIndex framework; its CodeSplitter parses source with tree-sitter and chunks on AST nodes, capping size by lines or characters without cutting through syntax.

Chonkie's CodeChunker takes the same approach and pushes on coverage: it uses tree-sitter-language-pack for 165+ languages and can auto-detect the language with Google's Magika before splitting on structure.

A lightweight chunking library whose CodeChunker splits source into structurally meaningful chunks via tree-sitter, covering 165+ languages with optional automatic language detection.

The part everyone skips: context enrichment

Here is the non-obvious idea, and it is worth more than the choice of splitter. A perfectly clean AST chunk is still an orphan. Retrieve the method _resolve on its own and the model sees a function with no file path, no idea which class it belongs to, and no clue what self.cache refers to. The chunk is syntactically whole and semantically homeless.

So enrich it. Before you embed a chunk, prepend the context that the AST already knows: the file path, the parent class or function signature, and the relevant imports. A chunk that begins with # file: billing/invoice.py and class Invoice: above the method body is self-explaining — and, just as important, it embeds better, because the file path and class name are exactly the tokens a developer's query ("how does invoice billing resolve discounts") will contain.

This is the same insight that makes Aider's repo map work. Aider does not dump whole files into context. It parses each file with tree-sitter, extracts just the signatures — the functions, classes, and exported types that form a file's public surface — and ranks them with a PageRank-style graph over symbol references, fitting the most-referenced definitions into a token budget. The unit of retrieval is a signature plus its location, not a slab of bytes.

AI pair programming in the terminal; its tree-sitter repo map extracts ranked symbol signatures with file locations instead of raw text, a model for self-describing code context.
★ 46kPythonAider-AI/aider

For the indexing side of this, CocoIndex is built around it: native tree-sitter chunking along real code structure, with incremental processing so only changed files get re-embedded — which matters when your "documents" are a live codebase an agent is editing.

An incremental indexing engine with native tree-sitter chunking that splits on functions and classes and re-embeds only what changed — built for keeping a codebase index fresh for agents.

The recommendation

Do not bring a prose chunker to a code problem — the chunking strategy that wins for documents is the one that fails hardest on source. If you want one line of improvement, use a language-aware recursive splitter and move on. If you are building retrieval you intend to keep, split on the tree-sitter AST so every chunk is a whole syntactic unit, then spend your remaining effort on enrichment — stamp each chunk with its file path, parent signature, and imports before you embed it. Chunk size is a knob you will tune in an afternoon. A retrieved fragment that explains itself is the difference between an agent that finds the right code and one that finds plausible-looking nonsense.