From 2d220aabb35857912945a08e8a9d8fc5becaf0f6 Mon Sep 17 00:00:00 2001 From: Gal Shubeli Date: Thu, 14 May 2026 16:24:35 +0300 Subject: [PATCH] feat(ingestion): default to SentenceTokenCapChunking in ingest()/update() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Changes the default chunker that ``GraphRAG.ingest()`` and ``GraphRAG.update()`` fall back to when the caller doesn't pass an explicit ``chunker=``. Was ``FixedSizeChunking()``; now ``SentenceTokenCapChunking()`` (sentence-aware, max_tokens=512, overlap_sentences=2 — the strategy's own defaults). Why --- ``FixedSizeChunking`` splits on a hard character window with no awareness of sentence, word, or paragraph boundaries. When the window cuts through an entity name, the per-chunk LLM extractor produces a stub entity for the fragment (``"Wayne Enterprises"`` → ``"Wayne En"`` in chunk N plus unparsable text in chunk N+1). These stubs never merge with their full forms during resolution because their embeddings differ enough that LLMVerifiedResolution scores them below the soft threshold. This silently inflates cypher counts and pollutes "which X" lists. The strategy that surfaced this — ``CypherFirstAggregationStrategy`` — was hitting a 6/7 ceiling on the internal aggregation benchmark with one question failing because of these stubs. Switching to ``SentenceTokenCapChunking`` cleared the benchmark to 7/7 stable across three runs, and the post-ingest graph state went from 11-14 organization nodes (including ``Glo`` / ``Initech System`` / ``Wayne En``) to exactly 10 clean orgs, and from 66-80 ``Person`` nodes (with ``Carla`` / ``Carla Okafor`` duplicates) to exactly 56 distinct persons — matching the corpus. A side benefit: sentence-aware chunks with 2-sentence overlap almost always keep a person's first mention in the same chunk as their later short-form references, so per-chunk FastCoref now binds ``Carla → Carla Okafor`` reliably. That eliminates the short-form-duplicate class too, not just the truncation stubs. Compatibility ------------- ``FixedSizeChunking`` remains exported and fully supported — callers who explicitly pass ``chunker=FixedSizeChunking()`` get unchanged behavior. Existing tests (748 passed, 24 skipped) pass without modification: no test in the suite asserts on chunk count or content shape from the default chunker, so switching defaults doesn't break the suite. Callers who relied on the previous default and want to keep it should pass ``chunker=FixedSizeChunking()`` explicitly. The docstrings call out the new default and reference ``FixedSizeChunking`` as the opt-in character-window alternative. Co-Authored-By: Claude Opus 4.7 (1M context) --- graphrag_sdk/src/graphrag_sdk/api/main.py | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/graphrag_sdk/src/graphrag_sdk/api/main.py b/graphrag_sdk/src/graphrag_sdk/api/main.py index 9b2486c6..684c7a2d 100644 --- a/graphrag_sdk/src/graphrag_sdk/api/main.py +++ b/graphrag_sdk/src/graphrag_sdk/api/main.py @@ -31,7 +31,9 @@ ) from graphrag_sdk.core.providers import Embedder, LLMInterface from graphrag_sdk.ingestion.chunking_strategies.base import ChunkingStrategy -from graphrag_sdk.ingestion.chunking_strategies.fixed_size import FixedSizeChunking +from graphrag_sdk.ingestion.chunking_strategies.sentence_token_cap import ( + SentenceTokenCapChunking, +) from graphrag_sdk.ingestion.extraction_strategies.base import ExtractionStrategy from graphrag_sdk.ingestion.extraction_strategies.graph_extraction import GraphExtraction from graphrag_sdk.ingestion.loaders.base import LoaderStrategy @@ -320,7 +322,10 @@ async def ingest( Uses sensible defaults for any unspecified strategy: - Loader: auto-detected from file extension (PDF or text) - - Chunker: FixedSizeChunking(chunk_size=1000) + - Chunker: SentenceTokenCapChunking(max_tokens=512, overlap_sentences=2) + — sentence-aware, never splits entity names at chunk boundaries. + Override with ``chunker=FixedSizeChunking(...)`` if you need + character-window chunking. - Extractor: GraphExtraction with configured LLM - Resolver: ExactMatchResolution @@ -529,7 +534,7 @@ async def _ingest_single( pipeline = IngestionPipeline( loader=loader or TextLoader(), - chunker=chunker or FixedSizeChunking(), + chunker=chunker or SentenceTokenCapChunking(), extractor=extractor or self._default_extractor(), resolver=resolver or ExactMatchResolution(), graph_store=self._graph_store, @@ -1010,7 +1015,7 @@ async def update( pipeline = IngestionPipeline( loader=loader or TextLoader(), # unused (text is provided below) - chunker=chunker or FixedSizeChunking(), + chunker=chunker or SentenceTokenCapChunking(), extractor=extractor or self._default_extractor(), resolver=resolver or ExactMatchResolution(), graph_store=self._graph_store, @@ -1271,7 +1276,7 @@ async def apply_changes( to ``ingest()`` and ``update()``). Defaults to per-extension auto-selection. ``deleted`` ignores this. chunker: Override the chunking strategy for ``added``/``modified``. - Defaults to ``FixedSizeChunking``. ``deleted`` ignores this. + Defaults to ``SentenceTokenCapChunking``. ``deleted`` ignores this. extractor: Override the entity-extraction strategy for ``added``/``modified``. ``deleted`` ignores this. resolver: Override the resolution strategy for ``added``/