From 2d220aabb35857912945a08e8a9d8fc5becaf0f6 Mon Sep 17 00:00:00 2001
From: Gal Shubeli <galshubeli93@gmail.com>
Date: Thu, 14 May 2026 16:24:35 +0300
Subject: [PATCH] feat(ingestion): default to SentenceTokenCapChunking in
 ingest()/update()
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Changes the default chunker that ``GraphRAG.ingest()`` and
``GraphRAG.update()`` fall back to when the caller doesn't pass an
explicit ``chunker=``. Was ``FixedSizeChunking()``; now
``SentenceTokenCapChunking()`` (sentence-aware, max_tokens=512,
overlap_sentences=2 — the strategy's own defaults).

Why
---
``FixedSizeChunking`` splits on a hard character window with no awareness
of sentence, word, or paragraph boundaries. When the window cuts through
an entity name, the per-chunk LLM extractor produces a stub entity for
the fragment (``"Wayne Enterprises"`` → ``"Wayne En"`` in chunk N plus
unparsable text in chunk N+1). These stubs never merge with their full
forms during resolution because their embeddings differ enough that
LLMVerifiedResolution scores them below the soft threshold.

This silently inflates cypher counts and pollutes "which X" lists. The
strategy that surfaced this — ``CypherFirstAggregationStrategy`` — was
hitting a 6/7 ceiling on the internal aggregation benchmark with one
question failing because of these stubs. Switching to
``SentenceTokenCapChunking`` cleared the benchmark to 7/7 stable across
three runs, and the post-ingest graph state went from 11-14 organization
nodes (including ``Glo`` / ``Initech System`` / ``Wayne En``) to exactly
10 clean orgs, and from 66-80 ``Person`` nodes (with ``Carla`` / ``Carla
Okafor`` duplicates) to exactly 56 distinct persons — matching the
corpus.

A side benefit: sentence-aware chunks with 2-sentence overlap almost
always keep a person's first mention in the same chunk as their later
short-form references, so per-chunk FastCoref now binds ``Carla → Carla
Okafor`` reliably. That eliminates the short-form-duplicate class too,
not just the truncation stubs.

Compatibility
-------------
``FixedSizeChunking`` remains exported and fully supported — callers who
explicitly pass ``chunker=FixedSizeChunking()`` get unchanged behavior.
Existing tests (748 passed, 24 skipped) pass without modification: no
test in the suite asserts on chunk count or content shape from the
default chunker, so switching defaults doesn't break the suite.

Callers who relied on the previous default and want to keep it should
pass ``chunker=FixedSizeChunking()`` explicitly. The docstrings call out
the new default and reference ``FixedSizeChunking`` as the opt-in
character-window alternative.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 graphrag_sdk/src/graphrag_sdk/api/main.py | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/graphrag_sdk/src/graphrag_sdk/api/main.py b/graphrag_sdk/src/graphrag_sdk/api/main.py
index 9b2486c6..684c7a2d 100644
--- a/graphrag_sdk/src/graphrag_sdk/api/main.py
+++ b/graphrag_sdk/src/graphrag_sdk/api/main.py
@@ -31,7 +31,9 @@
 )
 from graphrag_sdk.core.providers import Embedder, LLMInterface
 from graphrag_sdk.ingestion.chunking_strategies.base import ChunkingStrategy
-from graphrag_sdk.ingestion.chunking_strategies.fixed_size import FixedSizeChunking
+from graphrag_sdk.ingestion.chunking_strategies.sentence_token_cap import (
+    SentenceTokenCapChunking,
+)
 from graphrag_sdk.ingestion.extraction_strategies.base import ExtractionStrategy
 from graphrag_sdk.ingestion.extraction_strategies.graph_extraction import GraphExtraction
 from graphrag_sdk.ingestion.loaders.base import LoaderStrategy
@@ -320,7 +322,10 @@ async def ingest(
 
         Uses sensible defaults for any unspecified strategy:
         - Loader: auto-detected from file extension (PDF or text)
-        - Chunker: FixedSizeChunking(chunk_size=1000)
+        - Chunker: SentenceTokenCapChunking(max_tokens=512, overlap_sentences=2)
+          — sentence-aware, never splits entity names at chunk boundaries.
+          Override with ``chunker=FixedSizeChunking(...)`` if you need
+          character-window chunking.
         - Extractor: GraphExtraction with configured LLM
         - Resolver: ExactMatchResolution
 
@@ -529,7 +534,7 @@ async def _ingest_single(
 
         pipeline = IngestionPipeline(
             loader=loader or TextLoader(),
-            chunker=chunker or FixedSizeChunking(),
+            chunker=chunker or SentenceTokenCapChunking(),
             extractor=extractor or self._default_extractor(),
             resolver=resolver or ExactMatchResolution(),
             graph_store=self._graph_store,
@@ -1010,7 +1015,7 @@ async def update(
 
         pipeline = IngestionPipeline(
             loader=loader or TextLoader(),  # unused (text is provided below)
-            chunker=chunker or FixedSizeChunking(),
+            chunker=chunker or SentenceTokenCapChunking(),
             extractor=extractor or self._default_extractor(),
             resolver=resolver or ExactMatchResolution(),
             graph_store=self._graph_store,
@@ -1271,7 +1276,7 @@ async def apply_changes(
                 to ``ingest()`` and ``update()``). Defaults to per-extension
                 auto-selection. ``deleted`` ignores this.
             chunker: Override the chunking strategy for ``added``/``modified``.
-                Defaults to ``FixedSizeChunking``. ``deleted`` ignores this.
+                Defaults to ``SentenceTokenCapChunking``. ``deleted`` ignores this.
             extractor: Override the entity-extraction strategy for
                 ``added``/``modified``. ``deleted`` ignores this.
             resolver: Override the resolution strategy for ``added``/