feat(retrieval): CypherFirstAggregationStrategy + cypher-path accuracy fixes#255
feat(retrieval): CypherFirstAggregationStrategy + cypher-path accuracy fixes#255galshubeli wants to merge 4 commits into
Conversation
…ions
Six localized fixes to the cypher_generation pipeline, identified from a
failure-mode investigation on a 56-person/10-org synthetic corpus where
both vector-only and cypher-enabled retrieval were silently producing
wrong answers on counting, top-N, group-by, and intersection questions.
P0 fixes (ship together):
- Smart LIMIT injection. _sanitize_cypher no longer auto-injects LIMIT on
pure aggregations (count/sum/avg without group-by) and uses the new
_DEFAULT_ROW_LIMIT=100 constant otherwise. Paired with raising the
result_assembly slice cap to 100 plus a truncation sentinel, this stops
group-by lists ("orgs with >=5 employees", "top-N city") from being
silently cut at 20 rows.
- Authoritative result framing. The cypher results section is renamed to
"Authoritative Graph Query Results (deterministic; trust over passages
on counts and aggregates)" and a matching rule 8 is added to both
_RAG_SYSTEM_PROMPT variants in api/main.py. Together they stop the LLM
from contradicting a correct numeric cypher answer when verbose passage
text mentions a different entity.
- APOC/GDS/db function blocklist. validate_cypher now rejects dotted-
namespace function calls (apoc.text.regexGroups, gds.*, db.*) which
FalkorDB silently returns 0 rows for. The error feeds the existing
retry-with-feedback loop so attempt 2 has a concrete fix-it.
P1 fixes:
- 0-row label-widen fallback. When a typed-label cypher returns 0 rows
AND a name predicate is present (label is a routing hint, not the
filter itself), execute_cypher_retrieval rewrites typed labels to
__Entity__ and re-runs once. Pure-cypher rewrite, no second LLM call.
Recovers cases where the extractor labelled an entity differently than
the schema prompt steered the LLM toward.
- Two new schema-prompt examples replacing redundant ones: a top-N
group-by ("city with most employees", explicit 2-hop) and a set-
intersection ("works at BOTH A and B") with a matching rule.
P2: cypher metadata (cypher_fallback, cypher_truncated, cypher_rows)
threaded through assemble_raw_result into RetrieverResult.metadata so
operators can monitor fallback firing rate.
Public API: execute_cypher_retrieval now returns a 3-tuple
(facts, entities, metadata) instead of (facts, entities). Internal —
only callers are multi_path.py and tests.
Verification on the 6-question matrix moved 2 questions from wrong to
correct (city group-by, observability existential via label widen) and
preserved the 4 that were already passing. Unit tests added for
_is_pure_aggregation, _split_top_level_commas, _widen_typed_labels,
_should_widen_labels, apoc rejection, and label-widen firing/gating.
49 cypher tests + full SDK suite (629) green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New retrieval strategy that routes quantitative and structural questions through a deterministic Cypher-first path while delegating non-aggregation questions to a fallback strategy (default MultiPathRetrieval). Safe to use as the top-level strategy on GraphRAG. Implements six mechanisms identified from a failure-mode investigation where aggregation answers were being silently corrupted by extraction noise, lossy result formatting, and LLM mistrust of bare cypher numbers: - Intent classifier routes per question into numeric_math / aggregation / rag. Catches "how many", "which X", "more X than Y", "BOTH A and B", "are there any", "average / total of NUMBER". - Multi-candidate cypher generation. K parallel samples per question, execute all, pick the highest-row-count result. Beats LLM stochasticity on structural interpretation without serial retries. - Column-named markdown table formatting via result.header. Eliminates the "10 | 7 | True" ambiguity that was swapping comparison answers. - Description+chunk-text fuzzy hybrid for "shared X" / "BOTH A and B" shapes when X is a free-text property (role, project) not extracted as a typed entity. Single batched cypher, sentence-restricted regex extraction, fuzzy intersect (substring or 2-token overlap). Recovers cases where graph extraction summarized away the project names. - Numeric-math sub-path. RETURN raw values, do average / sum / median in Python. Avoids LLM-arithmetic errors deterministically. - Negation-existential empty handling. For "are there any X without Y?" an empty cypher result is the definitive "No"; positive existentials fall back to vector retrieval since extraction labels are unreliable. The strategy emits its result under an "Authoritative Graph Query Results" section heading that rule 8 of the existing _RAG_SYSTEM_PROMPT (added in bb920c6) is already configured to trust on quantitative questions. Benchmark: prototype scored 7/7 stably across three runs on the seven- question failure-mode matrix that previous strategies hit 2-5/7 on. The SDK port scored 6/7 in an end-to-end check; the remaining failure was malformed extracted org names ("Glo" / "Initech System") leaking through into the answer — an extraction-quality issue orthogonal to this strategy. Public API: CypherFirstAggregationStrategy is exported at the top level and from graphrag_sdk.retrieval.strategies. Pass it as retrieval_strategy= on GraphRAG construction to opt in. Tests: 45 unit tests cover the pure-Python helpers (intent classifier, shape detectors, role/project regex extractors, fuzzy intersect, markdown table formatter, numeric coercion). Full SDK suite stays green at 674 passed, 3 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nStrategy Addresses concerns from a pre-merge review of 7fa7a1f. Behavior preserved on the failure-mode benchmark; the changes are about surface quality, observability, and test coverage. R1 — Make the cypher-authority rule opt-in. Rule 8 ("trust the 'Authoritative Graph Query Results' section over passages on counts and aggregates") used to live inside the base _RAG_SYSTEM_PROMPT and _RAG_SYSTEM_PROMPT_DELIMITED, so it fired on every completion() call SDK-wide — including users who never enable cypher retrieval. The rule is now extracted into a separate constant _CYPHER_AUTH_RULE and appended to the system prompt only when the retriever produced a cypher_results section (detected via item metadata or the canonical heading marker as a defensive fallback). Callers on MultiPathRetrieval without enable_cypher keep the unchanged 7-rule prompt. R2 — Add cypher_first_path metadata to every strategy result. The strategy has five sub-paths plus three RAG-fallback branches; today operators can't tell which one handled a query from the result alone. Each result now carries one of seven canonical PATH_* labels: numeric_math, shared_property_hybrid, cypher_table, negation_empty_no, rag_fallback, rag_fallback_numeric_fail, rag_fallback_cypher_empty. A shared _tag_path() helper handles the bookkeeping including the three delegated-fallback wrappings so the contract is uniform. R3 — Document prose-shape and graph-topology assumptions. The shared- property hybrid was tuned on graphs produced by the SDK's default GraphExtraction pipeline; custom extractors or domain prose may not match. The class docstring now has dedicated "Assumptions and known limits" and "Accuracy ceiling" sections naming each one. Plus a runtime warning fires when the batched (Org)<-[:RELATES]-(Person) query returns zero tuples, so operators on different schemas get a fast signal rather than silent wrong answers. R4 — Mocked end-to-end routing tests. The 45 existing unit tests cover pure-Python helpers, not the strategy's branching. Seven new tests use a mock LLM (returning canned cypher), a mock graph (returning canned result_sets), and a stub _FakeFallback to assert which sub-path fires for each intent + graph-state combination. Patterns covered: rag intent → fallback, aggregation + rows → cypher_table, numeric → python math, numeric extraction empty → fallback (with numeric_fail tag), negation + empty cypher → No (no delegation), positive-existential + empty cypher → fallback (with cypher_empty tag), topology-violation warning. R10 — ruff check pass. Dropped two unused imports (RetrieverError, ChatMessage) and sorted the test_cypher_first imports. All 75 facade + 57 cypher_first + 49 cypher_generation tests pass. Full SDK suite 686 passed, 3 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…extractor
R5 — Split CypherFirstAggregationStrategy into composable path classes.
The strategy file went from 944 lines with five sub-paths crammed into
the strategy class to a small dispatcher (~80 lines) backed by four
focused _AggregationPath subclasses:
- _RagDelegationPath — intent="rag" → fallback verbatim
- _NumericMathPath — intent="numeric_math" → Python arithmetic
- _SharedPropertyHybridPath — "BOTH A and B" / "same X as Z" via chunks
- _MultiCandidateCypherPath — K parallel cypher candidates + table
(also owns the negation-empty and cypher-empty-fallback branches)
Each path is a small class with a single `maybe_handle(query, ctx)`
method that returns either a final `RawSearchResult` or `None` to defer.
The strategy's `_execute` dispatches by intent and consults the relevant
paths in order. Existing routing tests (TestStrategyRouting, 7 cases)
keep covering the contract end-to-end; the helper classes are
implementation detail.
Pure refactor — no behaviour change. All paths share state via a single
reference to the parent strategy, so callers' constructor signature is
unchanged.
R8 — Pluggable phrase extractor for the shared-property hybrid.
The default role/project regexes target the prose patterns the SDK's
GraphExtraction pipeline produces ("works at X as a <role>", "contributes
to <project>") and a closed set of role-suffix words. Domain-specific
use cases (medical, legal, e-commerce, non-English) need different
vocabularies.
Added a `PhraseExtractor` ABC and a `DefaultPhraseExtractor` that wraps
the existing regexes. Strategies accept a `phrase_extractor=` parameter
on construction; the shared-property hybrid consults it instead of the
module-level `_extract_phrases`. Domain users can now subclass and
inject without forking.
Exports added at strategies/__init__.py and top-level graphrag_sdk
namespace.
Tests
-----
- 1 unit test for DefaultPhraseExtractor (matches the default regexes,
unknown kinds return empty set rather than raising).
- 1 mocked end-to-end test confirming that a custom extractor passed to
the strategy is consulted by the hybrid path (intersection computed
over the custom vocabulary, not the default).
All 763 unit tests pass; ruff clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (10)
📝 WalkthroughWalkthroughThis PR introduces a comprehensive Cypher-first aggregation strategy that detects query intent and routes to specialized execution paths: numeric-math (Cypher-based numeric computation), shared-property hybrid (fuzzy intersection over extracted phrases), and multi-candidate Cypher-table (parallel K-candidate generation). Supporting enhancements include auto-injected row limits, pure-aggregation detection, label-widening fallback on 0-row results, result cap increase from 20 to 100, and authority rule injection into completion prompts. ChangesCypher-First Aggregation Strategy
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary
Adds a new opt-in retrieval strategy,
CypherFirstAggregationStrategy, that routes quantitative/structural questions ("how many", "which X has the most", "BOTH A and B", "are there any X without Y", "what is the average …") through a deterministic Cypher-first path. Non-aggregation questions delegate toMultiPathRetrievalunchanged.Alongside the strategy, this PR also lands six targeted SDK-level fixes to the pre-existing cypher path (used by
MultiPathRetrieval(enable_cypher=True)too): smartLIMITinjection,APOC/GDS/db.*function-call blocklist in the validator, 0-row label-widen fallback, two new schema-prompt examples, and metadata threading for observability.Background
A focused failure-mode investigation on a 56-person / 10-org synthetic benchmark identified seven distinct failure modes on aggregation questions — none of them random one-offs:
\" | \"-joined strings lose column names, so the answer-LLM can swap row valuesLIMIT 25truncates group-by aggregations silentlyapoc.text.regexGroups(...)-style function calls slipped past theCALLblocklistThe new strategy directly addresses each one. The 7-question failure-mode benchmark went from 2-5/7 (depending on which pre-fix variant) to 7/7 stable across three runs.
What's in this PR (4 commits)
Each commit has its own detailed body.
Compatibility (cypher-off callers)
The PR is additive for users who don't opt in:
CypherFirstAggregationStrategyonly runs if passed explicitly asretrieval_strategy=enable_cypher=TrueTest coverage (`tests/test_facade.py::TestCypherAuthorityRuleInjection`) pins the contract.
Usage
```python
from graphrag_sdk import (
CypherFirstAggregationStrategy,
FastCorefResolver,
GraphExtraction,
LLMVerifiedResolution,
SentenceTokenCapChunking,
)
chunker = SentenceTokenCapChunking(max_tokens=512, overlap_sentences=2)
extractor = GraphExtraction(llm=llm, coref_resolver=FastCorefResolver())
resolver = LLMVerifiedResolution(llm=llm, embedder=embedder)
async with GraphRAG(
connection=conn, llm=llm, embedder=embedder, embedding_dimension=256,
) as rag:
await rag.ingest(text=doc, chunker=chunker,
extractor=extractor, resolver=resolver)
await rag.finalize()
rag._retrieval_strategy = CypherFirstAggregationStrategy(
graph_store=rag._graph_store,
vector_store=rag._vector_store,
embedder=embedder,
llm=llm,
)
answer = await rag.completion("Which city has the most employees?")
```
Related PRs
SentenceTokenCapChunkingthe default chunker (recommended pairing; the benchmark required it to hit 7/7)Test plan
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Improvements