Skip to content

feat: LLM provider abstraction + gazette harness pattern#23

Open
thesocialdev wants to merge 4 commits intomainfrom
feat/llm-provider-abstraction
Open

feat: LLM provider abstraction + gazette harness pattern#23
thesocialdev wants to merge 4 commits intomainfrom
feat/llm-provider-abstraction

Conversation

@thesocialdev
Copy link
Copy Markdown
Collaborator

Summary

  • LLM provider abstraction: Centralized get_llm() factory replacing hardcoded ChatOpenAI across 11 node files. Supports OpenAI and Anthropic (Claude) via LLM_PROVIDER env var. Zero behavior change when unset (defaults to OpenAI).
  • Gazette harness pattern: Applied the Anthropic harness design pattern to break 4 self-evaluation biases in the gazette pipeline — raw passage passthrough, adversarial contradiction extraction, independent cross-reference verification.
  • E2E test infrastructure: Reusable test claims fixture (tests/fixtures/test_claims.json), CLI runner that saves timestamped results for A/B comparison (tests/run_e2e.py), OpenAI baseline + harness results saved.

Key changes

Phase 1: LLM Provider Abstraction

File Change
app/llm.py New centralized LLM factory with get_llm(mini=False)
11 node/plugin files ChatOpenAI(model=...)get_llm() / get_llm(mini=True)
requirements.txt Added langchain-anthropic
.env-example, deployment/app.yml, .github/workflows/aws.yml New env vars: LLM_PROVIDER, LLM_MODEL, ANTHROPIC_API_KEY

Phase 3: Gazette Harness Pattern

File Change
app/state.py +2 fields: raw_gazette_passages, contradictory_evidence
app/nodes/gazette/evidence_evaluator.py Removed score>=7 self-evaluation short-circuit
app/nodes/gazette/deep_analyzer.py Adversarial prompts, raw passage passthrough, dedicated contradiction extraction
app/nodes/gazette/cross_checker.py 3-layer evidence input (raw + analysis + contradictions), cross-reference instructions

Test results

Gazette claims: 2/2 pass (vs 1/2 baseline)
Online claims: 4/4 pass (no regression)

Next: Instrumentation & Research Phase

Before Phase 4 (MCP plugins), we need metrics to measure pipeline quality. The plan:

Instrumentation

  • Per-node token usage and latency tracking
  • Classification confidence extraction from cross_checker output
  • Contradiction detection quality scoring (real findings vs boilerplate)
  • Raw-vs-analyzed agreement metric (does the AI analysis faithfully represent raw passages?)
  • Cost per claim breakdown

Research: Benchmark Against Human Reviews

Use reviewtasks.csv (157 human-reviewed fact-checks from AletheiaFact) as ground truth:

  • Run pipeline against the CSV claims and compare classifications
  • Measure agreement rate with human reviewers (baseline: human cross-check agreement is 96%)
  • Identify systematic biases (which classification categories does the pipeline get wrong?)
  • Track per-category precision/recall

Open questions for the research phase

  1. Claim extraction: The CSV has summary and questions but no isolated claim text — do we reconstruct claims from summaries, or is there a separate claims dataset?
  2. Classification taxonomy mismatch: The CSV uses trustworthy-but (hyphenated), the pipeline uses Trustworthy, but (comma) — need normalization
  3. Cross-check disagreements: 4/108 human cross-checks disagreed — should we use the original or cross-check classification as ground truth?
  4. Incomplete records: 18 rows (11%) have no report/sources/verification — exclude from benchmark or test separately?
  5. Language coverage: All 157 reviews appear to be Portuguese — do we need English/multilingual test cases?
  6. Gazette vs online: The CSV doesn't indicate which search path was used — can we infer from sources (gazette URLs vs web URLs)?
  7. Temporal validity: Some claims reference specific dates/events — should we filter out time-sensitive claims that may have different evidence today?

Test plan

  • python tests/run_e2e.py online-01 online-03 --tag harness — no regression
  • python tests/run_e2e.py gazette-01 gazette-02 --tag harness — 2/2 pass
  • Syntax validation on all modified files
  • Set LLM_PROVIDER=anthropic and verify Claude models work (blocked by API credits)
  • Run full 6-claim suite after Anthropic credits are added

🤖 Generated with Claude Code

thesocialdev and others added 4 commits March 27, 2026 12:56
Replace hardcoded ChatOpenAI instantiations across 11 node files with
a centralized get_llm() factory (app/llm.py) that supports both OpenAI
and Anthropic (Claude) models via environment-driven configuration.

- LLM_PROVIDER env var switches between "openai" (default) and "anthropic"
- LLM_MODEL / LLM_MODEL_MINI env vars override default model names
- get_llm(mini=True) for cheap/fast tasks (scoring, classification)
- Zero behavior change when LLM_PROVIDER is unset (backward compatible)
- Add langchain-anthropic dependency to requirements.txt
- Update K8s deployment manifest and CI/CD pipeline with new env vars
- Add E2E test infrastructure: reusable test claims fixture and CLI
  runner that saves timestamped results for A/B provider comparison
- Include OpenAI baseline results (6 claims, 6/6 pass)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ion bias

Addresses 4 self-evaluation biases identified from the Anthropic harness
design blog where the same LLM both gathers and evaluates evidence:

1. Remove score>=7 short-circuit in evidence_evaluator — the relevance
   score was assigned by the same LLM in fetch_and_score, creating a
   self-referential loop that could skip independent evaluation.

2. Restructure deep_analyzer prompts for adversarial extraction — batch
   and merge prompts now require structured SUPPORTING/CONTRADICTING
   sections with equal attention. Add a dedicated contradiction-hunting
   LLM call that independently searches raw passages for counter-evidence.

3. Pass raw FAISS passages through to cross_checker via new state field
   raw_gazette_passages, so the classifier can cross-reference AI
   summaries against unfiltered source material.

4. Restructure cross_checker prompt to receive 3 evidence layers (raw
   passages, AI analysis, contradictory evidence) with explicit
   cross-reference instructions and confidence decomposition.

Zero graph wiring changes — all modifications are within existing node
functions and state fields. Subgraph topology is preserved.

Results: gazette claims improved from 1/2 to 2/2 pass, no online regressions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- test_source_selector: mock get_llm instead of removed ChatOpenAI
- test_gazette_accuracy: replace score>=7 short-circuit test with
  verification that high scores now go through LLM evaluation path

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New column data_hash_lookup_sentences.content with the original
claim text for each fact-check review. Enables benchmarking the
pipeline against 154 human-reviewed claims.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant