feat: LLM provider abstraction + gazette harness pattern by thesocialdev · Pull Request #23 · AletheiaFact/agencia

thesocialdev · 2026-03-27T18:28:26Z

Summary

LLM provider abstraction: Centralized get_llm() factory replacing hardcoded ChatOpenAI across 11 node files. Supports OpenAI and Anthropic (Claude) via LLM_PROVIDER env var. Zero behavior change when unset (defaults to OpenAI).
Gazette harness pattern: Applied the Anthropic harness design pattern to break 4 self-evaluation biases in the gazette pipeline — raw passage passthrough, adversarial contradiction extraction, independent cross-reference verification.
E2E test infrastructure: Reusable test claims fixture (tests/fixtures/test_claims.json), CLI runner that saves timestamped results for A/B comparison (tests/run_e2e.py), OpenAI baseline + harness results saved.

Key changes

Phase 1: LLM Provider Abstraction

File	Change
`app/llm.py`	New centralized LLM factory with `get_llm(mini=False)`
11 node/plugin files	`ChatOpenAI(model=...)` → `get_llm()` / `get_llm(mini=True)`
`requirements.txt`	Added `langchain-anthropic`
`.env-example`, `deployment/app.yml`, `.github/workflows/aws.yml`	New env vars: `LLM_PROVIDER`, `LLM_MODEL`, `ANTHROPIC_API_KEY`

Phase 3: Gazette Harness Pattern

File	Change
`app/state.py`	+2 fields: `raw_gazette_passages`, `contradictory_evidence`
`app/nodes/gazette/evidence_evaluator.py`	Removed score>=7 self-evaluation short-circuit
`app/nodes/gazette/deep_analyzer.py`	Adversarial prompts, raw passage passthrough, dedicated contradiction extraction
`app/nodes/gazette/cross_checker.py`	3-layer evidence input (raw + analysis + contradictions), cross-reference instructions

Test results

Gazette claims: 2/2 pass (vs 1/2 baseline)
Online claims: 4/4 pass (no regression)

Next: Instrumentation & Research Phase

Before Phase 4 (MCP plugins), we need metrics to measure pipeline quality. The plan:

Instrumentation

Per-node token usage and latency tracking
Classification confidence extraction from cross_checker output
Contradiction detection quality scoring (real findings vs boilerplate)
Raw-vs-analyzed agreement metric (does the AI analysis faithfully represent raw passages?)
Cost per claim breakdown

Research: Benchmark Against Human Reviews

Use reviewtasks.csv (157 human-reviewed fact-checks from AletheiaFact) as ground truth:

Run pipeline against the CSV claims and compare classifications
Measure agreement rate with human reviewers (baseline: human cross-check agreement is 96%)
Identify systematic biases (which classification categories does the pipeline get wrong?)
Track per-category precision/recall

Open questions for the research phase

Claim extraction: The CSV has summary and questions but no isolated claim text — do we reconstruct claims from summaries, or is there a separate claims dataset?
Classification taxonomy mismatch: The CSV uses trustworthy-but (hyphenated), the pipeline uses Trustworthy, but (comma) — need normalization
Cross-check disagreements: 4/108 human cross-checks disagreed — should we use the original or cross-check classification as ground truth?
Incomplete records: 18 rows (11%) have no report/sources/verification — exclude from benchmark or test separately?
Language coverage: All 157 reviews appear to be Portuguese — do we need English/multilingual test cases?
Gazette vs online: The CSV doesn't indicate which search path was used — can we infer from sources (gazette URLs vs web URLs)?
Temporal validity: Some claims reference specific dates/events — should we filter out time-sensitive claims that may have different evidence today?

Test plan

python tests/run_e2e.py online-01 online-03 --tag harness — no regression
python tests/run_e2e.py gazette-01 gazette-02 --tag harness — 2/2 pass
Syntax validation on all modified files
Set LLM_PROVIDER=anthropic and verify Claude models work (blocked by API credits)
Run full 6-claim suite after Anthropic credits are added

🤖 Generated with Claude Code

Replace hardcoded ChatOpenAI instantiations across 11 node files with a centralized get_llm() factory (app/llm.py) that supports both OpenAI and Anthropic (Claude) models via environment-driven configuration. - LLM_PROVIDER env var switches between "openai" (default) and "anthropic" - LLM_MODEL / LLM_MODEL_MINI env vars override default model names - get_llm(mini=True) for cheap/fast tasks (scoring, classification) - Zero behavior change when LLM_PROVIDER is unset (backward compatible) - Add langchain-anthropic dependency to requirements.txt - Update K8s deployment manifest and CI/CD pipeline with new env vars - Add E2E test infrastructure: reusable test claims fixture and CLI runner that saves timestamped results for A/B provider comparison - Include OpenAI baseline results (6 claims, 6/6 pass) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ion bias Addresses 4 self-evaluation biases identified from the Anthropic harness design blog where the same LLM both gathers and evaluates evidence: 1. Remove score>=7 short-circuit in evidence_evaluator — the relevance score was assigned by the same LLM in fetch_and_score, creating a self-referential loop that could skip independent evaluation. 2. Restructure deep_analyzer prompts for adversarial extraction — batch and merge prompts now require structured SUPPORTING/CONTRADICTING sections with equal attention. Add a dedicated contradiction-hunting LLM call that independently searches raw passages for counter-evidence. 3. Pass raw FAISS passages through to cross_checker via new state field raw_gazette_passages, so the classifier can cross-reference AI summaries against unfiltered source material. 4. Restructure cross_checker prompt to receive 3 evidence layers (raw passages, AI analysis, contradictory evidence) with explicit cross-reference instructions and confidence decomposition. Zero graph wiring changes — all modifications are within existing node functions and state fields. Subgraph topology is preserved. Results: gazette claims improved from 1/2 to 2/2 pass, no online regressions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- test_source_selector: mock get_llm instead of removed ChatOpenAI - test_gazette_accuracy: replace score>=7 short-circuit test with verification that high scores now go through LLM evaluation path Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

New column data_hash_lookup_sentences.content with the original claim text for each fact-check review. Enables benchmarking the pipeline against 154 human-reviewed claims. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

thesocialdev and others added 4 commits March 27, 2026 12:56

thesocialdev mentioned this pull request Mar 27, 2026

Instrumentation & research phase: benchmark pipeline against human reviews #24

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: LLM provider abstraction + gazette harness pattern#23

feat: LLM provider abstraction + gazette harness pattern#23
thesocialdev wants to merge 4 commits intomainfrom
feat/llm-provider-abstraction

thesocialdev commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thesocialdev commented Mar 27, 2026

Summary

Key changes

Phase 1: LLM Provider Abstraction

Phase 3: Gazette Harness Pattern

Test results

Next: Instrumentation & Research Phase

Instrumentation

Research: Benchmark Against Human Reviews

Open questions for the research phase

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant