feat: LLM provider abstraction + gazette harness pattern#23
Open
thesocialdev wants to merge 4 commits intomainfrom
Open
feat: LLM provider abstraction + gazette harness pattern#23thesocialdev wants to merge 4 commits intomainfrom
thesocialdev wants to merge 4 commits intomainfrom
Conversation
Replace hardcoded ChatOpenAI instantiations across 11 node files with a centralized get_llm() factory (app/llm.py) that supports both OpenAI and Anthropic (Claude) models via environment-driven configuration. - LLM_PROVIDER env var switches between "openai" (default) and "anthropic" - LLM_MODEL / LLM_MODEL_MINI env vars override default model names - get_llm(mini=True) for cheap/fast tasks (scoring, classification) - Zero behavior change when LLM_PROVIDER is unset (backward compatible) - Add langchain-anthropic dependency to requirements.txt - Update K8s deployment manifest and CI/CD pipeline with new env vars - Add E2E test infrastructure: reusable test claims fixture and CLI runner that saves timestamped results for A/B provider comparison - Include OpenAI baseline results (6 claims, 6/6 pass) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ion bias Addresses 4 self-evaluation biases identified from the Anthropic harness design blog where the same LLM both gathers and evaluates evidence: 1. Remove score>=7 short-circuit in evidence_evaluator — the relevance score was assigned by the same LLM in fetch_and_score, creating a self-referential loop that could skip independent evaluation. 2. Restructure deep_analyzer prompts for adversarial extraction — batch and merge prompts now require structured SUPPORTING/CONTRADICTING sections with equal attention. Add a dedicated contradiction-hunting LLM call that independently searches raw passages for counter-evidence. 3. Pass raw FAISS passages through to cross_checker via new state field raw_gazette_passages, so the classifier can cross-reference AI summaries against unfiltered source material. 4. Restructure cross_checker prompt to receive 3 evidence layers (raw passages, AI analysis, contradictory evidence) with explicit cross-reference instructions and confidence decomposition. Zero graph wiring changes — all modifications are within existing node functions and state fields. Subgraph topology is preserved. Results: gazette claims improved from 1/2 to 2/2 pass, no online regressions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- test_source_selector: mock get_llm instead of removed ChatOpenAI - test_gazette_accuracy: replace score>=7 short-circuit test with verification that high scores now go through LLM evaluation path Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New column data_hash_lookup_sentences.content with the original claim text for each fact-check review. Enables benchmarking the pipeline against 154 human-reviewed claims. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
get_llm()factory replacing hardcodedChatOpenAIacross 11 node files. Supports OpenAI and Anthropic (Claude) viaLLM_PROVIDERenv var. Zero behavior change when unset (defaults to OpenAI).tests/fixtures/test_claims.json), CLI runner that saves timestamped results for A/B comparison (tests/run_e2e.py), OpenAI baseline + harness results saved.Key changes
Phase 1: LLM Provider Abstraction
app/llm.pyget_llm(mini=False)ChatOpenAI(model=...)→get_llm()/get_llm(mini=True)requirements.txtlangchain-anthropic.env-example,deployment/app.yml,.github/workflows/aws.ymlLLM_PROVIDER,LLM_MODEL,ANTHROPIC_API_KEYPhase 3: Gazette Harness Pattern
app/state.pyraw_gazette_passages,contradictory_evidenceapp/nodes/gazette/evidence_evaluator.pyapp/nodes/gazette/deep_analyzer.pyapp/nodes/gazette/cross_checker.pyTest results
Gazette claims: 2/2 pass (vs 1/2 baseline)
Online claims: 4/4 pass (no regression)
Next: Instrumentation & Research Phase
Before Phase 4 (MCP plugins), we need metrics to measure pipeline quality. The plan:
Instrumentation
Research: Benchmark Against Human Reviews
Use
reviewtasks.csv(157 human-reviewed fact-checks from AletheiaFact) as ground truth:Open questions for the research phase
summaryandquestionsbut no isolated claim text — do we reconstruct claims from summaries, or is there a separate claims dataset?trustworthy-but(hyphenated), the pipeline usesTrustworthy, but(comma) — need normalizationTest plan
python tests/run_e2e.py online-01 online-03 --tag harness— no regressionpython tests/run_e2e.py gazette-01 gazette-02 --tag harness— 2/2 passLLM_PROVIDER=anthropicand verify Claude models work (blocked by API credits)🤖 Generated with Claude Code