feat: benchmark + streamlined search flow (#307) by kiyotis · Pull Request #310 · nablarch/nabledge-dev

kiyotis · 2026-04-22T07:57:09Z

Closes #307

Approach

nabledge の検索フロー改修に先立ち、まず 30件 QA ベンチマーク を構築中。旧 flow（AI キーワード抽出 → BM25 → AI 判定）は「ドキュメント語彙に寄せる方法」の改善余地が出来レースを招くため廃止し、ファセット検索 に pivot 設計した上で、Stage 1 (facet抽出) → Stage 2 (機械 filter) → Stage 3 (section選択 + 回答) の各段階を独立に計測できる基盤にする。

現在は Stage 3 まで初期実装が済み 30件ベースライン (judge=3 率 30/30) を取得済みだが、LLM judge が正解データなしで自己推定採点していた 問題が判明。模範回答（ground truth）を作成し、それとの照合採点に切り替える作業中。

このPRはDraft です。主に模範回答の中身を確認してもらうため先行公開しています。

進捗

✅ Stage 1/2/3 pipeline 実装完了（stream-json 全ログ保存、軸別 Jaccard、機械 filter）
✅ 30件ベースライン取得（Sonnet、judge=3 率 30/30、mean $0.53/件、mean 82.7s/件）
✅ 30件の模範回答作成完了 ← 今回メイン成果物
- qa-v6-answers/
- format spec: README.md
- Prompt Engineer スポットレビュー3回実施、High/Medium 指摘を全て反映
⏳ Stage 2 スクリプト判定実装（模範回答 citation → 正解パス集合 → filter 候補と突合）
⏳ Stage 3 judge を「模範回答との比較」方式に改訂
⏳ 全バージョン (1.2/1.3/1.4/5/6) への本番 skill 反映

Tasks

See tasks.md.

Expert Review

実施済み Prompt Engineer レビュー（模範回答関連）:

模範回答レビュー（3ラウンド）の結果は作業ログで実施・反映済み（個別ファイルは未保存）。

Success Criteria Check

Criterion	Status	Evidence
30件 script-based benchmark 構築	✅ Met	`tools/benchmark/` — Stage 1/2/3 pipeline 実装、30件ベースライン取得
LLM-as-Judge による採点	⏳ In Progress	正解ベース採点への切り替え中（模範回答30件完成、スクリプト判定実装待ち）
検索フロー改修（全5バージョン反映）	⏳ Not Started	模範回答による採点確立後に着手
Keyword search の新エントリ公開	⏳ Not Started	上記と同フェーズ

🤖 Generated with Claude Code

notes.md records investigation findings (current flow timing: 452s for review-04, hypotheses for the slowness). tasks.md tracks the remaining benchmark + flow refactor work. scenarios-all-30.json is the 30-scenario source used to seed tools/benchmark/. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Adds the benchmark harness skeleton — run.py driver, current/new flow prompts, QA scenario set, and gitignore for per-run results. Incomplete: 1-scenario runs exceed 7 minutes; next session will profile and trim overhead before running the 30-scenario baseline. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… judging Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… by expert review Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… stdin prompt handling Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Prompt Engineer expert design for end-to-end redesign of nabledge-6 knowledge search — replaces BM25 keyword flow with AI-1 facet extraction + mechanical filter + AI-2 section selection. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Replace Round 2 index-compact plan with faceted search pivot. Record user decisions on model comparison (Haiku vs Sonnet), rollout order (v6 pilot then lock-step), LLM judge 4-level scheme, hints-less AI-2, and stream-json logging. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…k-propagation Reason: facet-design spec is desk-designed, not validated. Starting with mapping expansion of 295 entries before any AI-1 output is collected reverses the natural order. Reorder so AI-1 prompt alone runs on 5 scenarios first; use the output to decide whether processing_patterns back-propagation is actually needed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Drop processing_patterns axis — simulation showed it is redundant with (type, category) on all 5 benchmark scenarios. Incorporate review fixes: explicit out_of_scope empty-array rule, pattern-category dual-role rule, non-misleading uncertain example. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- stream-json output captured per scenario for post-hoc audit - per-axis Jaccard for type and category + coverage exact match - markdown summary with got→want diff per scenario - --model accepts aliases (sonnet / haiku / opus) - scenarios gain expected_facets derived from expected_sections paths Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

5 scenarios × Haiku+Sonnet. Sonnet selected (+0.05 overall). One recall miss on req-02 identified; Prompt Engineer review recommends one Round 3 edit before Stage 2. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…307) Three prompt edits per Prompt Engineer post-Round-2 review: - disambiguate `check` axis (NOT runtime authorization) - extend processing-pattern rule to UI/runtime modifiers (画面/REST/バッチ/メッセージ) - add a differently-worded authorization example Scenario fixes: review-01 and req-09 expected_facets realigned with Round 3 selection rules. Sonnet now reaches Jaccard=1.0 on all 5 scenarios. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- type×category AND filter with fallback ladder (drop-category, drop-type, all, empty) - Scenario-level tests verify every expected file in the 5-scenario benchmark is reachable via the Stage 1-emitted facets - Tests assert all 295 index.toon rows parse cleanly Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

4-level rubric (miss / insufficient / partial / full). Reviewer fixes incorporated: stronger not-built-in directive, recall-only level 3 anchor, explicit title+path-only judging principle, tie-break toward 3. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Stage 2 runs Stage 1 → mechanical type×category filter → LLM judge, with fallback ladder captured in filter_result.json and raw judge stream-json persisted per scenario. Summary writes judge level distribution and mean candidate count. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The Claude CLI returns exit code 1 with subtype=error_max_turns when the 2-turn budget is hit after the model already emitted StructuredOutput successfully (schema revalidation on the next turn can trigger this). Parse assistant tool_use blocks as a fallback source of the structured output, and only error out when nothing was captured. Without this, 3/5 Stage 2 scenarios reported judge=None despite valid level-3 outputs in the stream log. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Stage 2 Round 1 (5 scenarios, Sonnet): all judge=3, fallback=none, mean cost $0.08, wall 27.7s. Prompt Engineer review rated 4/5; applied all High/Medium/Low fixes to judge_stage2.md: - Defined "primary file" explicitly - Required reason format: "Expected primaries: [...]. Present: [...]. {verdict}." - Added title-fit bar against surface-keyword matches - Reason language must match question language - Inlined not-built-in worked example Re-run after prompt edits: still 5/5 at level 3. No regression. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Pipeline: AI-1 facet extract → facet filter → AI-2 section select (title+path+section-titles, no hints, ≤10 selectors) → read_sections → AI-3 final answer (grounded, cited) → LLM judge (4-level). Prompts pre-reviewed by Prompt Engineer subagent; applied fixes: - section_select: whitelist grounding, target 3-6 selectors, worked examples for batch and not-built-in cases - stage3_answer: citation whitelist, cited consistency rule, JP/EN language routing, synthesis grounding rule - judge_stage3: material-detail definition, anti-verbosity check, Expected-core anchoring before verdict, not-built-in worked example Round 1 (5 scenarios, Sonnet): all judge=3, mean cost $0.557, wall 81.5s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

5/5 scenarios at judge=3 (full). Applied High fixes: - judge: anti-verbosity Expected-core mapping check - AI-2: tightened soft cap to 3-6 selectors, ≥6 from one file is a red flag - AI-3: synthesis grounding rule (cite all contributing selectors) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

New: qa-v6-sample15.json (5 pilot + 10 new, balanced review/impact/req). AI-2 mean picks dropped 8.6→5.1 after PE-review 3-6 target rule. No anomalies; proceeding to 30-scenario baseline. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

New flow (faceted search) baseline: 30/30 judge=3, fallback=none, mean cost $0.528/scenario, mean wall 82.7s (sequential). Baseline saved to tools/benchmark/baseline/20260422-stage3-sonnet/. Notable: review-10 with only 2 candidates and 1 section pick still achieves judge=3 — confirms facet precision at narrow cases. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

) LLM judge (Stage 2/3) was self-estimating without ground truth — not reliable. New task: create 30 ground-truth answer key facts via grep on knowledge files + sub-agent review, then rewrite both judge prompts to score against those facts. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Remove a_facts that are out of scope or describe internal behavior rather than answering the question directly. Applies to 16 scenarios. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>