feat: benchmark + streamlined search flow (#307)#310
Draft
Conversation
notes.md records investigation findings (current flow timing: 452s for review-04, hypotheses for the slowness). tasks.md tracks the remaining benchmark + flow refactor work. scenarios-all-30.json is the 30-scenario source used to seed tools/benchmark/. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the benchmark harness skeleton — run.py driver, current/new flow prompts, QA scenario set, and gitignore for per-run results. Incomplete: 1-scenario runs exceed 7 minutes; next session will profile and trim overhead before running the 30-scenario baseline. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… judging Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… by expert review Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… stdin prompt handling Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Prompt Engineer expert design for end-to-end redesign of nabledge-6 knowledge search — replaces BM25 keyword flow with AI-1 facet extraction + mechanical filter + AI-2 section selection. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace Round 2 index-compact plan with faceted search pivot. Record user decisions on model comparison (Haiku vs Sonnet), rollout order (v6 pilot then lock-step), LLM judge 4-level scheme, hints-less AI-2, and stream-json logging. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…k-propagation Reason: facet-design spec is desk-designed, not validated. Starting with mapping expansion of 295 entries before any AI-1 output is collected reverses the natural order. Reorder so AI-1 prompt alone runs on 5 scenarios first; use the output to decide whether processing_patterns back-propagation is actually needed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drop processing_patterns axis — simulation showed it is redundant with (type, category) on all 5 benchmark scenarios. Incorporate review fixes: explicit out_of_scope empty-array rule, pattern-category dual-role rule, non-misleading uncertain example. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- stream-json output captured per scenario for post-hoc audit - per-axis Jaccard for type and category + coverage exact match - markdown summary with got→want diff per scenario - --model accepts aliases (sonnet / haiku / opus) - scenarios gain expected_facets derived from expected_sections paths Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
5 scenarios × Haiku+Sonnet. Sonnet selected (+0.05 overall). One recall miss on req-02 identified; Prompt Engineer review recommends one Round 3 edit before Stage 2. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…307) Three prompt edits per Prompt Engineer post-Round-2 review: - disambiguate `check` axis (NOT runtime authorization) - extend processing-pattern rule to UI/runtime modifiers (画面/REST/バッチ/メッセージ) - add a differently-worded authorization example Scenario fixes: review-01 and req-09 expected_facets realigned with Round 3 selection rules. Sonnet now reaches Jaccard=1.0 on all 5 scenarios. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- type×category AND filter with fallback ladder (drop-category, drop-type, all, empty) - Scenario-level tests verify every expected file in the 5-scenario benchmark is reachable via the Stage 1-emitted facets - Tests assert all 295 index.toon rows parse cleanly Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
4-level rubric (miss / insufficient / partial / full). Reviewer fixes incorporated: stronger not-built-in directive, recall-only level 3 anchor, explicit title+path-only judging principle, tie-break toward 3. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Stage 2 runs Stage 1 → mechanical type×category filter → LLM judge, with fallback ladder captured in filter_result.json and raw judge stream-json persisted per scenario. Summary writes judge level distribution and mean candidate count. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Claude CLI returns exit code 1 with subtype=error_max_turns when the 2-turn budget is hit after the model already emitted StructuredOutput successfully (schema revalidation on the next turn can trigger this). Parse assistant tool_use blocks as a fallback source of the structured output, and only error out when nothing was captured. Without this, 3/5 Stage 2 scenarios reported judge=None despite valid level-3 outputs in the stream log. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Stage 2 Round 1 (5 scenarios, Sonnet): all judge=3, fallback=none,
mean cost $0.08, wall 27.7s. Prompt Engineer review rated 4/5; applied
all High/Medium/Low fixes to judge_stage2.md:
- Defined "primary file" explicitly
- Required reason format: "Expected primaries: [...]. Present: [...]. {verdict}."
- Added title-fit bar against surface-keyword matches
- Reason language must match question language
- Inlined not-built-in worked example
Re-run after prompt edits: still 5/5 at level 3. No regression.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pipeline: AI-1 facet extract → facet filter → AI-2 section select (title+path+section-titles, no hints, ≤10 selectors) → read_sections → AI-3 final answer (grounded, cited) → LLM judge (4-level). Prompts pre-reviewed by Prompt Engineer subagent; applied fixes: - section_select: whitelist grounding, target 3-6 selectors, worked examples for batch and not-built-in cases - stage3_answer: citation whitelist, cited consistency rule, JP/EN language routing, synthesis grounding rule - judge_stage3: material-detail definition, anti-verbosity check, Expected-core anchoring before verdict, not-built-in worked example Round 1 (5 scenarios, Sonnet): all judge=3, mean cost $0.557, wall 81.5s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
5/5 scenarios at judge=3 (full). Applied High fixes: - judge: anti-verbosity Expected-core mapping check - AI-2: tightened soft cap to 3-6 selectors, ≥6 from one file is a red flag - AI-3: synthesis grounding rule (cite all contributing selectors) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
New: qa-v6-sample15.json (5 pilot + 10 new, balanced review/impact/req). AI-2 mean picks dropped 8.6→5.1 after PE-review 3-6 target rule. No anomalies; proceeding to 30-scenario baseline. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
New flow (faceted search) baseline: 30/30 judge=3, fallback=none, mean cost $0.528/scenario, mean wall 82.7s (sequential). Baseline saved to tools/benchmark/baseline/20260422-stage3-sonnet/. Notable: review-10 with only 2 candidates and 1 section pick still achieves judge=3 — confirms facet precision at narrow cases. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Remove a_facts that are out of scope or describe internal behavior rather than answering the question directly. Applies to 16 scenarios. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
SE H-1: add TestComputeLevel unit tests (10 cases) for scoring engine
SE H-2: fix _normalize false-positive — apply _NORM_MD to body only,
not to LLM-supplied quote (avoids __ stripping wrong matches)
SE M-3/4: test helper mkdir(exist_ok=True), add docstring
SE L-5: align a_facts.fact maxLength 200→300 between judge.py and prompt
SE L-6: remove unused REPO_ROOT variable in test_llm_tools.py
PE H-1: Step 4 retry now covers quote-not-in-file case (try other files)
PE H-2: Bash command in Step 4 uses explicit double-quoting for all args
PE M-3: clarify Step 4 Grep budget is separate from Step 3 cap
PE M-4: add COVERED vs PARTIAL decision rule to Step 1
PE M-5: Step 2 B-claims marked provisional with reclassification path
PE L-6: add mixed-language fallback (default Japanese) to reasoning rule
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… --variant ids → next) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nced ids Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- tasks.md: Step 7-B-1〜4 分析結果まとめ、judge fix / human review タスク追加 - README: evaluation workflow 追記(human_review.json フォーマット、手順 4 ステップ) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…override) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…w-07, impact-01) Three fixes identified by horizontal code review: 1. verify_kb_evidence.py: add _NORM_BACKTICK to strip inline-code backticks from both body and quote independently. - Body: full normalization (MD markers + backticks + whitespace) - Quote: backticks + whitespace only (preserve Python dunders like __init__) - Add stdin interface (argv[4] == "-") to avoid shell expansion entirely 2. prompts/judge.md: change Step 4 verify command to use single-quoted heredoc (<<'QUOTE_END') — double-quoted args expand backticks and $ in bash before the script receives them (root cause of impact-01 timeout) 3. bench/search_next.py: add _NORM_BACKTICK_RE to _normalize_for_match() — same normalization gap existed in verify_read_notes evidence check Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
"リソースクラスに JAX-RS アノテーションを付与する" は質問スコープ(構成パターン)外の 実装手順。a_facts に混入していたため誤 MISSING 判定が発生していた。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #307
Approach
nabledge の検索フロー改修に先立ち、まず 30件 QA ベンチマーク を構築中。旧 flow(AI キーワード抽出 → BM25 → AI 判定)は「ドキュメント語彙に寄せる方法」の改善余地が出来レースを招くため廃止し、ファセット検索 に pivot 設計した上で、Stage 1 (facet抽出) → Stage 2 (機械 filter) → Stage 3 (section選択 + 回答) の各段階を独立に計測できる基盤にする。
現在は Stage 3 まで初期実装が済み 30件ベースライン (judge=3 率 30/30) を取得済みだが、LLM judge が正解データなしで自己推定採点していた 問題が判明。模範回答(ground truth)を作成し、それとの照合採点に切り替える作業中。
このPRはDraft です。主に模範回答の中身を確認してもらうため先行公開しています。
進捗
Tasks
See tasks.md.
Expert Review
実施済み Prompt Engineer レビュー(模範回答関連):
模範回答レビュー(3ラウンド)の結果は作業ログで実施・反映済み(個別ファイルは未保存)。
Success Criteria Check
tools/benchmark/— Stage 1/2/3 pipeline 実装、30件ベースライン取得🤖 Generated with Claude Code