Skip to content

feat: benchmark + streamlined search flow (#307)#310

Draft
kiyotis wants to merge 172 commits intomainfrom
307-benchmark-search-flow
Draft

feat: benchmark + streamlined search flow (#307)#310
kiyotis wants to merge 172 commits intomainfrom
307-benchmark-search-flow

Conversation

@kiyotis
Copy link
Copy Markdown
Contributor

@kiyotis kiyotis commented Apr 22, 2026

Closes #307

Approach

nabledge の検索フロー改修に先立ち、まず 30件 QA ベンチマーク を構築中。旧 flow(AI キーワード抽出 → BM25 → AI 判定)は「ドキュメント語彙に寄せる方法」の改善余地が出来レースを招くため廃止し、ファセット検索 に pivot 設計した上で、Stage 1 (facet抽出) → Stage 2 (機械 filter) → Stage 3 (section選択 + 回答) の各段階を独立に計測できる基盤にする。

現在は Stage 3 まで初期実装が済み 30件ベースライン (judge=3 率 30/30) を取得済みだが、LLM judge が正解データなしで自己推定採点していた 問題が判明。模範回答(ground truth)を作成し、それとの照合採点に切り替える作業中。

このPRはDraft です。主に模範回答の中身を確認してもらうため先行公開しています。

進捗

  • ✅ Stage 1/2/3 pipeline 実装完了(stream-json 全ログ保存、軸別 Jaccard、機械 filter)
  • ✅ 30件ベースライン取得(Sonnet、judge=3 率 30/30、mean $0.53/件、mean 82.7s/件)
  • 30件の模範回答作成完了 ← 今回メイン成果物
    • qa-v6-answers/
    • format spec: README.md
    • Prompt Engineer スポットレビュー3回実施、High/Medium 指摘を全て反映
  • ⏳ Stage 2 スクリプト判定実装(模範回答 citation → 正解パス集合 → filter 候補と突合)
  • ⏳ Stage 3 judge を「模範回答との比較」方式に改訂
  • ⏳ 全バージョン (1.2/1.3/1.4/5/6) への本番 skill 反映

Tasks

See tasks.md.

Expert Review

実施済み Prompt Engineer レビュー(模範回答関連):

模範回答レビュー(3ラウンド)の結果は作業ログで実施・反映済み(個別ファイルは未保存)。

Success Criteria Check

Criterion Status Evidence
30件 script-based benchmark 構築 ✅ Met tools/benchmark/ — Stage 1/2/3 pipeline 実装、30件ベースライン取得
LLM-as-Judge による採点 ⏳ In Progress 正解ベース採点への切り替え中(模範回答30件完成、スクリプト判定実装待ち)
検索フロー改修(全5バージョン反映) ⏳ Not Started 模範回答による採点確立後に着手
Keyword search の新エントリ公開 ⏳ Not Started 上記と同フェーズ

🤖 Generated with Claude Code

kiyotis and others added 30 commits April 22, 2026 08:04
notes.md records investigation findings (current flow timing: 452s
for review-04, hypotheses for the slowness). tasks.md tracks the
remaining benchmark + flow refactor work. scenarios-all-30.json
is the 30-scenario source used to seed tools/benchmark/.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the benchmark harness skeleton — run.py driver, current/new
flow prompts, QA scenario set, and gitignore for per-run results.
Incomplete: 1-scenario runs exceed 7 minutes; next session will
profile and trim overhead before running the 30-scenario baseline.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… judging

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… by expert review

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… stdin prompt handling

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Prompt Engineer expert design for end-to-end redesign of nabledge-6
knowledge search — replaces BM25 keyword flow with AI-1 facet extraction
+ mechanical filter + AI-2 section selection.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace Round 2 index-compact plan with faceted search pivot. Record
user decisions on model comparison (Haiku vs Sonnet), rollout order
(v6 pilot then lock-step), LLM judge 4-level scheme, hints-less AI-2,
and stream-json logging.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…k-propagation

Reason: facet-design spec is desk-designed, not validated. Starting with mapping
expansion of 295 entries before any AI-1 output is collected reverses the natural
order. Reorder so AI-1 prompt alone runs on 5 scenarios first; use the output to
decide whether processing_patterns back-propagation is actually needed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drop processing_patterns axis — simulation showed it is redundant with
(type, category) on all 5 benchmark scenarios. Incorporate review fixes:
explicit out_of_scope empty-array rule, pattern-category dual-role rule,
non-misleading uncertain example.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- stream-json output captured per scenario for post-hoc audit
- per-axis Jaccard for type and category + coverage exact match
- markdown summary with got→want diff per scenario
- --model accepts aliases (sonnet / haiku / opus)
- scenarios gain expected_facets derived from expected_sections paths

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
5 scenarios × Haiku+Sonnet. Sonnet selected (+0.05 overall). One recall miss
on req-02 identified; Prompt Engineer review recommends one Round 3 edit
before Stage 2.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…307)

Three prompt edits per Prompt Engineer post-Round-2 review:
- disambiguate `check` axis (NOT runtime authorization)
- extend processing-pattern rule to UI/runtime modifiers (画面/REST/バッチ/メッセージ)
- add a differently-worded authorization example

Scenario fixes: review-01 and req-09 expected_facets realigned with Round 3
selection rules. Sonnet now reaches Jaccard=1.0 on all 5 scenarios.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- type×category AND filter with fallback ladder (drop-category, drop-type,
  all, empty)
- Scenario-level tests verify every expected file in the 5-scenario
  benchmark is reachable via the Stage 1-emitted facets
- Tests assert all 295 index.toon rows parse cleanly

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
4-level rubric (miss / insufficient / partial / full). Reviewer fixes
incorporated: stronger not-built-in directive, recall-only level 3 anchor,
explicit title+path-only judging principle, tie-break toward 3.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Stage 2 runs Stage 1 → mechanical type×category filter → LLM judge, with
fallback ladder captured in filter_result.json and raw judge stream-json
persisted per scenario. Summary writes judge level distribution and mean
candidate count.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Claude CLI returns exit code 1 with subtype=error_max_turns when the
2-turn budget is hit after the model already emitted StructuredOutput
successfully (schema revalidation on the next turn can trigger this).
Parse assistant tool_use blocks as a fallback source of the structured
output, and only error out when nothing was captured.

Without this, 3/5 Stage 2 scenarios reported judge=None despite valid
level-3 outputs in the stream log.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Stage 2 Round 1 (5 scenarios, Sonnet): all judge=3, fallback=none,
mean cost $0.08, wall 27.7s. Prompt Engineer review rated 4/5; applied
all High/Medium/Low fixes to judge_stage2.md:

- Defined "primary file" explicitly
- Required reason format: "Expected primaries: [...]. Present: [...]. {verdict}."
- Added title-fit bar against surface-keyword matches
- Reason language must match question language
- Inlined not-built-in worked example

Re-run after prompt edits: still 5/5 at level 3. No regression.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pipeline: AI-1 facet extract → facet filter → AI-2 section select
(title+path+section-titles, no hints, ≤10 selectors) → read_sections
→ AI-3 final answer (grounded, cited) → LLM judge (4-level).

Prompts pre-reviewed by Prompt Engineer subagent; applied fixes:
- section_select: whitelist grounding, target 3-6 selectors, worked
  examples for batch and not-built-in cases
- stage3_answer: citation whitelist, cited consistency rule, JP/EN
  language routing, synthesis grounding rule
- judge_stage3: material-detail definition, anti-verbosity check,
  Expected-core anchoring before verdict, not-built-in worked example

Round 1 (5 scenarios, Sonnet): all judge=3, mean cost $0.557, wall 81.5s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
5/5 scenarios at judge=3 (full). Applied High fixes:
- judge: anti-verbosity Expected-core mapping check
- AI-2: tightened soft cap to 3-6 selectors, ≥6 from one file is a red flag
- AI-3: synthesis grounding rule (cite all contributing selectors)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
New: qa-v6-sample15.json (5 pilot + 10 new, balanced review/impact/req).
AI-2 mean picks dropped 8.6→5.1 after PE-review 3-6 target rule.
No anomalies; proceeding to 30-scenario baseline.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
New flow (faceted search) baseline: 30/30 judge=3, fallback=none,
mean cost $0.528/scenario, mean wall 82.7s (sequential).
Baseline saved to tools/benchmark/baseline/20260422-stage3-sonnet/.

Notable: review-10 with only 2 candidates and 1 section pick still
achieves judge=3 — confirms facet precision at narrow cases.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
)

LLM judge (Stage 2/3) was self-estimating without ground truth — not
reliable. New task: create 30 ground-truth answer key facts via grep on
knowledge files + sub-agent review, then rewrite both judge prompts to
score against those facts.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
kiyotis and others added 30 commits April 27, 2026 16:37
Remove a_facts that are out of scope or describe internal behavior
rather than answering the question directly. Applies to 16 scenarios.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
SE H-1: add TestComputeLevel unit tests (10 cases) for scoring engine
SE H-2: fix _normalize false-positive — apply _NORM_MD to body only,
        not to LLM-supplied quote (avoids __ stripping wrong matches)
SE M-3/4: test helper mkdir(exist_ok=True), add docstring
SE L-5: align a_facts.fact maxLength 200→300 between judge.py and prompt
SE L-6: remove unused REPO_ROOT variable in test_llm_tools.py
PE H-1: Step 4 retry now covers quote-not-in-file case (try other files)
PE H-2: Bash command in Step 4 uses explicit double-quoting for all args
PE M-3: clarify Step 4 Grep budget is separate from Step 3 cap
PE M-4: add COVERED vs PARTIAL decision rule to Step 1
PE M-5: Step 2 B-claims marked provisional with reclassification path
PE L-6: add mixed-language fallback (default Japanese) to reasoning rule

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… --variant ids → next)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nced ids

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- tasks.md: Step 7-B-1〜4 分析結果まとめ、judge fix / human review タスク追加
- README: evaluation workflow 追記(human_review.json フォーマット、手順 4 ステップ)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…override)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…w-07, impact-01)

Three fixes identified by horizontal code review:

1. verify_kb_evidence.py: add _NORM_BACKTICK to strip inline-code
   backticks from both body and quote independently.
   - Body: full normalization (MD markers + backticks + whitespace)
   - Quote: backticks + whitespace only (preserve Python dunders like __init__)
   - Add stdin interface (argv[4] == "-") to avoid shell expansion entirely

2. prompts/judge.md: change Step 4 verify command to use single-quoted
   heredoc (<<'QUOTE_END') — double-quoted args expand backticks and $
   in bash before the script receives them (root cause of impact-01 timeout)

3. bench/search_next.py: add _NORM_BACKTICK_RE to _normalize_for_match()
   — same normalization gap existed in verify_read_notes evidence check

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
"リソースクラスに JAX-RS アノテーションを付与する" は質問スコープ(構成パターン)外の
実装手順。a_facts に混入していたため誤 MISSING 判定が発生していた。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

As a nabledge developer, I want a streamlined search flow so that answers are faster and cheaper without losing accuracy

1 participant