Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
172 commits
Select commit Hold shift + click to select a range
69dcfaf
docs: add work log for issue #307 benchmark search flow
kiyotis Apr 21, 2026
709b64a
wip: scaffold tools/benchmark for search flow comparison (#307)
kiyotis Apr 21, 2026
bfe47a4
docs: update tasks.md — redesign measurement as 3-stage with isolated…
kiyotis Apr 21, 2026
317a47e
docs: update tasks.md — add Round-based workflow and sample 5 scenarios
kiyotis Apr 21, 2026
1b5b5bb
docs: require Prompt Engineer review for prompt changes not triggered…
kiyotis Apr 22, 2026
06c6e34
feat: stage1 benchmark runner — extraction prompt, sample5 scenarios,…
kiyotis Apr 22, 2026
dbb8c95
docs: record Stage 1 Round 1 measurement and Prompt Engineer review
kiyotis Apr 22, 2026
e66a2d9
docs: update tasks.md — mark DECISION checkpoint for H-A approach
kiyotis Apr 22, 2026
9b73d0f
docs: add Prompt Engineer design review for Stage 1 Round 2
kiyotis Apr 22, 2026
2da705f
docs: add faceted search flow design review (#307)
kiyotis Apr 22, 2026
7cd1e77
docs: update tasks.md — pivot to faceted search flow (#307)
kiyotis Apr 22, 2026
9cac561
docs: reorder tasks.md — measure AI-1 facets first, defer mapping bac…
kiyotis Apr 22, 2026
8499c55
feat: add 2-axis stage1_facet.md + Prompt Engineer review (#307)
kiyotis Apr 22, 2026
2b3ffcc
feat: run.py Stage 1 facet extraction with per-axis Jaccard (#307)
kiyotis Apr 22, 2026
8074e5a
docs: tasks.md — switch to 2-axis (type, category) design (#307)
kiyotis Apr 22, 2026
caf005a
docs: record Stage 1 Round 2 results and Prompt Engineer review (#307)
kiyotis Apr 22, 2026
3eaf49d
feat: Stage 1 Round 3 — targeted prompt tweak clears all 5 scenarios …
kiyotis Apr 22, 2026
6f3d0a2
docs: tasks.md — Stage 1 complete, proceed to Stage 2 (#307)
kiyotis Apr 22, 2026
e4138a0
feat: Stage 2 facet filter module with 19 unit tests (#307)
kiyotis Apr 22, 2026
f0d85f1
feat: Stage 2 LLM judge prompt + Prompt Engineer review (#307)
kiyotis Apr 22, 2026
4692d05
feat: run.py Stage 2 — filter + judge + stream-json per scenario (#307)
kiyotis Apr 22, 2026
96402a0
fix(benchmark): recover structured output on error_max_turns
kiyotis Apr 22, 2026
5e8f702
docs: record Stage 2 Round 1 + Prompt Engineer fixes
kiyotis Apr 22, 2026
854b358
docs: update tasks.md — Stage 2 done, Stage 3 in progress
kiyotis Apr 22, 2026
373e12a
feat: run.py Stage 3 — AI-2 section select + answer + judge (#307)
kiyotis Apr 22, 2026
30b492a
docs: record Stage 3 Round 1 + Prompt Engineer reviews (#307)
kiyotis Apr 22, 2026
8fdb708
docs: update tasks.md — Stage 3 Round 1 complete
kiyotis Apr 22, 2026
935ae47
docs: 15-scenario intermediate check — all judge=3, fallback=none (#307)
kiyotis Apr 22, 2026
2cc4ffb
docs: 30-scenario baseline — Stage 3 judge=3 rate 30/30 (#307)
kiyotis Apr 22, 2026
f17cba3
docs: add task — replace LLM judge with ground-truth based scoring (#…
kiyotis Apr 22, 2026
5cc385c
docs: update tasks.md — citation-based scoring, Stage 2 script first
kiyotis Apr 22, 2026
9faea4c
docs: add reference-answer format + review-01 (#307)
kiyotis Apr 22, 2026
0c9a4a2
docs: add review-02..05 reference answers + fix drift (#307)
kiyotis Apr 22, 2026
3b38c9f
docs: add review-06..10 + impact-01..10 reference answers (#307)
kiyotis Apr 22, 2026
adaf63e
docs: complete 30 reference answers (req-01..10 + fixes) (#307)
kiyotis Apr 22, 2026
946421c
docs: link PR #310 in tasks.md
kiyotis Apr 22, 2026
bc7a670
feat(benchmark): reference-answer-based Stage 2 script judge (#307)
kiyotis Apr 22, 2026
8ff5960
docs: reset tasks.md — info design flaw, reassess approach (#307)
kiyotis Apr 22, 2026
fe2fb04
docs: add reset policy to tasks.md, preserve existing plan (#307)
kiyotis Apr 22, 2026
c361d52
docs: revert tasks.md to pre-reset state (#307)
kiyotis Apr 22, 2026
69f9122
docs: update tasks.md — AI-1 info design flaw, Stage 1 redesign neede…
kiyotis Apr 22, 2026
8cf50fb
docs: clarify Stage 1 redesign task with concrete options (#307)
kiyotis Apr 22, 2026
7adc061
docs: redesign AI-1 approach — split LLM/script index (#307)
kiyotis Apr 22, 2026
aa1114a
feat: add LLM/script index generator for direct-ID search (#307)
kiyotis Apr 22, 2026
c47d764
feat: add stage3 ids variant — direct file_id|sid selection flow (#307)
kiyotis Apr 22, 2026
4598d75
docs: update tasks.md — ids variant shipped, measurement next (#307)
kiyotis Apr 22, 2026
db58163
docs: update tasks.md — Haiku 5-scenario run done, Sonnet next (#307)
kiyotis Apr 22, 2026
456bccd
docs: restructure tasks.md — 2-flow clarity, Sonnet fixed (#307)
kiyotis Apr 22, 2026
76c8eb0
feat: add current-flow variant for apples-to-apples benchmark (#307)
kiyotis Apr 22, 2026
76898b0
docs: record 30-scenario comparison ids vs current (#307)
kiyotis Apr 22, 2026
cb81695
feat: add judge_stage3_v2 fact-coverage judge + judge-only mode (#307)
kiyotis Apr 22, 2026
af43acb
docs: rewrite tasks.md to should-be plan (#307)
kiyotis Apr 23, 2026
322f616
docs: add benchmark rules — no parallel run.py invocations (#307)
kiyotis Apr 23, 2026
7abb7f3
refactor: restructure benchmark tool to should-be layout (#307)
kiyotis Apr 23, 2026
a7a016d
docs: rewrite notes.md to design-decisions only (#307)
kiyotis Apr 23, 2026
515ccd0
docs: update tasks.md — refactor done, next is 30-scenario rerun (#307)
kiyotis Apr 23, 2026
ec23377
docs: add rules on observing real LLM output before declaring success…
kiyotis Apr 23, 2026
2d8cc22
refactor: rewrite judge to A/B/C coverage grading — WIP (#307)
kiyotis Apr 23, 2026
1e2118b
docs: update tasks.md — pivot to A-pre-authored judge design (#307)
kiyotis Apr 23, 2026
f6217db
refactor: judge uses pre-authored A-facts per scenario (#307)
kiyotis Apr 23, 2026
781e96a
docs: add pre-authored a_facts to 30 v6 scenarios (#307)
kiyotis Apr 23, 2026
c3546c0
chore: refresh current-sonnet baseline judge.json with new A/B/C judg…
kiyotis Apr 23, 2026
0d00f9e
docs: update tasks.md — ids L1 14件の改善が本 PR のゴール (#307)
kiyotis Apr 23, 2026
7507547
docs: update tasks.md — impact-01 root cause analysis done
kiyotis Apr 23, 2026
3e17c73
docs: add L1 root cause analysis for 14 ids scenarios (#307)
kiyotis Apr 23, 2026
c92eec7
refactor: recall-first search_ids + a_facts scope fixes (#307)
kiyotis Apr 23, 2026
24874f2
docs: split tasks into Phase 1 (search) → Phase 2 (answer) (#307)
kiyotis Apr 23, 2026
960bfef
feat: add term_queries to AI-1 search_ids with body substring grep (#…
kiyotis Apr 23, 2026
d4316f8
refactor: term_queries coverage-gap first + anti-broad-class rule (#307)
kiyotis Apr 23, 2026
9baf220
feat: --search-only mode + coverage.py for Phase 1 measurement (#307)
kiyotis Apr 23, 2026
e44b79f
docs: update tasks.md — Phase 1 Step 1-2 done, Step 3 pending decisio…
kiyotis Apr 23, 2026
920ed2a
docs: add simulate-before-measure rule to benchmark (#307)
kiyotis Apr 24, 2026
511f205
chore: add scikit-learn to setup.sh for TF-IDF index enrichment (#307)
kiyotis Apr 24, 2026
c16402c
feat: classify_terms.py + index-enrichment docs for TF-IDF-based inde…
kiyotis Apr 24, 2026
e5eb8eb
docs: update tasks.md — Phase 1 Step 3 evolved to index enrichment (#…
kiyotis Apr 24, 2026
67d646e
feat: enrich index-llm.md with section-level keywords (#307)
kiyotis Apr 24, 2026
a590736
docs: update tasks.md — next step is term_queries mechanization + 10-…
kiyotis Apr 24, 2026
a7ad4f1
feat: term_extract module for deterministic question-term extraction …
kiyotis Apr 24, 2026
9039ca4
feat: df_pct stopset generator for v6 term queries (#307)
kiyotis Apr 24, 2026
b0c2ae9
refactor: drop AI-1 term_queries; script extracts terms from question…
kiyotis Apr 24, 2026
b17dfd4
test: update test_build_index for allowlist signature (#307)
kiyotis Apr 24, 2026
cd9dc38
docs: Step A done — term_queries extraction, PE review saved (#307)
kiyotis Apr 24, 2026
bc3fafd
fix: restrict term_extract to 4+ char ASCII identifiers (#307)
kiyotis Apr 24, 2026
87ed59c
docs: pivot to section-level TF-IDF; redo from scratch (#307)
kiyotis Apr 24, 2026
143ee40
docs: tasks.md — TF only (no IDF), stoplist-first approach (#307)
kiyotis Apr 24, 2026
08bf67a
docs: tasks.md — Step 0 narrowed to docs-only (rename during Step 2/3)
kiyotis Apr 24, 2026
93af2c7
docs: rewrite index-enrichment.md for section-level TF approach (#307)
kiyotis Apr 24, 2026
e9f6766
docs: tasks.md — Step 0 done (docs/index-enrichment.md rewritten)
kiyotis Apr 24, 2026
2548161
feat: section_df_ja.py — Japanese term section_df for stoplist judgme…
kiyotis Apr 24, 2026
8959cb7
feat: v6 Japanese stoplist for section-level TF index enrichment (#307)
kiyotis Apr 24, 2026
df56cd4
docs: tasks.md — Step 1 done (51-term stoplist committed)
kiyotis Apr 24, 2026
ed115b1
docs: freeze index enrichment params (tf>=2, section top-N=5) before …
kiyotis Apr 24, 2026
1c66c95
docs: tasks.md — freeze params + 10-case sim done, promote Phase 2 to…
kiyotis Apr 24, 2026
db6baeb
chore: tidy workspace — drop superseded artifacts, compress tasks.md …
kiyotis Apr 24, 2026
94dc02c
feat: classify_terms.py section-level TF (#307)
kiyotis Apr 24, 2026
e4187cf
feat: build_index.py consumes section-keyed keyword map (#307)
kiyotis Apr 24, 2026
b94c604
docs: tasks.md — Step 2/3 done + expert reviews (#307)
kiyotis Apr 24, 2026
c75a99c
feat: regenerate index-llm.md with section-level TF keywords (#307)
kiyotis Apr 24, 2026
9295593
docs: tasks.md — reflect index regeneration commit (#307)
kiyotis Apr 24, 2026
c491dc3
docs: tasks.md — pivot to 4-step Read-based AI-1 (#307)
kiyotis Apr 24, 2026
2b1d7ca
chore: gitignore root .results/ directory (#307)
kiyotis Apr 24, 2026
fa42a20
docs: communication rule — concise first, details on request
kiyotis Apr 24, 2026
3011785
refactor: benchmark — version-parameterized paths (#307)
kiyotis Apr 24, 2026
26f0af9
feat: 4-step Read-based AI-1 with merged answer (#307)
kiyotis Apr 24, 2026
36ae333
feat: judge verifies C-claims against full KB (#307)
kiyotis Apr 24, 2026
415c6aa
docs: Step 4/5 trial scenario subsets (#307)
kiyotis Apr 24, 2026
72f62aa
fix: judge — raise max_turns/timeout, cap tool budget (#307)
kiyotis Apr 24, 2026
220fd9f
docs: tasks.md — Step 5 merged AI-1 done, Step 6 pending (#307)
kiyotis Apr 24, 2026
d8fedda
docs: tasks.md — Step 6 方針確定、search_ids.md 品質課題4点 (#307)
kiyotis Apr 27, 2026
47d30e2
wip: search_ids.md — Step 6 prompt rewrite in progress
kiyotis Apr 27, 2026
cf0ebec
wip: H-1 TDD RED — caveats schema {note,cited}[] test added (#307)
kiyotis Apr 27, 2026
bf4e722
chore: add token-efficiency rule, update sv/re for commit-granular se…
kiyotis Apr 27, 2026
da6d77c
docs: tasks.md — session end, H1-L2 pending (#307)
kiyotis Apr 27, 2026
22d8668
feat: H-1 — caveats schema changed to {note,cited}[] (#307)
kiyotis Apr 27, 2026
86e068a
docs: tasks.md — H-1 完了、H-2 next (#307)
kiyotis Apr 27, 2026
3f5b0f6
feat: H-2 — scope_note field added to read_notes[].relevant_sections[…
kiyotis Apr 27, 2026
c11c925
docs: tasks.md — H-2 完了、H-3 next (#307)
kiyotis Apr 27, 2026
5b09afd
feat: H-3 — self-check substep added before answer composition in Ste…
kiyotis Apr 27, 2026
22704ce
feat: M-1 — candidate_files / read_notes maxItems 20→12
kiyotis Apr 27, 2026
cfaa185
docs: tasks.md — M-1 完了、M-2 next (#307)
kiyotis Apr 27, 2026
5788665
docs: auto-run /sv after each commit (token-efficiency rule)
kiyotis Apr 27, 2026
8d7d67a
feat: M-2/M-3/M-4/L-1/L-2 — prompt clarity improvements (#307)
kiyotis Apr 27, 2026
c5f63c0
docs: save PE review for Step 6 H-1 through L-2 (#307)
kiyotis Apr 27, 2026
de84f7f
docs: tasks.md — M-2〜L-2 + PE review 完了、3 件試走 next (#307)
kiyotis Apr 27, 2026
b3adf3c
docs: strengthen stop-after-commit rule in token-efficiency.md
kiyotis Apr 27, 2026
8f35f98
docs: fix /re instruction — must be /clear -> /re to reset context
kiyotis Apr 27, 2026
c93fadd
docs: tasks.md — 3 件試走結果記録 + 1件ずつ詳細調査タスク追加 (#307)
kiyotis Apr 27, 2026
3c503d6
docs: tasks.md — 調査タスクを事実ベースに書き直し (#307)
kiyotis Apr 27, 2026
3703ca6
docs: tasks.md — req-05 調査完了、judge fix タスク追加 (#307)
kiyotis Apr 27, 2026
6492d52
docs: tasks.md — 調査-2完了、a_facts全量見直しタスク追加 (#307)
kiyotis Apr 27, 2026
8dfcaca
docs: tasks.md — 調査-3完了、judge Grep-only変更タスク追加 (#307)
kiyotis Apr 27, 2026
ae301d5
docs: tasks.md — Fix-1〜3 + Step 6-D に整理 (#307)
kiyotis Apr 27, 2026
d9fc742
docs: tasks.md — Step 6-D を 1件ずつ確認タスクに分割 (#307)
kiyotis Apr 27, 2026
7596679
docs: tasks.md — Fix-1〜3 背景・方針を修正 (#307)
kiyotis Apr 27, 2026
90d54ed
feat: add llm_tools/verify_kb_evidence.py for judge self-correction
kiyotis Apr 27, 2026
235b03f
fix: judge self-corrects SUPPORTED_BY_KB and uses Grep-only KB verify
kiyotis Apr 27, 2026
5d49953
fix: remove solution-specific a_fact from review-01 scenario
kiyotis Apr 27, 2026
8055a0f
docs: tasks.md — Fix-1〜3 完了、Step 6-D 再計測待ちに更新
kiyotis Apr 27, 2026
5420eb7
docs: tasks.md — Expert Review タスクを追加 (SE + PE)
kiyotis Apr 27, 2026
7358292
Revert "fix: remove solution-specific a_fact from review-01 scenario"
kiyotis Apr 27, 2026
e089348
docs: tasks.md — Fix-3 を 30 シナリオ全量見直しに訂正、Expert Review を計測前に移動
kiyotis Apr 27, 2026
df4f5a6
docs: tasks.md — Fix-3 を 30 件個別ステップに展開
kiyotis Apr 27, 2026
77f0e07
test: tighten a_facts in qa-v6.json to match question scope
kiyotis Apr 27, 2026
2689373
docs: tasks.md — Fix-3 完了、Step 6-D 再計測待ちに更新
kiyotis Apr 27, 2026
88bad08
fix: address SE+PE expert review findings on Fix-1/2 (judge + verify)
kiyotis Apr 27, 2026
a6baae4
docs: tasks.md — Expert Review 完了、Step 6-D 再計測待ちに更新
kiyotis Apr 27, 2026
61e976f
docs: tasks.md — Step 6-D 再計測完了 (req-05/review-01/review-08 全件 L3)
kiyotis Apr 27, 2026
f5edbd4
docs: tasks.md — Step 6-E 追加 (main 検索を新 judge で再採点して比較ベースライン確立)
kiyotis Apr 27, 2026
3beb1d1
docs: tasks.md — Step 6-E を Step 7 に統合 (新検索確定後に比較ベースラインと 30 件計測を一括実施)
kiyotis Apr 27, 2026
1d39219
docs: tasks.md — Step 7-A 削除 (baseline/ids-sonnet と直接比較可能)
kiyotis Apr 27, 2026
138b379
docs: tasks.md — Step 7-0 追加 (ids→next リネーム、計測前に実施)
kiyotis Apr 27, 2026
ee8f937
refactor: rename ids variant to next (search_ids.py → search_next.py,…
kiyotis Apr 27, 2026
7c489c1
docs: tasks.md — Step 7-0 完了 (ids→next リネーム ee8f93754)
kiyotis Apr 27, 2026
d99f672
fix: search_next.py — prompt filename and variant string still refere…
kiyotis Apr 27, 2026
3798497
docs: tasks.md — Step 7-0 完了記録更新 (バグ修正 d99f672fd 追記)
kiyotis Apr 27, 2026
71861a0
docs: tasks.md — Step 7-B 試走確認済み (req-05 単件 L3)
kiyotis Apr 28, 2026
28663b4
docs: tasks.md — Step 7-B 14件計測結果と L1/L0 分析タスク追加
kiyotis Apr 28, 2026
0a646c8
docs: tasks.md — Step 7-B-1〜4 分析方針と調査済み事実を記録
kiyotis Apr 28, 2026
ab40abc
docs: tasks.md + README — Step 7-B-1〜4 完了、human review 手順整備
kiyotis Apr 28, 2026
c805281
docs: tasks.md — Step 7-B-1/2 完了、7-B-3/4 を個別タスクに分割
kiyotis Apr 28, 2026
00a5958
docs: tasks.md — Step 7-B-2 承認待ち状態に修正
kiyotis Apr 28, 2026
ac8f65c
docs: tasks.md — Step 7-B-2 結論修正(judge false positive、human_review で …
kiyotis Apr 28, 2026
68ea496
docs: tasks.md — Step 7-B-2 完了
kiyotis Apr 28, 2026
84385b5
docs: tasks.md — Step 7-B-3 調査中間メモ記録
kiyotis Apr 28, 2026
5e7c92d
docs: add L1/L0 analysis report format rule to benchmark.md
kiyotis Apr 28, 2026
7f9697f
docs: tasks.md — Step 7-B-3 完了、judge fix タスク方針確定
kiyotis Apr 28, 2026
d14de0d
fix: fix backtick/shell-escape issues in judge verify pipeline (revie…
kiyotis Apr 28, 2026
04269b8
docs: tasks.md — 7-B-4 完了、backtick/escape 横断修正完了、次: review-03 a_fact …
kiyotis Apr 28, 2026
fcecc73
fix: remove out-of-scope a_fact from review-03 (JAX-RS annotation)
kiyotis Apr 28, 2026
3b414c2
docs: tasks.md — review-03 a_fact 修正完了、次: 14件再計測 Step 7-C
kiyotis Apr 28, 2026
f6a95ca
docs: tasks.md — Step 7-C 部分完了、調査タスク積み(review-07/05/08)
kiyotis Apr 28, 2026
d4e88c7
docs: tasks.md — 調査タスクを1件ずつのセクションに分割
kiyotis Apr 28, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
9 changes: 8 additions & 1 deletion .claude/commands/re.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,14 @@ Proceed with the next task immediately. Work through the steps in tasks.md in or
- Update `**Updated**` date when modifying tasks.md
- If a step is ambiguous, resolve it from context (notes.md, code) before asking the user

Continue through all remaining tasks unless a `[BLOCKED:]` or `[DECISION:]` marker is encountered.
**Stop after each commit** — do not automatically continue to the next step.
After committing, tell the user:
> 「{step name} 完了 (`{short_hash}`)。`/sv` → `/re` でコンテキストをリセットして次のステップへ進むことを推奨します。」

Continue to the next step only if the user explicitly requests it (e.g. "続けて" / "next").
This keeps each session to one commit and prevents context window accumulation.

If there is a step marked `[BLOCKED:]` or `[DECISION:]`, use AskUserQuestion to resolve it before proceeding.

When all tasks are complete, notify the user and suggest `/sv` or `/pr create`.

Expand Down
18 changes: 9 additions & 9 deletions .claude/commands/sv.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,25 +74,24 @@ Confirm working tree is clean and the latest commit reflects what was done.

## 5. Output Handoff Summary

Print a concise summary for the next session:
Print a concise summary for the next session. Keep it short — one line per item:

```
## Saved: #{pr_number} — {PR title}

**Branch**: {branch} (clean)
**Latest commit**: {short_hash} {commit message}

### What was done this session
- {task or step completed}
- {task or step completed}
### Done this session
- {step completed — one line}

### Next task
{first remaining task from tasks.md — be specific}
### Next step
{exact next step from tasks.md — one line, be specific}

### Needs decision / blocked
{anything requiring user input before work can continue, if any}
### Blocked / needs decision
{only if something requires user input; omit section if none}

Resume with: /re
/re で再開
```

# Important
Expand All @@ -101,3 +100,4 @@ Resume with: /re
- Never leave modified tracked files uncommitted
- If a change is too incomplete to commit, use `git stash` and note it in the summary
- Do not implement anything new — this command only records state
- Keep the summary minimal — the goal is a clean handoff, not a progress report
64 changes: 64 additions & 0 deletions .claude/rules/benchmark.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Benchmark Rules

Rules for running `tools/benchmark/` (search-flow accuracy benchmark).

## Do not run `run.py` invocations in parallel

Run benchmark commands sequentially. Never start a second `tools/benchmark/run.py`
invocation while another one is still running.

**Why:** each scenario spawns a `claude` CLI with `--max-turns 2` and a 240s
timeout. Running two `run.py` processes in parallel doubles the concurrent CLI
load, pushes API latency past 240s, and causes many scenarios to fail with
`subprocess.TimeoutExpired`. Observed 2026-04-23 — parallel judge-only runs of
the `ids` and `current` flows both crashed partway through while the same
commands had previously succeeded when run sequentially.

**How to apply:**
- Applies to all `run.py` modes: `--variant ids/current`, `--rejudge`.
- When re-scoring both `ids` and `current` flows, run one to completion, then
run the other.
- If timeouts occur even in a single sequential run, investigate other
Anthropic API consumers on the machine — do not just raise the timeout.

This does **not** generalize to all `claude -p` usage. Other tools that invoke
`claude -p` (e.g. `tools/knowledge-creator`, `nabledge-test`) are known to work
in parallel. The rule is specific to `tools/benchmark/run.py` because each
scenario is long (60-70s) and a single `run.py` already saturates one API slot.

## Simulate before measuring

30件フル計測は $8 / 20分かかる。改善案を思いついたら即計測ではなく、まず
スクリプトで手元シミュレーションして手応えを確認してから本計測に進む。

**Why:** 改善案を試しては 20分待ち、結果を見て別案を試す、を繰り返すと時間もコストも
膨らむ。LLM を呼ばずに再現できる部分 (キーワード抽出、grep ヒット、index 変更後の
見え方、候補タイトルに対する人間判定) はローカルで先に確認できる。期待できる改善幅を
目視で掴めてから計測に進むと、1 回の計測で判断が付く。

**How to apply:**
- 改善案を出したら、まず LLM を呼ばない形で再現する小スクリプトを書く
- 数件〜30件に対してドライランし、想定どおり変化するかユーザーに見せる
- ユーザーと合意が取れてから `run.py` での本計測へ進む
- 計測した結果だけで判断せず、計測前のシミュレーション結果と照らし合わせる

## L1/L0 シナリオ分析の報告形式

採点で低評価になったシナリオを分析・報告するときは、以下の4項目を**実装詳細でなくユーザーが理解できる言葉**で書く。

| 項目 | 書き方 |
|------|--------|
| **質問** | ユーザーが nabledge に投げた質問(そのまま) |
| **事象** | AI が何を答えられなかったか / 採点が何点になったか |
| **直接原因** | AI のどのステップで何が起きたか(「〜を選ばなかった」「〜を誤判定した」など) |
| **根本原因** | なぜそのステップがそう動いたか(設計の問題 / データの問題 / バグ など) |

**書いてはいけないこと**:
- クラス名・メソッド名・変数名をそのまま原因として書く(例: `verify_kb_evidence.py が mismatch` → NG)
- ログの生データをそのまま貼る

**良い例**:
- 直接原因: 「採点 AI が KB にある文章を引用した際に、引用元のセクションを 1 つ間違えて指定した」
- 根本原因: 「採点ツールが引用文字列を照合する際に、文中の特殊文字をシェルコマンドとして解釈してしまう実装バグがある」

tasks.md の調査メモにも同じ形式で記録し、セッションをまたいでも方針が引き継がれるようにする。
14 changes: 14 additions & 0 deletions .claude/rules/communication.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,20 @@
- Do not mix multiple topics in one message — separate concerns clearly
- When reporting status, always anchor to the task file so the user knows where things stand

### Concise first, details on request

Default to the shortest answer that conveys the conclusion and a supporting
example. Do not preemptively dump analysis tables, per-case breakdowns, or
multi-section explanations — wait until the user asks a follow-up.

- **Bad**: first-response includes 3 large tables covering every case and a
"root cause" section the user never asked for
- **Good**: first-response is 2-4 lines stating the conclusion + one concrete
example; offer to expand if needed

When the user says "分からない" / "詳しく" / asks a specific follow-up, then
provide the targeted detail for that specific question — not the whole dump.

## Proposing, not asking for permission

When a decision needs to be made, propose the "should-be" state derived from the goal,
Expand Down
56 changes: 56 additions & 0 deletions .claude/rules/development.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,62 @@ Launch both as subagents in separate contexts (see `.claude/rules/design-decisio

If either review returns **Needs Fix**, address the issues before continuing.

## Observe Real Output Before Claiming Success

Prompt changes and LLM-driven logic cannot be validated by reading the
prompt. The design may look right on paper and still produce broken
output — wrong field types, over-extracted lists, truncated streams,
off-spec JSON, over-granular fact extraction. Always observe real
output before declaring a change "working".

**When this rule fires:**
- Any change to an AI prompt (skill workflow, benchmark prompt, judge prompt, etc.)
- Any change to code that parses LLM structured output
- Any change to scoring / grading / verification logic that consumes LLM output

**Required procedure:**
1. After the change, run ONE real case end-to-end.
2. Dump the full structured output — every field, every list — and
read it. Do not stop at the summary statistic or the level number.
3. Check for shape anomalies: array lengths outside expected range,
fields that should be objects arriving as strings, truncation,
missing required fields, over-extraction (e.g., a "required facts"
list that has 15 items when 6–8 is the reasonable range).
4. Only after the single case is clean, run on a small batch (3-5 cases)
and read their full outputs too. Do not scale to 30+ cases until
3-5 pass inspection.

**Why:** In the 2026-04-23 benchmark judge rewrite, the mean level
looked plausible but per-case inspection revealed a 1314-element
`a_facts` array (truncated JSON parsed into a char-by-char list) and
over-granular fact extraction (15 "required" facts where the real
required set was 6–8). Both slipped past the summary statistic.
Per-case output inspection would have caught them immediately.

## Expert Review for Prompt Changes

Any change to an AI prompt — whether as part of a skill, benchmark, workflow, or one-off script — must pass **Prompt Engineer** expert review before the change is adopted.

**Scope (applies to):**
- `.claude/skills/*/workflows/*.md`
- `.claude/skills/*/assets/*.md` (user-facing prompts)
- `tools/benchmark/prompts/*.md`
- Any new prompt template or schema used with `claude -p` / Agent tool

**When this rule fires:**
- If the trigger for the change is itself an expert review (e.g., Round 1 prompt change was already recommended by a Prompt Engineer review), no additional review is required for that change.
- If the trigger is anything else — new hypothesis, user request, own judgment, refactor — Prompt Engineer review is **required** before the prompt goes live.
- Save the review to `.work/xxxxx/review-by-prompt-engineer-{round or topic}.md` and link from tasks.md / round log.

**Attach real execution output to the review.** Reviewing the prompt
text alone will miss behavior that only shows up in output — field
shape, list granularity, truncation patterns. When the prompt has been
run even once, include a representative full output (one case, every
field) in the review input. If the prompt is new and has not been
executed yet, run it once first, then send the review.

**Purpose:** Prompt quality is load-bearing for downstream accuracy. A change that looks obvious may degrade behavior in ways the author cannot see without an independent read.

## Fix Problems Immediately

When a problem is found — test failure, bug, incorrect behavior, rule violation — fix it immediately. Do not defer it as "out of scope" or "tracked separately."
Expand Down
26 changes: 26 additions & 0 deletions .claude/rules/token-efficiency.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Token Efficiency

## Split tasks at commit boundaries

Each step in tasks.md must map to exactly one `git commit`. This keeps
`/sv` + `/re` cycles short and prevents context window accumulation.

**Good**: step = one commit
**Bad**: one step spans multiple commits, or one step covers multiple concerns

When creating or updating tasks.md, ensure every `- [ ] Step N` can be
completed and committed in a single work unit.

## Run /sv automatically after each commit

**STOP after every commit. Do not continue to the next step.**

After completing a commit:

1. Immediately run the `/sv` skill — do not wait for the user to ask.
2. Tell the user:
> 「{step name} 完了 (`{short_hash}`)。`/clear` → `/re` でコンテキストをリセットして次のステップへ進んでください。」
3. Wait for the user to run `/clear` → `/re`. Do not proceed on your own.

**Why stop here**: each commit = one session boundary. Crossing that boundary
without a `/re` accumulates context and defeats the purpose of this rule.
Loading