nablarch · kiyotis · Apr 21, 2026 · Apr 21, 2026 · Apr 21, 2026 · Apr 21, 2026
diff --git a/.claude/commands/re.md b/.claude/commands/re.md
@@ -70,7 +70,14 @@ Proceed with the next task immediately. Work through the steps in tasks.md in or
 - Update `**Updated**` date when modifying tasks.md
 - If a step is ambiguous, resolve it from context (notes.md, code) before asking the user
 
-Continue through all remaining tasks unless a `[BLOCKED:]` or `[DECISION:]` marker is encountered.
+**Stop after each commit** — do not automatically continue to the next step.
+After committing, tell the user:
+> 「{step name} 完了 (`{short_hash}`)。`/sv` → `/re` でコンテキストをリセットして次のステップへ進むことを推奨します。」
+
+Continue to the next step only if the user explicitly requests it (e.g. "続けて" / "next").
+This keeps each session to one commit and prevents context window accumulation.
+
+If there is a step marked `[BLOCKED:]` or `[DECISION:]`, use AskUserQuestion to resolve it before proceeding.
 
 When all tasks are complete, notify the user and suggest `/sv` or `/pr create`.
 

diff --git a/.claude/commands/sv.md b/.claude/commands/sv.md
@@ -74,25 +74,24 @@ Confirm working tree is clean and the latest commit reflects what was done.
 
 ## 5. Output Handoff Summary
 
-Print a concise summary for the next session:
+Print a concise summary for the next session. Keep it short — one line per item:
 
 ```
 ## Saved: #{pr_number} — {PR title}
 
 **Branch**: {branch} (clean)
 **Latest commit**: {short_hash} {commit message}
 
-### What was done this session
-- {task or step completed}
-- {task or step completed}
+### Done this session
+- {step completed — one line}
 
-### Next task
-{first remaining task from tasks.md — be specific}
+### Next step
+{exact next step from tasks.md — one line, be specific}
 
-### Needs decision / blocked
-{anything requiring user input before work can continue, if any}
+### Blocked / needs decision
+{only if something requires user input; omit section if none}
 
-Resume with: /re
+/re で再開
 ```
 
 # Important
@@ -101,3 +100,4 @@ Resume with: /re
 - Never leave modified tracked files uncommitted
 - If a change is too incomplete to commit, use `git stash` and note it in the summary
 - Do not implement anything new — this command only records state
+- Keep the summary minimal — the goal is a clean handoff, not a progress report
diff --git a/.claude/rules/benchmark.md b/.claude/rules/benchmark.md
@@ -0,0 +1,64 @@
+# Benchmark Rules
+
+Rules for running `tools/benchmark/` (search-flow accuracy benchmark).
+
+## Do not run `run.py` invocations in parallel
+
+Run benchmark commands sequentially. Never start a second `tools/benchmark/run.py`
+invocation while another one is still running.
+
+**Why:** each scenario spawns a `claude` CLI with `--max-turns 2` and a 240s
+timeout. Running two `run.py` processes in parallel doubles the concurrent CLI
+load, pushes API latency past 240s, and causes many scenarios to fail with
+`subprocess.TimeoutExpired`. Observed 2026-04-23 — parallel judge-only runs of
+the `ids` and `current` flows both crashed partway through while the same
+commands had previously succeeded when run sequentially.
+
+**How to apply:**
+- Applies to all `run.py` modes: `--variant ids/current`, `--rejudge`.
+- When re-scoring both `ids` and `current` flows, run one to completion, then
+  run the other.
+- If timeouts occur even in a single sequential run, investigate other
+  Anthropic API consumers on the machine — do not just raise the timeout.
+
+This does **not** generalize to all `claude -p` usage. Other tools that invoke
+`claude -p` (e.g. `tools/knowledge-creator`, `nabledge-test`) are known to work
+in parallel. The rule is specific to `tools/benchmark/run.py` because each
+scenario is long (60-70s) and a single `run.py` already saturates one API slot.
+
+## Simulate before measuring
+
+30件フル計測は $8 / 20分かかる。改善案を思いついたら即計測ではなく、まず
+スクリプトで手元シミュレーションして手応えを確認してから本計測に進む。
+
+**Why:** 改善案を試しては 20分待ち、結果を見て別案を試す、を繰り返すと時間もコストも
+膨らむ。LLM を呼ばずに再現できる部分 (キーワード抽出、grep ヒット、index 変更後の
+見え方、候補タイトルに対する人間判定) はローカルで先に確認できる。期待できる改善幅を
+目視で掴めてから計測に進むと、1 回の計測で判断が付く。
+
+**How to apply:**
+- 改善案を出したら、まず LLM を呼ばない形で再現する小スクリプトを書く
+- 数件〜30件に対してドライランし、想定どおり変化するかユーザーに見せる
+- ユーザーと合意が取れてから `run.py` での本計測へ進む
+- 計測した結果だけで判断せず、計測前のシミュレーション結果と照らし合わせる
+
+## L1/L0 シナリオ分析の報告形式
+
+採点で低評価になったシナリオを分析・報告するときは、以下の4項目を**実装詳細でなくユーザーが理解できる言葉**で書く。
+
+| 項目 | 書き方 |
+|------|--------|
+| **質問** | ユーザーが nabledge に投げた質問（そのまま） |
+| **事象** | AI が何を答えられなかったか / 採点が何点になったか |
+| **直接原因** | AI のどのステップで何が起きたか（「〜を選ばなかった」「〜を誤判定した」など） |
+| **根本原因** | なぜそのステップがそう動いたか（設計の問題 / データの問題 / バグ など） |
+
+**書いてはいけないこと**:
+- クラス名・メソッド名・変数名をそのまま原因として書く（例: `verify_kb_evidence.py が mismatch` → NG）
+- ログの生データをそのまま貼る
+
+**良い例**:
+- 直接原因: 「採点 AI が KB にある文章を引用した際に、引用元のセクションを 1 つ間違えて指定した」
+- 根本原因: 「採点ツールが引用文字列を照合する際に、文中の特殊文字をシェルコマンドとして解釈してしまう実装バグがある」
+
+tasks.md の調査メモにも同じ形式で記録し、セッションをまたいでも方針が引き継がれるようにする。
diff --git a/.claude/rules/communication.md b/.claude/rules/communication.md
@@ -7,6 +7,20 @@
 - Do not mix multiple topics in one message — separate concerns clearly
 - When reporting status, always anchor to the task file so the user knows where things stand
 
+### Concise first, details on request
+
+Default to the shortest answer that conveys the conclusion and a supporting
+example. Do not preemptively dump analysis tables, per-case breakdowns, or
+multi-section explanations — wait until the user asks a follow-up.
+
+- **Bad**: first-response includes 3 large tables covering every case and a
+  "root cause" section the user never asked for
+- **Good**: first-response is 2-4 lines stating the conclusion + one concrete
+  example; offer to expand if needed
+
+When the user says "分からない" / "詳しく" / asks a specific follow-up, then
+provide the targeted detail for that specific question — not the whole dump.
+
 ## Proposing, not asking for permission
 
 When a decision needs to be made, propose the "should-be" state derived from the goal,

diff --git a/.claude/rules/development.md b/.claude/rules/development.md
@@ -44,6 +44,62 @@ Launch both as subagents in separate contexts (see `.claude/rules/design-decisio
 
 If either review returns **Needs Fix**, address the issues before continuing.
 
+## Observe Real Output Before Claiming Success
+
+Prompt changes and LLM-driven logic cannot be validated by reading the
+prompt. The design may look right on paper and still produce broken
+output — wrong field types, over-extracted lists, truncated streams,
+off-spec JSON, over-granular fact extraction. Always observe real
+output before declaring a change "working".
+
+**When this rule fires:**
+- Any change to an AI prompt (skill workflow, benchmark prompt, judge prompt, etc.)
+- Any change to code that parses LLM structured output
+- Any change to scoring / grading / verification logic that consumes LLM output
+
+**Required procedure:**
+1. After the change, run ONE real case end-to-end.
+2. Dump the full structured output — every field, every list — and
+   read it. Do not stop at the summary statistic or the level number.
+3. Check for shape anomalies: array lengths outside expected range,
+   fields that should be objects arriving as strings, truncation,
+   missing required fields, over-extraction (e.g., a "required facts"
+   list that has 15 items when 6–8 is the reasonable range).
+4. Only after the single case is clean, run on a small batch (3-5 cases)
+   and read their full outputs too. Do not scale to 30+ cases until
+   3-5 pass inspection.
+
+**Why:** In the 2026-04-23 benchmark judge rewrite, the mean level
+looked plausible but per-case inspection revealed a 1314-element
+`a_facts` array (truncated JSON parsed into a char-by-char list) and
+over-granular fact extraction (15 "required" facts where the real
+required set was 6–8). Both slipped past the summary statistic.
+Per-case output inspection would have caught them immediately.
+
+## Expert Review for Prompt Changes
+
+Any change to an AI prompt — whether as part of a skill, benchmark, workflow, or one-off script — must pass **Prompt Engineer** expert review before the change is adopted.
+
+**Scope (applies to):**
+- `.claude/skills/*/workflows/*.md`
+- `.claude/skills/*/assets/*.md` (user-facing prompts)
+- `tools/benchmark/prompts/*.md`
+- Any new prompt template or schema used with `claude -p` / Agent tool
+
+**When this rule fires:**
+- If the trigger for the change is itself an expert review (e.g., Round 1 prompt change was already recommended by a Prompt Engineer review), no additional review is required for that change.
+- If the trigger is anything else — new hypothesis, user request, own judgment, refactor — Prompt Engineer review is **required** before the prompt goes live.
+- Save the review to `.work/xxxxx/review-by-prompt-engineer-{round or topic}.md` and link from tasks.md / round log.
+
+**Attach real execution output to the review.** Reviewing the prompt
+text alone will miss behavior that only shows up in output — field
+shape, list granularity, truncation patterns. When the prompt has been
+run even once, include a representative full output (one case, every
+field) in the review input. If the prompt is new and has not been
+executed yet, run it once first, then send the review.
+
+**Purpose:** Prompt quality is load-bearing for downstream accuracy. A change that looks obvious may degrade behavior in ways the author cannot see without an independent read.
+
 ## Fix Problems Immediately
 
 When a problem is found — test failure, bug, incorrect behavior, rule violation — fix it immediately. Do not defer it as "out of scope" or "tracked separately."

diff --git a/.claude/rules/token-efficiency.md b/.claude/rules/token-efficiency.md
@@ -0,0 +1,26 @@
+# Token Efficiency
+
+## Split tasks at commit boundaries
+
+Each step in tasks.md must map to exactly one `git commit`. This keeps
+`/sv` + `/re` cycles short and prevents context window accumulation.
+
+**Good**: step = one commit
+**Bad**: one step spans multiple commits, or one step covers multiple concerns
+
+When creating or updating tasks.md, ensure every `- [ ] Step N` can be
+completed and committed in a single work unit.
+
+## Run /sv automatically after each commit
+
+**STOP after every commit. Do not continue to the next step.**
+
+After completing a commit:
+
+1. Immediately run the `/sv` skill — do not wait for the user to ask.
+2. Tell the user:
+   > 「{step name} 完了 (`{short_hash}`)。`/clear` → `/re` でコンテキストをリセットして次のステップへ進んでください。」
+3. Wait for the user to run `/clear` → `/re`. Do not proceed on your own.
+
+**Why stop here**: each commit = one session boundary. Crossing that boundary
+without a `/re` accumulates context and defeats the purpose of this rule.