Fix acc_all_stderr grouping by question_id only (drops paragraph_id) by Chessing234 · Pull Request #3695 · EleutherAI/lm-evaluation-harness

Chessing234 · 2026-04-11T10:28:38Z

Bug

`acc_all_stderr` in `lm_eval/api/metrics.py` silently merges MultiRC questions from different paragraphs into a single bucket, so the stderr it produces disagrees with the point estimate that `acc_all` computes for the very same rows.

Root cause

`acc_all` and `acc_all_stderr` are a paired metric/stderr for the `acc_all` registered metric — a MultiRC-style "a question counts correct iff every answer row for that question is correct" metric. The metric function keys the per-question bucket by `(paragraph_id, question_id)`:

```python
def acc_all(items):
...
for doc, pred in zip(docs, preds):
paragraph_id = doc["idx"]["paragraph"]
question_id = doc["idx"]["question"]
if (paragraph_id, question_id) not in question_scoring_dict:
question_scoring_dict[(paragraph_id, question_id)] = []
...
```

but the stderr function right below it drops `paragraph_id` entirely and keys by just `question_id`:

```python
def acc_all_stderr(items):
...
for doc, pred in zip(docs, preds):
question_id = doc["idx"]["question"]
if question_id not in question_scoring_dict:
question_scoring_dict[question_id] = []
...
```

In SuperGLUE MultiRC, `question_id` is only unique within a paragraph — the same numeric `question_id` (0, 1, 2, ...) is reused across different paragraphs. Dropping `paragraph_id` from the bucket key conflates unrelated questions from different paragraphs: `all(x)` is then taken over a merged bucket containing answer rows from multiple distinct questions, and the resulting 0/1 list handed to `mean_stderr` has the wrong size and wrong per-question correctness values.

Concrete worked example

Paragraph 1 has question 0 with answers `[T, T]` (both correct). Paragraph 2 has question 0 with answers `[T, F]` (one wrong).

`acc_all`: keys `(1, 0)` = `[True, True]` → 1, `(2, 0)` = `[True, False]` → 0. `mean = 0.5`.
`acc_all_stderr` (current): key `0` = `[True, True, True, False]` → `all(...) = False` → 0. `mean_stderr` of `[0]` = 0.

The stderr function sees 1 “question” (the merged bucket) and marks it wrong, even though `acc_all` correctly sees 2 questions with mean 0.5. Any downstream consumer comparing the two numbers will see inconsistent results.

Fix

Mirror `acc_all`'s bucket key in `acc_all_stderr`:

```diff
for doc, pred in zip(docs, preds):

   paragraph_id = doc[\"idx\"][\"paragraph\"]
   question_id = doc[\"idx\"][\"question\"]

   if question_id not in question_scoring_dict:

       question_scoring_dict[question_id] = []

   if (paragraph_id, question_id) not in question_scoring_dict:

       question_scoring_dict[(paragraph_id, question_id)] = []

   gold_label = doc[\"label\"] == 1

   question_scoring_dict[question_id].append(gold_label == pred)

   question_scoring_dict[(paragraph_id, question_id)].append(gold_label == pred)

```

No other call sites (grepped `acc_all` across the repo; only `lm_eval/api/metrics.py` references it), so this is a strictly corrective change that makes the stderr bucket over the same question set as the metric.

\`acc_all\` and \`acc_all_stderr\` are the metric and stderr implementations for the \`acc_all\` registered metric, used for MultiRC-style tasks where a question has multiple answer options and the metric counts a question correct only if *every* one of its answer rows is labeled correctly. \`acc_all\` bucket-keys by \`(paragraph_id, question_id)\`: question_scoring_dict[(paragraph_id, question_id)].append(...) but \`acc_all_stderr\` right below it keys by only \`question_id\`: question_scoring_dict[question_id].append(...) In MultiRC, \`question_id\` is only unique within a paragraph — the same numeric question_id is reused across paragraphs. So \`acc_all_stderr\` conflates unrelated questions from different paragraphs into a single bucket, runs \`all(...)\` across the merged bucket, and passes the wrong list to \`mean_stderr\`. The stderr number then disagrees systematically with the point estimate that \`acc_all\` produces. Use the same \`(paragraph_id, question_id)\` key in \`acc_all_stderr\` so the metric and its stderr bucket over the same set of questions.

Chessing234 requested a review from 0xSMT as a code owner April 11, 2026 10:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix acc_all_stderr grouping by question_id only (drops paragraph_id)#3695

Fix acc_all_stderr grouping by question_id only (drops paragraph_id)#3695
Chessing234 wants to merge 1 commit intoEleutherAI:mainfrom
Chessing234:fix/acc-all-stderr-paragraph-key

Chessing234 commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Chessing234 commented Apr 11, 2026

Bug

Root cause

Concrete worked example

Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant