Skip to content

Fix acc_all_stderr grouping by question_id only (drops paragraph_id)#3695

Open
Chessing234 wants to merge 1 commit intoEleutherAI:mainfrom
Chessing234:fix/acc-all-stderr-paragraph-key
Open

Fix acc_all_stderr grouping by question_id only (drops paragraph_id)#3695
Chessing234 wants to merge 1 commit intoEleutherAI:mainfrom
Chessing234:fix/acc-all-stderr-paragraph-key

Conversation

@Chessing234
Copy link
Copy Markdown

Bug

`acc_all_stderr` in `lm_eval/api/metrics.py` silently merges MultiRC questions from different paragraphs into a single bucket, so the stderr it produces disagrees with the point estimate that `acc_all` computes for the very same rows.

Root cause

`acc_all` and `acc_all_stderr` are a paired metric/stderr for the `acc_all` registered metric — a MultiRC-style "a question counts correct iff every answer row for that question is correct" metric. The metric function keys the per-question bucket by `(paragraph_id, question_id)`:

```python
def acc_all(items):
...
for doc, pred in zip(docs, preds):
paragraph_id = doc["idx"]["paragraph"]
question_id = doc["idx"]["question"]
if (paragraph_id, question_id) not in question_scoring_dict:
question_scoring_dict[(paragraph_id, question_id)] = []
...
```

but the stderr function right below it drops `paragraph_id` entirely and keys by just `question_id`:

```python
def acc_all_stderr(items):
...
for doc, pred in zip(docs, preds):
question_id = doc["idx"]["question"]
if question_id not in question_scoring_dict:
question_scoring_dict[question_id] = []
...
```

In SuperGLUE MultiRC, `question_id` is only unique within a paragraph — the same numeric `question_id` (0, 1, 2, ...) is reused across different paragraphs. Dropping `paragraph_id` from the bucket key conflates unrelated questions from different paragraphs: `all(x)` is then taken over a merged bucket containing answer rows from multiple distinct questions, and the resulting 0/1 list handed to `mean_stderr` has the wrong size and wrong per-question correctness values.

Concrete worked example

Paragraph 1 has question 0 with answers `[T, T]` (both correct). Paragraph 2 has question 0 with answers `[T, F]` (one wrong).

  • `acc_all`: keys `(1, 0)` = `[True, True]` → 1, `(2, 0)` = `[True, False]` → 0. `mean = 0.5`.
  • `acc_all_stderr` (current): key `0` = `[True, True, True, False]` → `all(...) = False` → 0. `mean_stderr` of `[0]` = 0.

The stderr function sees 1 “question” (the merged bucket) and marks it wrong, even though `acc_all` correctly sees 2 questions with mean 0.5. Any downstream consumer comparing the two numbers will see inconsistent results.

Fix

Mirror `acc_all`'s bucket key in `acc_all_stderr`:

```diff
for doc, pred in zip(docs, preds):

  •    paragraph_id = doc[\"idx\"][\"paragraph\"]
       question_id = doc[\"idx\"][\"question\"]
    
  •    if question_id not in question_scoring_dict:
    
  •        question_scoring_dict[question_id] = []
    
  •    if (paragraph_id, question_id) not in question_scoring_dict:
    
  •        question_scoring_dict[(paragraph_id, question_id)] = []
    
       gold_label = doc[\"label\"] == 1
    
  •    question_scoring_dict[question_id].append(gold_label == pred)
    
  •    question_scoring_dict[(paragraph_id, question_id)].append(gold_label == pred)
    

```

No other call sites (grepped `acc_all` across the repo; only `lm_eval/api/metrics.py` references it), so this is a strictly corrective change that makes the stderr bucket over the same question set as the metric.

\`acc_all\` and \`acc_all_stderr\` are the metric and stderr
implementations for the \`acc_all\` registered metric, used for
MultiRC-style tasks where a question has multiple answer options
and the metric counts a question correct only if *every* one of its
answer rows is labeled correctly.

\`acc_all\` bucket-keys by \`(paragraph_id, question_id)\`:

    question_scoring_dict[(paragraph_id, question_id)].append(...)

but \`acc_all_stderr\` right below it keys by only \`question_id\`:

    question_scoring_dict[question_id].append(...)

In MultiRC, \`question_id\` is only unique within a paragraph — the
same numeric question_id is reused across paragraphs. So
\`acc_all_stderr\` conflates unrelated questions from different
paragraphs into a single bucket, runs \`all(...)\` across the merged
bucket, and passes the wrong list to \`mean_stderr\`. The stderr
number then disagrees systematically with the point estimate that
\`acc_all\` produces.

Use the same \`(paragraph_id, question_id)\` key in
\`acc_all_stderr\` so the metric and its stderr bucket over the same
set of questions.
@Chessing234 Chessing234 requested a review from 0xSMT as a code owner April 11, 2026 10:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant