Fix acc_all_stderr grouping by question_id only (drops paragraph_id)#3695
Open
Chessing234 wants to merge 1 commit intoEleutherAI:mainfrom
Open
Fix acc_all_stderr grouping by question_id only (drops paragraph_id)#3695Chessing234 wants to merge 1 commit intoEleutherAI:mainfrom
Chessing234 wants to merge 1 commit intoEleutherAI:mainfrom
Conversation
\`acc_all\` and \`acc_all_stderr\` are the metric and stderr
implementations for the \`acc_all\` registered metric, used for
MultiRC-style tasks where a question has multiple answer options
and the metric counts a question correct only if *every* one of its
answer rows is labeled correctly.
\`acc_all\` bucket-keys by \`(paragraph_id, question_id)\`:
question_scoring_dict[(paragraph_id, question_id)].append(...)
but \`acc_all_stderr\` right below it keys by only \`question_id\`:
question_scoring_dict[question_id].append(...)
In MultiRC, \`question_id\` is only unique within a paragraph — the
same numeric question_id is reused across paragraphs. So
\`acc_all_stderr\` conflates unrelated questions from different
paragraphs into a single bucket, runs \`all(...)\` across the merged
bucket, and passes the wrong list to \`mean_stderr\`. The stderr
number then disagrees systematically with the point estimate that
\`acc_all\` produces.
Use the same \`(paragraph_id, question_id)\` key in
\`acc_all_stderr\` so the metric and its stderr bucket over the same
set of questions.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bug
`acc_all_stderr` in `lm_eval/api/metrics.py` silently merges MultiRC questions from different paragraphs into a single bucket, so the stderr it produces disagrees with the point estimate that `acc_all` computes for the very same rows.
Root cause
`acc_all` and `acc_all_stderr` are a paired metric/stderr for the `acc_all` registered metric — a MultiRC-style "a question counts correct iff every answer row for that question is correct" metric. The metric function keys the per-question bucket by `(paragraph_id, question_id)`:
```python
def acc_all(items):
...
for doc, pred in zip(docs, preds):
paragraph_id = doc["idx"]["paragraph"]
question_id = doc["idx"]["question"]
if (paragraph_id, question_id) not in question_scoring_dict:
question_scoring_dict[(paragraph_id, question_id)] = []
...
```
but the stderr function right below it drops `paragraph_id` entirely and keys by just `question_id`:
```python
def acc_all_stderr(items):
...
for doc, pred in zip(docs, preds):
question_id = doc["idx"]["question"]
if question_id not in question_scoring_dict:
question_scoring_dict[question_id] = []
...
```
In SuperGLUE MultiRC, `question_id` is only unique within a paragraph — the same numeric `question_id` (0, 1, 2, ...) is reused across different paragraphs. Dropping `paragraph_id` from the bucket key conflates unrelated questions from different paragraphs: `all(x)` is then taken over a merged bucket containing answer rows from multiple distinct questions, and the resulting 0/1 list handed to `mean_stderr` has the wrong size and wrong per-question correctness values.
Concrete worked example
Paragraph 1 has question 0 with answers `[T, T]` (both correct). Paragraph 2 has question 0 with answers `[T, F]` (one wrong).
The stderr function sees 1 “question” (the merged bucket) and marks it wrong, even though `acc_all` correctly sees 2 questions with mean 0.5. Any downstream consumer comparing the two numbers will see inconsistent results.
Fix
Mirror `acc_all`'s bucket key in `acc_all_stderr`:
```diff
for doc, pred in zip(docs, preds):
```
No other call sites (grepped `acc_all` across the repo; only `lm_eval/api/metrics.py` references it), so this is a strictly corrective change that makes the stderr bucket over the same question set as the metric.