Adding Cruxeval by ThomasHeap · Pull Request #3699 · EleutherAI/lm-evaluation-harness

ThomasHeap · 2026-04-12T16:17:18Z

Summary

Adds CRUXEval, a benchmark of 800 Python functions that tests code reasoning in two directions: predicting a function's output given its input, and predicting a valid input given a known output
Includes chain-of-thought variants and a higher-temperature variant for pass@5 evaluation

The scoring logic was validated against the original CRUXEval reference implementation
(facebookresearch/cruxeval) using the
CodeLlama-7B generations provided in the reference repo (sample_codellama-7b_temp0.2).
Both pipelines were run on the same postprocessed generations across all 800 samples.

Mode	Reference pass@1	lm-eval pass@1	Reference pass@5	lm-eval pass@5	Disagreements
output	34.2%	34.2%	40.3%	40.3%	0 / 800
input	36.0%	36.0%	45.0%	45.0%	0 / 800

Task structure

Tasks
`cruxeval_output` — pass@1 at temp 0.2, matches published setup
`cruxeval_output_08` — pass@5 at temp 0.8, matches published setup
`cruxeval_output_cot` — pass@1 with chain-of-thought prompting
`cruxeval_input` — pass@1 at temp 0.2, matches published setup
`cruxeval_input_08` — pass@5 at temp 0.8, matches published setup
`cruxeval_input_cot` — pass@1 with chain-of-thought prompting

Links

Paper: https://arxiv.org/abs/2401.03065
Dataset: https://huggingface.co/datasets/cruxeval-org/cruxeval
Reference implementation: https://github.com/facebookresearch/cruxeval

Checklist

Is the task an existing benchmark in the literature?
Have you referenced the original paper that introduced the task?
Does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
Is the "Main" variant of this task clearly denoted?
Have you provided a short sentence in a README on what each new variant adds / evaluates?
Have you noted which, if any, published evaluation setups are matched by this variant?

adding cruxeval

dfe6108

ThomasHeap requested a review from 0xSMT as a code owner April 12, 2026 16:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Cruxeval#3699

Adding Cruxeval#3699
ThomasHeap wants to merge 1 commit intoEleutherAI:mainfrom
ThomasHeap:main

ThomasHeap commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ThomasHeap commented Apr 12, 2026

Summary

Task structure

Links

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant