Skip to content

Adding Cruxeval#3699

Open
ThomasHeap wants to merge 1 commit intoEleutherAI:mainfrom
ThomasHeap:main
Open

Adding Cruxeval#3699
ThomasHeap wants to merge 1 commit intoEleutherAI:mainfrom
ThomasHeap:main

Conversation

@ThomasHeap
Copy link
Copy Markdown

Summary

  • Adds CRUXEval, a benchmark of 800 Python functions that tests code reasoning in two directions: predicting a function's output given its input, and predicting a valid input given a known output
  • Includes chain-of-thought variants and a higher-temperature variant for pass@5 evaluation

The scoring logic was validated against the original CRUXEval reference implementation
(facebookresearch/cruxeval) using the
CodeLlama-7B generations provided in the reference repo (sample_codellama-7b_temp0.2).
Both pipelines were run on the same postprocessed generations across all 800 samples.

Mode Reference pass@1 lm-eval pass@1 Reference pass@5 lm-eval pass@5 Disagreements
output 34.2% 34.2% 40.3% 40.3% 0 / 800
input 36.0% 36.0% 45.0% 45.0% 0 / 800

Task structure

Tasks
cruxeval_output — pass@1 at temp 0.2, matches published setup
cruxeval_output_08 — pass@5 at temp 0.8, matches published setup
cruxeval_output_cot — pass@1 with chain-of-thought prompting
cruxeval_input — pass@1 at temp 0.2, matches published setup
cruxeval_input_08 — pass@5 at temp 0.8, matches published setup
cruxeval_input_cot — pass@1 with chain-of-thought prompting

Links

Checklist

  • Is the task an existing benchmark in the literature?
  • Have you referenced the original paper that introduced the task?
  • Does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
  • Is the "Main" variant of this task clearly denoted?
  • Have you provided a short sentence in a README on what each new variant adds / evaluates?
  • Have you noted which, if any, published evaluation setups are matched by this variant?

@ThomasHeap ThomasHeap requested a review from 0xSMT as a code owner April 12, 2026 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant