Question from @MarinaMancoridis
I was wondering whether you happen to have access to graded or annotated model responses for the dataset (ie. per-question correctness for specific models such as GPT-4/5, etc.). In particular, I’m curious whether question-level performance labels across models are available or were collected during your experiments.
Yes it is here: https://huggingface.co/datasets/bigcode/evaluation :)
Question from @MarinaMancoridis
Yes it is here: https://huggingface.co/datasets/bigcode/evaluation :)