Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
3625fb4
added `ifeval_pt` to the harness
Nkluge-correa Dec 9, 2025
4543919
made sure imports come from `ifeval_pt`
Nkluge-correa Dec 15, 2025
2b18fac
Merge branch 'EleutherAI:main' into ifeval_pt
Nkluge-correa Dec 15, 2025
6ba3881
make sure `self._nth_paragraph` is always an int
Nkluge-correa Dec 15, 2025
41fd6e2
Merge branch 'ifeval_pt' of https://github.com/Nkluge-correa/lm-evalu…
Nkluge-correa Dec 15, 2025
55a5e0c
Merge branch 'EleutherAI:main' into ifeval_pt
Nkluge-correa Dec 20, 2025
aa2c078
Merge branch 'EleutherAI:main' into ifeval_pt
Nkluge-correa Dec 29, 2025
9ad7d33
Merge branch 'EleutherAI:main' into ifeval_pt
Nkluge-correa Jan 12, 2026
b0b76ec
Merge branch 'EleutherAI:main' into ifeval_pt
Nkluge-correa Jan 14, 2026
120441b
Merge branch 'EleutherAI:main' into ifeval_pt
Nkluge-correa Jan 21, 2026
51676b8
Merge branch 'EleutherAI:main' into ifeval_pt
Nkluge-correa Jan 23, 2026
b34292c
Merge branch 'EleutherAI:main' into ifeval_pt
Nkluge-correa Jan 26, 2026
57516a6
Merge branch 'EleutherAI:main' into ifeval_pt
Nkluge-correa Jan 27, 2026
2e8cd81
Merge branch 'EleutherAI:main' into ifeval_pt
Nkluge-correa Feb 5, 2026
52ff2df
Merge branch 'EleutherAI:main' into ifeval_pt
Nkluge-correa Feb 11, 2026
9ffcac4
Merge branch 'EleutherAI:main' into ifeval_pt
Nkluge-correa Feb 11, 2026
cac6335
Merge branch 'EleutherAI:main' into ifeval_pt
Nkluge-correa Feb 23, 2026
f7b768d
Merge branch 'EleutherAI:main' into ifeval_pt
Nkluge-correa Mar 1, 2026
3806219
Merge branch 'EleutherAI:main' into ifeval_pt
Nkluge-correa Mar 5, 2026
65d468a
Merge branch 'EleutherAI:main' into ifeval_pt
Nkluge-correa Mar 15, 2026
97603cb
Merge branch 'EleutherAI:main' into ifeval_pt
Nkluge-correa Mar 27, 2026
2ccc9ac
Merge branch 'EleutherAI:main' into ifeval_pt
Nkluge-correa Apr 14, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 47 additions & 0 deletions lm_eval/tasks/ifeval_pt/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# IFEval-PT

**This is a Portuguese translation of the original IFEval benchmark. It contains 300 prompts translated to Portuguese. The prompts were translated by [Qwen/Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) and manually verified by native Portuguese speakers. Samples have also been adapted to ensure cultural alignment.**

### Paper

Title: Instruction-Following Evaluation for Large Language Models
Abstract: https://arxiv.org/abs/2311.07911

One core capability of Large Language Models (LLMs) is to follow natural language instructions. However, the evaluation of such abilities is not standardized: Human evaluations are expensive, slow, and not objectively reproducible, while LLM-based auto-evaluation is potentially biased or limited by the ability of the evaluator LLM. To overcome these issues, we introduce Instruction-Following Eval (IFEval) for large language models. IFEval is a straightforward and easy-to-reproduce evaluation benchmark. It focuses on a set of "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times". We identified 25 types of those verifiable instructions and constructed around 500 prompts, with each prompt containing one or more verifiable instructions. We show evaluation results of two widely available LLMs on the market. Our code and data can be found at https://github.com/google-research/google-research/tree/master/instruction_following_eval

Homepage: https://github.com/google-research/google-research/tree/master/instruction_following_eval

### Citation

```
@article{zhou2023instructionfollowing,
title={Instruction-Following Evaluation for Large Language Models},
author={Jeffrey Zhou and Tianjian Lu and Swaroop Mishra and Siddhartha Brahma and Sujoy Basu and Yi Luan and Denny Zhou and Le Hou},
journal={arXiv preprint arXiv:2311.07911},
year={2023},
}
```

### Groups and Tasks

#### Groups

- Not part of a group yet

#### Tasks

- `ifeval_pt`

### Checklist

For adding novel benchmarks/datasets to the library:

- [x] Is the task an existing benchmark in the literature?
- [x] Have you referenced the original paper that introduced the task?
- [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?

If other tasks on this dataset are already supported:

- [ ] Is the "Main" variant of this task clearly denoted?
- [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
- [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
29 changes: 29 additions & 0 deletions lm_eval/tasks/ifeval_pt/ifeval_pt.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
task: ifeval_pt
dataset_path: Polygl0t/IFEval-PT
dataset_name: null
output_type: generate_until
test_split: train
num_fewshot: 0
doc_to_text: prompt
doc_to_target: 0
generation_kwargs:
until: []
do_sample: false
temperature: 0.0
max_gen_toks: 1280
process_results: !function utils.process_results
metric_list:
- metric: prompt_level_strict_acc
aggregation: mean
higher_is_better: true
- metric: inst_level_strict_acc
aggregation: !function utils.agg_inst_level_acc
higher_is_better: true
- metric: prompt_level_loose_acc
aggregation: mean
higher_is_better: true
- metric: inst_level_loose_acc
aggregation: !function utils.agg_inst_level_acc
higher_is_better: true
metadata:
version: 4.0
Loading