Difficulty-Diversity Collaborative Filtering

The source code for the paper Difficulty–Diversity Collaborative Filtering for Data-Efficient LLM Fine-Tuning.

If you have any questions, please don't hesitate to contact me: longhp1618@gmail.com

Environment Setup

Create a virtual environment:

conda create -n ddcf python=3.10 -y
conda activate ddcf

Install dependencies:
```
pip install -r requirements.txt
```

Scripts Overview

Script	Description
`infer.py`	Run inference and verification on seed data for a specific model. Running all models listed in `DDCF_data/model_order.csv` produces the full binary correctness data.
`create_training_data`	Split full binary correctness data into training and validation sets for the correctness predictor.
`get_question_embeddings.py`	Compute seed embeddings and full-corpus embeddings.
`train_correctness_predictor.py`	Train the correctness predictor.
`fullcorpus_difficulty_estimation.py`	Estimate difficulty per example for a given model; results are saved under `factorized_data/`.
`k_greedy_selection.py`	Perform k-greedy selection to balance difficulty and diversity.

Data Preparation

You can either download our processed datasets (including 1,000-example curated sets) from Hugging Face or prepare data yourself.

Download full corpus and seed corpus
```
python download_ddcf_math_from_hf.py
```
Binary correctness matrix — either download precomputed data:
```
python download_binary_correctness_from_hf.py
```
or run inference for each collaborative model (running all models in DDCF_data/model_order.csv yields full binary correctness data):
```
bash run_infer_all_models.sh
python create_training_data.py
```
Question embeddings
```
python get_question_embeddings.py
```
Train correctness predictor
```
python train_correctness_predictor.py
```

Difficulty estimation and k-greedy selection

python fullcorpus_difficulty_estimation.py --model_name Qwen/Qwen2.5-Math-7B
python k_greedy_selection.py --model_name Qwen/Qwen2.5-Math-7B --lamda 0.2 --num_select 1000

Training and Evaluation

bash train/run.sh

Citation

If our framework is useful for your research, please consider to cite the paper:

@inproceedings{
hoang2026difficultydiversity,
title={Difficulty{\textendash}Diversity Collaborative Filtering for Data-Efficient {LLM} Fine-Tuning},
author={Long P. Hoang and Wenxuan Zhang and Wei Lu},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=n9mXlqD2SJ}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Difficulty-Diversity Collaborative Filtering

Environment Setup

Scripts Overview

Data Preparation

Training and Evaluation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
DDCF_data		DDCF_data
eval		eval
figures		figures
train		train
.gitignore		.gitignore
README.md		README.md
create_training_data.py		create_training_data.py
download_binary_correctness_from_hf.py		download_binary_correctness_from_hf.py
download_ddcf_math_from_hf.py		download_ddcf_math_from_hf.py
fullcorpus_difficulty_estimation.py		fullcorpus_difficulty_estimation.py
get_question_embedding.py		get_question_embedding.py
infer.py		infer.py
k_greedy_selection.py		k_greedy_selection.py
models.py		models.py
prepare.py		prepare.py
requirements.txt		requirements.txt
run_infer_all_models.sh		run_infer_all_models.sh
train_correctness_predictor.py		train_correctness_predictor.py

Folders and files

Latest commit

History

Repository files navigation

Difficulty-Diversity Collaborative Filtering

Environment Setup

Scripts Overview

Data Preparation

Training and Evaluation

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages