The source code for the paper Difficulty–Diversity Collaborative Filtering for Data-Efficient LLM Fine-Tuning.
If you have any questions, please don't hesitate to contact me: longhp1618@gmail.com
-
Create a virtual environment:
conda create -n ddcf python=3.10 -y conda activate ddcf
-
Install dependencies:
pip install -r requirements.txt
| Script | Description |
|---|---|
infer.py |
Run inference and verification on seed data for a specific model. Running all models listed in DDCF_data/model_order.csv produces the full binary correctness data. |
create_training_data |
Split full binary correctness data into training and validation sets for the correctness predictor. |
get_question_embeddings.py |
Compute seed embeddings and full-corpus embeddings. |
train_correctness_predictor.py |
Train the correctness predictor. |
fullcorpus_difficulty_estimation.py |
Estimate difficulty per example for a given model; results are saved under factorized_data/. |
k_greedy_selection.py |
Perform k-greedy selection to balance difficulty and diversity. |
You can either download our processed datasets (including 1,000-example curated sets) from Hugging Face or prepare data yourself.
-
Download full corpus and seed corpus
python download_ddcf_math_from_hf.py
-
Binary correctness matrix — either download precomputed data:
python download_binary_correctness_from_hf.py
or run inference for each collaborative model (running all models in
DDCF_data/model_order.csvyields full binary correctness data):bash run_infer_all_models.sh python create_training_data.py -
Question embeddings
python get_question_embeddings.py
-
Train correctness predictor
python train_correctness_predictor.py
-
Difficulty estimation and k-greedy selection
python fullcorpus_difficulty_estimation.py --model_name Qwen/Qwen2.5-Math-7B python k_greedy_selection.py --model_name Qwen/Qwen2.5-Math-7B --lamda 0.2 --num_select 1000
bash train/run.sh
If our framework is useful for your research, please consider to cite the paper:
@inproceedings{
hoang2026difficultydiversity,
title={Difficulty{\textendash}Diversity Collaborative Filtering for Data-Efficient {LLM} Fine-Tuning},
author={Long P. Hoang and Wenxuan Zhang and Wei Lu},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=n9mXlqD2SJ}
}