This repository contains AraSum, the first large-scale monolingual Arabic corpus for abstractive text summarization, as well as two newly released datasets for preference-based fine-tuning:
AraRLHF: A dataset formatted for Reinforcement Learning from Human Feedback (RLHF)AraDPO: A dataset formatted for Direct Preference Optimization (DPO)
AraSum is a monolingual Arabic summarization corpus containing 49,604 articles and their corresponding leads. It was constructed from the Arabic version of the Deutsche Welle (DW) news website, covering diverse domains such as politics, sports, and culture. This diversity promotes better generalization in summarization models.
- Format:
.csv - Structure: Each line contains an article and its lead summary, separated by a tab (
\t)
Related Publications
For more details on the creation and usage of this dataset, please refer to the following papers:
-
Cross-lingual Fine-tuning for Abstractive Arabic Text Summarization
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021) -
Fine-tuning and Multilingual Pre-training for Abstractive Summarization Task for the Arabic Language
Annales Mathematicae et Informaticae (2022)
A JSON-formatted dataset designed for training reward models in Reinforcement Learning from Human Feedback (RLHF) pipelines. The AraRLHF dataset consists of 1,746 samples, derived from manual evaluation results conducted in our prior research.
This dataset is used to train a reward model (RM) that predicts the quality of a generated summary based on human preferences.
- Format:
.json - Structure: Each record includes: An article, its lead summary, a ranking label reflecting human preference, and the evaluator ID (author)
A JSON-formatted dataset designed for Direct Preference Optimization (DPO), following the same structure as AraRLHF. The AraDPO dataset is used to fine-tune language models directly using binary preference pairs. To construct AraDPO, each ranked preference entry in AraRLHF was converted into all possible pairwise comparisons. We then applied deduplication to remove redundant or overlapping pairs, resulting in a final set of 2,309 unique preference records.
This dataset is used to Fine-tuning language models with DPO.
- Format:
.json - Structure: Each record includes: article_id, article, chosen (the preferred summary ), and rejected (the dispreferred summary).
If you use AraSum, please cite:
@inproceedings{kahla-etal-2021-cross,
title = "Cross-lingual Fine-tuning for Abstractive {A}rabic Text Summarization",
author = "Kahla, Mram and
Yang, Zijian Gy{\H{o}}z{\H{o}} and
Nov{\'a}k, Attila",
booktitle = "Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)",
month = sep,
year = "2021",
address = "Held Online",
publisher = "INCOMA Ltd.",
url = "https://aclanthology.org/2021.ranlp-1.74",
pages = "655--663",
}If you use AraRLHF, AraDPO, please cite:
@inproceedings{kahla2025optimizing,
title = "Optimizing {A}bstractive {A}rabic {S}ummarization via {RLHF} and {DPO} with {L}lama 2",
author = " Kahla, Mram and
Yang, Zijian Gy{\H{o}}z{\H{o}},
booktitle = "Magyar Számítógépes Nyelvészeti Konferencia",
volume = "XXI",
pages = "41--55",
url = "https://rgai.inf.u-szeged.hu/sites/rgai.inf.u-szeged.hu/files/mszny2025%20%281%29.pdf",
year = "2025",
}