AraSum

This repository contains AraSum, the first large-scale monolingual Arabic corpus for abstractive text summarization, as well as two newly released datasets for preference-based fine-tuning:

AraRLHF: A dataset formatted for Reinforcement Learning from Human Feedback (RLHF)
AraDPO: A dataset formatted for Direct Preference Optimization (DPO)

📚 Datasets

🔹 AraSum

AraSum is a monolingual Arabic summarization corpus containing 49,604 articles and their corresponding leads. It was constructed from the Arabic version of the Deutsche Welle (DW) news website, covering diverse domains such as politics, sports, and culture. This diversity promotes better generalization in summarization models.

Format: .csv
Structure: Each line contains an article and its lead summary, separated by a tab (\t)

Related Publications
For more details on the creation and usage of this dataset, please refer to the following papers:

Cross-lingual Fine-tuning for Abstractive Arabic Text Summarization
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
Fine-tuning and Multilingual Pre-training for Abstractive Summarization Task for the Arabic Language
Annales Mathematicae et Informaticae (2022)

🔹 AraRLHF

A JSON-formatted dataset designed for training reward models in Reinforcement Learning from Human Feedback (RLHF) pipelines. The AraRLHF dataset consists of 1,746 samples, derived from manual evaluation results conducted in our prior research.

This dataset is used to train a reward model (RM) that predicts the quality of a generated summary based on human preferences.

Format: .json
Structure: Each record includes: An article, its lead summary, a ranking label reflecting human preference, and the evaluator ID (author)

🔹 AraDPO

A JSON-formatted dataset designed for Direct Preference Optimization (DPO), following the same structure as AraRLHF. The AraDPO dataset is used to fine-tune language models directly using binary preference pairs. To construct AraDPO, each ranked preference entry in AraRLHF was converted into all possible pairwise comparisons. We then applied deduplication to remove redundant or overlapping pairs, resulting in a final set of 2,309 unique preference records.

This dataset is used to Fine-tuning language models with DPO.

Format: .json
Structure: Each record includes: article_id, article, chosen (the preferred summary ), and rejected (the dispreferred summary).

🔖 Citation

If you use AraSum, please cite:

@inproceedings{kahla-etal-2021-cross,
    title = "Cross-lingual Fine-tuning for Abstractive {A}rabic Text Summarization",
    author = "Kahla, Mram  and
      Yang, Zijian Gy{\H{o}}z{\H{o}}  and
      Nov{\'a}k, Attila",
    booktitle = "Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)",
    month = sep,
    year = "2021",
    address = "Held Online",
    publisher = "INCOMA Ltd.",
    url = "https://aclanthology.org/2021.ranlp-1.74",
    pages = "655--663",
}

If you use AraRLHF, AraDPO, please cite:

@inproceedings{kahla2025optimizing,
    title = "Optimizing {A}bstractive {A}rabic {S}ummarization via {RLHF} and {DPO} with {L}lama 2",
    author = " Kahla, Mram and 
            Yang, Zijian Gy{\H{o}}z{\H{o}},
    booktitle = "Magyar Számítógépes Nyelvészeti Konferencia",
    volume = "XXI",
    pages = "41--55",
    url = "https://rgai.inf.u-szeged.hu/sites/rgai.inf.u-szeged.hu/files/mszny2025%20%281%29.pdf",
    year = "2025",
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
AbsArSumCorpus_csv_v1.zip		AbsArSumCorpus_csv_v1.zip
AraDPO.json		AraDPO.json
AraRLHF.json		AraRLHF.json
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AraSum

📚 Datasets

🔹 AraSum

🔹 AraRLHF

🔹 AraDPO

🔖 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

AraSum

📚 Datasets

🔹 AraSum

🔹 AraRLHF

🔹 AraDPO

🔖 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages