Skip to content

ppke-nlpg/AraSum

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 

Repository files navigation

AraSum

This repository contains AraSum, the first large-scale monolingual Arabic corpus for abstractive text summarization, as well as two newly released datasets for preference-based fine-tuning:

  • AraRLHF: A dataset formatted for Reinforcement Learning from Human Feedback (RLHF)
  • AraDPO: A dataset formatted for Direct Preference Optimization (DPO)

📚 Datasets

🔹 AraSum

AraSum is a monolingual Arabic summarization corpus containing 49,604 articles and their corresponding leads. It was constructed from the Arabic version of the Deutsche Welle (DW) news website, covering diverse domains such as politics, sports, and culture. This diversity promotes better generalization in summarization models.

  • Format: .csv
  • Structure: Each line contains an article and its lead summary, separated by a tab (\t)

Related Publications
For more details on the creation and usage of this dataset, please refer to the following papers:

🔹 AraRLHF

A JSON-formatted dataset designed for training reward models in Reinforcement Learning from Human Feedback (RLHF) pipelines. The AraRLHF dataset consists of 1,746 samples, derived from manual evaluation results conducted in our prior research.

This dataset is used to train a reward model (RM) that predicts the quality of a generated summary based on human preferences.

  • Format: .json
  • Structure: Each record includes: An article, its lead summary, a ranking label reflecting human preference, and the evaluator ID (author)

🔹 AraDPO

A JSON-formatted dataset designed for Direct Preference Optimization (DPO), following the same structure as AraRLHF. The AraDPO dataset is used to fine-tune language models directly using binary preference pairs. To construct AraDPO, each ranked preference entry in AraRLHF was converted into all possible pairwise comparisons. We then applied deduplication to remove redundant or overlapping pairs, resulting in a final set of 2,309 unique preference records.

This dataset is used to Fine-tuning language models with DPO.

  • Format: .json
  • Structure: Each record includes: article_id, article, chosen (the preferred summary ), and rejected (the dispreferred summary).

🔖 Citation

If you use AraSum, please cite:

@inproceedings{kahla-etal-2021-cross,
    title = "Cross-lingual Fine-tuning for Abstractive {A}rabic Text Summarization",
    author = "Kahla, Mram  and
      Yang, Zijian Gy{\H{o}}z{\H{o}}  and
      Nov{\'a}k, Attila",
    booktitle = "Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)",
    month = sep,
    year = "2021",
    address = "Held Online",
    publisher = "INCOMA Ltd.",
    url = "https://aclanthology.org/2021.ranlp-1.74",
    pages = "655--663",
}

If you use AraRLHF, AraDPO, please cite:

@inproceedings{kahla2025optimizing,
    title = "Optimizing {A}bstractive {A}rabic {S}ummarization via {RLHF} and {DPO} with {L}lama 2",
    author = " Kahla, Mram and 
            Yang, Zijian Gy{\H{o}}z{\H{o}},
    booktitle = "Magyar Számítógépes Nyelvészeti Konferencia",
    volume = "XXI",
    pages = "41--55",
    url = "https://rgai.inf.u-szeged.hu/sites/rgai.inf.u-szeged.hu/files/mszny2025%20%281%29.pdf",
    year = "2025",
}

About

Arab Summarization Corpus

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors