Abstract

We present ClidSum, a benchmark dataset towards building cross-lingual summarization systems on dialogue documents. It consists of 67k+ dialogue documents and 112k+ annotated summaries in different target languages. Based on the proposed ClidSum, we introduce two benchmark settings for supervised and semi-supervised scenarios, respectively. We then build various baseline systems in different paradigms (pipeline and end-to-end) and conduct extensive experiments on ClidSum to provide deeper analyses. Furthermore, we propose mDialBART which extends mBART via further pre-training, where the multiple objectives help the pre-trained model capture the structural characteristics as well as key content in dialogues and the transformation from source to the target language. Experimental results show the superiority of mDialBART, as an end-to-end model, outperforms strong pipeline models on ClidSum. Finally, we discuss specific challenges that current approaches faced with this task and give multiple promising directions for future research. We have released the dataset and code at https://github.com/krystalan/ClidSum.

Keywords:
Automatic summarization Benchmark (surveying) Computer science Pipeline (software) Baseline (sea) Key (lock) Task (project management) Artificial intelligence Code (set theory) Natural language processing Transformation (genetics) Machine learning Language model Source code Information retrieval Programming language

Metrics

26
Cited By
5.09
FWCI (Field Weighted Citation Impact)
50
Refs
0.94
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Topic Modeling
Physical Sciences →  Computer Science →  Artificial Intelligence
Natural Language Processing Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
Advanced Text Analysis Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

JOURNAL ARTICLE

Cross-lingual Cross-temporal Summarization: Dataset, Models, Evaluation

Ran ZhangJihed OuniSteffen Eger

Journal:   Computational Linguistics Year: 2024 Vol: 50 (3)Pages: 1001-1047
BOOK-CHAPTER

MCLS: A Large-Scale Multimodal Cross-Lingual Summarization Dataset

Xiaorui Shi

Lecture notes in computer science Year: 2023 Pages: 273-288
JOURNAL ARTICLE

Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation

Olga MajewskaEvgeniia RazumovskaiaEdoardo Maria PontiIvan VulićAnna Korhonen

Journal:   Transactions of the Association for Computational Linguistics Year: 2023 Vol: 11 Pages: 139-156
© 2026 ScienceGate Book Chapters — All rights reserved.