JOURNAL ARTICLE

Multi-Document Summarization with Centroid-Based Pretraining

Abstract

In Multi-Document Summarization (MDS), the input can be modeled as a set of documents, and the output is its summary. In this paper, we focus on pretraining objectives for MDS. Specifically, we introduce a novel pretraining objective, which involves selecting the ROUGE-based centroid of each document cluster as a proxy for its summary. Our objective thus does not require human written summaries and can be utilized for pretraining on a dataset consisting solely of document sets. Through zero-shot, few-shot, and fully supervised experiments on multiple MDS datasets, we show that our model Centrum is better or comparable to a state-of-the-art model. We make the pretrained and fine-tuned models freely available to the research communityhttps://github.com/ratishsp/centrum.

Keywords:
Automatic summarization Centroid Computer science Focus (optics) Set (abstract data type) Artificial intelligence Proxy (statistics) Single shot Information retrieval One shot Multi-document summarization Data mining Natural language processing Pattern recognition (psychology) Machine learning

Metrics

10
Cited By
2.55
FWCI (Field Weighted Citation Impact)
20
Refs
0.88
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Topic Modeling
Physical Sciences →  Computer Science →  Artificial Intelligence
Natural Language Processing Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
Advanced Text Analysis Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.