Aravind DendukuriSagar GoyalJannat AroraAbhinav Pradeep
The past few years have seen a massive growth in the number of daily internet users whose primary language of communication is Hindi. Hindi is now one of the most spoken languages in the world and the official language of the Indian Government. Given this considerable rise in the amount of data in Hindi, managing, analyzing, and summarizing documents becomes a significant task with many applications. But language models and Natural Language Processing tasks catering to this demographic have been very limited in scope. Even state-of-the-art multilingual models cannot handle the nuances of the language. To bridge this gap, the MuRIL [37] language model was implemented and trained on large-scale Indian text corpora. The present work focuses on the summarization task for Hindi documents. We leverage the power of the MuRIL model and develop a novel extractive summarization-based solution using the language model's embeddings. Newspaper articles spanning several categories are extracted as our training data, and comprehensive testing shows that our model exceeds the performance of the previous baselines on the accuracy metric.
Mukesh MorePallavi YevaleAbhang MandwaleK.C. AgrawalOm MahaleSwati Rajput
Priyadarshini PatilChandan RaoG.V. Rithin Kumar ReddyRiteesh RamS. M. Meena