Automatic Extractive Text Summarization using Multiple Linguistic Features

Pooja Gupta; Swati Nigam; Rajiv Singh

doi:10.1145/3656471

ScienceGate Book Chapters

JOURNAL ARTICLE

Automatic Extractive Text Summarization using Multiple Linguistic Features

Pooja Gupta Swati Nigam Rajiv Singh

Year: 2024 Journal: ACM Transactions on Asian and Low-Resource Language Information Processing Publisher: Association for Computing Machinery

DOI: 10.1145/3656471

Get Full-Text PDF Get Analytical Report

Abstract

Automatic text summarization (ATS) provides a summary of distinct categories of information using natural language processing (NLP). Low-resource languages like Hindi have restricted applications of these techniques. This study proposes a method for automatically generating summaries of Hindi documents using extractive technique. The approach retrieves pertinent sentences from the source documents by employing multiple linguistic features and machine learning (ML) using maximum likelihood estimation (MLE) and maximum entropy (ME). We conducted pre-processing on the input documents, such as eliminating Hindi stop words and stemming. We have obtained 15 linguistic feature scores from each document to identify the phrases with high scores for summary generation. We have performed experiments over BBC News articles, CNN News, DUC 2004, Hindi Text Short Summarization Corpus, Indian Language News Text Summarization Corpus, and Wikipedia Articles for the proposed text summarizer. The Hindi Text Short Summarization Corpus and Indian Language News Text Summarization Corpus datasets are in Hindi, whereas BBC News articles, CNN News, and the DUC 2004 datasets have been translated into Hindi using Google, Microsoft Bing, and Systran translators for experiments. The summarization results have been calculated and shown for Hindi as well as for English to compare the performance of a low and rich-resource language. Multiple ROUGE metrics, along with precision, recall, and F-measure, have been used for the evaluation, which shows the better performance of the proposed method with multiple ROUGE scores. We compare the proposed method with the supervised and unsupervised machine learning methodologies, including support vector machine (SVM), Naive Bayes (NB), decision tree (DT), latent semantic analysis (LSA), latent Dirichlet allocation (LDA), and K-means clustering, and it was found that the proposed method outperforms these methods.

Keywords:

Automatic summarization Computer science Natural language processing Artificial intelligence Linguistics Information retrieval Philosophy

Metrics

Cited By

4.47

FWCI (Field Weighted Citation Impact)

Refs

0.91

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Topic Modeling

Physical Sciences → Computer Science → Artificial Intelligence

Advanced Text Analysis Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Automatic Extractive Text Summarization using Multiple Linguistic Features

Abstract

Metrics

Citation History

Topics

Related Documents

Statistical Features for Extractive Automatic Text Summarization

Statistical Features for Extractive Automatic Text Summarization

Extractive Text Summarization Using Topological Features

Optimal Features Set for Extractive Automatic Text Summarization

Automatic Persian Text Summarization Using Linguistic Features from Text Structure Analysis