JOURNAL ARTICLE

A novel feature selection technique for enhancing performance of unbalanced text classification problem

Santosh Kumar BeheraRajashree Dash

Year: 2022 Journal:   Intelligent Decision Technologies Vol: 16 (1)Pages: 51-69   Publisher: IOS Press

Abstract

Since the last few decades, Text Classification (TC) is being witnessed as an important research direction due to the availability of a huge amount of digital text documents on the web. It would be tedious to manually organize and label them by human experts. Again digging a large number of highly sparse terms and skewed categories present in the documents put a lot of challenges in the correct labeling of the unlabeled documents. Hence feature selection is an essential aspect in text classification, which aims to select more concise and relevant features for further mining of the documents. Additionally, if the text in the document set is associated with multiple categories and the distribution of classes in the dataset is unbalanced, it imposes more challenges on the suitable selection of features for text classification. In this paper, a Modified Chi-Square (ModCHI) based feature selection technique is proposed for enhancing the performance of classification of multi-labeled text documents with unbalanced class distributions. It is an improved version of the Chi-square (Chi) method, which emphasizes selecting maximum features from the classes with a large number of training and testing documents. Unlike Chi, in which the top features are selected with top Chi value, in this proposed technique a score is calculated by considering the total number of relevant documents corresponding to each class with respect to the total number of documents in the original dataset. According to the score the features related to the highly relevant classes as well as high Chi-square value are selected for further processing. The proposed technique is verified with four different classifiers such as Linear SVM (LSVM), Decision tree (DT), Multilevel KNN (MLKNN), Random Forest (RF) over Reuters benchmark multi-labeled, multi-class, unbalanced dataset. The effectiveness of the model is also tested by comparing it with four other traditional feature selection techniques such as term frequency-inverse document frequency (TF-IDF), Chi-square, and Mutual Information (MI). From the experimental outcomes, it is clearly inferred that LSVM with ModCHI produces the highest precision value of 0.94, recall value of 0.80, f-measure of 0.86 and the least hamming loss value of 0.003 with a feature size 1000. The proposed feature selection technique with LSVM produces an improvement of 3.33%, 2.19%, 16.25% in the average precision value, 3.03%, 33.33%, 21.42% in the average recall value, 4%, 34.48%, 14.70% in average F-measure value and 14%, 37.68%, 31.74% in average hamming loss value compared to TF-IDF, Chi and MI techniques respectively. These findings clearly interpret the better performance of the proposed feature selection technique compared to TF_IDF, Chi and MI techniques on the unbalanced Reuters Dataset.

Keywords:
Computer science Selection (genetic algorithm) Feature selection Class (philosophy) Set (abstract data type) Document classification Feature (linguistics) Value (mathematics) Information retrieval Artificial intelligence Data mining Pattern recognition (psychology) Machine learning

Metrics

7
Cited By
1.17
FWCI (Field Weighted Citation Impact)
29
Refs
0.76
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Text and Document Classification Technologies
Physical Sciences →  Computer Science →  Artificial Intelligence
Imbalanced Data Classification Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
Currency Recognition and Detection
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Performance Enhancement of the Unbalanced Text Classification Problem Through a Modified Chi Square-Based Feature Selection Technique

Santosh Kumar BeheraRajashree Dash

Journal:   International Journal of Intelligent Information Technologies Year: 2022 Vol: 18 (1)Pages: 1-23
BOOK-CHAPTER

A Novel Feature Selection Technique for Text Classification

D. S. GuruMostafa Z. AliMahamad Suhil

Advances in intelligent systems and computing Year: 2018 Pages: 721-733
JOURNAL ARTICLE

Effective feature selection technique for text classification

Hari SeethaM. Narasimha MurtyR. Saravanan

Journal:   International Journal of Data Mining Modelling and Management Year: 2015 Vol: 7 (3)Pages: 165-165
JOURNAL ARTICLE

A Novel Feature Selection Technique for Text Classification Using Naïve Bayes

Subhajit Dey SarkarSaptarsi GoswamiAman AgarwalJaved Aktar

Journal:   International Scholarly Research Notices Year: 2014 Vol: 2014 Pages: 1-10
© 2026 ScienceGate Book Chapters — All rights reserved.