JOURNAL ARTICLE

An embedded feature selection method for imbalanced data classification

Haoyue LiuMengChu ZhouQing Liu

Year: 2019 Journal:   IEEE/CAA Journal of Automatica Sinica Vol: 6 (3)Pages: 703-715   Publisher: Institute of Electrical and Electronics Engineers

Abstract

Imbalanced data is one type of datasets that are frequently found in real-world applications, e.g., fraud detection and cancer diagnosis. For this type of datasets, improving the accuracy to identify their minority class is a critically important issue. Feature selection is one method to address this issue. An effective feature selection method can choose a subset of features that favor in the accurate determination of the minority class. A decision tree is a classifier that can be built up by using different splitting criteria. Its advantage is the ease of detecting which feature is used as a splitting node. Thus, it is possible to use a decision tree splitting criterion as a feature selection method. In this paper, an embedded feature selection method using our proposed weighted Gini index (WGI) is proposed. Its comparison results with Chi2, F-statistic and Gini index feature selection methods show that F-statistic and Chi2 reach the best performance when only a few features are selected. As the number of selected features increases, our proposed method has the highest probability of achieving the best performance. The area under a receiver operating characteristic curve (ROC AUC) and F-measure are used as evaluation criteria. Experimental results with two datasets show that ROC AUC performance can be high, even if only a few features are selected and used, and only changes slightly as more and more features are selected. However, the performance of Fmeasure achieves excellent performance only if 20% or more of features are chosen. The results are helpful for practitioners to select a proper feature selection method when facing a practical problem.

Keywords:
Feature selection Statistic Computer science Classifier (UML) Pattern recognition (psychology) Decision tree Information gain ratio Artificial intelligence Receiver operating characteristic Feature (linguistics) Data mining Selection (genetic algorithm) Statistics Machine learning Mathematics

Metrics

409
Cited By
26.73
FWCI (Field Weighted Citation Impact)
38
Refs
1.00
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Imbalanced Data Classification Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
Text and Document Classification Technologies
Physical Sciences →  Computer Science →  Artificial Intelligence
Face and Expression Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

BOOK-CHAPTER

Imbalanced Data Classification Based on Feature Selection Techniques

Paweł KsieniewiczMichał Woźniak

Lecture notes in computer science Year: 2018 Pages: 296-303
JOURNAL ARTICLE

Imbalanced Big Data Classification using Feature Selection Under-Sampling

C. SaradaM. Sathya Devi

Journal:   CVR Journal of Science & Technology Year: 2019 Vol: 17 (1)Pages: 78-82
JOURNAL ARTICLE

Feature Selection in Imbalanced Data

Firuz KamalovFadi ThabtahHo‐Hon Leung

Journal:   Annals of Data Science Year: 2022 Vol: 10 (6)Pages: 1527-1541
JOURNAL ARTICLE

Feature Selection based Improved Seagull Optimization for Imbalanced Data Classification

Journal:   International journal of intelligent engineering and systems Year: 2024 Vol: 17 (6)Pages: 852-866
© 2026 ScienceGate Book Chapters — All rights reserved.