JOURNAL ARTICLE

Predicting Thalassemia Using Feature Selection Techniques: A Comparative Analysis

M.M. SaleemWaqar AslamM. Ikram Ullah LaliHafiz Tayyab RaufEmad Abouel Nasr

Year: 2023 Journal:   Diagnostics Vol: 13 (22)Pages: 3441-3441   Publisher: Multidisciplinary Digital Publishing Institute

Abstract

Thalassemia represents one of the most common genetic disorders worldwide, characterized by defects in hemoglobin synthesis. The affected individuals suffer from malfunctioning of one or more of the four globin genes, leading to chronic hemolytic anemia, an imbalance in the hemoglobin chain ratio, iron overload, and ineffective erythropoiesis. Despite the challenges posed by this condition, recent years have witnessed significant advancements in diagnosis, therapy, and transfusion support, significantly improving the prognosis for thalassemia patients. This research empirically evaluates the efficacy of models constructed using classification methods and explores the effectiveness of relevant features that are derived using various machine-learning techniques. Five feature selection approaches, namely Chi-Square (χ2), Exploratory Factor Score (EFS), tree-based Recursive Feature Elimination (RFE), gradient-based RFE, and Linear Regression Coefficient, were employed to determine the optimal feature set. Nine classifiers, namely K-Nearest Neighbors (KNN), Decision Trees (DT), Gradient Boosting Classifier (GBC), Linear Regression (LR), AdaBoost, Extreme Gradient Boosting (XGB), Random Forest (RF), Light Gradient Boosting Machine (LGBM), and Support Vector Machine (SVM), were utilized to evaluate the performance. The χ2 method achieved accuracy, registering 91.56% precision, 91.04% recall, and 92.65% f-score when aligned with the LR classifier. Moreover, the results underscore that amalgamating over-sampling with Synthetic Minority Over-sampling Technique (SMOTE), RFE, and 10-fold cross-validation markedly elevates the detection accuracy for αT patients. Notably, the Gradient Boosting Classifier (GBC) achieves 93.46% accuracy, 93.89% recall, and 92.72% F1 score.

Keywords:
Support vector machine Artificial intelligence Feature selection Decision tree AdaBoost Gradient boosting Random forest Pattern recognition (psychology) Computer science Classifier (UML) Machine learning Precision and recall

Metrics

21
Cited By
7.89
FWCI (Field Weighted Citation Impact)
93
Refs
0.97
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Hemoglobinopathies and Related Disorders
Health Sciences →  Medicine →  Genetics
Iron Metabolism and Disorders
Health Sciences →  Medicine →  Hematology
Imbalanced Data Classification Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.