JOURNAL ARTICLE

An empirical analysis of feature selection techniques for Software Defect Prediction

Abstract

<p>Detecting software defects before they occur is crucial in software engineering as it impacts software system quality and reliability. Previous studies on predicting software defects have typically employed software features, such as code size, complexity, coupling, cohesion, inheritance, and other software metrics., to forecast whether a code file or commit is prone to defects in the future. However, it is advantageous to restrict the number of features employed in a defect prediction model to avoid the challenges associated with multicollinearity and the “curse of dimensionality” and to simplify the data analysis process. By using a reduced number of features, the defect prediction model can concentrate on the most significant variables and improve its accuracy. This research paper investigates the impact of eight feature selection methods on the accuracy and stability of six supervised learning models. This study is novel as it is based on exhaustive experimentation of each of the eight feature selection techniques with each of the six supervised learning models. Two notable findings have been obtained. First, we discovered that the association and coherence-based techniques have demonstrated the highest level of accuracy when it comes to defect prediction. The models that utilized these selected features outperformed those using the original features. Second, the feature selection techniques, namely Correlation feature selection, Recursive feature elimination, and Ridge feature selection when combined with the Support vector machine and Decision tree classifier, consistently selected low-variance features across multiple supervised defect prediction models. When combined with different classifiers, these techniques achieved exceptional performance on the publicly available NASA datasets CM1 and PC2. The findings revealed a remarkable accuracy rate of over 85% for CM1 and 95% for PC2, accompanied by precision, recall, and f-measure values exceeding 95%. These exceptional results indicate the achievement of the highest level of performance in the evaluation.</p>

Keywords:
Feature selection Computer science Artificial intelligence Machine learning Software bug Data mining Software Support vector machine Predictive modelling Feature (linguistics) Software quality Classifier (UML) Random forest Multicollinearity Software metric Decision tree Pattern recognition (psychology) Regression analysis Software development

Metrics

2
Cited By
3.06
FWCI (Field Weighted Citation Impact)
0
Refs
0.86
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Software Engineering Research
Physical Sciences →  Computer Science →  Information Systems
Software System Performance and Reliability
Physical Sciences →  Computer Science →  Computer Networks and Communications
Software Reliability and Analysis Research
Physical Sciences →  Computer Science →  Software

Related Documents

JOURNAL ARTICLE

Empirical validation of feature selection techniques for cross-project defect prediction

Ruchika MalhotraShweta Meena

Journal:   International Journal of Systems Assurance Engineering and Management Year: 2023 Vol: 15 (5)Pages: 1743-1755
JOURNAL ARTICLE

ELM and KELM based software defect prediction using feature selection techniques

Ishani AroraAnju Saha

Journal:   Journal of Information and Optimization Sciences Year: 2019 Vol: 40 (5)Pages: 1025-1045
JOURNAL ARTICLE

Genetic Feature Selection for Software Defect Prediction

Romi Satria WahonoNanna Suryana

Journal:   Advanced Science Letters Year: 2013 Vol: 20 (1)Pages: 239-244
© 2026 ScienceGate Book Chapters — All rights reserved.