An empirical analysis of feature selection techniques for Software Defect Prediction

Tarunim Sharma; Aman Jatain; Shalini Bhaskar; Kavita Pabreja

doi:10.32629/jai.v7i3.1097

ScienceGate Book Chapters

JOURNAL ARTICLE

An empirical analysis of feature selection techniques for Software Defect Prediction

Tarunim Sharma Aman Jatain Shalini Bhaskar Kavita Pabreja

Year: 2024 Journal: Journal of Autonomous Intelligence Vol: 7 (3)

DOI: 10.32629/jai.v7i3.1097

Get Full-Text PDF Get Analytical Report

Abstract

<p>Detecting software defects before they occur is crucial in software engineering as it impacts software system quality and reliability. Previous studies on predicting software defects have typically employed software features, such as code size, complexity, coupling, cohesion, inheritance, and other software metrics., to forecast whether a code file or commit is prone to defects in the future. However, it is advantageous to restrict the number of features employed in a defect prediction model to avoid the challenges associated with multicollinearity and the “curse of dimensionality” and to simplify the data analysis process. By using a reduced number of features, the defect prediction model can concentrate on the most significant variables and improve its accuracy. This research paper investigates the impact of eight feature selection methods on the accuracy and stability of six supervised learning models. This study is novel as it is based on exhaustive experimentation of each of the eight feature selection techniques with each of the six supervised learning models. Two notable findings have been obtained. First, we discovered that the association and coherence-based techniques have demonstrated the highest level of accuracy when it comes to defect prediction. The models that utilized these selected features outperformed those using the original features. Second, the feature selection techniques, namely Correlation feature selection, Recursive feature elimination, and Ridge feature selection when combined with the Support vector machine and Decision tree classifier, consistently selected low-variance features across multiple supervised defect prediction models. When combined with different classifiers, these techniques achieved exceptional performance on the publicly available NASA datasets CM1 and PC2. The findings revealed a remarkable accuracy rate of over 85% for CM1 and 95% for PC2, accompanied by precision, recall, and f-measure values exceeding 95%. These exceptional results indicate the achievement of the highest level of performance in the evaluation.</p>

Keywords:

Feature selection Computer science Artificial intelligence Machine learning Software bug Data mining Software Support vector machine Predictive modelling Feature (linguistics) Software quality Classifier (UML) Random forest Multicollinearity Software metric Decision tree Pattern recognition (psychology) Regression analysis Software development

Metrics

Cited By

3.06

FWCI (Field Weighted Citation Impact)

Refs

0.86

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Software Engineering Research

Physical Sciences → Computer Science → Information Systems

Software System Performance and Reliability

Physical Sciences → Computer Science → Computer Networks and Communications

Software Reliability and Analysis Research

Physical Sciences → Computer Science → Software

An empirical analysis of feature selection techniques for Software Defect Prediction

Abstract

Metrics

Citation History

Topics

Related Documents

Search-Based Wrapper Feature Selection Methods in Software Defect Prediction: An Empirical Analysis

Performance Analysis of Feature Selection Techniques in Software Defect Prediction using Machine Learning

Empirical validation of feature selection techniques for cross-project defect prediction

ELM and KELM based software defect prediction using feature selection techniques

Genetic Feature Selection for Software Defect Prediction