Software defect prediction is an important software quality assurance technique. It utilizes historical project data and previously discovered defects to predict potential defects. However, most of existing methods assume that large amounts of labeled historical data are available for prediction, while in the early stage of the life cycle, projects may lack the data needed for building such predictors. In addition, most of existing techniques use static code metrics as predictors, while they omit change information that may introduce risks into software development. In this paper, we take these two issues into consideration, and propose a semi-supervised based defect prediction approach - extRF. extRF extends the classical supervised Random Forest algorithm by self-training paradigm. It also employs change burst information for improving accuracy of software defect prediction. We also conduct an experiment to evaluate extRF against three other supervised machine learners (i.e. Logistic Regression, Naive Bayes, Random Forest) and compare the effectiveness of code metrics, change burst metrics, and a combination of them. Experimental results show that extRF trained with a small size of labeled dataset achieves comparable performance to some supervised learning approaches trained with a larger size of labeled dataset. When only 2% of Eclipse 2.0 data are used for training, extRF can achieve F-measure about 0.562, approximate to that of LR (a supervised learning approach) at labeled sampling rate of 50%. Besides, change burst metrics outperform code metrics in that F-measure rises to a peak value of 0.75 for Eclipse 3.0 and JDT.Core.
Huihua LuBojan ČukićMark V. Culp
Ming ChengGuoqing WuMengting YuanHongyan Wan
Ming LiHongyu ZhangRongxin WuZhi‐Hua Zhou
Zhiwu ZhangXiao‐Yuan JingTiejian Wang
MaYingPanWeiweiZhuShunzhiYinHuayiLuoJian