DoR - Division of Research

Title: Training data selection using ensemble dataset approach for software defect prediction

Author: Md. Fahimuzzman Sohan, Md Alamgir Kabir, Md. Mostafijur Rahman, S. M. Hasan Mahmud, Touhid Bhuiyan,

Abstract: Cross-project defect prediction (CPDP) is using due to the limitation of within project defect prediction (WPDP) in Software Defect Prediction (SDP) research. CPDP aims to train one project data to predict another project using the machine learning technique. The source and target projects are different in the CPDP setting, because of various structured source-target projects, sometimes it may not be a perfect combination. This study represents a categorical data set ensemble technique, where multiple data sets have been aggregated for source data instead of using a single data set. The method has been evaluated on nine data sets, taken from the publicly accessible repository with two performance indicators. The results of this data set ensemble approach show the improvement of the prediction performance over 65% combinations compared with traditional CPDP models. The results also show that same categories (homogeneous) train-test data set pairs give high performance; otherwise, the prediction performances of different category data sets are mostly collapsed. Therefore, the proposed scheme is recommended as an alternative to predict defects that can improve the prediction of most of the cases compared with traditional cross-project SDP models.

Keywords: Software defect prediction Cross-project defect prediction Training data selection Data set ensemble

Journal or Conference Name: Lecture Notes of the Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering, LNICST