Assessing the Effect of Imbalanced Learning on Cross-project Software Defect Prediction
Software Defect Prediction (SDP) identifies the defect-prone modules from software source code, which helps to serve good quality software. Mostly previous cross-project SDP models were built based on single project data, where single project was used to prepare prediction models. However, this investigation represents an empirical study of SDP where multiple projects data have been used to prepare prediction models. In this study, multiple projects data have been used to prepare a balance and an imbalance datasets. After that this datasets have been used in different prediction models with eight different classifier algorithms. The trained models have been cross-checked by one balanced and imbalanced test datasets. Five evaluation metrics have been considered for evaluating the performance of the models. The experimental results show that there was no significant changes observed between balanced and imbalanced training models. Only AUC (Area Under the Curve) scores have increased significantly in terms of balanced training model with imbalanced test datasets. In the same training model with the balanced test, accuracy and AUC score have increased significantly. However, this study covers widely by creating the classification model from multiple projects' historical data. Further, it proves that if the sufficient number of non-defective and defective data are supplied in the prediction model, it can predict balanced and imbalanced both categories dataset alike. Here, recommendation will be to consider the imbalanced learning while building the prediction model for cross-projects.