129 52 10MB
English Pages 216 [210] Year 2024
Xiao-Yuan Jing · Haowen Chen Baowen Xu
Intelligent Software Defect Prediction
Intelligent Software Defect Prediction
Xiao-Yuan Jing • Haowen Chen • Baowen Xu
Intelligent Software Defect Prediction
Xiao-Yuan Jing School of Computer Science Wuhan University Wuhan, Hubei, China
Haowen Chen School of Computer Science Wuhan University Wuhan, Hubei, China
Baowen Xu Computer Science & Technology Nanjing University Nanjing, Jiangsu, China
ISBN 978-981-99-2841-5 ISBN 978-981-99-2842-2 https://doi.org/10.1007/978-981-99-2842-2
(eBook)
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore Paper in this product is recyclable.
Preface
With the increase of complexity and dependency of software, the software product may suffer from low quality, high cost, hard-to-maintain, and even the occurrence of defects. Software defect usually produces incorrect or unexpected results and behaviors in unintended ways. Software defect prediction (SDP) is one of the most active research fields in software engineering and plays an important role in software quality assurance. According to the feedback of SDP, developers can subsequently conduct defect location and repair under reasonable resource allocation, which is helpful in reducing the maintenance cost. The early task of SDP is performed within a single project. Developers can make use of the well-labeled historical data of the currently maintained project to build the model and predict the defect-proneness of the remaining instances. This process is called within-project defect prediction (WPDP). However, the annotation for defect data (i.e., defective or defective-free) is time-consuming and high-cost, which is a hard task for practitioners in the development or maintenance cycle. To solve this problem, researchers consider introducing other projects with sufficient historical data to conduct the cross-project defect prediction (CPDP) which has received extensive attention in recent years. As the special case of CPDP, heterogeneous defect prediction (HDP) refers to the scenario that training and test data have different metrics, which can relax the restriction on source and target projects’ metrics. Besides, there also exist other research questions of SDP to be further studied, such as cross-version defect prediction, just-in-time (JIT) defect prediction, and effort-aware JIT defect prediction. In the past few decades, more and more researchers pay attention to SDP and a lot of intelligent SDP techniques have been presented. In order to obtain the high-quality representations of defect data, a lot of machine learning techniques such as dictionary learning, semi-supervised learning, multi-view learning, and deep learning are applied to solve SDP problems. Besides, transfer learning techniques are also used to eliminate the divergence between different project data in CPDP scenario. Therefore, the combination with machine learning techniques is conducive to improving the prediction efficiency and accuracy, which can promote the research of intelligent SDP to make significant progress. v
vi
Preface
We propose to draft this book to provide a comprehensive picture of the current state of SDP researches instead of improving and comparing existing SDP approaches. More specifically, this book introduces a range of machine learningbased SDP approaches proposed for different scenarios (i.e., WPDP, CPDP, and HDP). Besides, this book also provides deep insight into current SDP approaches’ performance and learned lessons for further SDP researches. This book is mainly applicable to graduate students, researchers who work in or have interests in the areas of SDP, and the developers who are responsible for software maintenance. Wuhan, China December, 2022
Xiao-Yuan Jing Haowen Chen
Acknowledgments
We thank Li Zhiqiang, Wu Fei, Wang Tiejian, Zhang Zhiwu, and Sun Ying from Wuhan University for their contributions to this research. We would like to express our heartfelt gratitude to Professor Baowen Xu and his team from Nanjing University for their selfless technical assistance in the compilation of this book. We are so thankful for the invaluable help and support provided by Professor Xiaoyuan Xie from Wuhan University, whose valuable advice and guidance was crucial to the successful completion of this book. We wanted to express our sincere appreciation for the unwavering support provided by Nanjing University, Wuhan University, and Nanjing University of Posts and Telecommunications, as well as the editing suggestions provided by Kamesh and Wei Zhu from Springer Publishing House. We just wanted to thank you from the bottom of our hearts for your unwavering support and guidance throughout the compilation of this book. Finally, we would like to express our heartfelt appreciation to two master students, Hanwei and Xiuting Huang, who participated in the editing process and made indelible contributions to the compilation of this book.
vii
Contents
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Software Quality Assurance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Software Defect Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Research Directions of SDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Within-Project Defect Prediction (WPDP) . . . . . . . . . . . . . . . . . . . . 1.3.2 Cross-Project Defect Prediction (CPDP) . . . . . . . . . . . . . . . . . . . . . . 1.3.3 Heterogeneous Defect Prediction (HDP) . . . . . . . . . . . . . . . . . . . . . . 1.3.4 Other Research Questions of SDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Notations and Corresponding Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Structure of This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 1 2 3 3 4 4 5 7 8 9
2
Machine Learning Techniques for Intelligent SDP. . . . . . . . . . . . . . . . . . . . . . . 2.1 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Deep Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Other Techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Dictionary Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Multi-View Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13 13 14 15 15 15 16 16
3
Within-Project Defect Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Basic WPDP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Dictionary Learning Based Software Defect Prediction . . . . . . 3.1.2 Collaborative Representation Classification Based Software Defect Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Semi-supervised WPDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Sample-Based Software Defect Prediction with Active and Semi-supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19 19 19 26 28 28 33
ix
x
4
5
Contents
Cross-Project Defect Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Basic CPDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Manifold Embedded Distribution Adaptation . . . . . . . . . . . . . . . . . 4.2 Class Imbalance Problem in CPDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 An Improved SDA Based Defect Prediction Framework . . . . . 4.3 Semi-Supervised CPDP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Cost-Sensitive Kernelized Semi-supervised Dictionary Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35 36 36 46 46 54
Heterogeneous Defect Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Basic HDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Unified Metric Representation and CCA-Based Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Class Imbalance Problem in HDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Cost-Sensitive Transfer Kernel Canonical Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Other Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Multiple Sources and Privacy Preservation Problems in HDP . . . . . . . . 5.3.1 Multi-Source Selection Based Manifold Discriminant Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Sparse Representation Based Double Obfuscation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65 66
54 61
66 83 83 104 104 104 109 133
6
An Empirical Study on HDP Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Goal Question Metric (GQM) Based Research Methodology . . . . . . . . 6.1.1 Major Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Review of Research Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.3 Analysis on Research Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.4 Research Goal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.5 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.6 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 SDP Approaches for Comparisons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.3 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
139 139 139 140 141 144 145 145 147 147 149 150 151 160 168
7
Other Research Questions of SDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Cross-Version Defect Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.3 Discussions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Just-in-Time Defect Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
171 171 171 173 175 175
Contents
7.2.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.3 Discussions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Effort-Aware Just-in-Time Defect Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.3 Discussions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
xi
175 179 187 188 188 191 196 198
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 8.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Chapter 1
Introduction
1.1 Software Quality Assurance With the increasing pressures of expediting software projects that is always increasing in size and complexity to meet rapidly changing business needs, quality assurance activities such as fault prediction models have thus become extremely important. The main purpose of a fault prediction model is the effective allocation or prioritization of quality assurance effort (test effort and code inspection effort). Construction of these prediction models are mostly dependent on historical or previous software project data referred to as a dataset. However, a prevalent problem in data mining is the skewness of a dataset. Fault prediction datasets are not excluded from this phenomenon. Most datasets have the majority of the instances being either clean or not faulty and conventional learning methods are primarily designed for balanced datasets. Common classifiers such as Neural Networks (NN), Support Vector Machines (SVM), and decision trees work best toward optimizing their objective functions, which lead to the maximum overall accuracy—the ratio of correctly predicted instances to the total number of instances. The use of imbalanced datasets for training a classifier will most likely generate a classifier that tends to over-predict the presence of the majority class but a lower probability of predicting the minority or faulty modules. When the model predicts the minority class, it often has a higher error rate compared to predictions for the majority class. This impacts the performance of classifiers in machine learning and is known as learning from imbalanced datasets. This affects the prediction performance of classifiers, and in machine learning, this issue is known as learning from imbalanced datasets. Several methods have been proposed in machine learning for dealing with the class imbalanced issue such as random over and under sampling creating synthetic data application of cleaning techniques for data sampling and cluster-based sampling. With a significant amount of literature in machine learning for imbalanced datasets, very few studies have tackled it in the area of fault prediction. The first of such studies by Kamei et al. [1] showed that © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 X.-Y. Jing et al., Intelligent Software Defect Prediction, https://doi.org/10.1007/978-981-99-2842-2_1
1
2
1 Introduction
sampling techniques improved the prediction performance of linear and logistics models, whilst other two models (neural network and classification tree) did not have a better performance upon application of the sampling techniques. Interestingly, sampling techniques applied to datasets during fault prediction are mostly evaluated in terms of Accuracy, AUC, F1-measure, Geometric Mean Accuracy just to name a few; however, these measures ignore the effort needed to fix faults, that is, they do not distinguish between a predicted fault in a small module and a predicted fault in a large module. Nickerson et al. [2] conclude that to evaluate the performance of classifiers on imbalanced datasets, accuracy or its inverse error rate should never be used. Chawla et al. [3] also allude to the conclusion that simple predictive accuracy might not be appropriate for an imbalanced dataset. The goal of this research is to improve the prediction performance of fault-prone module prediction models, applying over and under sampling approaches to rebalance number of fault-prone modules and non-fault-prone modules in the training dataset and to find the appropriate distribution or proportion of faulty and non-faulty modules that results in the best performance. The experiment focuses on the use of Norm(Popt), which is an effort-aware measure proposed by Kamei et al. [4] to evaluate the effect of over/under sampling on prediction models to find out if the over/under sampling is still effective in a more realistic setting.
1.2 Software Defect Prediction The defect is a flaw in the component or system which can cause it to fail to perform its desired function, that is, an incorrect statement or data definition. A defect, if encountered during execution, may cause a failure of the system or a component. Defect prediction helps in identifying the vulnerabilities in the project plan in terms of lack of resources, improperly defined timelines, predictable defects, etc. It can help organizations to fetch huge profits without getting delayed on schedules planned or overrun on estimates of budget. It helps in modifying the parameters in order to meet the schedule variations. The methods to estimate the software defects are regression, genetic programming, clustering, neural network, statistical technique of discriminate analysis, dictionary learning approach, hybrid attribute selection approach, classification, attribute selection and instance filtering, Bayesian belief networks, K-means clustering, and association rule mining. In the domain of software defect prediction, people have developed many software defect prediction models. These models are mostly described in two classes: one class is in the later period of the software life cycle (testing phase), having gotten defect data, predicts how many defects still in the software with these data. Models in this class include: capture-recapture method based model, neural network based model, and measure method based on scale and complexity of source code. The other class, which occurs before the software development phase, aims to predict the number of defects that will arise during the software development process by analyzing defect data from previous projects. Presently, published models in this class include: phase based
1.3 Research Directions of SDP
3
model proposed by Gaffney and Davis, Ada programming defect prediction model proposed by Agresti and Evanco, early prediction model proposed by USA ROME lab, software development early prediction method proposed by Carol Smidts in Maryyland University, and early fuzzy neural network based model. However, there are a number of serious theoretical and practical problems in these methods. Software development is an extremely complicated process. Defects relate to many factors. If you want to measure exactly, you would consider as many correlative factors as possible, but it would make the model more complicated. If considering the solvability, you would have to simplify the model. However, it would not make out the convinced answer. Neural network based prediction model, for instance, has lots of problems in training and verifying the sample collection. Software test in many organizations is still in the original phase, so lots of software hardly gives the defects number requested, which would bring certain difficulties to sample collection. Early models consider inadequately on the uncertain factors in software develop process; the dependence to data factors is great besides. Therefore, many methods have difficulty in application.
1.3 Research Directions of SDP 1.3.1 Within-Project Defect Prediction (WPDP) Some defect data in the same project are used as the training set to build the prediction model, and the remaining small number of data are used as test set to test the performance. At present, some researchers mainly use the machine learning algorithm to construct the defect prediction model on the within-project defect prediction. In addition, how to optimize the data structure and extract effective feature are also the focus of current research. Some important research works will be summarized below. Elish et al. [5] use support vector machine (SVM) to conduct defect prediction and compare its predictive performance with eight statistical and machine learning models on four NASA datasets. Lu et al. [6] leverage active learning to predict defect, and they also use feature compression techniques to make feature reduction on defect data. Li et al. [7] propose a novel semi-supervised learning method—ACoForest—which can sample the prediction modules that are most helpful for learning. Rodriguez et al. [8] compare different methods for different data preprocessing problems, such as sampling method, cost sensitive method, integration method, and hybrid method. The final experimental results show that the above different methods can effectively improve the accuracy of defect prediction after performing the class imbalance. Seiffert et al. [9] analyze 11 different algorithms and seven different data sampling techniques and find that class imbalance and data noise would have the negative impact on prediction performance.
4
1 Introduction
1.3.2 Cross-Project Defect Prediction (CPDP) When data are insufficient or non-existent for building quality defect predictors, software engineers can use data from other organizations or projects. This is called cross-project defect prediction (CPDP). Acquiring data from other sources is a non-trivial task when data owners are concerned about confidentiality. In practice, extracting project data from organizations is often difficult due to the business sensitivity associated with the data. For example, at a keynote address at ESEM’11, Elaine Weyuker doubted that she will ever be able to release the AT&T data she used to build defect predictors [10]. Due to similar privacy concerns, we were only able to add seven records from two years of work to our NASA-wide software cost metrics repository [11]. In a personal communication, Barry Boehm stated that he was able to publish less than 200 cost estimation records even after 30 years of COCOMO effort. To enable sharing, we must assure confidentiality. In our view, confidentiality is the next grand challenge for CPDP in software engineering. In previous work, we allowed data owners to generate minimized and obfuscated versions of their original data. Our MORPH algorithm [12] reflects on the boundary between an instance and its nearest instance of another class, and MORPH’s restricted mutation policy never pushes an instance across that boundary. MORPH can be usefully combined with the CLIFF data minimization algorithm [13]. CLIFF is an instance selector that returns a subset of instances that best predict for the target class. Previously we reported that this combination of CLIFF and MORPH resulted in 7/10 defect datasets studied retaining high privacy scores, while remaining useful for CPDP [13]. This is a startling result since research by Grechanik et al. [14] and Brickell et al. [15] showed that standard privacy methods increase privacy while decreasing data mining efficacy. While useful CLIFF and MORPH only considered a singleparty scenario where each data owner privatized their data individually without considering privatized data from others. This resulted in privatized data that were directly proportional in size (number of instances) to the original data. Therefore, in a case where the size of the original data is small enough, any minimization might be meaningless, but if the size of the original data is large, minimization may not be enough to matter in practice.
1.3.3 Heterogeneous Defect Prediction (HDP) Existing CPDP approaches are based on the underlying assumption that both source and target project data should exhibit the same data distribution or are drawn from the same feature space (i.e., the same software metrics). When the distribution of the data changes, or when the metrics features for source and target projects are different, one cannot expect the resulting prediction performance to be satisfactory. We consider these scenarios as Heterogeneous Cross-Project Defect Prediction (HCPDP). Mostly, the software defect datasets are imbalanced, which
1.3 Research Directions of SDP
5
means the number of the defective modules is usually much smaller than that of the defective-free modules. The imbalanced nature of data can cause poor prediction performance. That is, the probability of defect prediction can be low, while the overall performance is high. Without taking this issue into account, the effectiveness of software defect prediction in many real-world tasks would be greatly reduced. Recently, some researchers have noticed the importance of these problems in software defect prediction. For example, Nam et al. [16] used the metrics selection and metrics matching to select similar metrics for building a prediction model with heterogeneous metrics set. They discarded dissimilar metrics, which may contain useful information for training. Jing et al. [17] introduced Canonical Correlation Analysis (CCA) into HCPDP, by constructing the common correlation space to associate cross-project data. Then, one can simply project the source and target project data into this space for defect prediction. Like previous CPDP methods, the class imbalance problem of software defect datasets was not taken into account. Ryu et al. [18] designed the Value-Cognitive Boosting with Support Vector Machine (VCB-SVM) algorithm which exploited sampling techniques to solve the class imbalance issue for cross-project environments. Nevertheless, sampling strategy alters the distribution of the original data, where it may discard some potentially useful samples that could be important for prediction process. Therefore, these methods are not good solutions for addressing the class imbalance issue under heterogeneous cross-project environments.
1.3.4 Other Research Questions of SDP 1.3.4.1
Cross-Version Defect Prediction
Cross Version Defect Prediction (CVDP) is a practical scenario by training the classification model on the historical data of the prior version and then predicting the defect labels of modules of the current version. Bennin et al. [19] evaluated the defect prediction performance of 11 basic classification models in IVDP and CVDP scenarios with an effort-aware indicator. They conducted experiments on 25 projects (each one has two versions with process metrics) and found that the optimal models for the two defect prediction scenarios are not identical due to different data as the training set. However, the performance differences of the 11 models are not significant in both scenarios. Premraj et al. [20] investigated the impacts of code and network metrics on the defect prediction performance of six classification models. They considered three scenarios, including IVDP, CVDP, and CPDP. CPDP uses the defect data of another project as the training set. Experiments on three projects (each with two versions) suggested that the network metrics are better than the code metrics in most cases. Holschuh et al. [21] explored the performance of CVDP on a large software system by collecting four types of metrics. The experiments on six projects (each with three versions) showed that the overall performance is unsatisfactory. Monden et al. [22] evaluated the cost effectiveness of defect
6
1 Introduction
prediction on three classification models by comparing seven test effort allocation strategies. The results on one project with five versions revealed that the reduction of test effort relied on the appropriate test strategy. Khoshgoftaar et al. [23] studied the performance of six classification models on one project with four versions and found that CART model with lease absolute deviation performed the best. Zhao et al. [24] investigated the relationship between the context-based cohesion metrics and the defect-proneness in IVDP and CVDP scenarios. They conducted CVDP study on four projects with total 19 versions and found that context-based cohesion metrics had negative impacts on defect prediction performance but can be complementary to non-context-based metrics. Yang et al. [25] surveyed the impacts of code, process, and slice-based cohesion metrics on defect prediction performance in IVDP, CVDP, and CPDP scenarios. They conducted CVDP study on one project with seven versions and found that slice-based cohesion metrics had adverse impacts on defect prediction performance but can be complementary to the commonly used metrics. Wang et al. [26] explored the performance of their proposed semantic metrics on defect prediction in CVDP and CPDP scenarios. The experiments on ten projects with 26 versions showed the superiority of the semantic metrics compared with traditional CK metrics and AST metrics.
1.3.4.2
Just-in-Time Defect Prediction
Just-in-time defect prediction aims to predict if a particular file involved in a commit (i.e., a change) is buggy or not. Traditional just-in-time defect prediction techniques typically follow the following steps: Training Data Extraction. For each change, label it as buggy or clean by mining a project’s revision history and issue tracking system. Buggy change means the change contains bugs (one or more), while clean change means the change has no bug. Feature Extraction. Extract the values of various features from each change. Many different features have been used in past change classification studies. Model Learning. Build a model by using a classification algorithm based on the labeled changes and their corresponding features. Model Application. For a new change, extract the values of various features. Input these values to the learned model to predict whether the change is buggy or clean. The studies by Kamei et al. [32] are great source of inspiration for our work. They proposed a just-in-time quality assurance technique that predicts defects at commitlevel trying to reduce the effort of a reviewer. Later on, they also evaluated how just-in-time models perform in the context of cross-project defect prediction [19]. Findings report good accuracy for the models not only in terms of both precision and recall but also in terms of saved inspection effort. Our work is complementary to these papers. In particular, we start from their basis of detecting defective commits and complement this model with the attributes necessary to filter only those files that are defect-prone and should be more thoroughly reviewed. Yang et al. [25] proposed
1.4 Notations and Corresponding Descriptions
7
the usage of alternative techniques for just-in-time quality assurance, such as cached history, deep learning, and textual analysis, reporting promising results. We did not investigate these further in the current chapter, but studies can be designed and carried out to determine if and how these techniques can be used within the model we present in this chapter to further increase its accuracy.
1.3.4.3
Effort-Aware Defect Prediction
Traditional SDP models based on some binary classification algorithms are not sufficient for software testing in practice, since they do not distinguish between a module with many defects or high defect density (i.e., number of defects/lines of source codes) and a module with a small number of defects or low defect density. Clearly, both modules require a different amount of effort to inspect and fix, yet they are considered equal and allocated the same testing resources. Therefore, Mende et al. [27] proposed effort-aware defect prediction (EADP) models to rank software modules based on the possibility of these modules being defective, their predicted number of defects, or defect density. Generally, EADP models are constructed by using learning to rank techniques [28]. These techniques can be grouped into three categories, that is, the pointwise approach, the pairwise approach, and the listwise approach [29–31]. There exists a vast variety of learning to rank algorithms in literature. It is thus important to empirically and statistically compare the impact and effectiveness of different learning to rank algorithms for EADP. To the best of our knowledge, few prior studies [32–36] evaluated and compared the existing learning to rank algorithms for EADP. Most of these studies, however, conducted their study with few learning to rank algorithms across a small number of datasets. Previous studies [34–36] conducted their study with as many as five EADP models and few datasets. For example, Jiang et al. [34] investigated the performance of only five classification-based pointwise algorithms for EADP on two NASA datasets. Nguyen et al. [36] investigated three regression based pointwise algorithm and two pairwise algorithms for EADP on five Eclipse CVS datasets.
1.4 Notations and Corresponding Descriptions We will briefly introduce some of the symbols and abbreviations that appear in this book, as listed in the Table 1.1: Some parts are listed in the table, and the parts that are not listed will be made in the corresponding text: Detailed description.
8
1 Introduction
Table 1.1 Symbols and corresponding descriptions Symbol/Abbreviation SDP WPDP HCCDP CPDP HDP CCA TKCCA CTKCCA GQM ROC MDA SDA .⇒ .a
= [a1 , a2 , . . . an ]
.a .∈ .tr(·)
Description Software Defect Prediction Within-Project Defect Prediction Heterogeneous cross-company defect predecton Cross-Project Defect Prediction Heterogeneous Defect Prediction Canonical correlation analysis Transfer kernel canonical correlation analysis Cost-sensitive transfer kernel canonical correlation analysis Goal Question Metric Receiver operating characteristic A manifold embedded distribution adaptation Subclass discriminant analysis The source company data and the right side of “.⇒” represents the target company data .a is a vector, and .ai is the .ith component The length of a vector An element belongs to a set The trace of a matrix
1.5 Structure of This Book In the second chapter of this book, several common learning algorithms and their applications in software defect prediction are briefly introduced, including deep learning, transfer learning, dictionary learning, semi-supervised learning, and multiview learning. In Chap. 3, we discussed mainly about within-project defect prediction but first introduced basic WPDP including dictionary learning based software defect prediction, collaborative representation classification based software defect prediction and then introduced the sample-based software defect prediction with active and semi-supervised learning belonging to the semi-supervised WPDP. In Chap. 4, we expounded some methodologies on cross-project defect prediction, including basic CPDP, among which we introduced manifold embedded distribution adaptation; for class imbalance problem in CPDP, we proposed an improved SDA based defect prediction framework; finally, in semi-supervised CPDP, we introduced cost-sensitive kernelized semi-supervised dictionary learning. In Chap. 5, we introduce Heterogeneous Defect Prediction (HDP), first explaining unified metric representation and CCA-based transfer learning in basic HDP; then in class imbalance problem in HDP, we introduce cost-sensitive transfer kernel canonical correlation analysis. Finally, regarding multiple sources and privacy preservation problems in HDP, we have introduced multi-source selection based
References
9
manifold discriminant alignment and sparse representation based double obfuscation algorithm. In Chap. 6, an empirical study on HDP approaches is introduced, including heterogeneous defect prediction and Goal Question Metric (GQM) based research methodology. Finally, in Chap. 7 of this book, we discuss other research questions of SDP, mainly including the following aspects: cross-version defect prediction, just-in-time defect prediction and effort-aware just-in-time defect prediction.
References 1. Kamei Y, Monden A, Matsumoto S, Kakimoto T, Matsumoto KI (2007) The Effects of Over and Under Sampling on Fault-prone Module Detection. In: Proceedings of the First International Symposium on Empirical Software Engineering and Measurement, pp 196–204. https://doi.org/10.1109/ESEM.2007.28 2. Nickerson A, Japkowicz N, Milios EE (2001) Using Unsupervised Learning to Guide Resampling in Imbalanced Data Sets. In: Proceedings of the Eighth International Workshop on Artificial Intelligence and Statistics. http://www.gatsby.ucl.ac.uk/aistats/aistats2001/files/ nickerson155.ps 3. Chawla NV (2010) Data Mining for Imbalanced Datasets: An Overview. In: Proceedings of the Data Mining and Knowledge Discovery Handbook, pp 875–886. https://doi.org/10.1007/ 978-0-387-09823-4_45 4. Kamei Y, Matsumoto S, Monden A, Matsumoto K, Adams B, Hassan AE (2010) Revisiting common bug prediction findings using effort-aware models. In: Proceedings of the 26th IEEE International Conference on Software Maintenance, pp 1–10. https://doi.org/10.1109/ICSM. 2010.5609530 5. Elish KO, Elish MO (2008) Predicting defect-prone software modules using support vector machines. J Syst Softw 81(5):649–660. https://doi.org/10.1016/j.jss.2007.07.040 6. Lu H, Kocaguneli E, Cukic B (2014) Defect Prediction between Software Versions with Active Learning and Dimensionality Reduction. In: Proceedings of the 25th IEEE International Symposium on Software Reliability Engineering, pp 312–322. https://doi.org/10.1109/ISSRE. 2014.35 7. Li M, Zhang H, Wu R, Zhou Z (2012) Sample-based software defect prediction with active and semi-supervised learning. Autom Softw Eng 19(2):201–230. https://doi.org/10.1007/s10515011-0092-1 8. Rodríguez D, Herraiz I, Harrison R, Dolado JJ, Riquelme JC (2014) Preliminary comparison of techniques for dealing with imbalance in software defect prediction. In: Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering, pp 43:1–43:10. https://doi.org/10.1145/2601248.2601294 9. Seiffert C, Khoshgoftaar TM, Hulse JV, Folleco A (2007) An Empirical Study of the Classification Performance of Learners on Imbalanced and Noisy Software Quality Data. In: Proceedings of the IEEE International Conference on Information Reuse and Integration, pp 651–658. https://doi.org/10.1109/IRI.2007.4296694 10. Weyuker EJ, Ostrand TJ, Bell RM (2008) Do too many cooks spoil the broth? Using the number of developers to enhance defect prediction models. Empir Softw Eng 13(5):539–559. https:// doi.org/10.1007/s10664-008-9082-8 11. Menzies T, El-Rawas O, Hihn J, Feather MS, Madachy RJ, Boehm BW (2007) The business case for automated software engineering. In: Proceedings of the 22nd IEEE/ACM International Conference on Automated Software Engineering ASE 2007, pp 303–312. https://doi.org/10. 1145/1321631.1321676
10
1 Introduction
12. Peters F, Menzies T (2012) Privacy and utility for defect prediction: Experiments with MORPH. In: Proceedings of the 34th International Conference on Software Engineering, pp 189–199. https://doi.org/10.1109/ICSE.2012.6227194 13. Peters F, Menzies T, Gong L, Zhang H (2013) Balancing Privacy and Utility in Cross-Company Defect Prediction. IEEE Trans Software Eng 39(8):1054–1068. https://doi.org/10.1109/TSE. 2013.6 14. Grechanik M, Csallner C, Fu C, Xie Q (2010) Is Data Privacy Always Good for Software Testing?. In: Proceedings of the IEEE 21st International Symposium on Software Reliability Engineering, pp 368–377. https://doi.org/10.1109/ISSRE.2010.13 15. Brickell J, Shmatikov V (2008) The cost of privacy: destruction of data-mining utility in anonymized data publishing. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 70–78. https://doi.org/10.1145/ 1401890.1401904 16. Nam J, Kim S (2015) Heterogeneous defect prediction. In: Proceedings of the 10th Joint Meeting on Foundations of Software Engineering, pp 508–519. https://doi.org/10.1145/2786805. 2786814 17. Jing X, Wu F, Dong X, Qi F, Xu B (2015) Heterogeneous cross-company defect prediction by unified metric representation and CCA-based transfer learning. In: Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, pp 496–507. https://doi.org/10. 1145/2786805.2786813 18. Ryu D, Choi O, Baik J (2016) Value-cognitive boosting with a support vector machine for cross-project defect prediction. Empir Softw Eng 21(1):43–71. https://doi.org/10.1007/ s10664-014-9346-4 19. Bennin KE, Toda K, Kamei Y, Keung J, Monden A, Ubayashi N (2016) Empirical Evaluation of Cross-Release Effort-Aware Defect Prediction Models. In: Proceedings of the 2016 IEEE International Conference on Software Quality, pp 214–221. https://doi.org/10.1109/QRS.2016. 33 20. Premraj R, Herzig K (2011) Network Versus Code Metrics to Predict Defects: A Replication Study. In: Proceedings of the 5th International Symposium on Empirical Software Engineering and Measurement, pp 215–224. https://doi.org/10.1109/ESEM.2011.30 21. Holschuh T, Pauser M, Herzig K, Zimmermann T, Premraj R, Zeller A (2009) Predicting defects in SAP Java code: An experience report. In: Proceedings of the 31st International Conference on Software Engineering, pp 172–181. https://doi.org/10.1109/ICSE-COMPANION. 2009.5070975 22. Monden A, Hayashi T, Shinoda S, Shirai K, Yoshida J, Barker M, Matsumoto K (2013) Assessing the Cost Effectiveness of Fault Prediction in Acceptance Testing. IEEE Trans Softw Eng 39(10):1345–1357. https://doi.org/10.1109/TSE.2013.21 23. Khoshgoftaar TM, Seliya N (2003) Fault Prediction Modeling for Software Quality Estimation: Comparing Commonly Used Techniques Empir. Softw Eng 8(3):255–283. https://doi.org/10. 1023/A:1024424811345 24. Zhao Y, Yang Y, Lu H, Liu J, Leung H, Wu Y, Zhou Y, Xu B (2017) Understanding the value of considering client usage context in package cohesion for fault-proneness prediction Autom. Softw Eng 24(2):393–453. https://doi.org/10.1007/s10515-016-0198-6 25. Yang Y, Zhou Y, Lu H, Chen L, Chen Z, Xu B, Leung HKN, Zhang Z (2015) Are Slice-Based Cohesion Metrics Actually Useful in Effort-Aware Post-Release Fault-Proneness Prediction? An Empirical Study IEEE Trans. Softw Eng 41(4):331–357. https://doi.org/10.1109/TSE.2014. 2370048 26. Wang S, Liu T, Tan L (2016) Automatically learning semantic features for defect prediction. In: Proceedings of the 38th International Conference on Software Engineering, pp 297–308. https://doi.org/10.1145/2884781.2884804 27. Mende T, Koschke R (2010) Effort-Aware Defect Prediction Models. In: Proceedings of the 14th European Conference on Software Maintenance and Reengineering, pp 107–116. https:// doi.org/10.1109/CSMR.2010.18
References
11
28. Wang F, Huang J, Ma Y (2018) A Top-k Learning to Rank Approach to Cross-Project Software Defect Prediction. In: Proceedings of the 25th Asia-Pacific Software Engineering Conference, pp 335–344. https://doi.org/10.1109/APSEC.2018.00048 29. Shi Z, Keung J, Bennin KE, Zhang X (2018) Comparing learning to rank techniques in hybrid bug localization. Appl Soft Comput 62636-648. https://doi.org/10.1016/j.asoc.2017.10.048 30. Liu T (2010) Learning to rank for information retrieval. In: Proceedings of the Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp 904. https://doi.org/10.1145/1835449.1835676 31. Yu X, Li Q, Liu J (2019) Scalable and parallel sequential pattern mining using spark. World Wide Web 22(1):295–324. https://doi.org/10.1007/s11280-018-0566-1 32. Bennin KE, Toda K, Kamei Y, Keung J, Monden A, Ubayashi N (2016) Empirical Evaluation of Cross-Release Effort-Aware Defect Prediction Models. In: Proceedings of the 2016 IEEE International Conference on Software Quality, pp 214–221. https://doi.org/10.1109/QRS.2016. 33 33. Yang X, Wen W (2018) Ridge and Lasso Regression Models for Cross-Version Defect Prediction. IEEE Trans Reliab 67(3):885–896. https://doi.org/10.1109/TR.2018.2847353 34. Jiang Y, Cukic B, Ma Y (2008) Techniques for evaluating fault prediction models. Empir Softw Eng 13(5):561–595. https://doi.org/10.1007/s10664-008-9079-3 35. Mende T, Koschke R (2009) Revisiting the evaluation of defect prediction models. In: Proceedings of the 5th International Workshop on Predictive Models in Software Engineering, pp 7. https://doi.org/10.1145/1540438.1540448 36. Nguyen TT, An TQ, Hai VT, Phuong TM (2014) Similarity-based and rank-based defect prediction. In: Proceedings of the 2014 International Conference on Advanced Technologies for Communications (ATC 2014), pp 321–325.
Chapter 2
Machine Learning Techniques for Intelligent SDP
Abstract In this chapter, several common learning algorithms and their applications in software defect prediction are briefly introduced, including deep learning, transfer learning, dictionary learning, semi-supervised learning, and multi-view learning.
2.1 Transfer Learning In many real world applications, it is expensive or impossible to recollect the needed training data and rebuild the models. It would be nice to reduce the need and effort to recollect the training data. In such cases, transfer learning (TL) between task domains would be desirable. Transfer learning exploits the knowledge gained from a previous task to improve generalization on another related task. Transfer learning can be useful when there is not enough labeled data for the new problem or when the computational cost of training a model from scratch is too high. Traditional data mining and machine learning algorithms make predictions on the future data using statistical models that are trained on previously collected labeled or unlabeled training data. Most of them assume that the distributions of the labeled and unlabeled data are the same. Transfer learning (TL), in contrast, allows the domains, tasks, and distributions used in training and testing to be different. It is used to improve a learner from one domain by transferring information from a related domain. Research on transfer learning has attracted more and more attention since 1995. Today, transfer learning methods appear in several top venues, most notably in data mining and applications of machine learning and data mining Due to their strong ability of domain adaptation, researchers introduce TL techniques to cross-project or heterogeneous defect prediction in recent years. The application of TL in cross-project defect prediction (CPDP) aims to reduce the distribution difference between source and target data. For example, Nam et al. [1] proposed a new CPDP method called TCA+, which extends transfer component analysis (TCA) by introducing a set of rules for selecting an appropriate normalization method to obtain better CPDP performance. Krishna and Menzies [2] introduced a baseline method named Bellwether for cross-project defect prediction © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 X.-Y. Jing et al., Intelligent Software Defect Prediction, https://doi.org/10.1007/978-981-99-2842-2_2
13
14
2 Machine Learning Techniques for Intelligent SDP
based on existing CPDP methods. For heterogeneous defect prediction (HDP), TL techniques are applied not only to reduce the distribution difference between source and target data but also to eliminate the heterogeneity of metrics between source and target projects. Jing et al. [3] proposed an HDP method named CCA+, which uses the canonical correlation analysis (CCA) technique and the unified metric representation (UMR) to find the latent common feature space between the source and target projects. Specifically, the UMR is made of three kinds of metrics, including the common metrics of the source and target data, source-specific metrics, and target-specific metrics. Based on UMR, the transfer learning method based on CCA is introduced to find common metrics by maximizing the canonical correlation coefficient between source and target data.
2.2 Deep Learning Deep learning (DL) is an extension of prior work on neural networks where the “deep” refers to the use of multiple layers in the network. In the 1960s and 1970s, it was found that very simple neural nets can be poor classifiers unless they are extended with (a) extra layers between inputs and outputs and (b) a nonlinear activation function controlling links from inputs to a hidden layer (which can be very wide) to an output layer. Essentially, deep learning is a modern variation on the above which is concerned with a potentially unbounded number of layers of bounded size. In the last century, most neural networks used the “sigmoid” activation function .f (x) = 1+e1 −x , which was subpar to other learners in several tasks. It was only when the ReLU activation function .f (x) = max(0, x) was introduced by Nair and Hinton [4] that their performance increased dramatically, and they became popular. With its strong representation learning ability, deep learning technology has quickly gained favor in the field of software engineering. In software defect prediction (SDP), researchers began to use DL techniques to extract deep features of defect data. Wang et al. [6] first introduced the Deep Belief Network (DBN) [5] that learns semantic features and then uses classical learners to perform defect prediction. In this approach, for each file in the source code, they extract tokens, disregarding ones that do not affect the semantics of the code, such as variable names. These tokens are vectorized and given unique numbers, forming a vector of integers for each source file. Wen et al. [7] utilized Recurrent Neural Network (RNN) to encode features from sequence data automatically. They propose a novel approach called FENCES, which extracts six types of change sequences covering different aspects of software changes via fine-grained change analysis. It approaches defect prediction by mapping it to a sequence labeling problem solvable by RNN.
2.3 Other Techniques
15
2.3 Other Techniques 2.3.1 Dictionary Learning Both sparse representation and dictionary learning have been successfully applied to many application fields, including image clustering, compressed sensing as well as image classification tasks. In sparse representation based classification, the dictionary for sparse coding could be predefined. For example, Wright et al. [8] directly used the training samples of all classes as the dictionary to code the query face image and classified the query face image by evaluating which class leads to the minimal reconstruction error. However, the dictionary in his method may not be effective enough to represent the query images due to the uncertain and noisy information in the original training images. In addition, the number of atoms of dictionary that is made up of image samples can also be very large, which increases the coding complexity. Dictionary learning (DL) aims to learn from the training samples’ space where the given signal could be well represented or coded for processing. Most DL methods attempt to learn a common dictionary shared by all classes as well as a classifier of coefficients for classification. Usually, the dictionary can be constructed by directly using the original training samples, whereas the original samples have much redundancy and noise, which are adverse to prediction. For the purpose of further improving the classification ability, DL techniques have been adopted in SDP tasks recently to represent project modules well. For example, Jing et al. [14] are the first to apply the DL technology to the field of software defect prediction and proposed a cost-sensitive discriminative dictionary learning (CDDL) approach. Specifically, CDDL introduces misclassification costs and builds the over-complete dictionary for software project modules.
2.3.2 Semi-Supervised Learning Due to the lack of labeled data, Semi-Supervised Learning (SSL) has always been a hot topic in machine learning. A myriad of SSL methods have been proposed. For example, co-training is a well-known disagreement-based SSL method, which trains different learners to exploit unlabeled data. Pseudo-label style methods label unlabeled data with pseudo labels. Graph-based methods aim to construct a similarity graph, through which label information propagates to unlabeled nodes. Local smoothness regularization-based methods represent another widely recognized category of semi-supervised learning (SSL) techniques, which leverage the inherent structure of the data to improve learning accuracy. Different methods apply different regularizers, such as Laplacian regularization, manifold regularization, and virtual adversarial regularization. For example, Miyato et al. [11] proposed a smooth regularization method called virtual adversarial training, which enables the model
16
2 Machine Learning Techniques for Intelligent SDP
to output a smooth label distribution for local perturbations of a given input. There are other popular methods, for example, Ladder Network. Since large unlabeled data exist in software projects, many SSL techniques have been considered in SDP tasks. Wang et al. [9] proposed a non-negative sparse-based semiboost learning approach for software defect prediction. Benefit from the idea of semi-supervised learning, this approach is capable of exploiting both labeled and unlabeled data and is formulated in a boosting framework. Besides, Zhang et al. [10] used graph-based semi-supervised learning technique to predict software defect. This approach utilizes not only few labeled data but also abundant unlabeled ones to improve the generalization capability.
2.3.3 Multi-View Learning Representation learning is a prerequisite step in many multi-view learning tasks. In recent years, a variety of classical multi-view representation learning methods have been proposed. These methods follow the previously presented taxonomy, that is, joint representation, alignment representation, as well as shared and specific representation. For example, based on Markov network, Chen et al. [12] presented a large-margin predictive multi-view subspace learning method, which joints features learned from multiple views. Jing et al. [13] proposed an intra-view and inter-view supervised correlation analysis method for image classification, in which CCA was applied to align multi-view features. Deep multi-view representation learning works also follow the joint representation, alignment representation, as well as shared and specific representation classification paradigm. For example, Kan et al. [14] proposed a multi-view deep network for cross-view classification. This network first extracts view-specific features with a sub-network and then concatenates and feeds these features into a common network, which is designed to project them into one uniform space. Harwath et al. [15] presented an unsupervised audiovisual matchmap neural network, which applies similarity metric and pairwise ranking criterion to align visual objects and spoken words. Hu et al. [16] introduced a sharable and individual multiview deep metric learning method. It leverages view-specific networks to extract individual features from each view and employs a common network to extract shared features from all views.
References 1. Nam, Jaechang and Pan, Sinno Jialin and Kim, Sunghun. Transfer defect learning. 35th international conference on software engineering (ICSE), 382–391, 2013. 2. Krishna, Rahul and Menzies, Tim. Bellwethers: A baseline method for transfer learning. IEEE Transactions on Software Engineering, 45(11):1081–1105, 2018.
References
17
3. Jing, Xiaoyuan and Wu, Fei and Dong, Xiwei and Qi, Fumin and Xu, Baowen. Heterogeneous cross-company defect prediction by unified metric representation and CCA-based transfer learning. In Proceedings of the 2015 10th joint meeting on foundations of software engineering, pages 496–507, 2015. 4. Nair, Vinod and Hinton, Geoffrey E. Rectified linear units improve restricted Boltzmann machines. In Icml’10, 2010. 5. Hinton, Geoffrey E. Deep belief networks. Scholarpedia, 4(5):5947, 2009. 6. Wang, Song and Liu, Taiyue and Tan, Lin. Automatically learning semantic features for defect prediction. In IEEE/ACM 38th International Conference on Software Engineering (ICSE), pages 297–308, 2016. 7. Wen, Ming and Wu, Rongxin and Cheung, Shing-Chi. How well do change sequences predict defects? sequence learning from software changes. IEEE Transactions on Software Engineering, 46(11):1155–1175, 2018. 8. Wright, John and Yang, Allen Y and Ganesh, Arvind and Sastry, S Shankar and Ma, Yi. Robust face recognition via sparse representation. IEEE transactions on pattern analysis and machine intelligence, 31(2):210–227, 2008. 9. Wang, Tiejian and Zhang, Zhiwu and Jing, Xiaoyuan and Liu, Yanli. Non-negative sparsebased SemiBoost for software defect prediction. Software Testing, Verification and Reliability, 26(7):498–515, 2016. 10. Zhang, Zhi-Wu and Jing, Xiao-Yuan and Wang, Tie-Jian. Label propagation based semisupervised learning for software defect prediction. Automated Software Engineering, 24(7):47–69, 2017. 11. Miyato, Takeru and Maeda, Shin-ichi and Koyama, Masanori and Ishii, Shin. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993, 2018. 12. Chen, Ning and Zhu, Jun and Sun, Fuchun and Xing, Eric Poe. Large-margin predictive latent subspace learning for multiview data analysis. IEEE transactions on pattern analysis and machine intelligence, 34(12):2365–2378, 2012. 13. Jing, Xiao-Yuan and Hu, Rui-Min and Zhu, Yang-Ping and Wu, Shan-Shan and Liang, Chao and Yang, Jing-Yu. Intra-view and inter-view supervised correlation analysis for multi-view feature learning. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1882– 1889, 2014. 14. Kan, Meina and Shan, Shiguang and Chen, Xilin. Multi-view deep network for cross-view classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4847–4855, 2016. 15. Harwath, David and Torralba, Antonio and Glass, James. Unsupervised learning of spoken language with visual context. In Advances in Neural Information Processing Systems, pages 1858–1866, 2016. 16. Hu, Junlin and Lu, Jiwen and Tan, Yap-Peng. Sharable and individual multi-view metric learning. IEEE transactions on pattern analysis and machine intelligence, 40(9):2281–2288, 2017.
Chapter 3
Within-Project Defect Prediction
Abstract In order to improve the quality of a software system, software defect prediction aims to automatically identify defective software modules for efficient software test. To predict software defect, those classification methods with static code attributes have attracted a great deal of attention. In recent years, machine learning techniques have been applied to defect prediction. Due to the fact that there exists the similarity among different software modules, one software module can be approximately represented by a small proportion of other modules. And the representation coefficients over the pre-defined dictionary, which consists of historical software module data, are generally sparse. We propose a cost-sensitive discriminative dictionary learning (CDDL) approach for software defect classification and prediction. The widely used datasets from NASA projects are employed as test data to evaluate the performance of all compared methods. Experimental results show that CDDL outperforms several representative state-of-the-art defect prediction methods.
3.1 Basic WPDP 3.1.1 Dictionary Learning Based Software Defect Prediction 3.1.1.1
Methodology
To fully exploit the discriminative information of training samples for improving the performance of classification, we design a supervised dictionary learning approach, which learns a dictionary that can represent the given software module more effectively. Moreover, the supervised dictionary learning can also reduce both the number of dictionary atoms and the sparse coding complexity. Instead of learning a shared dictionary for all classes, we learn a structured dictionary .D = [D1 , . . . , Di , . . . , Dc ], where .Di is the class-specified sub-dictionary associated with class i, and c is the total number of classes. We use the reconstruction error to do classification with such a dictionary D, as the SRC method does.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 X.-Y. Jing et al., Intelligent Software Defect Prediction, https://doi.org/10.1007/978-981-99-2842-2_3
19
20
3 Within-Project Defect Prediction
Suppose that .A = [A1 , . . . , Ai , . . . , Ac ] is the set of training samples (labeled software modules), .Ai is the subset of the training samples from class i, .X = [X1 , . . . , Xi , . . . , Xc ] is the coding coefficient matrix of A over D, that is, .A ≈ DX, where .Xi is the sub-matrix containing the coding coefficients of .Ai over D. We require that D should have not only powerful reconstruction capability of A but also powerful discriminative capability of classes in A. Thus, we propose the costsensitive discriminative dictionary learning (CDDL) model as follows: J(D,X) = arg min {r(A, D, X) + ‖X‖1 }
(3.1)
.
(D,X)
where .r(A, D, X) is the discriminative fidelity term; .‖X‖1 is the sparsity constraint; λ is a balance factor. j Let .Xi = [Xi1 , Xi2 , Xic ], where .Xi is the coding coefficient matrix of .Ai over the sub-dictionary .Dj . Denote the representation of .Dk to .Ai as .Rk = Dk Xik . First of all, the dictionary D should be able to well represent .Ai , and, therefore, .Ai ≈ DXi = D1 Xi1 + · · · + Di Xii + · · · + Dc Xic . Secondly, since .Di is associated with the ith class, it is expected that .Ai should be well represented by .Di (not by .Dj , j /= i), 2 j 2 which means both .Ai − Di Xii F and .Dj Xi should be minimized. Thus the F discriminative fidelity term is
.
r(A, D, X) =
c i=1
.
=
c i=1
r (Ai , D, Xi ) ⎛
⎞ c 2 2 ⎜ j ⎟ 2 i Dj Xi ⎠ ⎝‖Ai − DXi ‖F + Ai − Di Xi + F
j =1 j /=i
F
(3.2) An intuitive explanation of three terms in .r(Ai , D, Xi ) is shown in Fig. 3.1. In software defect prediction, there are two kinds of modules: the defective modules and the defective-free modules. Figure 3.1a shows that if we only minimize Fig. 3.1 Illustration of the discriminative fidelity term
3.1 Basic WPDP
21
the .‖Ai − DXi ‖2F on the total dictionary D, .Ri may deviate much from .Ai so that sub-dictionary .Di could not well represent .Ai . In order to achieve better powerful reconstruction capability and powerful discriminative capability, we add 2 another two parts .Ai − Di Xii F (which minimizes the reconstruction error on sub j 2 dictionary of its own class) and .Dj Xi (which minimizes the reconstruction F term on sub-dictionary of the other class); both of them should also be minimized. Figure 3.1b shows that the proposed discriminative fidelity term could overcome the problem in Fig. 3.1a. As previously stated, misclassifying defective-free modules leads to increasing the development cost, and misclassifying defective ones is related with risk cost. Cost-sensitive learning can incorporate the different misclassification costs into the classification process. In this section, we emphasize the risk cost such that we add the penalty factor .cost (i, j ) to increase the punishment when a defective software module is predicted as a defective-free software module. As a result, cost-sensitive dictionary learning makes the prediction incline to classify a module as a defective one and generates a dictionary for classification with minimum misclassification cost. The discriminative fidelity term with penalty factors is r(A, D, X) =
c i=1
.
=
c
r (Ai , D, Xi ) ⎡
⎤ c 2 2 j ⎣‖Ai − DXi ‖2F + Ai − Di Xii + cost(i, j ) Dj Xi ⎦ F
i=1
F
j =1
(3.3) Since there are only two classes in software defect prediction (the defective class and the defective-free class), that is, .c = 2, the model of cost-sensitive discriminative dictionary learning is J(D,X)
2 2 ‖Ai − DXi ‖2F + Ai − Di Xii = arg min (D,X)
.
+
2 j =1
F
i=1
⎫ ⎤ ⎬ j 2 ⎦ cost(i, j ) Dj Xi + λ‖X‖1 ⎭ F
(3.4)
where the cost matrix is shown in Table 3.1. Table 3.1 Cost matrix for CDDL Actually defective Actually defective-free
Predicts defective one 0 .cost(2, 1)
Predicts defective-free one .cost(1, 2)
0
22
3 Within-Project Defect Prediction
The CDDL objective function in Formula 3.4 can be divided into two subproblems: updating X by fixing D and updating D by fixing X. The optimization procedure is iteratively implemented for the desired discriminative dictionary D and corresponding coefficient matrix X. At first, suppose that D is fixed, the objective function in formula is reduced to a sparse coding problem to compute .X = [X1 , X2 ]. Here .X1 and .X2 are calculated one by one. We calculate .X1 with fixed .X2 and then compute .X2 with fixed .X1 . Thus, formula is rewritten as 2 J(Xi ) = arg min ‖Ai − DXi ‖2F + Ai − Di Xii
F
(Xi )
.
+
2
j 2 cost(i, j ) Dj Xi + λ ‖Xi ‖1 F
j =1
⎫ ⎬
(3.5)
⎭
Formula 3.5 can be solved by using the IPM algorithm in [1]. When X is fixed, we in turn update .D1 and .D2 . When we calculate .D1 , .D2 is fixed, then we compute .D2 , .D1 is fixed. Thus Formula 3.4 is rewritten as
J(Di )
⎧ 2 ⎪ 2 ⎨ = arg min −Di Xi − Dj X j ⎪ (Di ) ⎩ j =1 j /=i
.
2 + Ai − Di Xii + F
F
2 j =1
⎫ ⎬ j 2 cos(i, j ) Dj Xi F⎭
(3.6)
where .Xi is the coding coefficient matrix of A over .Di . Formula 3.6 is a quadratic programming problem, and we can solve it by using the algorithm in [2]. By utilizing the PCA technique, we are able to initialize the sub-dictionary for each class. Given the low data dimension of software defect prediction, PCA can create a fully initialized sub-dictionary for every class. This means that all subdictionaries have an equal number of atoms, which is generally equivalent to the data dimension. The algorithm of CDDL converges since its two alternative optimizations are both convex. Figure 3.2 illustrates the convergence of the algorithm.
3.1.1.2
Experiments
To evaluate our CDDL approach, we conduct some experiments. For all selected datasets, we use the 1:1 random division to obtain the training and testing sets for all compared methods. The random division treatment may affect the prediction performance. Therefore, we use the random division, perform prediction 20 times, and report the average prediction results in the following discussions.
3.1 Basic WPDP
23
b 1.5 1 0.5 0
Total objective function value
c
Total objective function value
2
1
5
10 15 20 Iteration number
25
d
2 1.5 1 0.5 0
1
5
10 15 20 Iteration number
25
30
10 8 6 4 2 0 1
30
Total objective function value
Total objective function value
a
5
10 15 20 Iteration number
25
30
5
10 15 20 Iteration number
25
30
5 4 3 2 1 0 1
Fig. 3.2 Convergence of the realization algorithm of CDDL on four NASA benchmark datasets. (a) CM1 dataset. (b) KC1 dataset. (c) MW1 dataset. (d) PC1 dataset
In our approach, in order to emphasize the risk cost, the parameters cost (1,2) and cost (2,1) are set as 1:5. For various projects, users can select a different cost ratio, such as cost(1,2) to cost(2,1) [3]. And the parameter is determined by searching a wide range of values and choosing the one that yields the best F-measure value. We compare the proposed CDDL approach with several representative methods, particularly presented in the last five years, including support vector machine (SVM) [4], Compressed C4.5 decision tree (CC4.5) [5], weighted Naïve Bayes (NB) [6], coding based ensemble learning (CEL) [7], and cost-sensitive boosting neural network (CBNN) [8]. In this section, we present the detailed experimental results of our CDDL approach and other compared methods.
3.1.1.3
Discussions
Table 3.2 shows the Pd and Pf values of our approach and other compared methods on 10 NASA datasets. For each dataset, Pd and Pf values of all methods are the mean values calculated from the results of 20 runs. The results of Pf values suggest that in spite of not acquiring the best Pf values on most datasets, CDDL can achieve
24
3 Within-Project Defect Prediction
Table 3.2 Experimental results: Pd and Pf comparisons on NASA’s ten datasets Dataset CM1 JM1 KC1 KC3 MC2 MW1 PC1 PC3 PC4 PC5
M Pd Pf Pd Pf Pd Pf Pd Pf Pd Pf Pd Pf Pd Pf Pd Pf Pd Pf Pd Pf
SVM 0.15 0.04 0.53 0.45 0.19 0.02 0.33 0.08 0.51 0.24 0.21 0.04 0.66 0.19 0.64 0.41 0.72 0.16 0.71 0.22
Table 3.3 Average Pd value of 10 NASA datasets
CC4.5 0.26 0.11 0.37 0.17 0.40 0.12 0.41 0.16 0.64 0.49 0.29 0.09 0.38 0.09 0.34 0.08 0.49 0.07 0.50 0.02
Average
NB 0.44 0.18 0.14 0.32 0.31 0.06 0.46 0.21 0.35 0.09 0.49 0.19 0.36 0.11 0.28 0.09 0.39 0.13 0.32 0.14
CEL 0.43 0.15 0.32 0.14 0.37 0.13 0.29 0.12 0.56 0.38 0.25 0.11 0.46 0.13 0.41 0.13 0.48 0.06 0.37 0.13
CBNN 0.59 0.29 0.54 0.29 0.69 0.30 0.51 0.25 0.79 0.54 0.61 0.25 0.54 0.17 0.65 0.25 0.66 0.18 0.79 0.08
CDDL 0.74 0.37 0.68 0.35 0.81 0.37 0.71 0.34 0.83 0.29 0.79 0.25 0.86 0.29 0.77 0.28 0.89 0.28 0.84 0.06
SVM
CC4.5
NB
CEL
CBNN
CDDL
.0.47
.0.41
.0.35
.0.39
.0.64
.0.79
comparatively better results in contrast with other methods. We can also observe that the Pd values of CDDL, which are presented with boldface, are higher than the corresponding values of all other methods. CDDL achieves the highest Pd values on all datasets. The results indicate that the proposed CDDL approach takes the misclassification costs into consideration, which makes the prediction tend to classify the defective-free modules as the defective ones in order to obtain higher Pd values. We calculate the average Pd values of 10 NASA datasets in Table 3.3. As compared with other methods, the average Pd value of our approach is higher in contrast with other related methods, and CDDL improves the average Pd value at least by .0.15(= 0.79 − 0.64). Table 3.4 shows the F-measure values of our approach and the compared methods on 10 NASA datasets. In Table 3.4, F-measure values of CDDL are better than other methods on all datasets, which means that our proposed approach outperforms other methods and achieves the ideal prediction effects. According to the average Fmeasure values shown in Table 3.4, CDDL improves the average F-measure value at
3.1 Basic WPDP
25
Table 3.4 F-measure values on ten NASA datasets
Datasets CM1 JM1 KC1 KC3 MC2 MW1 PC1 PC3 PC4 PC5 Average
SVM .0.20 .0.29 .0.29 .0.38 .0.52 .0.27 .0.35 .0.28 .0.47 .0.16 .0.32
CC4.5 .0.25 .0.34 .0.39 .0.38 .0.48 .0.27 .0.32 .0.29 .0.49 .0.48 .0.37
NB .0.32 .0.33 .0.38 .0.38 .0.45 .0.31 .0.28 .0.29 .0.36 .0.33 .0.34
CEL .0.27 .0.33 .0.36 .0.33 .0.49 .0.27 .0.32 .0.36 .0.48 .0.36 .0.35
CBNN .0.33 .0.38 .0.41 .0.38 .0.56 .0.33 .0.32 .0.38 .0.46 .0.37 .0.39
CDDL .0.38 .0.40 .0.47 .0.44 .0.63 .0.38 .0.41 .0.42 .0.55 .0.59 .0.47
Table 3.5 P -values between CDDL and other compared methods on ten NASA datasets .CDDL
Dataset .s .CM1 .JM1 .KC1 .KC3 .MC2 .MW1 .PC1 .PC3 .PC4 .PC5
.SVM
.CC4.5
.NB
.CEL
.CBNN
× 10−8 −18 .7.51 × 10 −14 .1.20 × 10 .0.0265 −4 .1.26 × 10 −3 .1.14 × 10 −4 .2.64 × 10 −14 .7.79 × 10 −8 .7.32 × 10 −18 .3.01 × 10
× 10−6 −13 .2.33 × 10 −9 .1.23 × 10 .0.0089 −5 .2.61 × 10 −4 .2.31 × 10 −5 .2.41 × 10 −9 .7.73 × 10 −4 .7.26 × 10 −9 .7.00 × 10
× 10−4 −14 .1.27 × 10 −13 .8.38 × 10 −4 .3.22 × 10 −8 .1.13 × 10 −3 .1.10 × 10 −8 .1.60 × 10 −8 .1.04 × 10 −16 .2.81 × 10 −14 .1.90 × 10
× 10−4 −13 .1.58 × 10 −11 .2.80 × 10 −4 .1.61 × 10 −6 .7.58 × 10 −5 .1.84 × 10 −5 .1.69 × 10 −5 .4.03 × 10 −6 .4.26 × 10 −12 .1.30 × 10
× 10−4 .0.0564 −6 .9.69 × 10 −4 .4.24 × 10 −4 .1.01 × 10 −3 .2.20 × 10 −8 .1.68 × 10 −5 .4.31 × 10 −10 .1.75 × 10 −11 .2.13 × 10
.1.23
.3.51
.4.24
.1.80
.1.01
least by (.0.47 − 0.39 = 0.08). To sum up, Tables 3.3 and 3.4 show that our approach has the best achievement in the Pd and F-measure values. To statistically analyze the F-measure results given in Table 3.4, we conduct a statistical test, that is, Mcnemar’s test [9]. This test can provide statistical significance between CDDL and other methods. Here, the Mcnemar’s test uses a significance level of 0.05. If the p-value is below 0.05, the performance difference between two compared methods is considered to be statistically significant. Table 3.5 shows the p-values between CDDL and other compared methods on 10 NASA datasets, where only one value is slightly above 0.05. According to Table 3.5, the proposed approach indeed makes a significant difference in comparison with other methods for software defect prediction.
26
3 Within-Project Defect Prediction
3.1.2 Collaborative Representation Classification Based Software Defect Prediction 3.1.2.1
Methodology
Figure 3.3 shows the flowchart of defect prediction in our approach, which includes three steps. The first step is Laplace sampling process for the defective-free modules to construct the training dataset. Second, the prediction models is trained by using the CRC based learner. Finally, the CRC based predictor classifies whether new modules are defective or defective-free. In the metric based software defect prediction, the number of defective-free modules is much larger than that of defective ones, that is, the class imbalance problem may occur. In this section, we conduct the Laplace score sampling for training samples, which solves the class imbalance problem effectively. Sparse representation classification (SRC) represents a testing sample collaboratively by samples of all classes. In SRC, there are enough training samples for each class so that the dictionary is over-completed. Unfortunately, the number of defective modules is usually much small. If we use this under-complete dictionary to represent a defective module, the representation error may be much big and the classification will be unstable. Fortunately, one fact in software defect prediction is that software modules share similarities. Some samples from one class may be very helpful to represent the testing sample of other classes. In CRC, this “lack of samples” problem is solved by taking the software modules from the other class as the possible samples of each class. The main idea of CRC technique is that information of a signal can be collaboratively represented by a linear combination of a few elementary signals. We utilize .A = [A1 , A2 ] ∈ R m×n to denote the set of training samples which is processed by Laplace sampling, and y denotes a testing sample. In order to collaboratively represent the query sample using A with low computational burden, we use the regularized least square method as follows: Xˆ = arg min ‖y − A · X‖22 + λ‖X‖22
(3.7)
.
X
Training Instances Test Instances Laplace Sampling Software Defect Database
Building a Prediction Model CRC Based Learner
Fig. 3.3 CRC based software defect prediction flowchart
Prediction Results CRC_RLS (defective/defective-free) Prediction CRC Based Predictor
3.1 Basic WPDP
27
where .λ is the regularization parameter. The role of the regularization term is twofold. First, it makes the least square solution stable. Second, it introduces a certain amount of “sparsity” to the solution .Xˆ while this sparsity is much weaker than that by .l1 -norm. The solution of collaborative representation with regularized least square in Eq. 3.7 can be easily and analytically derived as −1 AT y Xˆ = AT A + λ · I
.
(3.8)
Let .P = (AT A + λ · I )−1 AT . Clearly, P is independent of y so that it can be precalculated as a projection matrix. Hence, a query sample y can be simply projected onto P via P y, which makes the collaborative representation very fast. After training the CRC based learner, we can use the collaborative representation classification with regularized least square .CRCR LS algorithm to do prediction. ˆ For a test sample y, we code y over A and get .X. In addition to the class ˆ specific representation residual .y − Ai · Xi , where .Xˆ i is the coefficient vector 2 2 associated with class i (.i = 1, 2), the .l -norm “sparsity” .Xˆ i can also bring 2 some discrimination information for classification. Thus we use both of them in classificationandcalculate the regularized residual of each class by using .ri = ˆ ˆ y − Ai · Xi / Xi . The test sample y is assigned to the ith class corresponding 2 2 to the smallest regularized residual .ri .
3.1.2.2
Experiments
In the experiment, ten datasets from NASA Metrics Data Program are taken as the test data. We compare the proposed approach with several representative software defect prediction methods, including Compressed C4.5 decision tree (CC4.5), weighted Naïve Bayes (NB), cost-sensitive boosting neural network (CBNN), and coding based ensemble learning (CEL).
3.1.2.3
Discussions
We use recall (Pd), false positive rate (Pf), precision (Pre), and F-measure as prediction accuracy evaluation indexes. A good prediction model desires to achieve high value of recall rate and precision. However, there exists trade-off between precision and recall. F-measure is the harmonic mean of precision and recall rate. Note that these quality indexes are commonly used in the field of software defect prediction. Table 3.6 shows the average Pd, Pf, Pre, and F-measure values of our CSDP approach and other compared methods on ten NASA datasets, where each value is the mean of 20 random runs. Our approach can acquire better prediction
28
3 Within-Project Defect Prediction
Table 3.6 Average Pd, Pf, Pre, and F-measure values of 20 random runs on ten NASA datasets
Evaluation indexes Pd Pf Pre F-measure
Prediction methods CC4.5 NB CEL .0.408 .0.354 .0.394 .0.140 .0.152 .0.148 .0.342 .0.347 .0.324 .0.371 .0.342 .0.354
CBNN .0.637 .0.260 .0.288 .0.390
CSDP .0.745 .0.211 .0.343 .0.465
accuracy than other methods. In particular, our approach improves the average Pd at least by 16.95% (.= (0.745 − 0.637)/0.637) and the average F-measure at least by 19.23% (.= (0.465 − 0.390)/0.390).
3.2 Semi-supervised WPDP 3.2.1 Sample-Based Software Defect Prediction with Active and Semi-supervised Learning 3.2.1.1
Methodology
Software defect prediction, which aims to predict whether a particular software module contains any defects, can be cast into a classification problem in machine learning, where software metrics are extracted from each software module to form an example with manually assigned labels defective (having one or more defects) and non-defective (no defects). A classifier is then learned from these training examples in the purpose of predicting the defect-proneness of unknown software modules. In this section, we propose a sample-based defect prediction approach which does not rely on the assumption that the current project has the same defect characteristics as the historical projects. Given a newly finished project, unlike the previous studies that leverage the modules in historical projects for classifier learning, sample-based defect prediction manages to sample a small portion of modules for extensive testing in order to reliably label the sampled modules, while the defect-proneness of unsampled modules remains unknown. Then, a classifier is constructed based on the sample of software modules (the labeled data) and expected to provide accurate predictions for the unsampled modules (unlabeled data). Here, conventional machine learners (e.g., logistic regression, decision tree, Naive Bayes, etc.) can be applied to the classification. In practice, modern software systems often consist of hundreds or even thousands of modules. An organization is usually not able to afford extensive testing for all modules especially when time and resources are limited. In this case, the organization can only manage to sample a small percentage of modules and test them for defect-proneness. Classifier would have to be learned from a small training
3.2 Semi-supervised WPDP
29
set with the defect-proneness labels. Thus, the key for the sample-based defect prediction to be cost-effective is to learn a well-performing classifier while keeping the sample size small. To improve the performance of sample-based defect prediction, we propose to apply semi-supervised learning for classifier construction, which firstly learns an initial classifier from a small sample of labeled training set and refines it by further exploiting a larger number of available unlabeled data. In semi-supervised learning, an effective paradigm is known as disagreementbased semi-supervised learning, where multiple learners are trained for the same task and the disagreements among the learners are exploited during learning. In this paradigm, unlabeled data can be regarded as a special information exchange “platform.” If one learner is much more confident on a disagreed unlabeled example than other learner(s), then this learner will teach other(s) with this example; if all learners are comparably confident on a disagreed unlabeled example, then this example may be selected for query. Many well-known disagreement-based semisupervised learning methods have been developed. In this study, we apply CoForest for defect prediction. It works based on a well-known ensemble learning algorithm named random forest [10] to tackle the problems of determining the most confident examples to label and producing the final hypothesis. The pseudocode of CoForest is presented in Table 3.1. Briefly, it works as follows. Let L denote the labeled dataset and U denote the unlabeled dataset. First, N random trees are initiated from the training sets bootstrap-sampled from the labeled dataset L for creating a random forest. Then, in each learning iteration, each random tree is refined with the original labeled examples L and the newly labeled examples .L' selected by its concomitant ensemble (i.e., the ensemble of the other random trees except for the current tree). The learning process iterates until certain stopping criterion is reached. Finally, the prediction is made based on the majority voting from the ensemble of random trees. Note that in this way, CoForest is able to exploit the advantage of both semi-supervised learning and ensemble learning simultaneously, as suggested in Xu et al. [11]. In CoForest, the stopping criterion is essential to guarantee a good performance Li and Zhou [12] derived a stopping criterion based on the theoretical findings in Angluin and Laird [13]. By enforcing the worst case generalization error of a random tree in the current round to be less than that in the preceded round, they derived that semi-supervised learning process will be beneficial if the following condition is satisfied .
eˆi,t Wi,t−1