Modern programming environments automatically collect lots of data on software development, notably changes and defects. The field of mining software archives deals with the automated extraction, collection, and abstraction of information from this data. This is the introduction to a special issue of IEEE Software on Mining Software Archives presenting a selection of the exciting research that is taking place in the field.
Which components of a large software system are the most defect-prone? In a study on a large SAP Java system, we evaluated and compared a number of defect predictors, based on code features such as complexity metrics, static error detectors, change frequency, or component imports, thus replicating a number of earlier case studies in an industrial context. We found the overall predictive power to be lower than expected; still, the resulting regression models successfully predicted 50–60% of the 20% most defect-prone components.
Scrutinize is a web based tool designed to take information from a source code repository, and present it in a way that allows project team members to learn about how the project has been changing and who has made those changes. Try Scrutinize.
In software development, bug reports provide crucial information to developers. However, these reports widely differ in their quality. We conducted a survey among developers and users of APACHE, ECLIPSE, and MOZILLA to find out what makes a good bug report. The analysis of the 466 responses revealed an information mismatch between what developers need and what users supply. Most developers consider steps to reproduce, stack traces, and test cases as helpful, which are at the same time most difficult to provide for users. Such insight is helpful to design new bug tracking tools that guide users at collecting and providing more helpful information. Our CUEZILLA prototype is such a tool and measures the quality of new bug reports; it also recommends which elements should be added to improve the quality. We trained CUEZILLA on a sample of 289 bug reports, rated by developers as part of the survey. In our experiments, CUEZILLA was able to predict the quality of 31-48% of bug reports accurately.
In a survey we found that most developers have experienced duplicated bug reports, however, only few considered them as a serious problem. This contradicts popular wisdom that considers bug duplicates as a serious problem for open source projects. In the survey, developers also pointed out that the additional information provided by duplicates helps to resolve bugs quicker. In this paper, we therefore propose to merge bug duplicates, rather than treating them separately. We quantify the amount of information that is added for developers and show that automatic triaging can be improved as well. In addition, we discuss the different reasons why users submit duplicate bug reports in the first place.
Developers typically rely on the information submitted by end-users to resolve bugs. We conducted a survey on information needs and commonly faced problems with bug reporting among several hundred developers and users of the APACHE, ECLIPSE and MOZILLA projects. In this paper, we present the results of a card sort on the 175 comments sent back to us by the responders of the survey. The card sort revealed several hurdles involved in reporting and resolving bugs, which we present in a collection of recommendations for the design of new bug tracking systems. Such systems could provide contextual assistance, reminders to add information, and most important, assistance to collect and report crucial information to developers.
How do problem domains impact software features? We mine software code bases to relate problem domains (characterized by imports) to code features such as complexity, size, or quality. The resulting predictors take the specific imports of a component and predict its size, complexity, and quality metrics. In an experiment involving 89 plug-ins of the ECLIPSE project, we found good prediction accuracy for most metrics. Since the predictors rely only on import relationships, and since these are available at design time, our approach allows for early estimation of crucial software metrics.
Objective: This paper aims to generate explanations from a series of data points obtained from a decision support system called ReleasePlanner® for supporting product release planning and considered to be a black box. Method: Concept analysis is applied to 1085 data points received from running 10 scenarios of a real world product release planning project with 35 candidate solutions generated by ReleasePlanner®. Results: Three main results are obtained: (1) patterns between inputs and outputs; (2) evaluation of impact of individual input parameters on outputs; and (3) sensitivity level of outputs in dependence of inputs. Conclusion: Concept analysis is shown to be a feasible technique for gaining more insight into the structure of results obtained from a black box input-output system, such as, but not limited to, ReleasePlanner®.