How to Get Data to Mine Software Repositories

Looking for data to mine software repositories? This post lists six pointers on how to get data. Happy Mining. And don’t forget: the submission deadline for MSR 2010 – Mining Software Repositories is soon (January 11th/14th).

Please post any datasets that I missed in the comments.

1. MSR Mining Challenge

Every year the MSR conference hosts a mining challenge, which features data from open-source projects. The following data is available:

  • Mirrors of the version archives and bug databases of Mozilla/Firefox (MSR Challenge 2007) and Eclipse (MSR Challenge 2007 and 2008).
  • Repository logs of over 500+ Gnome projects, an XML dump of the Gnome bug database, and the complete SVN repositories of 69 Gnome projects (MSR Challenge 2009).
  • The Ultimate Debian Database and mirrors of the FreeBSD operating system and distribution (SVN, CVS, and bugs), as well as preprocessed data for these projects (MSR Challenge 2010).

2. Eclipse Bug Data! and iBUGS

The Eclipse bug dataset contains the number of pre-release and post-release defects in Eclipse 2.0, 2.1, and 3.0. The dataset also contains complexity metrics and links bug reports to the checkins that fixed the bug. The data is available on both file and package level.

iBUGS is a related dataset, with a focus on providing a benchmark for defect localization tools. For AspectJ and Rhino, the dataset contains 382 bugs including fixes and for 252 bugs an associated test case that exhibits the bug.

3. FLOSSMetrics

The LibreSoft group in Spain runs the FLOSSMetrics project, which has data about version archives, mailing lists, and bug databases for 2,800+ open-source projects. The tools to create the data are also open-source.

4. PROMISE Repository

The PROMISE repository has a great collection of software engineering datasets (over 90 datasets). And don’t miss the the PROMISE 2010 conference in Timisoara, Romania.

5. Rsync, CVSup, and the Eclipse Archive

Many open-source projects run Rsync and CVSup servers that allow you to mirror repositories. For example, to show all available modules for KDE, use
rsync master.kde.org::

To mirror the module “svnmirror” (which is the entire SVN repository), use
rsync --progress -za --timeout=3600 --delete master.kde.org::svnmirror PATH_TO_MIRROR

Not all projects have RSync or CVSup servers and finding the servers can be tricky and often requires intensive search (Bing) and guesswork.

Another data gem is the Eclipse Archive where you get weekly snapshot of the entire CVS repository of Eclipse. Use this data to setup a local CVS mirror and your analysis will run much faster. There are also archived builds.

6. CodePlex

Last but not least, Codeplex is a project hosting community for open source software. Codeplex hosts Team Foundation Server (TFS) repositories, which can be easily mined through the TFS API.