Looking for data to mine software repositories? This post lists six pointers on how to get data. Happy Mining. And don’t forget: the submission deadline for MSR 2010 – Mining Software Repositories is soon (January 11th/14th).
Please post any datasets that I missed in the comments.
1. MSR Mining Challenge
Every year the MSR conference hosts a mining challenge, which features data from open-source projects. The following data is available:
- Mirrors of the version archives and bug databases of Mozilla/Firefox (MSR Challenge 2007) and Eclipse (MSR Challenge 2007 and 2008).
- Repository logs of over 500+ Gnome projects, an XML dump of the Gnome bug database, and the complete SVN repositories of 69 Gnome projects (MSR Challenge 2009).
- The Ultimate Debian Database and mirrors of the FreeBSD operating system and distribution (SVN, CVS, and bugs), as well as preprocessed data for these projects (MSR Challenge 2010).
2. Eclipse Bug Data! and iBUGS
The Eclipse bug dataset contains the number of pre-release and post-release defects in Eclipse 2.0, 2.1, and 3.0. The dataset also contains complexity metrics and links bug reports to the checkins that fixed the bug. The data is available on both file and package level.
iBUGS is a related dataset, with a focus on providing a benchmark for defect localization tools. For AspectJ and Rhino, the dataset contains 382 bugs including fixes and for 252 bugs an associated test case that exhibits the bug.
The LibreSoft group in Spain runs the FLOSSMetrics project, which has data about version archives, mailing lists, and bug databases for 2,800+ open-source projects. The tools to create the data are also open-source.
4. PROMISE Repository
5. Rsync, CVSup, and the Eclipse Archive
To mirror the module “svnmirror” (which is the entire SVN repository), use
rsync --progress -za --timeout=3600 --delete master.kde.org::svnmirror PATH_TO_MIRROR
Not all projects have RSync or CVSup servers and finding the servers can be tricky and often requires intensive search (Bing) and guesswork.
Another data gem is the Eclipse Archive where you get weekly snapshot of the entire CVS repository of Eclipse. Use this data to setup a local CVS mirror and your analysis will run much faster. There are also archived builds.