awesome-msr

A curated repository of software engineering repository mining data sets

GitHub

415 stars
36 watching
67 forks
last commit: almost 4 years ago
Linked from 4 awesome lists

awesomeawesome-listdatasetghtorrentminingmsr

Awesome Empirical Software Engineering

contribution guide This list requires your input for its continuous improvement. Read the for instructions on how you can contribute. Alternatively, you can send me an if you find the process too cumbersome or confusing
awesome 327,194 26 days ago For more awesome lists, see

Awesome Empirical Software Engineering / Repositories

SIR Software-artifact infrastructure repository; Java, C, C++, and C# software together with test suites and fault data
PROMISE About 20 datasets related to software engineering research
FLOSSmole Collaborative collection and analysis of free/libre/open source project data
Zenodo Software data collections in CERN's open-access repository

Awesome Empirical Software Engineering / Repositories / Zenodo

Software Engineering Artifacts Can Really Assist Future Tasks
Empirical Software Engineering
Mining Software Repositories

Awesome Empirical Software Engineering / Data Sets

AndroidTimeMachine Graph-based dataset of commit history of 8,431 real-world Android apps
AndroZoo Collection of Android Applications
Bug Prediction Dataset Collection of models and metrics from Eclipse JDT Core, PDE UI, Equinox Framework, Lucene, Mylyn, and their histories
Code Reviews Code reviews of OpenStack, LibreOffice, AOSP, Qt, Eclipse
CoREBench Collection of 70 realistically Complex Regression Errors that were systematically extracted from the repositories and bug reports of four open-source software projects: Make, Grep, Findutils, and Coreutils
Cryptocurrency GitHub Activity and Market Cap Dataset Activity such as commits, stars, prices, and market cap of over 200 cryptocurrency projects on GitHub over time. Raw, historic data is also
Defects4J 714 17 days ago Collection of 395 reproducible bugs collected with the goal of advancing software testing research
Eclipse AERI stacktraces Collection of stacktraces of Exceptions encountered by users of the Eclipse IDE, as retrieved by the AERI reporting system
Enron Spreadsheets and Emails All the spreadsheets and emails used in the paper 'Enron's Spreadsheets and Related Emails: A Dataset and Analysis'
Findbugs-maven 2 about 9 years ago Set of FindBugs reports for the Java projects of the
GHTorrent Scalable, queriable, offline mirror of data offered through the GitHub REST API
GitHub Bug Dataset Bug Dataset of 15 Java open-source projects characterized by static source code metrics
GitHub on Google BigQuery GitHub data accessible through Google's BigQuery platform
Grammar Zoo Collection of grammars of DSLs and GPLs, some extracted from metamodels and document schemata
KaVE Developer tool interaction data
Linux Kernel 4.21 Call Graphs The Linux Kernel 4.21 Call Graphs produced using
Maven metrics 0 over 9 years ago Collection of software complexity & sizing metrics for the
Maven Dependency Graph Snapshot of the whole Maven Central taken on September 6, 2018, stored in a graph database
mzdata 7 over 8 years ago Multi-extract and multi-level dataset of Mozilla issue tracking history
npm-miner 1 over 4 years ago The dataset contains the analysis results of 5 open source software quality tools eslint, escomplex, nsp, jsinspect and sonarjs for 2000 popular (in terms of stars and downloads) npm packages
OCL Expressions on GitHub 4 almost 2 years ago Data set of 9188 OCL expressions originating from 504 EMF meta-models in 245 systematically selected GitHub repositories
RepoReapers Data Set Data set containing a collection of from GHTorrent
Software Heritage Graph Dataset Graph of the development history and file metadata of >80 million software projects from various forges (GitHub, Gitlab, Debian, PyPI, Google Code, etc) in a deduplicated and unified representation ( )
STAMINA (STAte Machine INference Approaches) data are used to benchmark techniques for learning deterministic finite state machines (FSMs)
Stack Exchange Anonymized dump of all user-contributed content on the Stack Exchange network
TravisTorrent Provides free and easy-to-use Traivs CI build analyses
Ultimate Debian Database (UDD) Data about various aspects of Debian (e.g. packages, bugs, mainteners) in the same SQL database
Unified Bug Dataset Static source code based datasets which includes the Bugcatchers Bug Dataset, the , the , the , some datasets from the repository
Unix history 6,542 over 2 years ago Git repository with 46 years of Unix history evolution

Awesome Empirical Software Engineering / Tools

astminer 280 10 months ago Library and tool for mining of path-based representations of code and other data derived from ASTs
Boa Domain-specific language and infrastructure that eases mining software repositories
buckwheat 24 over 2 years ago Multi-language tokenizer for extracting identifiers from source code
ckjm Chidamber and Kemerer Java Metrics
Coming 92 4 months ago A Java framework for analyzing code changes and mining instances of change patterns from Git repositories
CryptOSS 8 over 5 years ago Mine GitHub activity and market cap data for cryptocurrency projects
DbDeo 13 over 6 years ago Extract embedded SQL statements and detect database schema smells
Designite Compute source code metrics and detect a variety of implementation, design, and architecture smells for C#
DesigniteJava 172 6 months ago Compute source code metrics and detect a variety of implementation and design smells for Java
Diggit 20 about 3 years ago Agile Ruby Tool to analyze Git repositories
GrimoireLab Free/Libre/Open Source tools for Software Development Analytics
MetricMiner Lean Java DSL to mine and extract data (e.g. commits, developers, modifications, diffs) from Git and SVN repositories
Maven-miner 30 about 2 years ago Java tools and infrastructure to resolve the whole Maven dependency graph, hosted in Maven Central, in the form of a Graph
Perceval 288 13 days ago Fetch repository data from tens of back-ends
Puppeteer 38 about 4 years ago Detect configuration smells in Puppet code
PyDriller 831 about 1 month ago Python Framework to analyse Git repositories
qmcalc 63 over 2 years ago Calculate quality metrics from C source code
reaper 106 about 4 years ago Python tool to compute a score for a repository from GHTorrent. The score quantifies the extent to which the project contained within the repository is
RefactoringMiner 356 13 days ago Library/API for detection of refactorings in changes of Java code
VulData7 40 over 5 years ago Java framework enabling the automated collection of commits fixing vulnerabilities that are reported in NVD (links NVD with Git)

Awesome Empirical Software Engineering / Research Outlets / Outlets exclusively devoted to empirical software engineering research

Empirical Software Engineering journal
MSR: Mining Software Repositories conference
PROMISE: Predictive Models and Data Analytics in Software Engineering conference

Awesome Empirical Software Engineering / Research Outlets / Outlets that publish empirical software engineering research

ACM Transactions on Software Engineering and Methodology (TOSEM)
ESEC/FSE: ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering
ICSE: International Conference on Software Engineering
IEEE Software magazine
IEEE Transactions on Software Engineering
Journal of Systems and Software
SANER: IEEE International Conference on Software Analysis, Evolution and Reengineering

Backlinks from these awesome lists: