awesome-msr
Software engineering datasets
A curated collection of data sets and tools for research in software engineering
A curated repository of software engineering repository mining data sets
419 stars
36 watching
67 forks
last commit: almost 4 years ago
Linked from 4 awesome lists
awesomeawesome-listdatasetghtorrentminingmsr
Awesome Empirical Software Engineering | |||
contribution guide | This list requires your input for its continuous improvement. Read the for instructions on how you can contribute. Alternatively, you can send me an if you find the process too cumbersome or confusing | ||
awesome | 334,113 | about 4 hours ago | For more awesome lists, see |
Awesome Empirical Software Engineering / Repositories | |||
SIR | Software-artifact infrastructure repository; Java, C, C++, and C# software together with test suites and fault data | ||
PROMISE | About 20 datasets related to software engineering research | ||
FLOSSmole | Collaborative collection and analysis of free/libre/open source project data | ||
Zenodo | Software data collections in CERN's open-access repository | ||
Awesome Empirical Software Engineering / Repositories / Zenodo | |||
Software Engineering Artifacts Can Really Assist Future Tasks | |||
Empirical Software Engineering | |||
Mining Software Repositories | |||
Awesome Empirical Software Engineering / Data Sets | |||
AndroidTimeMachine | Graph-based dataset of commit history of 8,431 real-world Android apps | ||
AndroZoo | Collection of Android Applications | ||
Bug Prediction Dataset | Collection of models and metrics from Eclipse JDT Core, PDE UI, Equinox Framework, Lucene, Mylyn, and their histories | ||
Code Reviews | Code reviews of OpenStack, LibreOffice, AOSP, Qt, Eclipse | ||
CoREBench | Collection of 70 realistically Complex Regression Errors that were systematically extracted from the repositories and bug reports of four open-source software projects: Make, Grep, Findutils, and Coreutils | ||
Cryptocurrency GitHub Activity and Market Cap Dataset | Activity such as commits, stars, prices, and market cap of over 200 cryptocurrency projects on GitHub over time. Raw, historic data is also | ||
Defects4J | 743 | 18 days ago | Collection of 395 reproducible bugs collected with the goal of advancing software testing research |
Eclipse AERI stacktraces | Collection of stacktraces of Exceptions encountered by users of the Eclipse IDE, as retrieved by the AERI reporting system | ||
Enron Spreadsheets and Emails | All the spreadsheets and emails used in the paper 'Enron's Spreadsheets and Related Emails: A Dataset and Analysis' | ||
Findbugs-maven | 2 | over 9 years ago | Set of FindBugs reports for the Java projects of the |
GHTorrent | Scalable, queriable, offline mirror of data offered through the GitHub REST API | ||
GitHub Bug Dataset | Bug Dataset of 15 Java open-source projects characterized by static source code metrics | ||
GitHub on Google BigQuery | GitHub data accessible through Google's BigQuery platform | ||
Grammar Zoo | Collection of grammars of DSLs and GPLs, some extracted from metamodels and document schemata | ||
KaVE | Developer tool interaction data | ||
Linux Kernel 4.21 Call Graphs | The Linux Kernel 4.21 Call Graphs produced using | ||
Maven metrics | 0 | over 9 years ago | Collection of software complexity & sizing metrics for the |
Maven Dependency Graph | Snapshot of the whole Maven Central taken on September 6, 2018, stored in a graph database | ||
mzdata | 7 | over 8 years ago | Multi-extract and multi-level dataset of Mozilla issue tracking history |
npm-miner | 1 | over 4 years ago | The dataset contains the analysis results of 5 open source software quality tools eslint, escomplex, nsp, jsinspect and sonarjs for 2000 popular (in terms of stars and downloads) npm packages |
OCL Expressions on GitHub | 4 | about 2 years ago | Data set of 9188 OCL expressions originating from 504 EMF meta-models in 245 systematically selected GitHub repositories |
RepoReapers Data Set | Data set containing a collection of from GHTorrent | ||
Software Heritage Graph Dataset | Graph of the development history and file metadata of >80 million software projects from various forges (GitHub, Gitlab, Debian, PyPI, Google Code, etc) in a deduplicated and unified representation ( ) | ||
STAMINA | (STAte Machine INference Approaches) data are used to benchmark techniques for learning deterministic finite state machines (FSMs) | ||
Stack Exchange | Anonymized dump of all user-contributed content on the Stack Exchange network | ||
TravisTorrent | Provides free and easy-to-use Traivs CI build analyses | ||
Ultimate Debian Database (UDD) | Data about various aspects of Debian (e.g. packages, bugs, mainteners) in the same SQL database | ||
Unified Bug Dataset | Static source code based datasets which includes the Bugcatchers Bug Dataset, the , the , the , some datasets from the repository | ||
Unix history | 6,595 | over 2 years ago | Git repository with 46 years of Unix history evolution |
Awesome Empirical Software Engineering / Tools | |||
astminer | 282 | 12 months ago | Library and tool for mining of path-based representations of code and other data derived from ASTs |
Boa | Domain-specific language and infrastructure that eases mining software repositories | ||
buckwheat | 24 | over 2 years ago | Multi-language tokenizer for extracting identifiers from source code |
ckjm | Chidamber and Kemerer Java Metrics | ||
Coming | 92 | 8 days ago | A Java framework for analyzing code changes and mining instances of change patterns from Git repositories |
CryptOSS | 7 | over 5 years ago | Mine GitHub activity and market cap data for cryptocurrency projects |
DbDeo | 13 | almost 7 years ago | Extract embedded SQL statements and detect database schema smells |
Designite | Compute source code metrics and detect a variety of implementation, design, and architecture smells for C# | ||
DesigniteJava | 173 | 8 months ago | Compute source code metrics and detect a variety of implementation and design smells for Java |
Diggit | 20 | about 3 years ago | Agile Ruby Tool to analyze Git repositories |
GrimoireLab | Free/Libre/Open Source tools for Software Development Analytics | ||
MetricMiner | Lean Java DSL to mine and extract data (e.g. commits, developers, modifications, diffs) from Git and SVN repositories | ||
Maven-miner | 31 | about 2 years ago | Java tools and infrastructure to resolve the whole Maven dependency graph, hosted in Maven Central, in the form of a Graph |
Perceval | 290 | 8 days ago | Fetch repository data from tens of back-ends |
Puppeteer | 38 | about 4 years ago | Detect configuration smells in Puppet code |
PyDriller | 840 | 21 days ago | Python Framework to analyse Git repositories |
qmcalc | 64 | over 2 years ago | Calculate quality metrics from C source code |
reaper | 107 | about 4 years ago | Python tool to compute a score for a repository from GHTorrent. The score quantifies the extent to which the project contained within the repository is |
RefactoringMiner | 372 | 4 days ago | Library/API for detection of refactorings in changes of Java code |
VulData7 | 40 | almost 6 years ago | Java framework enabling the automated collection of commits fixing vulnerabilities that are reported in NVD (links NVD with Git) |
Awesome Empirical Software Engineering / Research Outlets / Outlets exclusively devoted to empirical software engineering research | |||
Empirical Software Engineering journal | |||
MSR: Mining Software Repositories conference | |||
PROMISE: Predictive Models and Data Analytics in Software Engineering conference | |||
Awesome Empirical Software Engineering / Research Outlets / Outlets that publish empirical software engineering research | |||
ACM Transactions on Software Engineering and Methodology (TOSEM) | |||
ESEC/FSE: ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering | |||
ICSE: International Conference on Software Engineering | |||
IEEE Software magazine | |||
IEEE Transactions on Software Engineering | |||
Journal of Systems and Software | |||
SANER: IEEE International Conference on Software Analysis, Evolution and Reengineering |