awesome-machine-learning-on-source-code

Source Code ML

A curated list of research papers, datasets, and projects exploring machine learning applications on source code

Cool links & research papers related to Machine Learning applied to source code (MLonCode)

GitHub

6k stars

357 watching

843 forks

last commit: about 5 years ago

Linked from 3 awesome lists

awesomeawesome-listmachine-learningmachine-learning-on-source-code

Awesome Machine Learning On Source Code / Digests
Learning from "Big Code"			Techniques, challenges, tools, datasets on "Big Code"
A Survey of Machine Learning for Big Code and Naturalness			Survey and literature review on Machine Learning on Source Code
Awesome Machine Learning On Source Code / Conferences
ACM International Conference on Software Engineering, ICSE
ACM International Conference on Automated Software Engineering, ASE
ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (FSE)
2018 IEEE 25th International Conference on Software Analysis, Evolution, and Reengineering (SANER)
Machine Learning for Programming
Workshop on NLP for Software Engineering
SysML
Awesome Machine Learning On Source Code / Conferences / SysML
Talks
Awesome Machine Learning On Source Code / Conferences
Mining Software Repositories
AIFORSE
source{d} tech talks
Awesome Machine Learning On Source Code / Conferences / source{d} tech talks
Talks
Awesome Machine Learning On Source Code / Conferences
NIPS Neural Abstract Machines and Program Induction workshop
Awesome Machine Learning On Source Code / Conferences / NIPS Neural Abstract Machines and Program Induction workshop
Talks
Awesome Machine Learning On Source Code / Conferences
CamAIML
Awesome Machine Learning On Source Code / Conferences / CamAIML
Learning to Code: Machine Learning for Program Induction			Alexander Gaunt
Awesome Machine Learning On Source Code / Conferences
MASES 2018
Awesome Machine Learning On Source Code / Competitions
CodRep	92	over 6 years ago	competition on automatic program repair: given a source line, find the insertion point
Awesome Machine Learning On Source Code / Papers
Program Synthesis and Semantic Parsing with Learned Code Idioms			Richard Shin, Miltiadis Allamanis, Marc Brockschmidt, Oleksandr Polozov, 2019
Synthetic Datasets for Neural Program Synthesis			Richard Shin, Neel Kant, Kavi Gupta, Chris Bender, Brandon Trabucco, Rishabh Singh, Dawn Song, ICLR 2019
Execution-Guided Neural Program Synthesis			Xinyun Chen, Chang Liu, Dawn Song, ICLR 2019
DeepFuzz: Automatic Generation of Syntax Valid C Programs for Fuzz Testing			Xiao Liu, Xiaoting Li, Rupesh Prajapati, Dinghao Wu, AAAI 2019
NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System			Xi Victoria Lin, Chenglong Wang, Luke Zettlemoyer, Michael D. Ernst, LREC 2018
Recent Advances in Neural Program Synthesis			Neel Kant, 2018
Neural Sketch Learning for Conditional Program Generation			Vijayaraghavan Murali, Letao Qi, Swarat Chaudhuri, Chris Jermaine, ICLR 2018
Neural Program Search: Solving Programming Tasks from Description and Examples			Illia Polosukhin, Alexander Skidanov, ICLR 2018
Neural Program Synthesis with Priority Queue Training			Daniel A. Abolafia, Mohammad Norouzi, Quoc V. Le, 2018
Towards Synthesizing Complex Programs from Input-Output Examples			Xinyun Chen, Chang Liu, Dawn Song, ICLR 2018
Glass-Box Program Synthesis: A Machine Learning Approach			Konstantina Christakopoulou, Adam Tauman Kalai, AAAI 2018
Synthesizing Benchmarks for Predictive Modeling			Chris Cummins, Pavlos Petoumenos, Zheng Wang, Hugh Leather, CGO 2017
Program Synthesis for Character Level Language Modeling			Pavol Bielik, Veselin Raychev, Martin Vechev, ICLR 2017
SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning			Xiaojun Xu, Chang Liu, Dawn Song, 2017
Learning to Select Examples for Program Synthesis			Yewen Pu, Zachery Miranda, Armando Solar-Lezama, Leslie Pack Kaelbling, 2017
Neural Program Meta-Induction			Jacob Devlin, Rudy Bunel, Rishabh Singh, Matthew Hausknecht, Pushmeet Kohli, NIPS 2017
Learning to Infer Graphics Programs from Hand-Drawn Images			Kevin Ellis, Daniel Ritchie, Armando Solar-Lezama, Joshua B. Tenenbaum, 2017
Neural Attribute Machines for Program Generation			Matthew Amodio, Swarat Chaudhuri, Thomas Reps, 2017
Abstract Syntax Networks for Code Generation and Semantic Parsing			Maxim Rabinovich, Mitchell Stern, Dan Klein, ACL 2017
Making Neural Programming Architectures Generalize via Recursion			Jonathon Cai, Richard Shin, Dawn Song, ICLR 2017
A Syntactic Neural Model for General-Purpose Code Generation			Pengcheng Yin, Graham Neubig, ACL 2017
Program Synthesis from Natural Language Using Recurrent Neural Networks			Xi Victoria Lin, Chenglong Wang, Deric Pang, Kevin Vu, Luke Zettlemoyer, Michael Ernst, 2017
RobustFill: Neural Program Learning under Noisy I/O			Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, Pushmeet Kohli, ICML 2017
Lifelong Perceptual Programming By Example			Gaunt, Alexander L., Marc Brockschmidt, Nate Kushman, and Daniel Tarlow, 2017
Neural Programming by Example			Chengxun Shu, Hongyu Zhang, AAAI 2017
DeepCoder: Learning to Write Programs			Balog Matej, Alexander L. Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow, ICLR 2017
A Differentiable Approach to Inductive Logic Programming			Yang Fan, Zhilin Yang, and William W. Cohen, 2017
Latent Attention For If-Then Program Synthesis			Xinyun Chen, Chang Liu, Richard Shin, Dawn Song, Mingcheng Chen, NIPS 2016
Latent Predictor Networks for Code Generation			Wang Ling, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočiský, Andrew Senior, Fumin Wang, Phil Blunsom, ACL 2016
Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision (Short Version)			Liang Chen, Jonathan Berant, Quoc Le, Kenneth D. Forbus, and Ni Lao, NIPS 2016
Programs as Black-Box Explanations			Singh, Sameer, Marco Tulio Ribeiro, and Carlos Guestrin, NIPS 2016
Search-Based Generalization and Refinement of Code Templates			Tim Molderez, Coen De Roover, SSBSE 2016
Structured Generative Models of Natural Source Code			Chris J. Maddison, Daniel Tarlow, ICML 2014
Modeling Vocabulary for Big Code Machine Learning			Hlib Babii, Andrea Janes, Romain Robbes, 2019
Generative Code Modeling with Graphs			Marc Brockschmidt, Miltiadis Allamanis, Alexander L. Gaunt, Oleksandr Polozov, ICLR 2019
NL2Type: Inferring JavaScript Function Types from Natural Language Information			Rabee Sohail Malik, Jibesh Patra, Michael Pradel, ICSE 2019
A Novel Neural Source Code Representation based on Abstract Syntax Tree			Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, Xudong Liu, ICSE 2019
Deep Learning Type Inference			Vincent J. Hellendoorn, Christian Bird, Earl T. Barr and Miltiadis Allamanis, FSE 2018.
Tree2Tree Neural Translation Model for Learning Source Code Changes			Saikat Chakraborty, Miltiadis Allamanis, Baishakhi Ray, 2018
code2seq: Generating Sequences from Structured Representations of Code			Uri Alon, Omer Levy, Eran Yahav, 2018
Syntax and Sensibility: Using language models to detect and correct syntax errors			Eddie Antonio Santos, Joshua Charles Campbell, Dhvani Patel, Abram Hindle, and José Nelson Amaral, SANER 2018
code2vec: Learning Distributed Representations of Code			Uri Alon, Meital Zilberstein, Omer Levy, Eran Yahav, 2018
Learning to Represent Programs with Graphs			Miltiadis Allamanis, Marc Brockschmidt, Mahmoud Khademi, ICLR 2018
A Survey of Machine Learning for Big Code and Naturalness			Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, Charles Sutton, 2017
Are Deep Neural Networks the Best Choice for Modeling Source Code?			Vincent J. Hellendoorn, Premkumar Devanbu, FSE 2017
A deep language model for software code			Hoa Khanh Dam, Truyen Tran, Trang Pham, 2016
Convolutional Neural Networks over Tree Structures for Programming Language Processing			Lili Mou, Ge Li, Lu Zhang, Tao Wang, Zhi Jin, AAAI-16.
Suggesting Accurate Method and Class Names			Miltiadis Allamanis, Earl T. Barr, Christian Bird, Charles Sutton, FSE 2015
Mining Source Code Repositories at Massive Scale using Language Modeling			Miltiadis Allamanis, Charles Sutton, MSR 2013
Learning Compositional Neural Programs with Recursive Tree Search and Planning			Thomas Pierrot, Guillaume Ligner, Scott Reed, Olivier Sigaud, Nicolas Perrin, Alexandre Laterre, David Kas, Karim Beguir, Nando de Freitas, 2019
From Programs to Interpretable Deep Models and Back			Eran Yahav, ICCAV 2018
Neural Code Comprehension: A Learnable Representation of Code Semantics			Tal Ben-Nun, Alice Shoshana Jakobovits, Torsten Hoefler, NIPS 2018
A General Path-Based Representation for Predicting Program Properties			Uri Alon, Meital Zilberstein, Omer Levy, Eran Yahav, PLDI 2018
Cross-Language Learning for Program Classification using Bilateral Tree-Based Convolutional Neural Networks			Nghi D. Q. Bui, Lingxiao Jiang, Yijun Yu, AAAI 2018
Bilateral Dependency Neural Networks for Cross-Language Algorithm Classification			Nghi D. Q. Bui, Yijun Yu, Lingxiao Jiang, SANER 2018
Syntax-Directed Variational Autoencoder for Structured Data			Hanjun Dai, Yingtao Tian, Bo Dai, Steven Skiena, Le Song, ICLR 2018
Divide and Conquer with Neural Networks			Nowak, Alex, and Joan Bruna, ICLR 2018
Hierarchical multiscale recurrent neural networks			Chung Junyoung, Sungjin Ahn, and Yoshua Bengio, ICLR 2017
Learning Efficient Algorithms with Hierarchical Attentive Memory			Andrychowicz, Marcin, and Karol Kurach, 2016
Learning Operations on a Stack with Neural Turing Machines			Deleu, Tristan, and Joseph Dureau, NIPS 2016
Probabilistic Neural Programs			Murray, Kenton W., and Jayant Krishnamurthy, NIPS 2016
Neural Programmer-Interpreters			Reed, Scott, and Nando de Freitas, ICLR 2016
Neural GPUs Learn Algorithms			Kaiser, Łukasz, and Ilya Sutskever, ICLR 2016
Neural Random-Access Machines			Karol Kurach, Marcin Andrychowicz, Ilya Sutskever, ERCIM News 2016
Neural Programmer: Inducing Latent Programs with Gradient Descent			Neelakantan, Arvind, Quoc V. Le, and Ilya Sutskever, ICLR 2015
Learning to Execute			Wojciech Zaremba, Ilya Sutskever, 2015
Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets			Joulin, Armand, and Tomas Mikolov, NIPS 2015
Neural Turing Machines			Graves, Alex, Greg Wayne, and Ivo Danihelka, 2014
From Machine Learning to Machine Reasoning			Bottou Leon, Journal of Machine Learning 2011
A Literature Study of Embeddings on Source Code			Zimin Chen and Martin Monperrus, 2019
AST-Based Deep Learning for Detecting Malicious PowerShell			Gili Rusak, Abdullah Al-Dujaili, Una-May O'Reilly, 2018
Deep Code Search			Xiaodong Gu, Hongyu Zhang, Sunghun Kim, ICSE 2018
Word Embeddings for the Software Engineering Domain	40	almost 8 years ago	Vasiliki Efstathiou, Christos Chatzilenas, Diomidis Spinellis, MSR 2018
Code Vectors: Understanding Programs Through Embedded Abstracted Symbolic Traces			Jordan Henkel, Shuvendu K. Lahiri, Ben Liblit, Thomas Reps, FSE 2018
Document Distance Estimation via Code Graph Embedding			Zeqi Lin, Junfeng Zhao, Yanzhen Zou, Bing Xie, Internetware 2017
Combining Word2Vec with revised vector space model for better code retrieval			Thanh Van Nguyen, Anh Tuan Nguyen, Hung Dang Phan, Trong Duc Nguyen, Tien N. Nguyen, ICSE 2017
From word embeddings to document similarities for improved information retrieval in software engineering			Xin Ye, Hui Shen, Xiao Ma, Razvan Bunescu, Chang Liu, ICSE 2016
Mapping API Elements for Code Migration with Vector Representation			Trong Duc Nguyen, Anh Tuan Nguyen, Tien N. Nguyen, ICSE 2016
Towards Neural Decompilation			Omer Katz, Yuval Olshaker, Yoav Goldberg, Eran Yahav, 2019
Tree-to-tree Neural Networks for Program Translation			Xinyun Chen, Chang Liu, Dawn Song, ICLR 2018
Code Attention: Translating Code to Comments by Exploiting Domain Features			Wenhao Zheng, Hong-Yu Zhou, Ming Li, Jianxin Wu, 2017
Automatically Generating Commit Messages from Diffs using Neural Machine Translation			Siyuan Jiang, Ameer Armaly, Collin McMillan, ASE 2017
A Parallel Corpus of Python Functions and Documentation Strings for Automated Code Documentation and Code Generation			Antonio Valerio Miceli Barone, Rico Sennrich, ICNLP 2017
A Neural Architecture for Generating Natural Language Descriptions from Source Code Changes			Pablo Loyola, Edison Marrese-Taylor, Yutaka Matsuo, ACL 2017
Aroma: Code Recommendation via Structural Code Search			Sifei Luan, Di Yang, Koushik Sen and Satish Chandra, 2019
Intelligent Code Reviews Using Deep Learning			Anshul Gupta, Neel Sundaresan, KDD DL Day 2018
Code Completion with Neural Attention and Pointer Networks			Jian Li, Yue Wang, Irwin King, Michael R. Lyu, 2017
Learning Python Code Suggestion with a Sparse Pointer Network			Avishkar Bhoopchand, Tim Rocktäschel, Earl Barr, Sebastian Riedel, 2016
Code Completion with Statistical Language Models			Veselin Raychev, Martin Vechev, Eran Yahav, PLDI 2014
SampleFix: Learning to Correct Programs by Sampling Diverse Fixes			Hossein Hajipour, Apratim Bhattacharya, Mario Fritz, 2019
Maximal Divergence Sequential Autoencoder for Binary Software Vulnerability Detection			Tue Le, Tuan Nguyen, Trung Le, Dinh Phung, Paul Montague, Olivier De Vel, Lizhen Qu, ICLR 2019
Neural Program Repair by Jointly Learning to Localize and Repair			Marko Vasic, Aditya Kanade, Petros Maniatis, David Bieber, Rishabh Singh, ICLR 2019
Compiler Fuzzing through Deep Learning			Chris Cummins, Pavlos Petoumenos, Alastair Murray, Hugh Leather, ISSTA 2018
Automatically assessing vulnerabilities discovered by compositional analysis			Saahil Ognawala, Ricardo Nales Amato, Alexander Pretschner and Pooja Kulkarni, MASES 2018
An Empirical Investigation into Learning Bug-Fixing Patches in the Wild via Neural Machine Translation			Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, Denys Poshyvanyk, ASE 2018
DeepBugs: A Learning Approach to Name-based Bug Detection			Michael Pradel, Koushik Sen, 2018
Learning How to Mutate Source Code from Bug-Fixes			Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, Denys Poshyvanyk, 2018
A deep tree-based model for software defect prediction			HK Dam, T Pham, SW Ng, , J Grundy, A Ghose, T Kim, CJ Kim, 2018
Automated Vulnerability Detection in Source Code Using Deep Representation Learning			Rebecca L. Russell, Louis Kim, Lei H. Hamilton, Tomo Lazovich, Jacob A. Harer, Onur Ozdemir, Paul M. Ellingwood, Marc W. McConley, 2018
Shaping Program Repair Space with Existing Patches and Similar Code			Jiajun Jiang, Yingfei Xiong, Hongyu Zhang, Qing Gao, Xiangqun Chen, 2018. ( )
Learning to Repair Software Vulnerabilities with Generative Adversarial Networks			Jacob A. Harer, Onur Ozdemir, Tomo Lazovich, Christopher P. Reale, Rebecca L. Russell, Louis Y. Kim, Peter Chin, 2018
Dynamic Neural Program Embedding for Program Repair			Ke Wang, Rishabh Singh, Zhendong Su, ICLR 2018
Estimating defectiveness of source code: A predictive model using GitHub content			Ritu Kapur, Balwinder Sodhi, 2018
Automated software vulnerability detection with machine learning			Jacob A. Harer, Louis Y. Kim, Rebecca L. Russell, Onur Ozdemir, Leonard R. Kosta, Akshay Rangamani, Lei H. Hamilton, Gabriel I. Centeno, Jonathan R. Key, Paul M. Ellingwood, Marc W. McConley, Jeffrey M. Opper, Peter Chin, Tomo Lazovich, IWSPA 2018
Learning a Static Analyzer from Data			Pavol Bielik, Veselin Raychev, Martin Vechev, CAV 2017.
To Type or Not to Type: Quantifying Detectable Bugs in JavaScript			Zheng Gao, Christian Bird, Earl Barr, ICSE 2017
Sorting and Transforming Program Repair Ingredients via Deep Learning Code Similarities			Martin White, Michele Tufano, Matías Martínez, Martin Monperrus, Denys Poshyvanyk, 2017
Semantic Code Repair using Neuro-Symbolic Transformation Networks			Jacob Devlin, Jonathan Uesato, Rishabh Singh, Pushmeet Kohli, 2017
Automated Identification of Security Issues from Commit Messages and Bug Reports			Yaqin Zhou and Asankhaya Sharma, FSE 2017
SmartPaste: Learning to Adapt Source Code			Miltiadis Allamanis, Marc Brockschmidt, 2017
End-to-End Prediction of Buffer Overruns from Raw Source Code via Neural Memory Networks			Min-je Choi, Sehun Jeong, Hakjoo Oh, Jaegul Choo, IJCAI 2017
Tailored Mutants Fit Bugs Better			Miltiadis Allamanis, Earl T. Barr, René Just, Charles Sutton, 2016
SAR: Learning Cross-Language API Mappings with Little Knowledge			Nghi D. Q. Bui, Yijun Yu, Lingxiao Jiang, FSE 2019
Hierarchical Learning of Cross-Language Mappings through Distributed Vector Representations for Code			Nghi D. Q. Bui, Lingxiao Jiang, ICSE 2018
DeepAM: Migrate APIs with Multi-modal Sequence to Sequence Learning			Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, Sunghun Kim, IJCAI 2017
Mining Change Histories for Unknown Systematic Edits			Tim Molderez, Reinout Stevens, Coen De Roover, MSR 2017
Deep API Learning			Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, Sunghun Kim, FSE 2016
Exploring API Embedding for API Usages and Applications			Nguyen, Nguyen, Phan and Nguyen, Journal of Systems and Software 2017
API usage pattern recommendation for software development			Haoran Niu, Iman Keivanloo, Ying Zou, 2017
Parameter-Free Probabilistic API Mining across GitHub			Jaroslav Fowkes, Charles Sutton, FSE 2016
A Subsequence Interleaving Model for Sequential Pattern Mining			Jaroslav Fowkes, Charles Sutton, KDD 2016
Lean GHTorrent: GitHub data on demand			Georgios Gousios, Bogdan Vasilescu, Alexander Serebrenik, Andy Zaidman, MSR 2014
Mining idioms from source code			Miltiadis Allamanis, Charles Sutton, FSE 2014
The GHTorent Dataset and Tool Suite			Georgios Gousios, MSR 2013
The Case for Learned Index Structures			Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, Neoklis Polyzotis, SIGMOD 2018
End-to-end Deep Learning of Optimization Heuristics			Chris Cummins, Pavlos Petoumenos, Zheng Wang, Hugh Leather, PACT 2017
Learning to superoptimize programs			Rudy Bunel, Alban Desmaison, M. Pawan Kumar, Philip H.S. Torr, Pushmeet Kohlim ICLR 2017
Neural Nets Can Learn Function Type Signatures From Binaries			Zheng Leong Chua, Shiqi Shen, Prateek Saxena, and Zhenkai Liang, USENIX Security Symposium 2017
Adaptive Neural Compilation			Rudy Bunel, Alban Desmaison, Pushmeet Kohli, Philip H.S. Torr, M. Pawan Kumar, NIPS 2016
Learning to Superoptimize Programs - Workshop Version			Bunel, Rudy, Alban Desmaison, M. Pawan Kumar, Philip H. S. Torr, and Pushmeet Kohli, NIPS 2016
A Language-Agnostic Model for Semantic Source Code Labeling			Ben Gelman, Bryan Hoyle, Jessica Moore, Joshua Saxe and David Slater, MASES 2018
Topic modeling of public repositories at scale using names in source code			Vadim Markovtsev, Eiso Kant, 2017
Why, When, and What: Analyzing Stack Overflow Questions by Topic, Type, and Code			Miltiadis Allamanis, Charles Sutton, MSR 2013
Semantic clustering: Identifying topics in source code			Adrian Kuhn, Stéphane Ducasse, Tudor Girba, Information & Software Technology 2007
A Benchmark Study on Sentiment Analysis for Software Engineering Research			Nicole Novielli, Daniela Girardi, Filippo Lanubile, MSR 2018
Sentiment Analysis for Software Engineering: How Far Can We Go?			Bin Lin, Fiorella Zampetti, Gabriele Bavota, Massimiliano Di Penta, Michele Lanza, Rocco Oliveto, ICSE 2018
Leveraging Automated Sentiment Analysis in Software Engineering			Md Rakibul Islam, Minhaz F. Zibran, MSR 2017
Sentiment Polarity Detection for Software Development			Fabio Calefato, Filippo Lanubile, Federico Maiorano, Nicole Novielli, Empirical Software Engineering 2017
SentiCR: A Customized Sentiment Analysis Tool for Code Review Interactions			Toufique Ahmed, Amiangshu Bosu, Anindya Iqbal, Shahram Rahimi, ASE 2017
Summarizing Source Code with Transferred API Knowledge			Xing Hu, Ge Li, Xin Xia, David Lo, Shuai Lu, Zhi Jin, IJCAI 2018
Deep Code Comment Generation			Xing Hu, Ge Li, Xin Xia, David Lo, Zhi Jin, ICPC 2018
A Neural Framework for Retrieval and Summarization of Source Code			Qingying Chen, Minghui Zhou, ASE 2018
Improving Automatic Source Code Summarization via Deep Reinforcement Learning			Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, Haochao Ying, Jian Wu and Philip S. Yu, ASE 2018
A Convolutional Attention Network for Extreme Summarization of Source Code			Miltiadis Allamanis, Hao Peng, Charles Sutton, ICML 2016
TASSAL: Autofolding for Source Code Summarization			Jaroslav Fowkes, Pankajan Chanthirasegaran, Razvan Ranca, Miltiadis Allamanis, Mirella Lapata, Charles Sutton, ICSE 2016
Summarizing Source Code using a Neural Attention Model	237	almost 3 years ago	Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Luke Zettlemoyer, ACL 2016
Automatic Generation of Pull Request Descriptions			Zhongxin Liu, Xin Xia, Christoph Treude, David Lo, Shanping Li, ASE 2019
Learning-Based Recursive Aggregation of Abstract Syntax Trees for Code Clone Detection			Lutz Büch and Artur Andrzejak, SANER 2019
Oreo: detection of clones in the twilight zone			Vaibhav Saini, Farima Farmahinifarahani, Yadong Lu, Pierre Baldi, and Cristina V. Lopes, FSE 2018
A Deep Learning Approach to Program Similarity			Niccolò Marastoni, Roberto Giacobazzi and Mila Dalla Preda, MASES 2018
Recurrent Neural Network for Code Clone Detection			Arseny Zorin and Vladimir Itsykson, SEIM 2018
The Adverse Effects of Code Duplication in Machine Learning Models of Code			Miltiadis Allamanis, 2018
DéjàVu: a map of code duplicates on GitHub			Cristina V. Lopes, Petr Maj, Pedro Martins, Vaibhav Saini, Di Yang, Jakub Zitny, Hitesh Sajnani, Jan Vitek, Programming Languages OOPSLA 2017
Some from Here, Some from There: Cross-project Code Reuse in GitHub			Mohammad Gharehyazie, Baishakhi Ray, Vladimir Filkov, MSR 2017
Deep Learning Code Fragments for Code Clone Detection			Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk, ASE 2016
A study of repetitiveness of code changes in software evolution			HA Nguyen, AT Nguyen, TT Nguyen, TN Nguyen, H Rajan, ASE 2013
DDRprog: A CLEVR Differentiable Dynamic Reasoning Programmer			Joseph Suarez, Justin Johnson, Fei-Fei Li, 2018
Improving the Universality and Learnability of Neural Programmer-Interpreters with Combinator Abstraction			Da Xiao, Jo-Yu Liao, Xingyuan Yuan, ICLR 2018
Differentiable Programs with Neural Libraries			Alexander L. Gaunt, Marc Brockschmidt, Nate Kushman, Daniel Tarlow, ICML 2017
Differentiable Functional Program Interpreters			John K. Feser, Marc Brockschmidt, Alexander L. Gaunt, Daniel Tarlow, 2017
Programming with a Differentiable Forth Interpreter			Bošnjak, Matko, Tim Rocktäschel, Jason Naradowsky, and Sebastian Riedel, ICML 2017
Neural Functional Programming			Feser John K., Marc Brockschmidt, Alexander L. Gaunt, and Daniel Tarlow, ICLR 2017
TerpreT: A Probabilistic Programming Language for Program Induction			Gaunt, Alexander L., Marc Brockschmidt, Rishabh Singh, Nate Kushman, Pushmeet Kohli, Jonathan Taylor, and Daniel Tarlow, NIPS 2016
ClDiff: Generating Concise Linked Code Differences			Kaifeng Huang, Bihuan Chen, Xin Peng, Daihong Zhou, Ying Wang, Yang Liu, Wenyun Zhao, ASE 2018.
Generating Accurate and Compact Edit Scripts Using Tree Differencing			Veit Frick, Thomas Grassauer, Fabian Beck, Martin Pinzger, ICSME 2018
Fine-grained and Accurate Source Code Differencing			Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, Martin Monperrus, ASE 2014
Clustering Binary Data with Bernoulli Mixture Models			Neal S. Grantham
A Family of Blockwise One-Factor Distributions for Modelling High-Dimensional Binary Data			Matthieu Marbac and Mohammed Sedki, Computational Statistics & Data Analysis 2017
BayesBinMix: an R Package for Model Based Clustering of Multivariate Binary Data			Panagiotis Papastamoulis and Magnus Rattray, R Journal 2016
Robust mixture modelling using the t distribution			D. Peel and G. J. McLachlan, Statistics and Computing 2000
Robust mixture modeling using the skew t distribution			Tsung I. Lin, Jack C. Lee and Wan J. Hsieh, Statistics and Computing 2010
A Fast Unified Model for Parsing and Sentence Understanding			Samuel R. Bowman, Jon Gauthier, Abhinav Rastogi, Raghav Gupta, Christopher D. Manning, Christopher Potts, ACL 2016
Awesome Machine Learning On Source Code / Posts
Semantic Code Search
Learning from Source Code
Training a Model to Summarize Github Issues
Sequence Intent Classification Using Hierarchical Attention Networks
Syntax-Directed Variational Autoencoder for Structured Data
Weighted MinHash on GPU helps to find duplicate GitHub repositories.
Source Code Identifier Embeddings
Using recurrent neural networks to predict next tokens in the java solutions
The half-life of code & the ship of Theseus
The eigenvector of "Why we moved from language X to language Y"
Analyzing Github, How Developers Change Programming Languages Over Time
Topic Modeling of GitHub Repositories
Aroma: Using machine learning for code recommendation
Awesome Machine Learning On Source Code / Talks
Machine Learning on Source Code
Similarity of GitHub Repositories by Source Code Identifiers
Using deep RNN to model source code
Source code abstracts classification using CNN (1)
Source code abstracts classification using CNN (2)
Source code abstracts classification using CNN (3)
Embedding the GitHub contribution graph
Measuring code sentiment in a Git repository
Awesome Machine Learning On Source Code / Software
Differentiable Neural Computer (DNC)	2,501	over 4 years ago	TensorFlow implementation of the Differentiable Neural Computer
sourced.ml	141	over 6 years ago	Abstracts feature extraction from source code syntax trees and working with ML models
vecino	48	over 6 years ago	Finds similar Git repositories
apollo	52	over 3 years ago	Source code deduplication as scale, research
gemini	54	over 6 years ago	Source code deduplication as scale, production
enry	460	about 4 years ago	Insanely fast file based programming language detector
hercules	2,643	almost 3 years ago	Git repository mining framework with batteries on top of go-git
DeepCS	279	over 3 years ago	Keras and Pytorch implementations of DeepCS (Deep Code Search)
Code Neuron	12	about 7 years ago	Recurrent neural network to detect code blocks in natural language text
Naturalize	56	over 10 years ago	Language agnostic framework for learning coding conventions from a codebase and then expoiting this information for suggesting better identifier names and formatting changes in the code
Extreme Source Code Summarization	120	over 9 years ago	Convolutional attention neural network that learns to summarize source code into a short method name-like summary by just looking at the source code tokens
Summarizing Source Code using a Neural Attention Model	237	almost 3 years ago	CODE-NN, uses LSTM networks with attention to produce sentences that describe C# code snippets and SQL queries from StackOverflow. Torch over C#/SQL
Probabilistic API Miner	53	about 8 years ago	Near parameter-free probabilistic algorithm for mining the most interesting API patterns from a list of API call sequences
Interesting Sequence Miner	44	over 7 years ago	Novel algorithm that mines the most interesting sequences under a probabilistic model. It is able to efficiently infer interesting sequences directly from the database
TASSAL	42	over 9 years ago	Tool for the automatic summarization of source code using autofolding. Autofolding automatically creates a summary of a source code file by folding non-essential code and comment blocks
JNice2Predict			Efficient and scalable open-source framework for structured prediction, enabling one to build new statistical engines more quickly
Clone Digger			clone detection for Python and Java
Sensibility	18	about 4 years ago	Uses LSTMs to detect and correct syntax errors in Java source code
DeepBugs	148	almost 5 years ago	Framework for learning bug detectors from an existing code corpus
DeepSim	60	over 6 years ago	a deep learning-based approach to measure code functional similarity
rnn-autocomplete	9	over 6 years ago	Neural code autocompletion with RNN (bachelor's thesis)
MindsDB	26,915	about 1 year ago	MindsDB is an Explainable AutoML framework for developers. With MindsDB you can build, train and use state of the art ML models in as simple as one line of code
go-git	4,897	over 3 years ago	Highly extensible Git implementation in pure Go which is friendly to data mining
bblfsh			Self-hosted server for source code parsing
engine	187	about 6 years ago	Scalable and distributed data retrieval pipeline for source code
minhashcuda	114	about 2 years ago	Weighted MinHash implementation on CUDA to efficiently find duplicates
kmcuda	809	about 3 years ago	k-means on CUDA to cluster and to search for nearest neighbors in dense space
wmd-relax	461	over 2 years ago	Python package which finds nearest neighbors at Word Mover's Distance
Tregex, Tsurgeon and Semgrex			Tregex is a utility for matching patterns in trees, based on tree relationships and regular expression matches on nodes (the name is short for "tree regular expressions")
source{d} models	19	about 6 years ago	Machine Learning models for MLonCode trained using the source{d} stack
Neural-Code-Search-Evaluation-Dataset	123	over 1 year ago	dataset contains links to 4.7M methods from 24k+ repositories with 287 StackOverflow questions and code snippet answers
CodeSearchNet	2,229	almost 4 years ago	collection of datasets and benchmarks for code retrieval using natural language. Contains 2M pairs of ( , )
Public Git Archive	323	about 6 years ago	6 TB of Git repositories from GitHub
StackOverflow Question-Code Dataset	166	over 4 years ago	~148K Python and ~120K SQL question-code pairs mined from StackOverflow
GitHub Issue Titles and Descriptions for NLP Analysis			~8 million GitHub issue titles and descriptions from 2017
GitHub repositories - languages distribution			Programming languages distribution in 14,000,000 repositories on GitHub (October 2016)
452M commits on GitHub			≈ 452M commits' metadata from 16M repositories on GitHub (October 2016)
GitHub readme files			Readme files of all GitHub repositories (16M) (October 2016)
from language X to Y			Cache file Erik Bernhardsson collected for his awesome blog post
GitHub word2vec 120k			Sequences of identifiers extracted from top starred 120,000 GitHub repositories
GitHub Source Code Names			Names in source code extracted from 13M GitHub repositories, not people
GitHub duplicate repositories			GitHub repositories not marked as forks but very similar to each other
GitHub lng keyword frequencies			Programming language keyword frequency extracted from 16M GitHub repositories
GitHub Java Corpus			GitHub Java corpus is a set of Java projects collected from GitHub that we have used in a number of our publications. The corpus consists of 14,785 projects and 352,312,696 LOC
150k Python Dataset			Dataset consisting of 150,000 Python ASTs
150k JavaScript Dataset			Dataset consisting of 150,000 JavaScript files and their parsed ASTs
card2code	242	about 8 years ago	This dataset contains the language to code datasets described in the paper
NL2Bash	452	over 1 year ago	This dataset contains a set of ~10,000 bash one-liners collected from websites such as StackOverflow and their English descriptions written by Bash programmers, as described in the
GitHub JavaScript Dump October 2016			Dataset consisting of 494,352 syntactically-valid JavaScript files obtained from the top ~10000 starred JavaScript repositories on GitHub, with licenses, and parsed ASTs
BigCloneBench			Clone detection benchmark of 8 million function clone pairs in the IJaDataset
Awesome Machine Learning On Source Code / Credits
mast-group			A lot of references and articles were taken from
Awesome Machine Learning	66,380	about 1 year ago	Inspired by

awesome-machine-learning-on-source-code

Awesome Machine Learning On Source Code / Digests

Awesome Machine Learning On Source Code / Conferences

Awesome Machine Learning On Source Code / Conferences / SysML

Awesome Machine Learning On Source Code / Conferences

Awesome Machine Learning On Source Code / Conferences / source{d} tech talks

Awesome Machine Learning On Source Code / Conferences

Awesome Machine Learning On Source Code / Conferences / NIPS Neural Abstract Machines and Program Induction workshop

Awesome Machine Learning On Source Code / Conferences

Awesome Machine Learning On Source Code / Conferences / CamAIML

Awesome Machine Learning On Source Code / Conferences

Awesome Machine Learning On Source Code / Competitions

Awesome Machine Learning On Source Code / Papers

Awesome Machine Learning On Source Code / Posts

Awesome Machine Learning On Source Code / Talks

Awesome Machine Learning On Source Code / Software

Awesome Machine Learning On Source Code / Credits

Backlinks from these awesome lists:

More related projects: