Awesome-Code-LLM
Code research catalog
A curated list of research and papers on using language models to improve software development and code generation
[TMLR] A curated list of language modeling researches for code and related datasets.
2k stars
46 watching
108 forks
last commit: 5 days ago
Linked from 1 awesome list
aiawesomedatasetsllmnlppaperssoftware-engineeringsurveytmlr
Awesome-Code-LLM / 8. Datasets / 8.2 Benchmarks | |||
paper | "NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System" [ ] [ ] | ||
paper | "Mapping Language to Code in Programmatic Context" [ ] [ ] | ||
paper | "JuICe: A Large Scale Distantly Supervised Dataset for Open Domain Context-based Code Generation" [ ] [ ] | ||
paper | "Measuring Coding Challenge Competence With APPS" [ ] [ ] | ||
paper | "Evaluating Large Language Models Trained on Code" [ ] [ ] | ||
paper | "Program Synthesis with Large Language Models" [ ] [ ] [ ] | ||
paper | "PlotCoder: Hierarchical Decoding for Synthesizing Visualization Code in Programmatic Context" [ ] [ ] | ||
paper | "Training and Evaluating a Jupyter Notebook Data Science Assistant" [ ] [ ] | ||
paper | "Competition-Level Code Generation with AlphaCode" [ ] [ ] | ||
paper | "MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages" [ ] [ ] | ||
paper | "AixBench: A Code Generation Benchmark Dataset" [ ] [ ] | ||
paper | "MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation", [ ] [ ] | ||
paper | "Multi-lingual Evaluation of Code Generation Models" [ ] [ ] | ||
paper | "Multi-lingual Evaluation of Code Generation Models" [ ] [ ] | ||
paper | "Multi-lingual Evaluation of Code Generation Models" [ ] [ ] | ||
paper | "Execution-based Evaluation for Data Science Code Generation Models" [ ] [ ] | ||
paper | "DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation" [ ] [ ] | ||
paper | "Execution-Based Evaluation for Open-Domain Code Generation" [ ] [ ] | ||
paper | "CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models" [ ] [ ] | ||
paper | "XCodeEval: An Execution-based Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval" [ ] [ ] | ||
paper | "CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X" [ ] [ ] | ||
paper | "Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation" [ ] [ ] | ||
paper | "StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code" [ ] [ ] | ||
paper | "OctoPack: Instruction Tuning Code Large Language Models" [ ] [ ] | ||
paper | "Guiding Language Models of Code with Global Context using Monitors" [ ] [ ] | ||
paper | "CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models" [ ] [ ] | ||
paper | "VerilogEval: Evaluating Large Language Models for Verilog Code Generation" [ ] [ ] | ||
paper | "ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks" [ ] [ ] | ||
paper | "TACO: Topics in Algorithmic COde generation dataset" [ ] [ ] | ||
paper | "Can Large Language Models Write Parallel Code?" [ ] [ ] | ||
paper | "OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models" [ ] [ ] | ||
paper | "HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization" [ ] [ ] | ||
paper | "Can Language Models Solve Olympiad Programming?" [ ] [ ] | ||
paper | "PECC: Problem Extraction and Coding Challenges" [ ] [ ] | ||
paper | "Constrained Decoding for Secure Code Generation" [ ] [ ] | ||
paper | "NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts" [ ] [ ] | ||
paper | "MHPP: Exploring the Capabilities and Limitations of Language Models Beyond Basic Code Generation" [ ] [ ] | ||
paper | "VHDL-Eval: A Framework for Evaluating Large Language Models in VHDL Code Generation" [ ] | ||
paper | "AICoderEval: Improving AI Domain Code Generation of Large Language Models" [ ] [ ] | ||
paper | "VersiCode: Towards Version-controllable Code Generation" [ ] [ ] | ||
paper | "ScenEval: A Benchmark for Scenario-Based Evaluation of Code Generation" [ ] | ||
paper | "BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions" [ ] [ ] | ||
paper | "CodeUpdateArena: Benchmarking Knowledge Editing on API Updates" [ ] [ ] | ||
paper | "On Leakage of Code Generation Evaluation Datasets" [ ] [ ] | ||
paper | "NoviCode: Generating Programs from Natural Language Utterances by Novices" [ ] [ ] | ||
paper | "Case2Code: Learning Inductive Reasoning with Synthetic Data" [ ] [ ] | ||
paper | "SciCode: A Research Coding Benchmark Curated by Scientists" [ ] [ ] | ||
paper | "Generating Unseen Code Tests In Infinitum" [ ] | ||
paper | "WebApp1K: A Practical Code-Generation Benchmark for Web App Development" [ ] [ ] | ||
paper | "CodeInsight: A Curated Dataset of Practical Coding Solutions from Stack Overflow" [ ] [ ] | ||
paper | "DOMAINEVAL: An Auto-Constructed Benchmark for Multi-Domain Code Generation" [ ] [ ] | ||
paper | "ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code" [ ] [ ] | ||
paper | "Contextualized Data-Wrangling Code Generation in Computational Notebooks" [ ] [ ] | ||
paper | "Evaluation of Code LLMs on Geospatial Code Generation" [ ] [ ] | ||
paper | "mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation" [ ] [ ] | ||
paper | "Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists" [ ] [ ] | ||
paper | "GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models" [ ] [ ] | ||
paper | "MMCode: Evaluating Multi-Modal Code Large Language Models with Visually Rich Programming Problems" [ ] [ ] | ||
paper | "Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots" [ ] [ ] | ||
paper | "ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation" [ ] [ ] | ||
paper | "HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks" [ ] [ ] | ||
paper | "TurtleBench: A Visual Programming Benchmark in Turtle Geometry" [ ] [ ] | ||
paper | "CodeQA: A Question Answering Dataset for Source Code Comprehension" [ ] [ ] | ||
paper | "CS1QA: A Dataset for Assisting Code-based Question Answering in an Introductory Programming Course" [ ] [ ] | ||
paper | "CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models" [ ] [ ] | ||
paper | "CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution" [ ] [ ] | ||
paper | "Multiple-Choice Questions are Efficient and Robust LLM Evaluators" [ ] [ ] | ||
paper | "Aligning LLMs through Multi-perspective User Preference Ranking-based Feedback for Programming Question Answering" [ ] [ ] | ||
paper | "RepoQA: Evaluating Long Context Code Understanding" [ ] [ ] | ||
paper | "CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution" [ ] [ ] | ||
paper | "SpecEval: Evaluating Code Comprehension in Large Language Models via Program Specifications" [ ] [ ] | ||
paper | "CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs" [ ] [ ] | ||
paper | "Leveraging Large Language Models in Code Question Answering: Baselines and Issues" [ ] [ ] | ||
paper | "Deep learning driven natural languages text to SQL query conversion: A survey", 2022-08, arXiv, [ ] | ||
paper | "Recent Advances in Text-to-SQL: A Survey of What We Have and What We Expect", 2022-08, COLING 2022, [ ] | ||
paper | "A Survey on Text-to-SQL Parsing: Concepts, Methods, and Future Directions", 2022-08, arXiv, [ ] | ||
paper | "A survey on deep learning approaches for text-to-SQL", 2023-01, VLDB J., [ ] | ||
paper | "Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning" [ ] [ ] | ||
paper | "Improving Text-to-SQL Evaluation Methodology" [ ] [ ] | ||
paper | "Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task" [ ] [ ] | ||
paper | "SParC: Cross-Domain Semantic Parsing in Context" [ ] [ ] | ||
paper | "Text-to-SQL Generation for Question Answering on Electronic Medical Records" [ ] [ ] | ||
paper | "CoSQL: A Conversational Text-to-SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases" [ ] [ ] | ||
paper | "Dataset and Enhanced Model for Eligibility Criteria-to-SQL Semantic Parsing" [ ] [ ] | ||
paper | "On the Potential of Lexico-logical Alignments for Semantic Parsing to SQL Queries" [ ] [ ] | ||
paper | "Structure-Grounded Pretraining for Text-to-SQL" [ ] [ ] | ||
paper | "Towards Robustness of Text-to-SQL Models against Synonym Substitution" [ ] [ ] | ||
paper | "Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data" [ ] [ ] | ||
paper | "KaggleDBQA: Realistic Evaluation of Text-to-SQL Parsers" [ ] [ ] | ||
paper | "Exploring Underexplored Limitations of Cross-Domain Text-to-SQL Generalization" [ ] [ ] | ||
paper | "Measuring and Improving Compositional Generalization in Text-to-SQL via Component Alignment" [ ] [ ] | ||
paper | "Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs" [ ] [ ] | ||
paper | "XSemPLR: Cross-Lingual Semantic Parsing in Multiple Natural Languages and Meaning Representations" [ ] [ ] | ||
paper | "EHR-SeqSQL : A Sequential Text-to-SQL Dataset For Interactively Exploring Electronic Health Records" [ ] | ||
paper | "BookSQL: A Large Scale Text-to-SQL Dataset for Accounting Domain" [ ] [ ] | ||
paper | "MultiSQL: A Schema-Integrated Context-Dependent Text2SQL Dataset with Diverse SQL Operations" [ ] [ ] | ||
paper | "BEAVER: An Enterprise Benchmark for Text-to-SQL" [ ] | ||
paper | "PRACTIQ: A Practical Conversational Text-to-SQL dataset with Ambiguous and Unanswerable Queries" [ ] | ||
paper | "BIS: NL2SQL Service Evaluation Benchmark for Business Intelligence Scenarios" [ ] [ ] | ||
paper | "Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows" [ ] [ ] | ||
paper | "Unsupervised Translation of Programming Languages" [ ] [ ] | ||
paper | "CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation" [ ] [ ] | ||
paper | "AVATAR: A Parallel Corpus for Java-Python Program Translation" [ ] [ ] | ||
paper | "Multilingual Code Snippets Training for Program Translation" [ ] [ ] | ||
paper | "XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence" [ ] [ ] | ||
paper | "xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval" [ ] [ ] | ||
paper | "CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X" [ ] [ ] | ||
paper | "On the Evaluation of Neural Code Translation: Taxonomy and Benchmark" [ ] [ ] | ||
paper | "CodeTransOcean: A Comprehensive Multilingual Benchmark for Code Translation" [ ] [ ] | ||
paper | "Escalating LLM-based Code Translation Benchmarking into the Class-level Era" [ ] | ||
paper | "Neural Program Repair: Systems, Challenges and Solutions", 2022-02, Internetware 2022, [ ] | ||
paper | "A Survey of Learning-based Automated Program Repair", 2023-01, arXiv, [ ] | ||
paper | "A Survey on Automated Program Repair Techniques", 2023-03, arXiv, [ ] | ||
paper | "Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs" [ ] [ ] | ||
paper | "The ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs" [ ] [ ] | ||
paper | "Discovering Bug Patterns in JavaScript" [ ] [ ] | ||
paper | "DeepFix: Fixing Common C Language Errors by Deep Learning" [ ] [ ] | ||
paper | "DeepFix: Fixing Common C Language Errors by Deep Learning" [ ] [ ] | ||
paper | "QuixBugs: a multi-lingual program repair benchmark set based on the quixey challenge" [ ] [ ] | ||
paper | "Bugs.jar: a large-scale, diverse dataset of real-world Java bugs" [ ] [ ] | ||
paper | "An Empirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation" [ ] [ ] | ||
paper | "Bears: An Extensible Java Bug Benchmark for Automatic Program Repair Studies" [ ] [ ] | ||
paper | "On Learning Meaningful Code Changes via Neural Machine Translation" [ ] [ ] | ||
paper | "BugsJS: a Benchmark of JavaScript Bugs" [ ] [ ] | ||
paper | "BugSwarm: mining and continuously growing a dataset of reproducible failures and fixes" [ ] [ ] | ||
paper | "Graph-based mining of in-the-wild, fine-grained, semantic code change patterns" [ ] [ ] | ||
paper | "How Often Do Single-Statement Bugs Occur? The ManySStuBs4J Dataset" [ ] [ ] | ||
paper | "Re-factoring based program repair applied to programming assignments" [ ] [ ] | ||
paper | "CoCoNuT: combining context-aware neural translation models using ensemble for program repair" [ ] [ ] | ||
paper | "Review4Repair: Code Review Aided Automatic Program Repairing" [ ] [ ] | ||
paper | "BugsInPy: A Database of Existing Bugs in Python Programs to Enable Controlled Testing and Debugging Studies" [ ] [ ] | ||
paper | "TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer" [ ] [ ] | ||
paper | "Megadiff: A Dataset of 600k Java Source Code Changes Categorized by Diff Size" [ ] [ ] | ||
paper | "TSSB-3M: Mining single statement bugs at massive scale" [ ] [ ] | ||
paper | "FixJS: a dataset of bug-fixing JavaScript commits" [ ] [ ] | ||
paper | "PyTER: Effective Program Repair for Python Type Errors" [ ] [ ] | ||
paper | "xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval" [ ] [ ] | ||
paper | "RunBugRun -- An Executable Dataset for Automated Program Repair" [ ] [ ] | ||
paper | "OctoPack: Instruction Tuning Code Large Language Models" [ ] [ ] | ||
paper | "DebugBench: Evaluating Debugging Capability of Large Language Models" [ ] [ ] | ||
paper | "MdEval: Massively Multilingual Code Debugging" [ ] | ||
paper | "A Survey of Automatic Source Code Summarization", 2022-02, Symmetry, [ ] | ||
paper | "Summarizing Source Code using a Neural Attention Model" [ ] [ ] | ||
paper | "A parallel corpus of Python functions and documentation strings for automated code documentation and code generation" [ ] [ ] | ||
paper | "Deep code comment generation" [ ] [ ] | ||
paper | "Summarizing Source Code with Transferred API Knowledge" [ ] [ ] | ||
paper | "Improving Automatic Source Code Summarization via Deep Reinforcement Learning" [ ] [ ] | ||
paper | "CodeSearchNet Challenge: Evaluating the State of Semantic Code Search" [ ] [ ] | ||
paper | "OctoPack: Instruction Tuning Code Large Language Models" [ ] [ ] | ||
paper | "Benchmarking Software Vulnerability Detection Techniques: A Survey", 2023-03, arXiv, [ ] | ||
paper | "VulDeePecker: A Deep Learning-Based System for Vulnerability Detection" [ ] [ ] | ||
paper | "Cross-Project Transfer Representation Learning for Vulnerable Function Discovery" [ ] [ ] | ||
paper | "Automated Vulnerability Detection in Source Code Using Deep Representation Learning" [ ] [ ] | ||
paper | "SySeVR: A Framework for Using Deep Learning to Detect Software Vulnerabilities" [ ] [ ] | ||
paper | "A Manually-Curated Dataset of Fixes to Vulnerabilities of Open-Source Software" [ ] [ ] | ||
paper | "Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks" [ ] [ ] | ||
paper | "Software Vulnerability Discovery via Learning Multi-Domain Knowledge Bases" [ ] [ ] | ||
paper | "Global Relational Models of Source Code" [ ] [ ] | ||
paper | "μVulDeePecker: A Deep Learning-Based System for Multiclass Vulnerability Detection" [ ] [ ] | ||
paper | "Deep Learning-Based Vulnerable Function Detection: A Benchmark" [ ] [ ] | ||
paper | "Deep Learning based Vulnerability Detection: Are We There Yet?" [ ] [ ] | ||
paper | "A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries" [ ] [ ] | ||
paper | "D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis" [ ] [ ] | ||
paper | "Self-Supervised Bug Detection and Repair" [ ] [ ] | ||
paper | "CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software" [ ] [ ] | ||
paper | "CrossVul: a cross-language vulnerability dataset with commit data" [ ] [ ] | ||
paper | "DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection" [ ] [ ] | ||
paper | "Limits of Machine Learning for Automatic Vulnerability Detection" [ ] [ ] | ||
paper | "How Far Have We Gone in Vulnerability Detection Using Large Language Models" [ ] [ ] | ||
paper | "Vulnerability Detection with Code Language Models: How Far Are We?" [ ] | ||
paper | "VulDetectBench: Evaluating the Deep Capability of Vulnerability Detection with Large Language Models" [ ] [ ] | ||
paper | "CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?" [ ] [ ] | ||
paper | "A Survey of Deep Code Search", 2023-05, arXiv, [ ] | ||
paper | "StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow" [ ] [ ] | ||
paper | "Deep Code Search" [ ] [ ] | ||
paper | "Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow" [ ] [ ] | ||
paper | "Neural Code Search Evaluation Dataset" [ ] [ ] | ||
paper | "CodeSearchNet Challenge: Evaluating the State of Semantic Code Search" [ ] [ ] | ||
paper | "Are the Code Snippets What We Are Searching for? A Benchmark and an Empirical Study on Code Search with Natural-Language Queries" [ ] [ ] | ||
paper | "Neural Code Search Revisited: Enhancing Code Snippet Retrieval through Natural Language Intent" [ ] [ ] | ||
paper | "Deep Graph Matching and Searching for Semantic Code Retrieval" [ ] [ ] | ||
paper | "CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation" [ ] [[data]] | ||
paper | "CoSQA: 20,000+ Web Queries for Code Search and Question Answering" [ ] [ ] | ||
paper | "ProCQA: A Large-scale Community-based Programming Question Answering Dataset for Code Search" [ ] [ ] | ||
paper | "CoSQA+: Enhancing Code Search Dataset with Matching Code" [ ] [ ] | ||
paper | "CoIR: A Comprehensive Benchmark for Code Information Retrieval Models" [ ] [ ] | ||
paper | "What can Large Language Models Capture about Code Functional Equivalence?" [ ] | ||
paper | "TypeWriter: Neural Type Prediction with Search-based Validation" [ ] [ ] | ||
paper | "Typilus: Neural Type Hints" [ ] [ ] | ||
paper | "LambdaNet: Probabilistic Type Inference using Graph Neural Networks" [ ] [ ] | ||
paper | "ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-based Type Inference" [ ] [ ] | ||
paper | "ManyTypes4TypeScript: a comprehensive TypeScript dataset for sequence-based type inference" [ ] [ ] | ||
paper | "Do Machine Learning Models Produce TypeScript Types That Type Check?" [ ] [ ] | ||
paper | "TypeT5: Seq2seq Type Inference using Static Analysis" [ ] [ ] | ||
paper | "Type Prediction With Program Decomposition and Fill-in-the-Type Training" [ ] [ ] | ||
paper | "On the Evaluation of Commit Message Generation Models: An Experimental Study", 2021-07, ICSME 2021, [ ] | ||
paper | "Towards Automatic Generation of Short Summaries of Commits" [ ] [ ] | ||
paper | "A Neural Architecture for Generating Natural Language Descriptions from Source Code Changes" [ ] [ ] | ||
paper | "Automatically Generating Commit Messages from Diffs using Neural Machine Translation" [ ] [ ] | ||
paper | "Neural-machine-translation-based commit message generation: how far are we?" [ ] [ ] | ||
paper | "Generating commit messages from diffs using pointer-generator network" [ ] [[data( | ||
paper | "Commit message generation for source code changes" [ ] [ ] | ||
paper | "ATOM: Commit Message Generation Based on Abstract Syntax Tree and Hybrid Ranking" [ ] [ ] | ||
paper | "CommitBERT: Commit Message Generation Using Pre-Trained Programming Language Model" [ ] [ ] | ||
paper | "On the Evaluation of Commit Message Generation Models: An Experimental Study" [ ] [ ] | ||
paper | "Context-aware Retrieval-based Deep Commit Message Generation" [ ] [ ] | ||
paper | "Delving into Commit-Issue Correlation to Enhance Commit Message Generation Models" [ ] [ ] | ||
paper | "From Commit Message Generation to History-Aware Commit Message Completion" [ ] [ ] | ||
paper | "RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation" [ ] [ ] | ||
paper | "RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems" [ ] [ ] | ||
paper | "Guiding Language Models of Code with Global Context using Monitors" [ ] [ ] | ||
paper | "RepoFusion: Training Code Models to Understand Your Repository" [ ] [ ] | ||
paper | "BioCoder: A Benchmark for Bioinformatics Code Generation with Large Language Models" [ ] [ ] | ||
paper | "CodePlan: Repository-level Coding using LLMs and Planning" [ ] [ ] | ||
paper | "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" [ ] [ ] | ||
paper | "CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion" [ ] [ ] | ||
paper | "EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories" [ ] [ ] | ||
paper | "DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories" [ ] [ ] | ||
paper | "Can AI Beat Undergraduates in Entry-level Java Assignments? Benchmarking Large Language Models on JavaBench" [ ] [ ] | ||
paper | "Towards more realistic evaluation of LLM-based code generation: an experimental study and beyond" [ ] [ ] | ||
paper | "REPOEXEC: Evaluate Code Generation with a Repository-Level Executable Benchmark" [ ] | ||
paper | "RES-Q: Evaluating Code-Editing Large Language Model Systems at the Repository Scale" [ ] [ ] | ||
paper | "SWE-bench-java: A GitHub Issue Resolving Benchmark for Java" [ ] [ ] | ||
paper | "Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion?" [ ] [ ] | ||
paper | "SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?" [ ] [ ] | ||
paper | "SWE-Bench+: Enhanced Coding Benchmark for LLMs" [ ] [ ] | ||
paper | "DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models" [ ] [ ] | ||
paper | "Can Language Models Replace Programmers? REPOCOD Says 'Not Yet'" [ ] | ||
paper | "M2rc-Eval: Massively Multilingual Repository-level Code Completion Evaluation" [ ] [ ] | ||
Awesome-Code-LLM / 9. Recommended Readings | |||
Neural Machine Translation by Jointly Learning to Align and Translate | |||
Neural Machine Translation of Rare Words with Subword Units | |||
Attention Is All You Need | |||
Mixed Precision Training | |||
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding | |||
Improving Language Understanding by Generative Pre-Training | |||
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding | |||
Language Models are Unsupervised Multitask Learners | |||
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems | |||
RoBERTa: A Robustly Optimized BERT Pretraining Approach | |||
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism | |||
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models | |||
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | |||
Language Models are Few-Shot Learners | |||
Measuring Massive Multitask Language Understanding | |||
The Pile: An 800GB Dataset of Diverse Text for Language Modeling | |||
LoRA: Low-Rank Adaptation of Large Language Models | |||
Finetuned Language Models Are Zero-Shot Learners | |||
Multitask Prompted Training Enables Zero-Shot Task Generalization | |||
Scaling Language Models: Methods, Analysis & Insights from Training Gopher | |||
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models | |||
Training language models to follow instructions with human feedback | |||
Training Compute-Optimal Large Language Models | |||
PaLM: Scaling Language Modeling with Pathways | |||
Large Language Models are Zero-Shot Reasoners | |||
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models | |||
Emergent Abilities of Large Language Models | |||
Scaling Instruction-Finetuned Language Models | |||
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model | |||
Self-Instruct: Aligning Language Models with Self-Generated Instructions |