Awesome-Code-LLM

Code benchmarks

A curated list of language modeling researches for code and software engineering activities.

[TMLR] A curated list of language modeling researches for code (and other software engineering activities), plus related datasets.

GitHub

2k stars
53 watching
114 forks
last commit: about 1 month ago
Linked from 1 awesome list

aiawesomedatasetsllmnlppaperssoftware-engineeringsurveytmlr

Awesome-Code-LLM / 5. Methods/Models for Downstream Tasks / Code QA

paper "DialogAgent: An Auto-engagement Agent for Code Question Answering Data Production" [2024-12] [ ]

Awesome-Code-LLM / 8. Datasets / 8.2 Benchmarks

paper "NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System" [ ] [ ]
paper "Mapping Language to Code in Programmatic Context" [ ] [ ]
paper "JuICe: A Large Scale Distantly Supervised Dataset for Open Domain Context-based Code Generation" [ ] [ ]
paper "Measuring Coding Challenge Competence With APPS" [ ] [ ]
paper "Evaluating Large Language Models Trained on Code" [ ] [ ]
paper "Program Synthesis with Large Language Models" [ ] [ ] [ ]
paper "PlotCoder: Hierarchical Decoding for Synthesizing Visualization Code in Programmatic Context" [ ] [ ]
paper "Training and Evaluating a Jupyter Notebook Data Science Assistant" [ ] [ ]
paper "Competition-Level Code Generation with AlphaCode" [ ] [ ]
paper "MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages" [ ] [ ]
paper "AixBench: A Code Generation Benchmark Dataset" [ ] [ ]
paper "MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation", [ ] [ ]
paper "Multi-lingual Evaluation of Code Generation Models" [ ] [ ]
paper "Multi-lingual Evaluation of Code Generation Models" [ ] [ ]
paper "Multi-lingual Evaluation of Code Generation Models" [ ] [ ]
paper "Execution-based Evaluation for Data Science Code Generation Models" [ ] [ ]
paper "DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation" [ ] [ ]
paper "Execution-Based Evaluation for Open-Domain Code Generation" [ ] [ ]
paper "CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models" [ ] [ ]
paper "XCodeEval: An Execution-based Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval" [ ] [ ]
paper "CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X" [ ] [ ]
paper "Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation" [ ] [ ]
paper "StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code" [ ] [ ]
paper "OctoPack: Instruction Tuning Code Large Language Models" [ ] [ ]
paper "Guiding Language Models of Code with Global Context using Monitors" [ ] [ ]
paper "CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models" [ ] [ ]
paper "VerilogEval: Evaluating Large Language Models for Verilog Code Generation" [ ] [ ]
paper "ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks" [ ] [ ]
paper "TACO: Topics in Algorithmic COde generation dataset" [ ] [ ]
paper "PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLMs" [ ] [ ]
paper "Can Large Language Models Write Parallel Code?" [ ] [ ]
paper "OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models" [ ] [ ]
paper "HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization" [ ] [ ]
paper "Can Language Models Solve Olympiad Programming?" [ ] [ ]
paper "PECC: Problem Extraction and Coding Challenges" [ ] [ ]
paper "Constrained Decoding for Secure Code Generation" [ ] [ ]
paper "NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts" [ ] [ ]
paper "MHPP: Exploring the Capabilities and Limitations of Language Models Beyond Basic Code Generation" [ ] [ ]
paper "VHDL-Eval: A Framework for Evaluating Large Language Models in VHDL Code Generation" [ ]
paper "AICoderEval: Improving AI Domain Code Generation of Large Language Models" [ ] [ ]
paper "VersiCode: Towards Version-controllable Code Generation" [ ] [ ]
paper "ScenEval: A Benchmark for Scenario-Based Evaluation of Code Generation" [ ]
paper "BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions" [ ] [ ]
paper "CodeUpdateArena: Benchmarking Knowledge Editing on API Updates" [ ] [ ]
paper "On Leakage of Code Generation Evaluation Datasets" [ ] [ ]
paper "NoviCode: Generating Programs from Natural Language Utterances by Novices" [ ] [ ]
paper "Case2Code: Learning Inductive Reasoning with Synthetic Data" [ ] [ ]
paper "SciCode: A Research Coding Benchmark Curated by Scientists" [ ] [ ]
paper "Generating Unseen Code Tests In Infinitum" [ ]
paper "WebApp1K: A Practical Code-Generation Benchmark for Web App Development" [ ] [ ]
paper "CodeInsight: A Curated Dataset of Practical Coding Solutions from Stack Overflow" [ ] [ ]
paper "DOMAINEVAL: An Auto-Constructed Benchmark for Multi-Domain Code Generation" [ ] [ ]
paper "ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code" [ ] [ ]
paper "Contextualized Data-Wrangling Code Generation in Computational Notebooks" [ ] [ ]
paper "Evaluation of Code LLMs on Geospatial Code Generation" [ ] [ ]
paper "mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation" [ ] [ ]
paper "Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists" [ ] [ ]
paper "GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models" [ ] [ ]
paper "One-to-many testing for code generation from (just) natural language" [ ] [ ]
paper "LibEvolutionEval: A Benchmark and Study for Version-Specific Code Generation" [ ]
paper "Drawing Pandas: A Benchmark for LLMs in Generating Plotting Code" [ ] [ ]
paper "Evaluating and Aligning CodeLLMs on Human Preference" [ ] [ ]
paper "Unseen Horizons: Unveiling the Real Capability of LLM Code Generation Beyond the Familiar" [ ] [ ]
paper "MMCode: Evaluating Multi-Modal Code Large Language Models with Visually Rich Programming Problems" [ ] [ ]
paper "Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots" [ ] [ ]
paper "ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation" [ ] [ ]
paper "HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks" [ ] [ ]
paper "TurtleBench: A Visual Programming Benchmark in Turtle Geometry" [ ] [ ]
paper "BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks" [ ] [ ]
paper "CodeQA: A Question Answering Dataset for Source Code Comprehension" [ ] [ ]
paper "CS1QA: A Dataset for Assisting Code-based Question Answering in an Introductory Programming Course" [ ] [ ]
paper "CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models" [ ] [ ]
paper "CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution" [ ] [ ]
paper "Multiple-Choice Questions are Efficient and Robust LLM Evaluators" [ ] [ ]
paper "Aligning LLMs through Multi-perspective User Preference Ranking-based Feedback for Programming Question Answering" [ ] [ ]
paper "RepoQA: Evaluating Long Context Code Understanding" [ ] [ ]
paper "CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution" [ ] [ ]
paper "SpecEval: Evaluating Code Comprehension in Large Language Models via Program Specifications" [ ] [ ]
paper "CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs" [ ] [ ]
paper "Leveraging Large Language Models in Code Question Answering: Baselines and Issues" [ ] [ ]
paper "ScratchEval: Are GPT-4o Smarter than My Child? Evaluating Large Multimodal Models with Visual Programming Challenges" [ ] [ ]
paper "Deep learning driven natural languages text to SQL query conversion: A survey", 2022-08, arXiv, [ ]
paper "Recent Advances in Text-to-SQL: A Survey of What We Have and What We Expect", 2022-08, COLING 2022, [ ]
paper "A Survey on Text-to-SQL Parsing: Concepts, Methods, and Future Directions", 2022-08, arXiv, [ ]
paper "A survey on deep learning approaches for text-to-SQL", 2023-01, VLDB J., [ ]
paper "Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning" [ ] [ ]
paper "Improving Text-to-SQL Evaluation Methodology" [ ] [ ]
paper "Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task" [ ] [ ]
paper "SParC: Cross-Domain Semantic Parsing in Context" [ ] [ ]
paper "Text-to-SQL Generation for Question Answering on Electronic Medical Records" [ ] [ ]
paper "CoSQL: A Conversational Text-to-SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases" [ ] [ ]
paper "Dataset and Enhanced Model for Eligibility Criteria-to-SQL Semantic Parsing" [ ] [ ]
paper "On the Potential of Lexico-logical Alignments for Semantic Parsing to SQL Queries" [ ] [ ]
paper "Structure-Grounded Pretraining for Text-to-SQL" [ ] [ ]
paper "Towards Robustness of Text-to-SQL Models against Synonym Substitution" [ ] [ ]
paper "Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data" [ ] [ ]
paper "KaggleDBQA: Realistic Evaluation of Text-to-SQL Parsers" [ ] [ ]
paper "Exploring Underexplored Limitations of Cross-Domain Text-to-SQL Generalization" [ ] [ ]
paper "Measuring and Improving Compositional Generalization in Text-to-SQL via Component Alignment" [ ] [ ]
paper "Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs" [ ] [ ]
paper "XSemPLR: Cross-Lingual Semantic Parsing in Multiple Natural Languages and Meaning Representations" [ ] [ ]
paper "EHR-SeqSQL : A Sequential Text-to-SQL Dataset For Interactively Exploring Electronic Health Records" [ ]
paper "BookSQL: A Large Scale Text-to-SQL Dataset for Accounting Domain" [ ] [ ]
paper "MultiSQL: A Schema-Integrated Context-Dependent Text2SQL Dataset with Diverse SQL Operations" [ ] [ ]
paper "BEAVER: An Enterprise Benchmark for Text-to-SQL" [ ]
paper "PRACTIQ: A Practical Conversational Text-to-SQL dataset with Ambiguous and Unanswerable Queries" [ ]
paper "BIS: NL2SQL Service Evaluation Benchmark for Business Intelligence Scenarios" [ ] [ ]
paper "Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows" [ ] [ ]
paper "Unsupervised Translation of Programming Languages" [ ] [ ]
paper "CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation" [ ] [ ]
paper "AVATAR: A Parallel Corpus for Java-Python Program Translation" [ ] [ ]
paper "Multilingual Code Snippets Training for Program Translation" [ ] [ ]
paper "XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence" [ ] [ ]
paper "xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval" [ ] [ ]
paper "CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X" [ ] [ ]
paper "On the Evaluation of Neural Code Translation: Taxonomy and Benchmark" [ ] [ ]
paper "CodeTransOcean: A Comprehensive Multilingual Benchmark for Code Translation" [ ] [ ]
paper "Escalating LLM-based Code Translation Benchmarking into the Class-level Era" [ ]
paper "Repository-level Code Translation Benchmark Targeting Rust" [ ] [ ]
paper "Neural Program Repair: Systems, Challenges and Solutions", 2022-02, Internetware 2022, [ ]
paper "A Survey of Learning-based Automated Program Repair", 2023-01, arXiv, [ ]
paper "A Survey on Automated Program Repair Techniques", 2023-03, arXiv, [ ]
paper "Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs" [ ] [ ]
paper "The ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs" [ ] [ ]
paper "Discovering Bug Patterns in JavaScript" [ ] [ ]
paper "DeepFix: Fixing Common C Language Errors by Deep Learning" [ ] [ ]
paper "DeepFix: Fixing Common C Language Errors by Deep Learning" [ ] [ ]
paper "QuixBugs: a multi-lingual program repair benchmark set based on the quixey challenge" [ ] [ ]
paper "Bugs.jar: a large-scale, diverse dataset of real-world Java bugs" [ ] [ ]
paper "An Empirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation" [ ] [ ]
paper "Bears: An Extensible Java Bug Benchmark for Automatic Program Repair Studies" [ ] [ ]
paper "On Learning Meaningful Code Changes via Neural Machine Translation" [ ] [ ]
paper "BugsJS: a Benchmark of JavaScript Bugs" [ ] [ ]
paper "BugSwarm: mining and continuously growing a dataset of reproducible failures and fixes" [ ] [ ]
paper "Graph-based mining of in-the-wild, fine-grained, semantic code change patterns" [ ] [ ]
paper "How Often Do Single-Statement Bugs Occur? The ManySStuBs4J Dataset" [ ] [ ]
paper "Re-factoring based program repair applied to programming assignments" [ ] [ ]
paper "CoCoNuT: combining context-aware neural translation models using ensemble for program repair" [ ] [ ]
paper "Review4Repair: Code Review Aided Automatic Program Repairing" [ ] [ ]
paper "BugsInPy: A Database of Existing Bugs in Python Programs to Enable Controlled Testing and Debugging Studies" [ ] [ ]
paper "TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer" [ ] [ ]
paper "Megadiff: A Dataset of 600k Java Source Code Changes Categorized by Diff Size" [ ] [ ]
paper "TSSB-3M: Mining single statement bugs at massive scale" [ ] [ ]
paper "FixJS: a dataset of bug-fixing JavaScript commits" [ ] [ ]
paper "PyTER: Effective Program Repair for Python Type Errors" [ ] [ ]
paper "xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval" [ ] [ ]
paper "RunBugRun -- An Executable Dataset for Automated Program Repair" [ ] [ ]
paper "OctoPack: Instruction Tuning Code Large Language Models" [ ] [ ]
paper "DebugBench: Evaluating Debugging Capability of Large Language Models" [ ] [ ]
paper "MdEval: Massively Multilingual Code Debugging" [ ]
paper "A Survey of Automatic Source Code Summarization", 2022-02, Symmetry, [ ]
paper "Summarizing Source Code using a Neural Attention Model" [ ] [ ]
paper "A parallel corpus of Python functions and documentation strings for automated code documentation and code generation" [ ] [ ]
paper "Deep code comment generation" [ ] [ ]
paper "Summarizing Source Code with Transferred API Knowledge" [ ] [ ]
paper "Improving Automatic Source Code Summarization via Deep Reinforcement Learning" [ ] [ ]
paper "CodeSearchNet Challenge: Evaluating the State of Semantic Code Search" [ ] [ ]
paper "OctoPack: Instruction Tuning Code Large Language Models" [ ] [ ]
paper "Benchmarking Software Vulnerability Detection Techniques: A Survey", 2023-03, arXiv, [ ]
paper "VulDeePecker: A Deep Learning-Based System for Vulnerability Detection" [ ] [ ]
paper "Cross-Project Transfer Representation Learning for Vulnerable Function Discovery" [ ] [ ]
paper "Automated Vulnerability Detection in Source Code Using Deep Representation Learning" [ ] [ ]
paper "SySeVR: A Framework for Using Deep Learning to Detect Software Vulnerabilities" [ ] [ ]
paper "A Manually-Curated Dataset of Fixes to Vulnerabilities of Open-Source Software" [ ] [ ]
paper "Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks" [ ] [ ]
paper "Software Vulnerability Discovery via Learning Multi-Domain Knowledge Bases" [ ] [ ]
paper "Global Relational Models of Source Code" [ ] [ ]
paper "μVulDeePecker: A Deep Learning-Based System for Multiclass Vulnerability Detection" [ ] [ ]
paper "Deep Learning-Based Vulnerable Function Detection: A Benchmark" [ ] [ ]
paper "Deep Learning based Vulnerability Detection: Are We There Yet?" [ ] [ ]
paper "A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries" [ ] [ ]
paper "D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis" [ ] [ ]
paper "Self-Supervised Bug Detection and Repair" [ ] [ ]
paper "CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software" [ ] [ ]
paper "CrossVul: a cross-language vulnerability dataset with commit data" [ ] [ ]
paper "DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection" [ ] [ ]
paper "Limits of Machine Learning for Automatic Vulnerability Detection" [ ] [ ]
paper "How Far Have We Gone in Vulnerability Detection Using Large Language Models" [ ] [ ]
paper "Vulnerability Detection with Code Language Models: How Far Are We?" [ ]
paper "VulDetectBench: Evaluating the Deep Capability of Vulnerability Detection with Large Language Models" [ ] [ ]
paper "CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?" [ ] [ ]
paper "CleanVul: Automatic Function-Level Vulnerability Detection in Code Commits Using LLM Heuristics" [ ] [ ]
paper "A Survey of Deep Code Search", 2023-05, arXiv, [ ]
paper "StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow" [ ] [ ]
paper "Deep Code Search" [ ] [ ]
paper "Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow" [ ] [ ]
paper "Neural Code Search Evaluation Dataset" [ ] [ ]
paper "CodeSearchNet Challenge: Evaluating the State of Semantic Code Search" [ ] [ ]
paper "Are the Code Snippets What We Are Searching for? A Benchmark and an Empirical Study on Code Search with Natural-Language Queries" [ ] [ ]
paper "Neural Code Search Revisited: Enhancing Code Snippet Retrieval through Natural Language Intent" [ ] [ ]
paper "Deep Graph Matching and Searching for Semantic Code Retrieval" [ ] [ ]
paper "CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation" [ ] [[data]]
paper "CoSQA: 20,000+ Web Queries for Code Search and Question Answering" [ ] [ ]
paper "ProCQA: A Large-scale Community-based Programming Question Answering Dataset for Code Search" [ ] [ ]
paper "CoSQA+: Enhancing Code Search Dataset with Matching Code" [ ] [ ]
paper "CoIR: A Comprehensive Benchmark for Code Information Retrieval Models" [ ] [ ]
paper "What can Large Language Models Capture about Code Functional Equivalence?" [ ]
paper "TypeWriter: Neural Type Prediction with Search-based Validation" [ ] [ ]
paper "Typilus: Neural Type Hints" [ ] [ ]
paper "LambdaNet: Probabilistic Type Inference using Graph Neural Networks" [ ] [ ]
paper "ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-based Type Inference" [ ] [ ]
paper "ManyTypes4TypeScript: a comprehensive TypeScript dataset for sequence-based type inference" [ ] [ ]
paper "Do Machine Learning Models Produce TypeScript Types That Type Check?" [ ] [ ]
paper "TypeT5: Seq2seq Type Inference using Static Analysis" [ ] [ ]
paper "Type Prediction With Program Decomposition and Fill-in-the-Type Training" [ ] [ ]
paper "On the Evaluation of Commit Message Generation Models: An Experimental Study", 2021-07, ICSME 2021, [ ]
paper "Towards Automatic Generation of Short Summaries of Commits" [ ] [ ]
paper "A Neural Architecture for Generating Natural Language Descriptions from Source Code Changes" [ ] [ ]
paper "Automatically Generating Commit Messages from Diffs using Neural Machine Translation" [ ] [ ]
paper "Neural-machine-translation-based commit message generation: how far are we?" [ ] [ ]
paper "Generating commit messages from diffs using pointer-generator network" [ ] [[data(
paper "Commit message generation for source code changes" [ ] [ ]
paper "ATOM: Commit Message Generation Based on Abstract Syntax Tree and Hybrid Ranking" [ ] [ ]
paper "CommitBERT: Commit Message Generation Using Pre-Trained Programming Language Model" [ ] [ ]
paper "On the Evaluation of Commit Message Generation Models: An Experimental Study" [ ] [ ]
paper "Context-aware Retrieval-based Deep Commit Message Generation" [ ] [ ]
paper "Delving into Commit-Issue Correlation to Enhance Commit Message Generation Models" [ ] [ ]
paper "From Commit Message Generation to History-Aware Commit Message Completion" [ ] [ ]
paper "RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation" [ ] [ ]
paper "RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems" [ ] [ ]
paper "Guiding Language Models of Code with Global Context using Monitors" [ ] [ ]
paper "RepoFusion: Training Code Models to Understand Your Repository" [ ] [ ]
paper "BioCoder: A Benchmark for Bioinformatics Code Generation with Large Language Models" [ ] [ ]
paper "CodePlan: Repository-level Coding using LLMs and Planning" [ ] [ ]
paper "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" [ ] [ ]
paper "CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion" [ ] [ ]
paper "EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories" [ ] [ ]
paper "DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories" [ ] [ ]
paper "Can AI Beat Undergraduates in Entry-level Java Assignments? Benchmarking Large Language Models on JavaBench" [ ] [ ]
paper "Towards more realistic evaluation of LLM-based code generation: an experimental study and beyond" [ ] [ ]
paper "REPOEXEC: Evaluate Code Generation with a Repository-Level Executable Benchmark" [ ]
paper "RES-Q: Evaluating Code-Editing Large Language Model Systems at the Repository Scale" [ ] [ ]
paper "SWE-bench-java: A GitHub Issue Resolving Benchmark for Java" [ ] [ ]
paper "Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion?" [ ] [ ]
paper "SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?" [ ] [ ]
paper "SWE-Bench+: Enhanced Coding Benchmark for LLMs" [ ] [ ]
paper "DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models" [ ] [ ]
paper "Can Language Models Replace Programmers? REPOCOD Says 'Not Yet'" [ ]
paper "M2rc-Eval: Massively Multilingual Repository-level Code Completion Evaluation" [ ] [ ]
paper "A Real-World Benchmark for Evaluating Fine-Grained Issue Solving Capabilities of Large Language Models" [ ] [ ]
paper "Commit0: Library Generation from Scratch" [ ] [ ]
Neural Machine Translation by Jointly Learning to Align and Translate
Neural Machine Translation of Rare Words with Subword Units
Attention Is All You Need
Mixed Precision Training
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Improving Language Understanding by Generative Pre-Training
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Language Models are Unsupervised Multitask Learners
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Language Models are Few-Shot Learners
Measuring Massive Multitask Language Understanding
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
LoRA: Low-Rank Adaptation of Large Language Models
Finetuned Language Models Are Zero-Shot Learners
Multitask Prompted Training Enables Zero-Shot Task Generalization
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Training language models to follow instructions with human feedback
Training Compute-Optimal Large Language Models
PaLM: Scaling Language Modeling with Pathways
Large Language Models are Zero-Shot Reasoners
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Emergent Abilities of Large Language Models
Scaling Instruction-Finetuned Language Models
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Self-Instruct: Aligning Language Models with Self-Generated Instructions

Backlinks from these awesome lists: