bigcode-evaluation-harness

Code evaluation framework

A framework for evaluating autoregressive code generation language models in terms of their accuracy and robustness.

A framework for the evaluation of autoregressive code generation language models.

GitHub

818 stars
12 watching
218 forks
Language: Python
last commit: 22 days ago
Linked from 1 awesome list


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
bigcode-project/starcoder2 Trains models to generate code in multiple programming languages 1,786
modelscope/evalscope A framework for efficient large model evaluation and performance benchmarking. 248
flageval-baai/flageval An evaluation toolkit and platform for assessing large models in various domains 300
princeton-nlp/intercode An interactive code environment framework for evaluating language agents through execution feedback. 194
bin123apple/autocoder An AI model designed to generate and execute code automatically 814
codefuse-ai/codefuse-devops-eval An evaluation suite for assessing foundation models in the DevOps field. 685
relari-ai/continuous-eval Provides a comprehensive framework for evaluating Large Language Model (LLM) applications and pipelines with customizable metrics 446
open-evals/evals A framework for evaluating OpenAI models and an open-source registry of benchmarks. 19
quantifiedcode/quantifiedcode A code analysis and automation platform 111
ukgovernmentbeis/inspect_ai A framework for evaluating large language models 615
allenai/olmo-eval An evaluation framework for large language models. 310
quantifiedcode/python-anti-patterns A collection of common Python coding mistakes and poor practices 1,716
huggingface/evaluate An evaluation framework for machine learning models and datasets, providing standardized metrics and tools for comparing model performance. 2,034
nvlabs/verilog-eval An evaluation harness for generating Verilog code from natural language prompts 179
budecosystem/code-millenials A state-of-the-art code generation model capable of producing high-quality code on par with other leading models. 20