bigcode-evaluation-harness

Code evaluation framework

A framework for evaluating autoregressive code generation language models in terms of their accuracy and robustness.

A framework for the evaluation of autoregressive code generation language models.

GitHub

846 stars

12 watching

225 forks

Language: Python

last commit: 9 months ago

Linked from 1 awesome list

Backlinks from these awesome lists:

ethicalml/awesome-production-machine-learning

Related projects:

Repository	Description	Stars
bigcode-project/starcoder2	Trains models to generate code in multiple programming languages	1,808
modelscope/evalscope	A framework for efficiently evaluating and benchmarking large models	308
flageval-baai/flageval	An evaluation toolkit and platform for assessing large models in various domains	307
princeton-nlp/intercode	An interactive code environment framework for evaluating language agents through execution feedback.	198
bin123apple/autocoder	An AI model designed to generate and execute code automatically	816
codefuse-ai/codefuse-devops-eval	An evaluation suite for assessing foundation models in the DevOps field.	690
relari-ai/continuous-eval	Provides a comprehensive framework for evaluating Large Language Model (LLM) applications and pipelines with customizable metrics	455
open-evals/evals	A framework for evaluating OpenAI models and an open-source registry of benchmarks.	19
quantifiedcode/quantifiedcode	A code analysis and automation platform	111
ukgovernmentbeis/inspect_ai	A framework for evaluating large language models	669
allenai/olmo-eval	A framework for evaluating language models on NLP tasks	326
quantifiedcode/python-anti-patterns	A collection of common Python coding mistakes and poor practices	1,716
huggingface/evaluate	An evaluation framework for machine learning models and datasets, providing standardized metrics and tools for comparing model performance.	2,063
nvlabs/verilog-eval	An evaluation harness for generating Verilog code from natural language prompts	188
budecosystem/code-millenials	A state-of-the-art open-source code generation model with human evaluability score comparable to GPT-4 and Google's proprietary models.	20