bigcode-evaluation-harness

Code evaluation framework

A framework for evaluating autoregressive code generation language models in terms of their accuracy and robustness.

A framework for the evaluation of autoregressive code generation language models.

GitHub

846 stars
12 watching
225 forks
Language: Python
last commit: 3 months ago
Linked from 1 awesome list


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
bigcode-project/starcoder2 Trains models to generate code in multiple programming languages 1,808
modelscope/evalscope A framework for efficiently evaluating and benchmarking large models 308
flageval-baai/flageval An evaluation toolkit and platform for assessing large models in various domains 307
princeton-nlp/intercode An interactive code environment framework for evaluating language agents through execution feedback. 198
bin123apple/autocoder An AI model designed to generate and execute code automatically 816
codefuse-ai/codefuse-devops-eval An evaluation suite for assessing foundation models in the DevOps field. 690
relari-ai/continuous-eval Provides a comprehensive framework for evaluating Large Language Model (LLM) applications and pipelines with customizable metrics 455
open-evals/evals A framework for evaluating OpenAI models and an open-source registry of benchmarks. 19
quantifiedcode/quantifiedcode A code analysis and automation platform 111
ukgovernmentbeis/inspect_ai A framework for evaluating large language models 669
allenai/olmo-eval A framework for evaluating language models on NLP tasks 326
quantifiedcode/python-anti-patterns A collection of common Python coding mistakes and poor practices 1,716
huggingface/evaluate An evaluation framework for machine learning models and datasets, providing standardized metrics and tools for comparing model performance. 2,063
nvlabs/verilog-eval An evaluation harness for generating Verilog code from natural language prompts 188
budecosystem/code-millenials A state-of-the-art open-source code generation model with human evaluability score comparable to GPT-4 and Google's proprietary models. 20