bigcode-evaluation-harness
Code evaluation framework
A framework for evaluating autoregressive code generation language models in terms of their accuracy and robustness.
A framework for the evaluation of autoregressive code generation language models.
818 stars
12 watching
218 forks
Language: Python
last commit: 22 days ago
Linked from 1 awesome list
Related projects:
Repository | Description | Stars |
---|---|---|
bigcode-project/starcoder2 | Trains models to generate code in multiple programming languages | 1,786 |
modelscope/evalscope | A framework for efficient large model evaluation and performance benchmarking. | 248 |
flageval-baai/flageval | An evaluation toolkit and platform for assessing large models in various domains | 300 |
princeton-nlp/intercode | An interactive code environment framework for evaluating language agents through execution feedback. | 194 |
bin123apple/autocoder | An AI model designed to generate and execute code automatically | 814 |
codefuse-ai/codefuse-devops-eval | An evaluation suite for assessing foundation models in the DevOps field. | 685 |
relari-ai/continuous-eval | Provides a comprehensive framework for evaluating Large Language Model (LLM) applications and pipelines with customizable metrics | 446 |
open-evals/evals | A framework for evaluating OpenAI models and an open-source registry of benchmarks. | 19 |
quantifiedcode/quantifiedcode | A code analysis and automation platform | 111 |
ukgovernmentbeis/inspect_ai | A framework for evaluating large language models | 615 |
allenai/olmo-eval | An evaluation framework for large language models. | 310 |
quantifiedcode/python-anti-patterns | A collection of common Python coding mistakes and poor practices | 1,716 |
huggingface/evaluate | An evaluation framework for machine learning models and datasets, providing standardized metrics and tools for comparing model performance. | 2,034 |
nvlabs/verilog-eval | An evaluation harness for generating Verilog code from natural language prompts | 179 |
budecosystem/code-millenials | A state-of-the-art code generation model capable of producing high-quality code on par with other leading models. | 20 |