bigcode-evaluation-harness
Code evaluation framework
A framework for evaluating autoregressive code generation language models in terms of their accuracy and robustness.
A framework for the evaluation of autoregressive code generation language models.
846 stars
12 watching
225 forks
Language: Python
last commit: 12 months ago
Linked from 1 awesome list
Related projects:
| Repository | Description | Stars |
|---|---|---|
| | Trains models to generate code in multiple programming languages | 1,808 |
| | A framework for efficiently evaluating and benchmarking large models | 308 |
| | An evaluation toolkit and platform for assessing large models in various domains | 307 |
| | An interactive code environment framework for evaluating language agents through execution feedback. | 198 |
| | An AI model designed to generate and execute code automatically | 816 |
| | An evaluation suite for assessing foundation models in the DevOps field. | 690 |
| | Provides a comprehensive framework for evaluating Large Language Model (LLM) applications and pipelines with customizable metrics | 455 |
| | A framework for evaluating OpenAI models and an open-source registry of benchmarks. | 19 |
| | A code analysis and automation platform | 111 |
| | A framework for evaluating large language models | 669 |
| | A framework for evaluating language models on NLP tasks | 326 |
| | A collection of common Python coding mistakes and poor practices | 1,716 |
| | An evaluation framework for machine learning models and datasets, providing standardized metrics and tools for comparing model performance. | 2,063 |
| | An evaluation harness for generating Verilog code from natural language prompts | 188 |
| | A state-of-the-art open-source code generation model with human evaluability score comparable to GPT-4 and Google's proprietary models. | 20 |