evals
Model evaluation framework
A framework for evaluating OpenAI models and an open-source registry of benchmarks.
Evals is a framework for evaluating OpenAI models and an open-source registry of benchmarks.
19 stars
1 watching
3 forks
Language: Python
last commit: almost 2 years ago
Linked from 1 awesome list
Related projects:
Repository | Description | Stars |
---|---|---|
openai/simple-evals | Evaluates language models using standardized benchmarks and prompting techniques. | 2,059 |
modelscope/evalscope | A framework for efficiently evaluating and benchmarking large models | 308 |
allenai/olmo-eval | A framework for evaluating language models on NLP tasks | 326 |
huggingface/evaluate | An evaluation framework for machine learning models and datasets, providing standardized metrics and tools for comparing model performance. | 2,063 |
prometheus-eval/prometheus-eval | An open-source framework that enables language model evaluation using Prometheus and GPT4 | 820 |
flageval-baai/flageval | An evaluation toolkit and platform for assessing large models in various domains | 307 |
aiverify-foundation/llm-evals-catalogue | A collaborative catalogue of LLM evaluation frameworks and papers | 13 |
declare-lab/instruct-eval | An evaluation framework for large language models trained with instruction tuning methods | 535 |
hkust-nlp/ceval | An evaluation suite providing multiple-choice questions for foundation models in various disciplines, with tools for assessing model performance. | 1,650 |
cloud-cv/evalai | A platform for comparing and evaluating AI and machine learning algorithms at scale | 1,779 |
codefuse-ai/codefuse-devops-eval | An evaluation suite for assessing foundation models in the DevOps field. | 690 |
evolvinglmms-lab/lmms-eval | Tools and evaluation framework for accelerating the development of large multimodal models by providing an efficient way to assess their performance | 2,164 |
open-compass/vlmevalkit | An evaluation toolkit for large vision-language models | 1,514 |
allenai/reward-bench | A comprehensive benchmarking framework for evaluating the performance and safety of reward models in reinforcement learning. | 459 |
maluuba/nlg-eval | A toolset for evaluating and comparing natural language generation models | 1,350 |