evals

Model evaluation framework

A framework for evaluating OpenAI models and an open-source registry of benchmarks.

Evals is a framework for evaluating OpenAI models and an open-source registry of benchmarks.

GitHub

19 stars
1 watching
3 forks
Language: Python
last commit: over 1 year ago
Linked from 1 awesome list


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
openai/simple-evals A library for evaluating language models using standardized prompts and benchmarking tests. 1,939
modelscope/evalscope A framework for efficient large model evaluation and performance benchmarking. 248
allenai/olmo-eval An evaluation framework for large language models. 311
huggingface/evaluate An evaluation framework for machine learning models and datasets, providing standardized metrics and tools for comparing model performance. 2,034
prometheus-eval/prometheus-eval An open-source framework that enables language model evaluation using Prometheus and GPT4 796
flageval-baai/flageval An evaluation toolkit and platform for assessing large models in various domains 300
aiverify-foundation/llm-evals-catalogue A collaborative catalogue of Large Language Model evaluation frameworks and papers. 14
declare-lab/instruct-eval An evaluation framework for large language models trained with instruction tuning methods 528
hkust-nlp/ceval An evaluation suite providing multiple-choice questions for foundation models in various disciplines, with tools for assessing model performance. 1,636
cloud-cv/evalai A platform for comparing and evaluating AI and machine learning algorithms at scale 1,771
codefuse-ai/codefuse-devops-eval An evaluation suite for assessing foundation models in the DevOps field. 685
evolvinglmms-lab/lmms-eval Tools and evaluation suite for large multimodal models 2,058
open-compass/vlmevalkit A toolkit for evaluating large vision-language models on various benchmarks and datasets. 1,343
allenai/reward-bench A comprehensive benchmarking framework for evaluating the performance and safety of reward models in reinforcement learning. 437
maluuba/nlg-eval A toolset for evaluating and comparing natural language generation models 1,349