evals

Model evaluation framework

A framework for evaluating OpenAI models and an open-source registry of benchmarks.

Evals is a framework for evaluating OpenAI models and an open-source registry of benchmarks.

GitHub

19 stars
1 watching
3 forks
Language: Python
last commit: almost 2 years ago
Linked from 1 awesome list


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
openai/simple-evals Evaluates language models using standardized benchmarks and prompting techniques. 2,059
modelscope/evalscope A framework for efficiently evaluating and benchmarking large models 308
allenai/olmo-eval A framework for evaluating language models on NLP tasks 326
huggingface/evaluate An evaluation framework for machine learning models and datasets, providing standardized metrics and tools for comparing model performance. 2,063
prometheus-eval/prometheus-eval An open-source framework that enables language model evaluation using Prometheus and GPT4 820
flageval-baai/flageval An evaluation toolkit and platform for assessing large models in various domains 307
aiverify-foundation/llm-evals-catalogue A collaborative catalogue of LLM evaluation frameworks and papers 13
declare-lab/instruct-eval An evaluation framework for large language models trained with instruction tuning methods 535
hkust-nlp/ceval An evaluation suite providing multiple-choice questions for foundation models in various disciplines, with tools for assessing model performance. 1,650
cloud-cv/evalai A platform for comparing and evaluating AI and machine learning algorithms at scale 1,779
codefuse-ai/codefuse-devops-eval An evaluation suite for assessing foundation models in the DevOps field. 690
evolvinglmms-lab/lmms-eval Tools and evaluation framework for accelerating the development of large multimodal models by providing an efficient way to assess their performance 2,164
open-compass/vlmevalkit An evaluation toolkit for large vision-language models 1,514
allenai/reward-bench A comprehensive benchmarking framework for evaluating the performance and safety of reward models in reinforcement learning. 459
maluuba/nlg-eval A toolset for evaluating and comparing natural language generation models 1,350