evals

Benchmarking framework

A framework for evaluating large language models and systems, providing a registry of benchmarks.

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

GitHub

15k stars

264 watching

3k forks

Language: Python

last commit: 10 months ago

Linked from 4 awesome lists

Backlinks from these awesome lists:

Related projects:

Repository	Description	Stars
open-evals/evals	A framework for evaluating OpenAI models and an open-source registry of benchmarks.	19
openai/simple-evals	Evaluates language models using standardized benchmarks and prompting techniques.	2,059
explodinggradients/ragas	A toolkit for evaluating and optimizing Large Language Model applications with objective metrics, test data generation, and seamless integrations.	7,598
eleutherai/lm-evaluation-harness	Provides a unified framework to test generative language models on various evaluation tasks.	7,200
confident-ai/deepeval	A framework for evaluating large language models	4,003
ianarawjo/chainforge	An environment for battle-testing prompts to Large Language Models (LLMs) to evaluate response quality and performance.	2,413
allenai/olmo-eval	A framework for evaluating language models on NLP tasks	326
prometheus-eval/prometheus-eval	An open-source framework that enables language model evaluation using Prometheus and GPT4	820
openai/baselines	High-quality implementations of reinforcement learning algorithms for research and development purposes	15,885
evidentlyai/evidently	An observability framework for evaluating and monitoring the performance of machine learning models and data pipelines	5,519
relari-ai/continuous-eval	Provides a comprehensive framework for evaluating Large Language Model (LLM) applications and pipelines with customizable metrics	455
hegelai/prompttools	A set of tools for testing and evaluating natural language processing models and vector databases.	2,731
open-compass/vlmevalkit	An evaluation toolkit for large vision-language models	1,514
mlgroupjlu/llm-eval-survey	A repository of papers and resources for evaluating large language models.	1,450
giskard-ai/giskard	Automates the detection of performance, bias, and security issues in AI applications	4,125