evals

Benchmarking framework

A framework for evaluating large language models and systems, providing a registry of benchmarks.

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

GitHub

15k stars
262 watching
3k forks
Language: Python
last commit: about 2 months ago
Linked from 4 awesome lists


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
open-evals/evals A framework for evaluating OpenAI models and an open-source registry of benchmarks. 19
openai/simple-evals A library for evaluating language models using standardized prompts and benchmarking tests. 1,939
explodinggradients/ragas A toolkit for evaluating and optimizing Large Language Model applications with data-driven insights 7,233
eleutherai/lm-evaluation-harness Provides a unified framework to test generative language models on various evaluation tasks. 6,970
confident-ai/deepeval A framework for evaluating large language models 3,669
ianarawjo/chainforge An environment for battle-testing prompts to Large Language Models (LLMs) to evaluate response quality and performance. 2,334
allenai/olmo-eval An evaluation framework for large language models. 310
prometheus-eval/prometheus-eval An open-source framework that enables language model evaluation using Prometheus and GPT4 796
openai/baselines High-quality implementations of reinforcement learning algorithms for research and development purposes 15,810
evidentlyai/evidently An observability framework for evaluating and monitoring the performance of machine learning models and data pipelines 5,391
relari-ai/continuous-eval Provides a comprehensive framework for evaluating Large Language Model (LLM) applications and pipelines with customizable metrics 446
hegelai/prompttools A set of tools for testing and evaluating natural language processing models and vector databases. 2,708
open-compass/vlmevalkit A toolkit for evaluating large vision-language models on various benchmarks and datasets. 1,343
mlgroupjlu/llm-eval-survey A repository of papers and resources for evaluating large language models. 1,433
giskard-ai/giskard Automates detection and evaluation of performance, bias, and security issues in AI applications 4,071