evals
Benchmarking framework
A framework for evaluating large language models and systems, providing a registry of benchmarks.
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
15k stars
262 watching
3k forks
Language: Python
last commit: about 2 months ago
Linked from 4 awesome lists
Related projects:
Repository | Description | Stars |
---|---|---|
open-evals/evals | A framework for evaluating OpenAI models and an open-source registry of benchmarks. | 19 |
openai/simple-evals | A library for evaluating language models using standardized prompts and benchmarking tests. | 1,939 |
explodinggradients/ragas | A toolkit for evaluating and optimizing Large Language Model applications with data-driven insights | 7,233 |
eleutherai/lm-evaluation-harness | Provides a unified framework to test generative language models on various evaluation tasks. | 6,970 |
confident-ai/deepeval | A framework for evaluating large language models | 3,669 |
ianarawjo/chainforge | An environment for battle-testing prompts to Large Language Models (LLMs) to evaluate response quality and performance. | 2,334 |
allenai/olmo-eval | An evaluation framework for large language models. | 310 |
prometheus-eval/prometheus-eval | An open-source framework that enables language model evaluation using Prometheus and GPT4 | 796 |
openai/baselines | High-quality implementations of reinforcement learning algorithms for research and development purposes | 15,810 |
evidentlyai/evidently | An observability framework for evaluating and monitoring the performance of machine learning models and data pipelines | 5,391 |
relari-ai/continuous-eval | Provides a comprehensive framework for evaluating Large Language Model (LLM) applications and pipelines with customizable metrics | 446 |
hegelai/prompttools | A set of tools for testing and evaluating natural language processing models and vector databases. | 2,708 |
open-compass/vlmevalkit | A toolkit for evaluating large vision-language models on various benchmarks and datasets. | 1,343 |
mlgroupjlu/llm-eval-survey | A repository of papers and resources for evaluating large language models. | 1,433 |
giskard-ai/giskard | Automates detection and evaluation of performance, bias, and security issues in AI applications | 4,071 |