evals
Benchmarking framework
A framework for evaluating large language models and systems, providing a registry of benchmarks.
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
15k stars
264 watching
3k forks
Language: Python
last commit: about 1 year ago
Linked from 4 awesome lists
Related projects:
| Repository | Description | Stars |
|---|---|---|
| | A framework for evaluating OpenAI models and an open-source registry of benchmarks. | 19 |
| | Evaluates language models using standardized benchmarks and prompting techniques. | 2,059 |
| | A toolkit for evaluating and optimizing Large Language Model applications with objective metrics, test data generation, and seamless integrations. | 7,598 |
| | Provides a unified framework to test generative language models on various evaluation tasks. | 7,200 |
| | A framework for evaluating large language models | 4,003 |
| | An environment for battle-testing prompts to Large Language Models (LLMs) to evaluate response quality and performance. | 2,413 |
| | A framework for evaluating language models on NLP tasks | 326 |
| | An open-source framework that enables language model evaluation using Prometheus and GPT4 | 820 |
| | High-quality implementations of reinforcement learning algorithms for research and development purposes | 15,885 |
| | An observability framework for evaluating and monitoring the performance of machine learning models and data pipelines | 5,519 |
| | Provides a comprehensive framework for evaluating Large Language Model (LLM) applications and pipelines with customizable metrics | 455 |
| | A set of tools for testing and evaluating natural language processing models and vector databases. | 2,731 |
| | An evaluation toolkit for large vision-language models | 1,514 |
| | A repository of papers and resources for evaluating large language models. | 1,450 |
| | Automates the detection of performance, bias, and security issues in AI applications | 4,125 |