LLM-Evals-Catalogue

LLM evaluator collection

A collaborative catalogue of LLM evaluation frameworks and papers

This repository stems from our paper, “Cataloguing LLM Evaluations”, and serves as a living, collaborative catalogue of LLM evaluation frameworks, benchmarks and papers.

GitHub

13 stars

1 watching

2 forks

last commit: almost 2 years ago

Linked from 1 awesome list

Backlinks from these awesome lists:

jphall663/awesome-machine-learning-interpretability

Related projects:

Repository	Description	Stars
mlgroupjlu/llm-eval-survey	A repository of papers and resources for evaluating large language models.	1,450
relari-ai/continuous-eval	Provides a comprehensive framework for evaluating Large Language Model (LLM) applications and pipelines with customizable metrics	455
h2oai/h2o-llm-eval	An evaluation framework for large language models with Elo rating system and A/B testing capabilities	50
huggingface/lighteval	An all-in-one toolkit for evaluating Large Language Models (LLMs) across multiple backends.	879
evolvinglmms-lab/lmms-eval	Tools and evaluation framework for accelerating the development of large multimodal models by providing an efficient way to assess their performance	2,164
modelscope/evalscope	A framework for efficiently evaluating and benchmarking large models	308
open-evals/evals	A framework for evaluating OpenAI models and an open-source registry of benchmarks.	19
freedomintelligence/mllm-bench	Evaluates and compares the performance of multimodal large language models on various tasks	56
aifeg/benchlmm	An open-source benchmarking framework for evaluating cross-style visual capability of large multimodal models	84
mlabonne/llm-autoeval	A tool to automate the evaluation of large language models in Google Colab using various benchmarks and custom parameters.	566
allenai/olmo-eval	A framework for evaluating language models on NLP tasks	326
declare-lab/instruct-eval	An evaluation framework for large language models trained with instruction tuning methods	535
chenllliang/mmevalpro	A benchmarking framework for evaluating Large Multimodal Models by providing rigorous metrics and an efficient evaluation pipeline.	22
psycoy/mixeval	An evaluation suite and dynamic data release platform for large language models	230
volcengine/verl	A flexible RL training framework designed for large language models	427