LLM-Evals-Catalogue

LLM evaluator collection

A collaborative catalogue of LLM evaluation frameworks and papers

This repository stems from our paper, “Cataloguing LLM Evaluations”, and serves as a living, collaborative catalogue of LLM evaluation frameworks, benchmarks and papers.

GitHub

13 stars
1 watching
2 forks
last commit: about 1 year ago
Linked from 1 awesome list


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
mlgroupjlu/llm-eval-survey A repository of papers and resources for evaluating large language models. 1,450
relari-ai/continuous-eval Provides a comprehensive framework for evaluating Large Language Model (LLM) applications and pipelines with customizable metrics 455
h2oai/h2o-llm-eval An evaluation framework for large language models with Elo rating system and A/B testing capabilities 50
huggingface/lighteval An all-in-one toolkit for evaluating Large Language Models (LLMs) across multiple backends. 879
evolvinglmms-lab/lmms-eval Tools and evaluation framework for accelerating the development of large multimodal models by providing an efficient way to assess their performance 2,164
modelscope/evalscope A framework for efficiently evaluating and benchmarking large models 308
open-evals/evals A framework for evaluating OpenAI models and an open-source registry of benchmarks. 19
freedomintelligence/mllm-bench Evaluates and compares the performance of multimodal large language models on various tasks 56
aifeg/benchlmm An open-source benchmarking framework for evaluating cross-style visual capability of large multimodal models 84
mlabonne/llm-autoeval A tool to automate the evaluation of large language models in Google Colab using various benchmarks and custom parameters. 566
allenai/olmo-eval A framework for evaluating language models on NLP tasks 326
declare-lab/instruct-eval An evaluation framework for large language models trained with instruction tuning methods 535
chenllliang/mmevalpro A benchmarking framework for evaluating Large Multimodal Models by providing rigorous metrics and an efficient evaluation pipeline. 22
psycoy/mixeval An evaluation suite and dynamic data release platform for large language models 230
volcengine/verl A flexible RL training framework designed for large language models 427