h2o-LLM-eval

LLM evaluator

An evaluation framework for large language models with Elo rating system and A/B testing capabilities

Large-language Model Evaluation framework with Elo Leaderboard and A-B testing

GitHub

50 stars

37 watching

1 forks

Language: Jupyter Notebook

last commit: almost 2 years ago

Screenshot of h2oai/h2o-LLM-eval website

evalgpt.ai/

Related projects:

Repository	Description	Stars
allenai/olmo-eval	A framework for evaluating language models on NLP tasks	326
mlgroupjlu/llm-eval-survey	A repository of papers and resources for evaluating large language models.	1,450
h2oai/mli-resources	Provides tools and techniques for interpreting machine learning models	483
relari-ai/continuous-eval	Provides a comprehensive framework for evaluating Large Language Model (LLM) applications and pipelines with customizable metrics	455
h2oai/h2o-3	An in-memory machine learning platform that supports various algorithms and provides tools for building, deploying, and scaling machine learning models	6,950
evolvinglmms-lab/lmms-eval	Tools and evaluation framework for accelerating the development of large multimodal models by providing an efficient way to assess their performance	2,164
h2oai/article-information-2019	A framework for building and evaluating machine learning systems with high accuracy and interpretability, particularly in human-centered applications.	13
declare-lab/instruct-eval	An evaluation framework for large language models trained with instruction tuning methods	535
aiverify-foundation/llm-evals-catalogue	A collaborative catalogue of LLM evaluation frameworks and papers	13
h2oai/h2o-2	An analytics engine that provides fast and scalable predictive modeling capabilities for big data	2,224
mlabonne/llm-autoeval	A tool to automate the evaluation of large language models in Google Colab using various benchmarks and custom parameters.	566
tatsu-lab/alpaca_eval	An automatic evaluation tool for large language models	1,568
h2oai/h2o-flow	An interactive computing environment for machine learning and data analysis	134
huggingface/lighteval	An all-in-one toolkit for evaluating Large Language Models (LLMs) across multiple backends.	879
chenllliang/mmevalpro	A benchmarking framework for evaluating Large Multimodal Models by providing rigorous metrics and an efficient evaluation pipeline.	22