evaluate

Model Evaluator

An evaluation framework for machine learning models and datasets, providing standardized metrics and tools for comparing model performance.

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.

GitHub

2k stars
48 watching
258 forks
Language: Python
last commit: 2 months ago
Linked from 2 awesome lists

evaluationmachine-learning

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
modelscope/evalscope A framework for efficient large model evaluation and performance benchmarking. 248
huggingface/lighteval A toolkit for evaluating Large Language Models across multiple backends 804
openai/simple-evals A library for evaluating language models using standardized prompts and benchmarking tests. 1,939
chenllliang/mmevalpro A benchmarking framework for evaluating Large Multimodal Models by providing rigorous metrics and an efficient evaluation pipeline. 22
edublancas/sklearn-evaluation A tool for evaluating and visualizing machine learning model performance 3
allenai/olmo-eval An evaluation framework for large language models. 310
declare-lab/instruct-eval An evaluation framework for large language models trained with instruction tuning methods 528
open-evals/evals A framework for evaluating OpenAI models and an open-source registry of benchmarks. 19
obss/jury A comprehensive toolkit for evaluating NLP experiments offering automated metrics and efficient computation. 188
maluuba/nlg-eval A toolset for evaluating and comparing natural language generation models 1,347
evolvinglmms-lab/lmms-eval Tools and evaluation suite for large multimodal models 2,058
tsb0601/mmvp An evaluation framework for multimodal language models' visual capabilities using image and question benchmarks. 288
tatsu-lab/alpaca_eval An automatic evaluation tool for large language models 1,526
mlabonne/llm-autoeval A tool to automate the evaluation of large language models in Google Colab using various benchmarks and custom parameters. 558
stanford-crfm/helm A framework to evaluate and compare language models by analyzing their performance on various tasks 1,947