simple-evals

Model Evaluator

A library for evaluating language models using standardized prompts and benchmarking tests.

GitHub

2k stars
28 watching
165 forks
Language: Python
last commit: 24 days ago
Linked from 1 awesome list


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
open-evals/evals A framework for evaluating OpenAI models and an open-source registry of benchmarks. 19
allenai/olmo-eval An evaluation framework for large language models. 311
huggingface/evaluate An evaluation framework for machine learning models and datasets, providing standardized metrics and tools for comparing model performance. 2,034
declare-lab/instruct-eval An evaluation framework for large language models trained with instruction tuning methods 528
tatsu-lab/alpaca_eval An automatic evaluation tool for large language models 1,526
chenllliang/mmevalpro A benchmarking framework for evaluating Large Multimodal Models by providing rigorous metrics and an efficient evaluation pipeline. 22
maluuba/nlg-eval A toolset for evaluating and comparing natural language generation models 1,349
stanford-futuredata/ares A tool for automatically evaluating RAG models by generating synthetic data and fine-tuning classifiers 483
ruixiangcui/agieval Evaluates foundation models on human-centric tasks with diverse exams and question types 708
open-compass/vlmevalkit A toolkit for evaluating large vision-language models on various benchmarks and datasets. 1,343
pkunlp-icler/pca-eval An open-source benchmark and evaluation tool for assessing multimodal large language models' performance in embodied decision-making tasks 100
obss/jury A comprehensive toolkit for evaluating NLP experiments offering automated metrics and efficient computation. 188
evolvinglmms-lab/lmms-eval Tools and evaluation suite for large multimodal models 2,058
edublancas/sklearn-evaluation A tool for evaluating and visualizing machine learning model performance 3
dtcenter/metplus Provides a Python scripting infrastructure for evaluating and visualizing meteorological model performance. 98