FlagEval

Model evaluation framework

An evaluation toolkit and platform for assessing large models in various domains

FlagEval is an evaluation toolkit for AI large foundation models.

GitHub

307 stars
13 watching
27 forks
Language: Python
last commit: 6 months ago
Linked from 1 awesome list


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
flagai-open/aquila2 Provides pre-trained language models and tools for fine-tuning and evaluation 439
allenai/olmo-eval A framework for evaluating language models on NLP tasks 326
huggingface/lighteval An all-in-one toolkit for evaluating Large Language Models (LLMs) across multiple backends. 879
open-evals/evals A framework for evaluating OpenAI models and an open-source registry of benchmarks. 19
modelscope/evalscope A framework for efficiently evaluating and benchmarking large models 308
huggingface/evaluate An evaluation framework for machine learning models and datasets, providing standardized metrics and tools for comparing model performance. 2,063
openai/simple-evals Evaluates language models using standardized benchmarks and prompting techniques. 2,059
psycoy/mixeval An evaluation suite and dynamic data release platform for large language models 230
stanford-crfm/helm A framework to evaluate and compare language models by analyzing their performance on various tasks 1,981
baaivision/emu A multimodal generative model framework 1,672
chenllliang/mmevalpro A benchmarking framework for evaluating Large Multimodal Models by providing rigorous metrics and an efficient evaluation pipeline. 22
declare-lab/instruct-eval An evaluation framework for large language models trained with instruction tuning methods 535
aiverify-foundation/llm-evals-catalogue A collaborative catalogue of LLM evaluation frameworks and papers 13
ukgovernmentbeis/inspect_ai A framework for evaluating large language models 669
maluuba/nlg-eval A toolset for evaluating and comparing natural language generation models 1,350