ceval

Evaluation suite

An evaluation suite providing multiple-choice questions for foundation models in various disciplines, with tools for assessing model performance.

Official github repo for C-Eval, a Chinese evaluation suite for foundation models [NeurIPS 2023]

GitHub

2k stars

14 watching

79 forks

Language: Python

last commit: almost 2 years ago

cevalbenchmark.com/

Related projects:

Repository	Description	Stars
princeton-nlp/charxiv	An evaluation suite for assessing chart understanding in multimodal large language models.	85
ruixiangcui/agieval	Evaluates foundation models on human-centric tasks with diverse exams and question types	714
open-evals/evals	A framework for evaluating OpenAI models and an open-source registry of benchmarks.	19
psycoy/mixeval	An evaluation suite and dynamic data release platform for large language models	230
dfki-nlp/gevalm	Evaluates German transformer language models with syntactic agreement tests	7
obss/jury	A comprehensive toolkit for evaluating NLP experiments offering automated metrics and efficient computation.	187
culiver/space	A framework for evaluating contribution of individual clients in federated learning systems.	7
maluuba/nlg-eval	A toolset for evaluating and comparing natural language generation models	1,350
pkunlp-icler/pca-eval	An open-source benchmark and evaluation tool for assessing multimodal large language models' performance in embodied decision-making tasks	99
kentcdodds/preval.macro	A build-time code evaluation tool for JavaScript	127
mshukor/evalign-icl	Evaluating and improving large multimodal models through in-context learning	21
yuweihao/mm-vet	Evaluates the capabilities of large multimodal models using a set of diverse tasks and metrics	274
codefuse-ai/codefuse-devops-eval	An evaluation suite for assessing foundation models in the DevOps field.	690
openai/simple-evals	Evaluates language models using standardized benchmarks and prompting techniques.	2,059
chenllliang/mmevalpro	A benchmarking framework for evaluating Large Multimodal Models by providing rigorous metrics and an efficient evaluation pipeline.	22