ceval

Evaluation suite

An evaluation suite providing multiple-choice questions for foundation models in various disciplines, with tools for assessing model performance.

Official github repo for C-Eval, a Chinese evaluation suite for foundation models [NeurIPS 2023]

GitHub

2k stars
14 watching
79 forks
Language: Python
last commit: about 1 year ago

Related projects:

Repository Description Stars
princeton-nlp/charxiv An evaluation suite for assessing chart understanding in multimodal large language models. 85
ruixiangcui/agieval Evaluates foundation models on human-centric tasks with diverse exams and question types 714
open-evals/evals A framework for evaluating OpenAI models and an open-source registry of benchmarks. 19
psycoy/mixeval An evaluation suite and dynamic data release platform for large language models 230
dfki-nlp/gevalm Evaluates German transformer language models with syntactic agreement tests 7
obss/jury A comprehensive toolkit for evaluating NLP experiments offering automated metrics and efficient computation. 187
culiver/space A framework for evaluating contribution of individual clients in federated learning systems. 7
maluuba/nlg-eval A toolset for evaluating and comparing natural language generation models 1,350
pkunlp-icler/pca-eval An open-source benchmark and evaluation tool for assessing multimodal large language models' performance in embodied decision-making tasks 99
kentcdodds/preval.macro A build-time code evaluation tool for JavaScript 127
mshukor/evalign-icl Evaluating and improving large multimodal models through in-context learning 21
yuweihao/mm-vet Evaluates the capabilities of large multimodal models using a set of diverse tasks and metrics 274
codefuse-ai/codefuse-devops-eval An evaluation suite for assessing foundation models in the DevOps field. 690
openai/simple-evals Evaluates language models using standardized benchmarks and prompting techniques. 2,059
chenllliang/mmevalpro A benchmarking framework for evaluating Large Multimodal Models by providing rigorous metrics and an efficient evaluation pipeline. 22