ceval

Evaluation suite

An evaluation suite providing multiple-choice questions for foundation models in various disciplines, with tools for assessing model performance.

Official github repo for C-Eval, a Chinese evaluation suite for foundation models [NeurIPS 2023]

GitHub

2k stars
15 watching
78 forks
Language: Python
last commit: about 1 year ago

Related projects:

Repository Description Stars
princeton-nlp/charxiv An evaluation suite for assessing chart understanding in multimodal large language models. 75
ruixiangcui/agieval Evaluates foundation models on human-centric tasks with diverse exams and question types 708
open-evals/evals A framework for evaluating OpenAI models and an open-source registry of benchmarks. 19
psycoy/mixeval An evaluation suite and dynamic data release platform for large language models 224
dfki-nlp/gevalm Evaluates German transformer language models with syntactic agreement tests 7
obss/jury A comprehensive toolkit for evaluating NLP experiments offering automated metrics and efficient computation. 188
culiver/space A framework for evaluating contribution of individual clients in federated learning systems. 6
maluuba/nlg-eval A toolset for evaluating and comparing natural language generation models 1,347
pkunlp-icler/pca-eval An open-source benchmark and evaluation tool for assessing multimodal large language models' performance in embodied decision-making tasks 100
kentcdodds/preval.macro A build-time code evaluation tool for JavaScript 127
mshukor/evalign-icl Evaluating and improving large multimodal models through in-context learning 20
yuweihao/mm-vet Evaluates the capabilities of large multimodal models using a set of diverse tasks and metrics 267
codefuse-ai/codefuse-devops-eval An evaluation suite for assessing foundation models in the DevOps field. 685
openai/simple-evals A library for evaluating language models using standardized prompts and benchmarking tests. 1,939
chenllliang/mmevalpro A benchmarking framework for evaluating Large Multimodal Models by providing rigorous metrics and an efficient evaluation pipeline. 22