ceval
Evaluation suite
An evaluation suite providing multiple-choice questions for foundation models in various disciplines, with tools for assessing model performance.
Official github repo for C-Eval, a Chinese evaluation suite for foundation models [NeurIPS 2023]
2k stars
14 watching
79 forks
Language: Python
last commit: about 1 year ago Related projects:
Repository | Description | Stars |
---|---|---|
princeton-nlp/charxiv | An evaluation suite for assessing chart understanding in multimodal large language models. | 85 |
ruixiangcui/agieval | Evaluates foundation models on human-centric tasks with diverse exams and question types | 714 |
open-evals/evals | A framework for evaluating OpenAI models and an open-source registry of benchmarks. | 19 |
psycoy/mixeval | An evaluation suite and dynamic data release platform for large language models | 230 |
dfki-nlp/gevalm | Evaluates German transformer language models with syntactic agreement tests | 7 |
obss/jury | A comprehensive toolkit for evaluating NLP experiments offering automated metrics and efficient computation. | 187 |
culiver/space | A framework for evaluating contribution of individual clients in federated learning systems. | 7 |
maluuba/nlg-eval | A toolset for evaluating and comparing natural language generation models | 1,350 |
pkunlp-icler/pca-eval | An open-source benchmark and evaluation tool for assessing multimodal large language models' performance in embodied decision-making tasks | 99 |
kentcdodds/preval.macro | A build-time code evaluation tool for JavaScript | 127 |
mshukor/evalign-icl | Evaluating and improving large multimodal models through in-context learning | 21 |
yuweihao/mm-vet | Evaluates the capabilities of large multimodal models using a set of diverse tasks and metrics | 274 |
codefuse-ai/codefuse-devops-eval | An evaluation suite for assessing foundation models in the DevOps field. | 690 |
openai/simple-evals | Evaluates language models using standardized benchmarks and prompting techniques. | 2,059 |
chenllliang/mmevalpro | A benchmarking framework for evaluating Large Multimodal Models by providing rigorous metrics and an efficient evaluation pipeline. | 22 |