GAOKAO-Bench
Model testing framework
An evaluation framework using Chinese high school examination questions to assess large language model capabilities
GAOKAO-Bench is an evaluation framework that utilizes GAOKAO questions as a dataset to evaluate large language models.
565 stars
4 watching
42 forks
Language: Python
last commit: 10 months ago Related projects:
Repository | Description | Stars |
---|---|---|
felixgithub2017/mmcu | Measures the understanding of massive multitask Chinese datasets using large language models | 87 |
flagai-open/aquila2 | Provides pre-trained language models and tools for fine-tuning and evaluation | 439 |
freedomintelligence/mllm-bench | Evaluates and compares the performance of multimodal large language models on various tasks | 56 |
zjunlp/knowlm | A framework for training and utilizing large language models with knowledge augmentation capabilities | 1,251 |
opengvlab/lamm | A framework and benchmark for training and evaluating multi-modal large language models, enabling the development of AI agents capable of seamless interaction between humans and machines. | 305 |
aifeg/benchlmm | An open-source benchmarking framework for evaluating cross-style visual capability of large multimodal models | 84 |
openlmlab/openchinesellama | An incremental pre-trained Chinese large language model based on the LLaMA-7B model | 234 |
matthewhammer/motoko-bigtest | A testing framework that enables the creation of long-running tests using a domain-specific language. | 12 |
open-compass/lawbench | Evaluates the legal knowledge of large language models using a custom benchmarking framework. | 273 |
letmeno1/aki | A cross-platform desktop testing framework utilizing accessibility APIs and the JNA library to automate interactions with GUI elements. | 34 |
google/paxml | A framework for configuring and running machine learning experiments on top of Jax. | 461 |
johnsnowlabs/langtest | A tool for testing and evaluating large language models with a focus on AI safety and model assessment. | 506 |
pku-yuangroup/video-bench | Evaluates and benchmarks large language models' video understanding capabilities | 121 |
yuliang-liu/monkey | An end-to-end image captioning system that uses large multi-modal models and provides tools for training, inference, and demo usage. | 1,849 |
yuweihao/mm-vet | Evaluates the capabilities of large multimodal models using a set of diverse tasks and metrics | 274 |