M3Exam

LM Benchmark

A benchmark for evaluating large language models in multiple languages and formats

Data and code for paper "M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models"

GitHub

92 stars
9 watching
12 forks
Language: Python
last commit: over 1 year ago
ai-educationchatgptevaluationgpt-4large-language-modelsllmsmultilingualmultimodal

Related projects:

Repository Description Stars
damo-nlp-mt/polylm A polyglot large language model designed to address limitations in current LLM research and provide better multilingual instruction-following capability. 76
damo-nlp-sg/llm-zoo A collection of information about various large language models used in natural language processing 272
qcri/llmebench A benchmarking framework for large language models 80
aifeg/benchlmm An open-source benchmarking framework for evaluating cross-style visual capability of large multimodal models 83
pleisto/yuren-baichuan-7b A multi-modal large language model that integrates natural language and visual capabilities with fine-tuning for various tasks 72
junyangwang0410/amber An LLM-free benchmark suite for evaluating MLLMs' hallucination capabilities in various tasks and dimensions 93
ray-project/llmperf A tool for evaluating the performance of large language model APIs 641
mlgroupjlu/llm-eval-survey A repository of papers and resources for evaluating large language models. 1,433
deeplangai/lingowhale-8b An open bilingual LLM developed using the LingoWhale model, trained on a large dataset of high-quality middle English text, and fine-tuned for specific tasks such as conversation generation. 134
km1994/llmsninestorydemontower Exploring various LLMs and their applications in natural language processing and related areas 1,798
bobazooba/xllm A tool for training and fine-tuning large language models using advanced techniques 380
bilibili/index-1.9b A lightweight, multilingual language model with a long context length 904
damoebius/haxebench A benchmarking project comparing the performance of different programming languages and their compiled outputs in various formats. 52
deepseek-ai/deepseek-moe A large language model with improved efficiency and performance compared to similar models 1,006
ailab-cvc/seed-bench A benchmark for evaluating large language models' ability to process multimodal input 315