M3Exam

LM Benchmark

A benchmark for evaluating large language models in multiple languages and formats

Data and code for paper "M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models"

GitHub

93 stars

9 watching

12 forks

Language: Python

last commit: over 2 years ago

ai-educationchatgptevaluationgpt-4large-language-modelsllmsmultilingualmultimodal

Related projects:

Repository	Description	Stars
damo-nlp-mt/polylm	A polyglot large language model designed to address limitations in current LLM research and provide better multilingual instruction-following capability.	77
damo-nlp-sg/llm-zoo	A collection of information about various large language models used in natural language processing	272
qcri/llmebench	A benchmarking framework for large language models	81
aifeg/benchlmm	An open-source benchmarking framework for evaluating cross-style visual capability of large multimodal models	84
pleisto/yuren-baichuan-7b	A multi-modal large language model that integrates natural language and visual capabilities with fine-tuning for various tasks	73
junyangwang0410/amber	An LLM-free benchmark suite for evaluating MLLMs' hallucination capabilities in various tasks and dimensions	98
ray-project/llmperf	A tool for evaluating the performance of large language model APIs	678
mlgroupjlu/llm-eval-survey	A repository of papers and resources for evaluating large language models.	1,450
deeplangai/lingowhale-8b	An open bilingual LLM developed using the LingoWhale model, trained on a large dataset of high-quality middle English text, and fine-tuned for specific tasks such as conversation generation.	134
km1994/llmsninestorydemontower	Exploring various LLMs and their applications in natural language processing and related areas	1,854
bobazooba/xllm	A tool for training and fine-tuning large language models using advanced techniques	387
bilibili/index-1.9b	A lightweight, multilingual language model with a long context length	920
damoebius/haxebench	A benchmarking project comparing the performance of different programming languages and their compiled outputs in various formats.	52
deepseek-ai/deepseek-moe	A large language model with improved efficiency and performance compared to similar models	1,024
ailab-cvc/seed-bench	A benchmark for evaluating large language models' ability to process multimodal input	322