M3Exam
LM Benchmark
A benchmark for evaluating large language models in multiple languages and formats
Data and code for paper "M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models"
92 stars
9 watching
12 forks
Language: Python
last commit: over 1 year ago ai-educationchatgptevaluationgpt-4large-language-modelsllmsmultilingualmultimodal
Related projects:
Repository | Description | Stars |
---|---|---|
damo-nlp-mt/polylm | A polyglot large language model designed to address limitations in current LLM research and provide better multilingual instruction-following capability. | 76 |
damo-nlp-sg/llm-zoo | A collection of information about various large language models used in natural language processing | 272 |
qcri/llmebench | A benchmarking framework for large language models | 80 |
aifeg/benchlmm | An open-source benchmarking framework for evaluating cross-style visual capability of large multimodal models | 83 |
pleisto/yuren-baichuan-7b | A multi-modal large language model that integrates natural language and visual capabilities with fine-tuning for various tasks | 72 |
junyangwang0410/amber | An LLM-free benchmark suite for evaluating MLLMs' hallucination capabilities in various tasks and dimensions | 93 |
ray-project/llmperf | A tool for evaluating the performance of large language model APIs | 641 |
mlgroupjlu/llm-eval-survey | A repository of papers and resources for evaluating large language models. | 1,433 |
deeplangai/lingowhale-8b | An open bilingual LLM developed using the LingoWhale model, trained on a large dataset of high-quality middle English text, and fine-tuned for specific tasks such as conversation generation. | 134 |
km1994/llmsninestorydemontower | Exploring various LLMs and their applications in natural language processing and related areas | 1,798 |
bobazooba/xllm | A tool for training and fine-tuning large language models using advanced techniques | 380 |
bilibili/index-1.9b | A lightweight, multilingual language model with a long context length | 904 |
damoebius/haxebench | A benchmarking project comparing the performance of different programming languages and their compiled outputs in various formats. | 52 |
deepseek-ai/deepseek-moe | A large language model with improved efficiency and performance compared to similar models | 1,006 |
ailab-cvc/seed-bench | A benchmark for evaluating large language models' ability to process multimodal input | 315 |