M3Exam
LM Benchmark
A benchmark for evaluating large language models in multiple languages and formats
Data and code for paper "M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models"
93 stars
9 watching
12 forks
Language: Python
last commit: over 1 year ago ai-educationchatgptevaluationgpt-4large-language-modelsllmsmultilingualmultimodal
Related projects:
Repository | Description | Stars |
---|---|---|
| A polyglot large language model designed to address limitations in current LLM research and provide better multilingual instruction-following capability. | 77 |
| A collection of information about various large language models used in natural language processing | 272 |
| A benchmarking framework for large language models | 81 |
| An open-source benchmarking framework for evaluating cross-style visual capability of large multimodal models | 84 |
| A multi-modal large language model that integrates natural language and visual capabilities with fine-tuning for various tasks | 73 |
| An LLM-free benchmark suite for evaluating MLLMs' hallucination capabilities in various tasks and dimensions | 98 |
| A tool for evaluating the performance of large language model APIs | 678 |
| A repository of papers and resources for evaluating large language models. | 1,450 |
| An open bilingual LLM developed using the LingoWhale model, trained on a large dataset of high-quality middle English text, and fine-tuned for specific tasks such as conversation generation. | 134 |
| Exploring various LLMs and their applications in natural language processing and related areas | 1,854 |
| A tool for training and fine-tuning large language models using advanced techniques | 387 |
| A lightweight, multilingual language model with a long context length | 920 |
| A benchmarking project comparing the performance of different programming languages and their compiled outputs in various formats. | 52 |
| A large language model with improved efficiency and performance compared to similar models | 1,024 |
| A benchmark for evaluating large language models' ability to process multimodal input | 322 |