CharXiv
Chart eval
An evaluation suite for assessing chart understanding in multimodal large language models.
[NeurIPS 2024] CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
85 stars
3 watching
9 forks
Language: Python
last commit: 4 months ago benchmarkchart-understandingmachine-learningmultimodalvision-language-model
Related projects:
Repository | Description | Stars |
---|---|---|
| An evaluation suite providing multiple-choice questions for foundation models in various disciplines, with tools for assessing model performance. | 1,650 |
| Evaluates foundation models on human-centric tasks with diverse exams and question types | 714 |
| An open-source benchmark and evaluation tool for assessing multimodal large language models' performance in embodied decision-making tasks | 99 |
| A platform for comparing and evaluating AI and machine learning algorithms at scale | 1,779 |
| A toolset for evaluating and comparing natural language generation models | 1,350 |
| Evaluating and improving large multimodal models through in-context learning | 21 |
| A benchmark for evaluating large language models' ability to process multimodal input | 322 |
| An open-source benchmarking project that evaluates large multimodal models' code generation capabilities via visually-grounded chart-to-code conversion | 95 |
| A comprehensive toolkit for evaluating NLP experiments offering automated metrics and efficient computation. | 187 |
| Develops a large-scale dataset and benchmark for training multimodal chart understanding models using large language models. | 87 |
| Evaluates and aligns the values of Chinese large language models with safety and responsibility standards | 481 |
| An evaluation toolkit for large vision-language models | 1,514 |
| Evaluates the legal knowledge of large language models using a custom benchmarking framework. | 273 |
| An API for evaluating the quality of text prompts used in Large Language Models (LLMs) based on perplexity estimation | 43 |
| Evaluates language models using standardized benchmarks and prompting techniques. | 2,059 |