CharXiv
Chart eval
An evaluation suite for assessing chart understanding in multimodal large language models.
[NeurIPS 2024] CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
75 stars
3 watching
8 forks
Language: Python
last commit: about 1 month ago benchmarkchart-understandingmachine-learningmultimodalvision-language-model
Related projects:
Repository | Description | Stars |
---|---|---|
hkust-nlp/ceval | An evaluation suite providing multiple-choice questions for foundation models in various disciplines, with tools for assessing model performance. | 1,636 |
ruixiangcui/agieval | Evaluates foundation models on human-centric tasks with diverse exams and question types | 708 |
pkunlp-icler/pca-eval | An open-source benchmark and evaluation tool for assessing multimodal large language models' performance in embodied decision-making tasks | 100 |
cloud-cv/evalai | A platform for comparing and evaluating AI and machine learning algorithms at scale | 1,771 |
maluuba/nlg-eval | A toolset for evaluating and comparing natural language generation models | 1,347 |
mshukor/evalign-icl | Evaluating and improving large multimodal models through in-context learning | 20 |
ailab-cvc/seed-bench | A benchmark for evaluating large language models' ability to process multimodal input | 315 |
chartmimic/chartmimic | An open-source benchmarking project that evaluates large multimodal models' code generation capabilities via visually-grounded chart-to-code conversion | 94 |
obss/jury | A comprehensive toolkit for evaluating NLP experiments offering automated metrics and efficient computation. | 188 |
fuxiaoliu/mmc | Develops a large-scale dataset and benchmark for training multimodal chart understanding models using large language models. | 84 |
x-plug/cvalues | Evaluates and aligns the values of Chinese large language models with safety and responsibility standards | 477 |
open-compass/vlmevalkit | A toolkit for evaluating large vision-language models on various benchmarks and datasets. | 1,343 |
open-compass/lawbench | Evaluates the legal knowledge of large language models using a custom benchmarking framework. | 267 |
krrishdholakia/betterprompt | An API for evaluating the quality of text prompts used in Large Language Models (LLMs) based on perplexity estimation | 38 |
openai/simple-evals | A library for evaluating language models using standardized prompts and benchmarking tests. | 1,939 |