CharXiv

Chart eval

An evaluation suite for assessing chart understanding in multimodal large language models.

[NeurIPS 2024] CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

GitHub

85 stars

3 watching

9 forks

Language: Python

last commit: over 1 year ago

benchmarkchart-understandingmachine-learningmultimodalvision-language-model

Screenshot of princeton-nlp/CharXiv website

charxiv.github.io/

Related projects:

Repository	Description	Stars
hkust-nlp/ceval	An evaluation suite providing multiple-choice questions for foundation models in various disciplines, with tools for assessing model performance.	1,650
ruixiangcui/agieval	Evaluates foundation models on human-centric tasks with diverse exams and question types	714
pkunlp-icler/pca-eval	An open-source benchmark and evaluation tool for assessing multimodal large language models' performance in embodied decision-making tasks	99
cloud-cv/evalai	A platform for comparing and evaluating AI and machine learning algorithms at scale	1,779
maluuba/nlg-eval	A toolset for evaluating and comparing natural language generation models	1,350
mshukor/evalign-icl	Evaluating and improving large multimodal models through in-context learning	21
ailab-cvc/seed-bench	A benchmark for evaluating large language models' ability to process multimodal input	322
chartmimic/chartmimic	An open-source benchmarking project that evaluates large multimodal models' code generation capabilities via visually-grounded chart-to-code conversion	95
obss/jury	A comprehensive toolkit for evaluating NLP experiments offering automated metrics and efficient computation.	187
fuxiaoliu/mmc	Develops a large-scale dataset and benchmark for training multimodal chart understanding models using large language models.	87
x-plug/cvalues	Evaluates and aligns the values of Chinese large language models with safety and responsibility standards	481
open-compass/vlmevalkit	An evaluation toolkit for large vision-language models	1,514
open-compass/lawbench	Evaluates the legal knowledge of large language models using a custom benchmarking framework.	273
krrishdholakia/betterprompt	An API for evaluating the quality of text prompts used in Large Language Models (LLMs) based on perplexity estimation	43
openai/simple-evals	Evaluates language models using standardized benchmarks and prompting techniques.	2,059