CharXiv

Chart eval

An evaluation suite for assessing chart understanding in multimodal large language models.

[NeurIPS 2024] CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

GitHub

75 stars
3 watching
8 forks
Language: Python
last commit: about 1 month ago
benchmarkchart-understandingmachine-learningmultimodalvision-language-model

Related projects:

Repository Description Stars
hkust-nlp/ceval An evaluation suite providing multiple-choice questions for foundation models in various disciplines, with tools for assessing model performance. 1,636
ruixiangcui/agieval Evaluates foundation models on human-centric tasks with diverse exams and question types 708
pkunlp-icler/pca-eval An open-source benchmark and evaluation tool for assessing multimodal large language models' performance in embodied decision-making tasks 100
cloud-cv/evalai A platform for comparing and evaluating AI and machine learning algorithms at scale 1,771
maluuba/nlg-eval A toolset for evaluating and comparing natural language generation models 1,347
mshukor/evalign-icl Evaluating and improving large multimodal models through in-context learning 20
ailab-cvc/seed-bench A benchmark for evaluating large language models' ability to process multimodal input 315
chartmimic/chartmimic An open-source benchmarking project that evaluates large multimodal models' code generation capabilities via visually-grounded chart-to-code conversion 94
obss/jury A comprehensive toolkit for evaluating NLP experiments offering automated metrics and efficient computation. 188
fuxiaoliu/mmc Develops a large-scale dataset and benchmark for training multimodal chart understanding models using large language models. 84
x-plug/cvalues Evaluates and aligns the values of Chinese large language models with safety and responsibility standards 477
open-compass/vlmevalkit A toolkit for evaluating large vision-language models on various benchmarks and datasets. 1,343
open-compass/lawbench Evaluates the legal knowledge of large language models using a custom benchmarking framework. 267
krrishdholakia/betterprompt An API for evaluating the quality of text prompts used in Large Language Models (LLMs) based on perplexity estimation 38
openai/simple-evals A library for evaluating language models using standardized prompts and benchmarking tests. 1,939