AMBER

MLLM benchmark

An LLM-free benchmark suite for evaluating MLLMs' hallucination capabilities in various tasks and dimensions

An LLM-free Multi-dimensional Benchmark for Multi-modal Hallucination Evaluation

GitHub

98 stars
1 watching
2 forks
Language: Python
last commit: about 1 year ago

Related projects:

Repository Description Stars
x-plug/mplug-halowl Evaluates and mitigates hallucinations in multimodal large language models 82
junyangwang0410/haelm A framework for detecting hallucinations in large language models 17
damo-nlp-sg/m3exam A benchmark for evaluating large language models in multiple languages and formats 93
vectara/hallucination-leaderboard Compares performance of large language models on generating coherent summaries from short documents 1,281
freedomintelligence/mllm-bench Evaluates and compares the performance of multimodal large language models on various tasks 56
aifeg/benchlmm An open-source benchmarking framework for evaluating cross-style visual capability of large multimodal models 84
multimodal-art-projection/omnibench Evaluates and benchmarks multimodal language models' ability to process visual, acoustic, and textual inputs simultaneously. 15
tianyi-lab/hallusionbench An image-context reasoning benchmark designed to challenge large vision-language models and help improve their accuracy 259
ailab-cvc/seed-bench A benchmark for evaluating large language models' ability to process multimodal input 322
uw-madison-lee-lab/cobsat Provides a benchmarking framework and dataset for evaluating the performance of large language models in text-to-image tasks 30
km1994/llmsninestorydemontower Exploring various LLMs and their applications in natural language processing and related areas 1,854
bradyfu/woodpecker A method to correct hallucinations in multimodal large language models without requiring retraining 617
oval-group/mlogger A lightweight logger for machine learning experiments 127
pleisto/yuren-baichuan-7b A multi-modal large language model that integrates natural language and visual capabilities with fine-tuning for various tasks 73
szilard/benchm-ml A benchmark for evaluating machine learning algorithms' performance on large datasets 1,874