AMBER

MLLM benchmark

An LLM-free benchmark suite for evaluating MLLMs' hallucination capabilities in various tasks and dimensions

An LLM-free Multi-dimensional Benchmark for Multi-modal Hallucination Evaluation

GitHub

93 stars
1 watching
2 forks
Language: Python
last commit: 10 months ago

Related projects:

Repository Description Stars
x-plug/mplug-halowl Evaluates and mitigates hallucinations in multimodal large language models 79
junyangwang0410/haelm A framework for detecting hallucinations in large language models 17
damo-nlp-sg/m3exam A benchmark for evaluating large language models in multiple languages and formats 92
vectara/hallucination-leaderboard Evaluates and compares the performance of large language models in generating hallucinations during document summarization. 1,236
freedomintelligence/mllm-bench Evaluates and compares the performance of multimodal large language models on various tasks 55
aifeg/benchlmm An open-source benchmarking framework for evaluating cross-style visual capability of large multimodal models 83
multimodal-art-projection/omnibench Evaluates and benchmarks multimodal language models' ability to process visual, acoustic, and textual inputs simultaneously. 14
tianyi-lab/hallusionbench An image-context reasoning benchmark designed to challenge large vision-language models and help improve their accuracy 243
ailab-cvc/seed-bench A benchmark for evaluating large language models' ability to process multimodal input 315
uw-madison-lee-lab/cobsat Provides a benchmarking framework and dataset for evaluating the performance of large language models in text-to-image tasks 28
km1994/llmsninestorydemontower Exploring various LLMs and their applications in natural language processing and related areas 1,798
bradyfu/woodpecker A method to correct hallucinations in multimodal large language models during text generation 611
oval-group/mlogger A lightweight logger for machine learning experiments 127
pleisto/yuren-baichuan-7b A multi-modal large language model that integrates natural language and visual capabilities with fine-tuning for various tasks 72
szilard/benchm-ml A benchmark for evaluating machine learning algorithms' performance on large datasets 1,869