AMBER
MLLM benchmark
An LLM-free benchmark suite for evaluating MLLMs' hallucination capabilities in various tasks and dimensions
An LLM-free Multi-dimensional Benchmark for Multi-modal Hallucination Evaluation
93 stars
1 watching
2 forks
Language: Python
last commit: 10 months ago Related projects:
Repository | Description | Stars |
---|---|---|
x-plug/mplug-halowl | Evaluates and mitigates hallucinations in multimodal large language models | 79 |
junyangwang0410/haelm | A framework for detecting hallucinations in large language models | 17 |
damo-nlp-sg/m3exam | A benchmark for evaluating large language models in multiple languages and formats | 92 |
vectara/hallucination-leaderboard | Evaluates and compares the performance of large language models in generating hallucinations during document summarization. | 1,236 |
freedomintelligence/mllm-bench | Evaluates and compares the performance of multimodal large language models on various tasks | 55 |
aifeg/benchlmm | An open-source benchmarking framework for evaluating cross-style visual capability of large multimodal models | 83 |
multimodal-art-projection/omnibench | Evaluates and benchmarks multimodal language models' ability to process visual, acoustic, and textual inputs simultaneously. | 14 |
tianyi-lab/hallusionbench | An image-context reasoning benchmark designed to challenge large vision-language models and help improve their accuracy | 243 |
ailab-cvc/seed-bench | A benchmark for evaluating large language models' ability to process multimodal input | 315 |
uw-madison-lee-lab/cobsat | Provides a benchmarking framework and dataset for evaluating the performance of large language models in text-to-image tasks | 28 |
km1994/llmsninestorydemontower | Exploring various LLMs and their applications in natural language processing and related areas | 1,798 |
bradyfu/woodpecker | A method to correct hallucinations in multimodal large language models during text generation | 611 |
oval-group/mlogger | A lightweight logger for machine learning experiments | 127 |
pleisto/yuren-baichuan-7b | A multi-modal large language model that integrates natural language and visual capabilities with fine-tuning for various tasks | 72 |
szilard/benchm-ml | A benchmark for evaluating machine learning algorithms' performance on large datasets | 1,869 |