Multi-Modality-Arena

Model arena

An evaluation platform for comparing multi-modality models on visual question-answering tasks

Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP-2, and many more!

GitHub

478 stars
6 watching
36 forks
Language: Python
last commit: 9 months ago
chatchatbotchatgptgradiolarge-language-modelsllmsmulti-modalityvision-language-modelvqa

Related projects:

Repository Description Stars
open-mmlab/multimodal-gpt Trains a multimodal chatbot that combines visual and language instructions to generate responses 1,478
opengvlab/lamm A framework and benchmark for training and evaluating multi-modal large language models, enabling the development of AI agents capable of seamless interaction between humans and machines. 305
mbzuai-oryx/groundinglmm An end-to-end trained model capable of generating natural language responses integrated with object segmentation masks for interactive visual conversations 797
openbmb/viscpm A family of large multimodal models supporting multimodal conversational capabilities and text-to-image generation in multiple languages 1,098
mbzuai-oryx/video-chatgpt A video conversation model that generates meaningful conversations about videos using large vision and language models 1,246
nvlabs/eagle Develops high-resolution multimodal LLMs by combining vision encoders and various input resolutions 549
xverse-ai/xverse-v-13b A large multimodal model for visual question answering, trained on a dataset of 2.1B image-text pairs and 8.2M instruction sequences. 78
multimodal-art-projection/omnibench Evaluates and benchmarks multimodal language models' ability to process visual, acoustic, and textual inputs simultaneously. 15
open-compass/mmbench A collection of benchmarks to evaluate the multi-modal understanding capability of large vision language models. 168
tsb0601/mmvp An evaluation framework for multimodal language models' visual capabilities using image and question benchmarks. 296
mlpc-ucsd/bliva A multimodal LLM designed to handle text-rich visual questions 270
ailab-cvc/seed An implementation of a multimodal language model with capabilities for comprehension and generation 585
ucsc-vlaa/sight-beyond-text An implementation of a multimodal LLM training paradigm to enhance truthfulness and ethics in language models 19
wisconsinaivision/vip-llava A system designed to enable large multimodal models to understand arbitrary visual prompts 302
mlo-lab/muvi A software framework for multi-view latent variable modeling with domain-informed structured sparsity 27