BLIVA

VQA model

A multimodal LLM designed to handle text-rich visual questions

(AAAI 2024) BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions

GitHub

270 stars
12 watching
28 forks
Language: Python
last commit: 9 months ago
blip2blivachatbotinstruction-tuningllamallmloramultimodalvisual-language-learning

Related projects:

Repository Description Stars
wisconsinaivision/vip-llava A system designed to enable large multimodal models to understand arbitrary visual prompts 302
ailab-cvc/seed An implementation of a multimodal language model with capabilities for comprehension and generation 585
llava-vl/llava-interactive-demo An all-in-one demo for interactive image processing and generation 353
ucsc-vlaa/sight-beyond-text An implementation of a multimodal LLM training paradigm to enhance truthfulness and ethics in language models 19
ailab-cvc/seed-bench A benchmark for evaluating large language models' ability to process multimodal input 322
dvlab-research/llama-vid An image-based language model that uses large language models to generate visual and text features from videos 748
gt-vision-lab/vqa_lstm_cnn A Visual Question Answering model using a deeper LSTM and normalized CNN architecture. 377
vita-mllm/vita A large multimodal language model designed to process and analyze video, image, text, and audio inputs in real-time. 1,005
akirafukui/vqa-mcb A software framework for training and deploying multimodal visual question answering models using compact bilinear pooling. 222
llava-vl/llava-plus-codebase A platform for training and deploying large language and vision models that can use tools to perform tasks 717
jnhwkim/nips-mrn-vqa This project presents a neural network model designed to answer visual questions by combining question and image features in a residual learning framework. 39
milvlg/prophet An implementation of a two-stage framework designed to prompt large language models with answer heuristics for knowledge-based visual question answering tasks. 270
openbmb/viscpm A family of large multimodal models supporting multimodal conversational capabilities and text-to-image generation in multiple languages 1,098
nvlabs/eagle Develops high-resolution multimodal LLMs by combining vision encoders and various input resolutions 549
lxtgh/omg-seg Develops an end-to-end model for multiple visual perception and reasoning tasks using a single encoder, decoder, and large language model. 1,336