BLIVA

VQA model

A multimodal LLM designed to handle text-rich visual questions

(AAAI 2024) BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions

GitHub

270 stars

12 watching

28 forks

Language: Python

last commit: over 1 year ago

blip2blivachatbotinstruction-tuningllamallmloramultimodalvisual-language-learning

arxiv.org/abs/2308.09936

Related projects:

Repository	Description	Stars
wisconsinaivision/vip-llava	A system designed to enable large multimodal models to understand arbitrary visual prompts	302
ailab-cvc/seed	An implementation of a multimodal language model with capabilities for comprehension and generation	585
llava-vl/llava-interactive-demo	An all-in-one demo for interactive image processing and generation	353
ucsc-vlaa/sight-beyond-text	An implementation of a multimodal LLM training paradigm to enhance truthfulness and ethics in language models	19
ailab-cvc/seed-bench	A benchmark for evaluating large language models' ability to process multimodal input	322
dvlab-research/llama-vid	An image-based language model that uses large language models to generate visual and text features from videos	748
gt-vision-lab/vqa_lstm_cnn	A Visual Question Answering model using a deeper LSTM and normalized CNN architecture.	377
vita-mllm/vita	A large multimodal language model designed to process and analyze video, image, text, and audio inputs in real-time.	1,005
akirafukui/vqa-mcb	A software framework for training and deploying multimodal visual question answering models using compact bilinear pooling.	222
llava-vl/llava-plus-codebase	A platform for training and deploying large language and vision models that can use tools to perform tasks	717
jnhwkim/nips-mrn-vqa	This project presents a neural network model designed to answer visual questions by combining question and image features in a residual learning framework.	39
milvlg/prophet	An implementation of a two-stage framework designed to prompt large language models with answer heuristics for knowledge-based visual question answering tasks.	270
openbmb/viscpm	A family of large multimodal models supporting multimodal conversational capabilities and text-to-image generation in multiple languages	1,098
nvlabs/eagle	Develops high-resolution multimodal LLMs by combining vision encoders and various input resolutions	549
lxtgh/omg-seg	Develops an end-to-end model for multiple visual perception and reasoning tasks using a single encoder, decoder, and large language model.	1,336