BLIVA

VQA model

A multimodal LLM designed to handle text-rich visual questions

(AAAI 2024) BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions

GitHub

269 stars
12 watching
28 forks
Language: Python
last commit: 7 months ago
blip2blivachatbotinstruction-tuningllamallmloramultimodalvisual-language-learning

Related projects:

Repository Description Stars
wisconsinaivision/vip-llava A system designed to enable large multimodal models to understand arbitrary visual prompts 294
ailab-cvc/seed An implementation of a multimodal language model with capabilities for comprehension and generation 576
llava-vl/llava-interactive-demo An all-in-one demo for interactive image processing and generation 351
ucsc-vlaa/sight-beyond-text This repository provides an official implementation of a research paper exploring the use of multi-modal training to enhance language models' truthfulness and ethics in various applications. 19
ailab-cvc/seed-bench A benchmark for evaluating large language models' ability to process multimodal input 315
dvlab-research/llama-vid An image-based language model that uses large language models to generate visual and text features from videos 733
gt-vision-lab/vqa_lstm_cnn A Visual Question Answering model using a deeper LSTM and normalized CNN architecture. 376
vita-mllm/vita A large multimodal language model designed to process and analyze video, image, text, and audio inputs in real-time. 961
akirafukui/vqa-mcb A software framework for training and deploying multimodal visual question answering models using compact bilinear pooling. 222
llava-vl/llava-plus-codebase A platform for training and deploying large language and vision models that can use tools to perform tasks 704
jnhwkim/nips-mrn-vqa This project presents a neural network model designed to answer visual questions by combining question and image features in a residual learning framework. 39
milvlg/prophet An implementation of a two-stage framework designed to prompt large language models with answer heuristics for knowledge-based visual question answering tasks. 267
openbmb/viscpm A family of large multimodal models supporting multimodal conversational capabilities and text-to-image generation in multiple languages 1,089
nvlabs/eagle Develops high-resolution multimodal LLMs by combining vision encoders and various input resolutions 539
lxtgh/omg-seg Develops an end-to-end model for multiple visual perception and reasoning tasks using a single encoder, decoder, and large language model. 1,300