BLIVA
VQA model
A multimodal LLM designed to handle text-rich visual questions
(AAAI 2024) BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions
270 stars
12 watching
28 forks
Language: Python
last commit: 9 months ago blip2blivachatbotinstruction-tuningllamallmloramultimodalvisual-language-learning
Related projects:
Repository | Description | Stars |
---|---|---|
wisconsinaivision/vip-llava | A system designed to enable large multimodal models to understand arbitrary visual prompts | 302 |
ailab-cvc/seed | An implementation of a multimodal language model with capabilities for comprehension and generation | 585 |
llava-vl/llava-interactive-demo | An all-in-one demo for interactive image processing and generation | 353 |
ucsc-vlaa/sight-beyond-text | An implementation of a multimodal LLM training paradigm to enhance truthfulness and ethics in language models | 19 |
ailab-cvc/seed-bench | A benchmark for evaluating large language models' ability to process multimodal input | 322 |
dvlab-research/llama-vid | An image-based language model that uses large language models to generate visual and text features from videos | 748 |
gt-vision-lab/vqa_lstm_cnn | A Visual Question Answering model using a deeper LSTM and normalized CNN architecture. | 377 |
vita-mllm/vita | A large multimodal language model designed to process and analyze video, image, text, and audio inputs in real-time. | 1,005 |
akirafukui/vqa-mcb | A software framework for training and deploying multimodal visual question answering models using compact bilinear pooling. | 222 |
llava-vl/llava-plus-codebase | A platform for training and deploying large language and vision models that can use tools to perform tasks | 717 |
jnhwkim/nips-mrn-vqa | This project presents a neural network model designed to answer visual questions by combining question and image features in a residual learning framework. | 39 |
milvlg/prophet | An implementation of a two-stage framework designed to prompt large language models with answer heuristics for knowledge-based visual question answering tasks. | 270 |
openbmb/viscpm | A family of large multimodal models supporting multimodal conversational capabilities and text-to-image generation in multiple languages | 1,098 |
nvlabs/eagle | Develops high-resolution multimodal LLMs by combining vision encoders and various input resolutions | 549 |
lxtgh/omg-seg | Develops an end-to-end model for multiple visual perception and reasoning tasks using a single encoder, decoder, and large language model. | 1,336 |