BLIVA
VQA model
A multimodal LLM designed to handle text-rich visual questions
(AAAI 2024) BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions
269 stars
12 watching
28 forks
Language: Python
last commit: 7 months ago blip2blivachatbotinstruction-tuningllamallmloramultimodalvisual-language-learning
Related projects:
Repository | Description | Stars |
---|---|---|
wisconsinaivision/vip-llava | A system designed to enable large multimodal models to understand arbitrary visual prompts | 294 |
ailab-cvc/seed | An implementation of a multimodal language model with capabilities for comprehension and generation | 576 |
llava-vl/llava-interactive-demo | An all-in-one demo for interactive image processing and generation | 351 |
ucsc-vlaa/sight-beyond-text | This repository provides an official implementation of a research paper exploring the use of multi-modal training to enhance language models' truthfulness and ethics in various applications. | 19 |
ailab-cvc/seed-bench | A benchmark for evaluating large language models' ability to process multimodal input | 315 |
dvlab-research/llama-vid | An image-based language model that uses large language models to generate visual and text features from videos | 733 |
gt-vision-lab/vqa_lstm_cnn | A Visual Question Answering model using a deeper LSTM and normalized CNN architecture. | 376 |
vita-mllm/vita | A large multimodal language model designed to process and analyze video, image, text, and audio inputs in real-time. | 961 |
akirafukui/vqa-mcb | A software framework for training and deploying multimodal visual question answering models using compact bilinear pooling. | 222 |
llava-vl/llava-plus-codebase | A platform for training and deploying large language and vision models that can use tools to perform tasks | 704 |
jnhwkim/nips-mrn-vqa | This project presents a neural network model designed to answer visual questions by combining question and image features in a residual learning framework. | 39 |
milvlg/prophet | An implementation of a two-stage framework designed to prompt large language models with answer heuristics for knowledge-based visual question answering tasks. | 267 |
openbmb/viscpm | A family of large multimodal models supporting multimodal conversational capabilities and text-to-image generation in multiple languages | 1,089 |
nvlabs/eagle | Develops high-resolution multimodal LLMs by combining vision encoders and various input resolutions | 539 |
lxtgh/omg-seg | Develops an end-to-end model for multiple visual perception and reasoning tasks using a single encoder, decoder, and large language model. | 1,300 |