BLIVA
VQA model
A multimodal LLM designed to handle text-rich visual questions
(AAAI 2024) BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions
270 stars
12 watching
28 forks
Language: Python
last commit: 10 months ago blip2blivachatbotinstruction-tuningllamallmloramultimodalvisual-language-learning
Related projects:
Repository | Description | Stars |
---|---|---|
| A system designed to enable large multimodal models to understand arbitrary visual prompts | 302 |
| An implementation of a multimodal language model with capabilities for comprehension and generation | 585 |
| An all-in-one demo for interactive image processing and generation | 353 |
| An implementation of a multimodal LLM training paradigm to enhance truthfulness and ethics in language models | 19 |
| A benchmark for evaluating large language models' ability to process multimodal input | 322 |
| An image-based language model that uses large language models to generate visual and text features from videos | 748 |
| A Visual Question Answering model using a deeper LSTM and normalized CNN architecture. | 377 |
| A large multimodal language model designed to process and analyze video, image, text, and audio inputs in real-time. | 1,005 |
| A software framework for training and deploying multimodal visual question answering models using compact bilinear pooling. | 222 |
| A platform for training and deploying large language and vision models that can use tools to perform tasks | 717 |
| This project presents a neural network model designed to answer visual questions by combining question and image features in a residual learning framework. | 39 |
| An implementation of a two-stage framework designed to prompt large language models with answer heuristics for knowledge-based visual question answering tasks. | 270 |
| A family of large multimodal models supporting multimodal conversational capabilities and text-to-image generation in multiple languages | 1,098 |
| Develops high-resolution multimodal LLMs by combining vision encoders and various input resolutions | 549 |
| Develops an end-to-end model for multiple visual perception and reasoning tasks using a single encoder, decoder, and large language model. | 1,336 |