ViP-LLaVA
Visual Prompt Model
A system designed to enable large multimodal models to understand arbitrary visual prompts
[CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
302 stars
5 watching
21 forks
Language: Python
last commit: over 1 year ago chatbotclipcvpr2024foundation-modelsgpt-4gpt-4-visionllamallama2llavamulti-modalvision-languagevisual-prompting
Related projects:
| Repository | Description | Stars |
|---|---|---|
| | An all-in-one demo for interactive image processing and generation | 353 |
| | A platform for training and deploying large language and vision models that can use tools to perform tasks | 717 |
| | A multimodal LLM designed to handle text-rich visual questions | 270 |
| | An image-based language model that uses large language models to generate visual and text features from videos | 748 |
| | Transfers visual prompt generators across large language models to reduce training costs and enable customization of multimodal LLMs | 270 |
| | Develops a multimodal Chinese language model with visual capabilities | 429 |
| | A vision-language model that uses a query transformer to encode images as visual tokens and allows flexible choice of the number of visual tokens. | 101 |
| | Debiasing techniques to minimize hallucinations in large visual language models | 75 |
| | A large multi-modal model developed using the Llama3 language model, designed to improve image understanding capabilities. | 32 |
| | An implementation of a multimodal language model with capabilities for comprehension and generation | 585 |
| | A deep learning framework for training multi-modal models with vision and language capabilities. | 1,299 |
| | An implementation of a two-stage framework designed to prompt large language models with answer heuristics for knowledge-based visual question answering tasks. | 270 |
| | A PyTorch implementation of an encoder-free vision-language model that can be fine-tuned for various tasks and modalities | 246 |
| | A multimodal AI model that enables real-world vision-language understanding applications | 2,145 |
| | Develops an end-to-end model for multiple visual perception and reasoning tasks using a single encoder, decoder, and large language model. | 1,336 |