ViP-LLaVA
Visual Prompt Model
A system designed to enable large multimodal models to understand arbitrary visual prompts
[CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
294 stars
5 watching
21 forks
Language: Python
last commit: 4 months ago chatbotclipcvpr2024foundation-modelsgpt-4gpt-4-visionllamallama2llavamulti-modalvision-languagevisual-prompting
Related projects:
Repository | Description | Stars |
---|---|---|
llava-vl/llava-interactive-demo | An all-in-one demo for interactive image processing and generation | 351 |
llava-vl/llava-plus-codebase | A platform for training and deploying large language and vision models that can use tools to perform tasks | 704 |
mlpc-ucsd/bliva | A multimodal LLM designed to handle text-rich visual questions | 269 |
dvlab-research/llama-vid | An image-based language model that uses large language models to generate visual and text features from videos | 733 |
vpgtrans/vpgtrans | Transfers visual prompt generators across large language models to reduce training costs and enable customization of multimodal LLMs | 269 |
airaria/visual-chinese-llama-alpaca | Develops a multimodal Chinese language model with visual capabilities | 424 |
gordonhu608/mqt-llava | A vision-language model that uses a query transformer to encode images as visual tokens and allows flexible choice of the number of visual tokens. | 97 |
yfzhang114/llava-align | Debiasing techniques to minimize hallucinations in large visual language models | 71 |
360cvgroup/360vl | A large multi-modal model developed using the Llama3 language model, designed to improve image understanding capabilities. | 30 |
ailab-cvc/seed | An implementation of a multimodal language model with capabilities for comprehension and generation | 576 |
nvlabs/prismer | A deep learning framework for training multi-modal models with vision and language capabilities. | 1,298 |
milvlg/prophet | An implementation of a two-stage framework designed to prompt large language models with answer heuristics for knowledge-based visual question answering tasks. | 267 |
baaivision/eve | A PyTorch implementation of an encoder-free vision-language model that can be fine-tuned for various tasks and modalities | 230 |
deepseek-ai/deepseek-vl | A multimodal AI model that enables real-world vision-language understanding applications | 2,077 |
lxtgh/omg-seg | Develops an end-to-end model for multiple visual perception and reasoning tasks using a single encoder, decoder, and large language model. | 1,300 |