ViP-LLaVA

Visual Prompt Model

A system designed to enable large multimodal models to understand arbitrary visual prompts

[CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts

GitHub

302 stars

5 watching

21 forks

Language: Python

last commit: about 1 year ago

chatbotclipcvpr2024foundation-modelsgpt-4gpt-4-visionllamallama2llavamulti-modalvision-languagevisual-prompting

Screenshot of WisconsinAIVision/ViP-LLaVA website

vip-llava.github.io/

Related projects:

Repository	Description	Stars
llava-vl/llava-interactive-demo	An all-in-one demo for interactive image processing and generation	353
llava-vl/llava-plus-codebase	A platform for training and deploying large language and vision models that can use tools to perform tasks	717
mlpc-ucsd/bliva	A multimodal LLM designed to handle text-rich visual questions	270
dvlab-research/llama-vid	An image-based language model that uses large language models to generate visual and text features from videos	748
vpgtrans/vpgtrans	Transfers visual prompt generators across large language models to reduce training costs and enable customization of multimodal LLMs	270
airaria/visual-chinese-llama-alpaca	Develops a multimodal Chinese language model with visual capabilities	429
gordonhu608/mqt-llava	A vision-language model that uses a query transformer to encode images as visual tokens and allows flexible choice of the number of visual tokens.	101
yfzhang114/llava-align	Debiasing techniques to minimize hallucinations in large visual language models	75
360cvgroup/360vl	A large multi-modal model developed using the Llama3 language model, designed to improve image understanding capabilities.	32
ailab-cvc/seed	An implementation of a multimodal language model with capabilities for comprehension and generation	585
nvlabs/prismer	A deep learning framework for training multi-modal models with vision and language capabilities.	1,299
milvlg/prophet	An implementation of a two-stage framework designed to prompt large language models with answer heuristics for knowledge-based visual question answering tasks.	270
baaivision/eve	A PyTorch implementation of an encoder-free vision-language model that can be fine-tuned for various tasks and modalities	246
deepseek-ai/deepseek-vl	A multimodal AI model that enables real-world vision-language understanding applications	2,145
lxtgh/omg-seg	Develops an end-to-end model for multiple visual perception and reasoning tasks using a single encoder, decoder, and large language model.	1,336