prismer

Vision-Language Model

A deep learning framework for training multi-modal models with vision and language capabilities.

The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".

GitHub

1k stars

16 watching

74 forks

Language: Python

last commit: almost 2 years ago

image-captioninglanguage-modelmulti-modal-learningmulti-task-learningvision-and-languagevision-language-modelvqa

shikun.io/projects/prismer

Related projects:

Repository	Description	Stars
deepseek-ai/deepseek-vl	A multimodal AI model that enables real-world vision-language understanding applications	2,145
shizhediao/davinci	Implementing a unified modal learning framework for generative vision-language models	43
baaivision/eve	A PyTorch implementation of an encoder-free vision-language model that can be fine-tuned for various tasks and modalities	246
nvlabs/eagle	Develops high-resolution multimodal LLMs by combining vision encoders and various input resolutions	549
opengvlab/visionllm	A large language model designed to process and generate visual information	956
meituan-automl/mobilevlm	An implementation of a vision language model designed for mobile devices, utilizing a lightweight downsample projector and pre-trained language models.	1,076
dvlab-research/lisa	A system that uses large language models to generate segmentation masks for images based on complex queries and world knowledge.	1,923
yiren-jian/blitext	Develops and trains models for vision-language learning with decoupled language pre-training	24
gordonhu608/mqt-llava	A vision-language model that uses a query transformer to encode images as visual tokens and allows flexible choice of the number of visual tokens.	101
vlf-silkie/vlfeedback	An annotated preference dataset and training framework for improving large vision language models.	88
byungkwanlee/collavo	Develops a PyTorch implementation of an enhanced vision language model	93
zhourax/vega	Develops a multimodal task and dataset to assess vision-language models' ability to handle interleaved image-text inputs.	33
dvlab-research/llama-vid	An image-based language model that uses large language models to generate visual and text features from videos	748
360cvgroup/360vl	A large multi-modal model developed using the Llama3 language model, designed to improve image understanding capabilities.	32
vishaal27/sus-x	This is an open-source project that proposes a novel method to train large-scale vision-language models with minimal resources and no fine-tuning required.	94