prismer
Vision-Language Model
A deep learning framework for training multi-modal models with vision and language capabilities.
The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".
1k stars
16 watching
75 forks
Language: Python
last commit: 10 months ago image-captioninglanguage-modelmulti-modal-learningmulti-task-learningvision-and-languagevision-language-modelvqa
Related projects:
Repository | Description | Stars |
---|---|---|
deepseek-ai/deepseek-vl | A multimodal AI model that enables real-world vision-language understanding applications | 2,077 |
shizhediao/davinci | An implementation of vision-language models for multimodal learning tasks, enabling generative vision-language models to be fine-tuned for various applications. | 43 |
baaivision/eve | A PyTorch implementation of an encoder-free vision-language model that can be fine-tuned for various tasks and modalities | 230 |
nvlabs/eagle | Develops high-resolution multimodal LLMs by combining vision encoders and various input resolutions | 539 |
opengvlab/visionllm | A large language model designed to process and generate visual information | 915 |
meituan-automl/mobilevlm | An implementation of a vision language model designed for mobile devices, utilizing a lightweight downsample projector and pre-trained language models. | 1,039 |
dvlab-research/lisa | A system that uses large language models to generate segmentation masks for images based on complex queries and world knowledge. | 1,861 |
yiren-jian/blitext | Develops and trains models for vision-language learning with decoupled language pre-training | 24 |
gordonhu608/mqt-llava | A vision-language model that uses a query transformer to encode images as visual tokens and allows flexible choice of the number of visual tokens. | 97 |
vlf-silkie/vlfeedback | An annotated preference dataset and training framework for improving large vision language models. | 85 |
byungkwanlee/collavo | Develops a PyTorch implementation of an enhanced vision language model | 93 |
zhourax/vega | Develops a multimodal task and dataset to assess vision-language models' ability to handle interleaved image-text inputs. | 33 |
dvlab-research/llama-vid | An image-based language model that uses large language models to generate visual and text features from videos | 733 |
360cvgroup/360vl | A large multi-modal model developed using the Llama3 language model, designed to improve image understanding capabilities. | 30 |
vishaal27/sus-x | This is an open-source project that proposes a novel method to train large-scale vision-language models with minimal resources and no fine-tuning required. | 94 |