MQT-LLaVA
Visual encoder
A vision-language model that uses a query transformer to encode images as visual tokens and allows flexible choice of the number of visual tokens.
[NeurIPS 2024] Matryoshka Query Transformer for Large Vision-Language Models
101 stars
13 watching
11 forks
Language: Python
last commit: 8 months ago Related projects:
Repository | Description | Stars |
---|---|---|
| A system designed to enable large multimodal models to understand arbitrary visual prompts | 302 |
| An implementation of a vision transformer architecture designed for high-resolution image encoding with multiple efficient attention mechanisms | 243 |
| A deep learning framework for training multi-modal models with vision and language capabilities. | 1,299 |
| Develops an end-to-end model for multiple visual perception and reasoning tasks using a single encoder, decoder, and large language model. | 1,336 |
| A large language model designed to process and generate visual information | 956 |
| An image-based language model that uses large language models to generate visual and text features from videos | 748 |
| A platform for training and deploying large language and vision models that can use tools to perform tasks | 717 |
| A large vision-language model using a mixture-of-experts architecture to improve performance on multi-modal learning tasks | 2,023 |
| An open-source project that enhances visual instruction tuning for text-rich image understanding by integrating GPT-4 models with multimodal datasets. | 259 |
| This project presents an optimization technique for large-scale image models to reduce computational requirements while maintaining performance. | 106 |
| Debiasing techniques to minimize hallucinations in large visual language models | 75 |
| An implementation of Pyramid Vision Transformers for image classification, object detection, and semantic segmentation tasks | 1,745 |
| An all-in-one demo for interactive image processing and generation | 353 |
| A project that generates visual representations tailored for general visual reasoning, leveraging hierarchical scene descriptions and instance-level world knowledge. | 14 |
| Improves performance of vision language tasks by integrating computer vision capabilities into large language models | 314 |