MQT-LLaVA
Visual encoder
A vision-language model that uses a query transformer to encode images as visual tokens and allows flexible choice of the number of visual tokens.
[NeurIPS 2024] Matryoshka Query Transformer for Large Vision-Language Models
101 stars
13 watching
11 forks
Language: Python
last commit: 7 months ago Related projects:
Repository | Description | Stars |
---|---|---|
wisconsinaivision/vip-llava | A system designed to enable large multimodal models to understand arbitrary visual prompts | 302 |
microsoft/vision-longformer | An implementation of a vision transformer architecture designed for high-resolution image encoding with multiple efficient attention mechanisms | 243 |
nvlabs/prismer | A deep learning framework for training multi-modal models with vision and language capabilities. | 1,299 |
lxtgh/omg-seg | Develops an end-to-end model for multiple visual perception and reasoning tasks using a single encoder, decoder, and large language model. | 1,336 |
opengvlab/visionllm | A large language model designed to process and generate visual information | 956 |
dvlab-research/llama-vid | An image-based language model that uses large language models to generate visual and text features from videos | 748 |
llava-vl/llava-plus-codebase | A platform for training and deploying large language and vision models that can use tools to perform tasks | 717 |
pku-yuangroup/moe-llava | A large vision-language model using a mixture-of-experts architecture to improve performance on multi-modal learning tasks | 2,023 |
salt-nlp/llavar | An open-source project that enhances visual instruction tuning for text-rich image understanding by integrating GPT-4 models with multimodal datasets. | 259 |
alibaba/conv-llava | This project presents an optimization technique for large-scale image models to reduce computational requirements while maintaining performance. | 106 |
yfzhang114/llava-align | Debiasing techniques to minimize hallucinations in large visual language models | 75 |
whai362/pvt | An implementation of Pyramid Vision Transformers for image classification, object detection, and semantic segmentation tasks | 1,745 |
llava-vl/llava-interactive-demo | An all-in-one demo for interactive image processing and generation | 353 |
lavi-lab/visual-table | A project that generates visual representations tailored for general visual reasoning, leveraging hierarchical scene descriptions and instance-level world knowledge. | 14 |
byungkwanlee/moai | Improves performance of vision language tasks by integrating computer vision capabilities into large language models | 314 |