MQT-LLaVA

Visual encoder

A vision-language model that uses a query transformer to encode images as visual tokens and allows flexible choice of the number of visual tokens.

[NeurIPS 2024] Matryoshka Query Transformer for Large Vision-Language Models

GitHub

101 stars
13 watching
11 forks
Language: Python
last commit: 7 months ago

Related projects:

Repository Description Stars
wisconsinaivision/vip-llava A system designed to enable large multimodal models to understand arbitrary visual prompts 302
microsoft/vision-longformer An implementation of a vision transformer architecture designed for high-resolution image encoding with multiple efficient attention mechanisms 243
nvlabs/prismer A deep learning framework for training multi-modal models with vision and language capabilities. 1,299
lxtgh/omg-seg Develops an end-to-end model for multiple visual perception and reasoning tasks using a single encoder, decoder, and large language model. 1,336
opengvlab/visionllm A large language model designed to process and generate visual information 956
dvlab-research/llama-vid An image-based language model that uses large language models to generate visual and text features from videos 748
llava-vl/llava-plus-codebase A platform for training and deploying large language and vision models that can use tools to perform tasks 717
pku-yuangroup/moe-llava A large vision-language model using a mixture-of-experts architecture to improve performance on multi-modal learning tasks 2,023
salt-nlp/llavar An open-source project that enhances visual instruction tuning for text-rich image understanding by integrating GPT-4 models with multimodal datasets. 259
alibaba/conv-llava This project presents an optimization technique for large-scale image models to reduce computational requirements while maintaining performance. 106
yfzhang114/llava-align Debiasing techniques to minimize hallucinations in large visual language models 75
whai362/pvt An implementation of Pyramid Vision Transformers for image classification, object detection, and semantic segmentation tasks 1,745
llava-vl/llava-interactive-demo An all-in-one demo for interactive image processing and generation 353
lavi-lab/visual-table A project that generates visual representations tailored for general visual reasoning, leveraging hierarchical scene descriptions and instance-level world knowledge. 14
byungkwanlee/moai Improves performance of vision language tasks by integrating computer vision capabilities into large language models 314