MQT-LLaVA

Visual encoder

A vision-language model that uses a query transformer to encode images as visual tokens and allows flexible choice of the number of visual tokens.

[NeurIPS 2024] Matryoshka Query Transformer for Large Vision-Language Models

GitHub

97 stars
13 watching
11 forks
Language: Python
last commit: 5 months ago

Related projects:

Repository Description Stars
wisconsinaivision/vip-llava A system designed to enable large multimodal models to understand arbitrary visual prompts 294
microsoft/vision-longformer An implementation of a vision transformer architecture designed for high-resolution image encoding with multiple efficient attention mechanisms 241
nvlabs/prismer A deep learning framework for training multi-modal models with vision and language capabilities. 1,298
lxtgh/omg-seg Develops an end-to-end model for multiple visual perception and reasoning tasks using a single encoder, decoder, and large language model. 1,300
opengvlab/visionllm A large language model designed to process and generate visual information 915
dvlab-research/llama-vid An image-based language model that uses large language models to generate visual and text features from videos 733
llava-vl/llava-plus-codebase A platform for training and deploying large language and vision models that can use tools to perform tasks 704
pku-yuangroup/moe-llava Develops a neural network architecture for multi-modal learning with large vision-language models 1,980
salt-nlp/llavar An open-source project that enhances visual instruction tuning for text-rich image understanding by integrating GPT-4 models with multimodal datasets. 258
alibaba/conv-llava This project presents an optimization technique for large-scale image models to reduce computational requirements while maintaining performance. 104
yfzhang114/llava-align Debiasing techniques to minimize hallucinations in large visual language models 71
whai362/pvt An implementation of Pyramid Vision Transformers for image classification, object detection, and semantic segmentation tasks 1,728
llava-vl/llava-interactive-demo An all-in-one demo for interactive image processing and generation 351
lavi-lab/visual-table A project that generates visual representations tailored for general visual reasoning, leveraging hierarchical scene descriptions and instance-level world knowledge. 14
byungkwanlee/moai Improves performance of vision language tasks by integrating computer vision capabilities into large language models 311