MQT-LLaVA
Visual encoder
A vision-language model that uses a query transformer to encode images as visual tokens and allows flexible choice of the number of visual tokens.
[NeurIPS 2024] Matryoshka Query Transformer for Large Vision-Language Models
97 stars
13 watching
11 forks
Language: Python
last commit: 5 months ago Related projects:
Repository | Description | Stars |
---|---|---|
wisconsinaivision/vip-llava | A system designed to enable large multimodal models to understand arbitrary visual prompts | 294 |
microsoft/vision-longformer | An implementation of a vision transformer architecture designed for high-resolution image encoding with multiple efficient attention mechanisms | 241 |
nvlabs/prismer | A deep learning framework for training multi-modal models with vision and language capabilities. | 1,298 |
lxtgh/omg-seg | Develops an end-to-end model for multiple visual perception and reasoning tasks using a single encoder, decoder, and large language model. | 1,300 |
opengvlab/visionllm | A large language model designed to process and generate visual information | 915 |
dvlab-research/llama-vid | An image-based language model that uses large language models to generate visual and text features from videos | 733 |
llava-vl/llava-plus-codebase | A platform for training and deploying large language and vision models that can use tools to perform tasks | 704 |
pku-yuangroup/moe-llava | Develops a neural network architecture for multi-modal learning with large vision-language models | 1,980 |
salt-nlp/llavar | An open-source project that enhances visual instruction tuning for text-rich image understanding by integrating GPT-4 models with multimodal datasets. | 258 |
alibaba/conv-llava | This project presents an optimization technique for large-scale image models to reduce computational requirements while maintaining performance. | 104 |
yfzhang114/llava-align | Debiasing techniques to minimize hallucinations in large visual language models | 71 |
whai362/pvt | An implementation of Pyramid Vision Transformers for image classification, object detection, and semantic segmentation tasks | 1,728 |
llava-vl/llava-interactive-demo | An all-in-one demo for interactive image processing and generation | 351 |
lavi-lab/visual-table | A project that generates visual representations tailored for general visual reasoning, leveraging hierarchical scene descriptions and instance-level world knowledge. | 14 |
byungkwanlee/moai | Improves performance of vision language tasks by integrating computer vision capabilities into large language models | 311 |