MQT-LLaVA

Visual encoder

A vision-language model that uses a query transformer to encode images as visual tokens and allows flexible choice of the number of visual tokens.

[NeurIPS 2024] Matryoshka Query Transformer for Large Vision-Language Models

GitHub

101 stars

13 watching

11 forks

Language: Python

last commit: about 2 years ago

Related projects:

Repository	Description	Stars
wisconsinaivision/vip-llava	A system designed to enable large multimodal models to understand arbitrary visual prompts	302
microsoft/vision-longformer	An implementation of a vision transformer architecture designed for high-resolution image encoding with multiple efficient attention mechanisms	243
nvlabs/prismer	A deep learning framework for training multi-modal models with vision and language capabilities.	1,299
lxtgh/omg-seg	Develops an end-to-end model for multiple visual perception and reasoning tasks using a single encoder, decoder, and large language model.	1,336
opengvlab/visionllm	A large language model designed to process and generate visual information	956
dvlab-research/llama-vid	An image-based language model that uses large language models to generate visual and text features from videos	748
llava-vl/llava-plus-codebase	A platform for training and deploying large language and vision models that can use tools to perform tasks	717
pku-yuangroup/moe-llava	A large vision-language model using a mixture-of-experts architecture to improve performance on multi-modal learning tasks	2,023
salt-nlp/llavar	An open-source project that enhances visual instruction tuning for text-rich image understanding by integrating GPT-4 models with multimodal datasets.	259
alibaba/conv-llava	This project presents an optimization technique for large-scale image models to reduce computational requirements while maintaining performance.	106
yfzhang114/llava-align	Debiasing techniques to minimize hallucinations in large visual language models	75
whai362/pvt	An implementation of Pyramid Vision Transformers for image classification, object detection, and semantic segmentation tasks	1,745
llava-vl/llava-interactive-demo	An all-in-one demo for interactive image processing and generation	353
lavi-lab/visual-table	A project that generates visual representations tailored for general visual reasoning, leveraging hierarchical scene descriptions and instance-level world knowledge.	14
byungkwanlee/moai	Improves performance of vision language tasks by integrating computer vision capabilities into large language models	314