VCoder

Perception adapter

An adapter for improving large language models at object-level perception tasks with auxiliary perception modalities

VCoder: Versatile Vision Encoders for Multimodal Large Language Models, arXiv 2023 / CVPR 2024

GitHub

266 stars
9 watching
15 forks
Language: Python
last commit: 9 months ago

Related projects:

Repository Description Stars
lhoyer/mic An unsupervised domain adaptation method that uses contextual information to improve performance on visual recognition tasks 271
shi-labs/gfr-dsod Improving Object Detection from Scratch via Gated Feature Reuse 65
vchitect/vbench A benchmark suite for evaluating the performance of video generative models 643
roboflow/maestro A tool to streamline fine-tuning of multimodal models for vision-language tasks 1,415
wasidennis/adaptsegnet This project implements a deep learning-based approach to adapt semantic segmentation models from one domain to another. 851
vision-cair/longvu An artificial intelligence system designed to understand and describe long-form video content 329
yiyangzhou/lure Analyzing and mitigating object hallucination in large vision-language models to improve their accuracy and reliability. 136
yunxinli/lingcloud Enhances language models by incorporating human-like eyes to improve visual comprehension and interaction with external world 48
gordonhu608/mqt-llava A vision-language model that uses a query transformer to encode images as visual tokens and allows flexible choice of the number of visual tokens. 101
byungkwanlee/moai Improves performance of vision language tasks by integrating computer vision capabilities into large language models 314
vlf-silkie/vlfeedback An annotated preference dataset and training framework for improving large vision language models. 88
tianyi-lab/hallusionbench An image-context reasoning benchmark designed to challenge large vision-language models and help improve their accuracy 259
thecodrr/vspeech Provides an interface to Mozilla's DeepSpeech TensorFlow-based Speech-to-Text library using V bindings. 49
cvondrick/vatic Tools for efficiently scaling up video annotation using crowdsourced marketplaces. 609
sergioburdisso/pyss3 A Python package implementing an interpretable machine learning model for text classification with visualization tools 336