VCoder

Perception adapter

An adapter for improving large language models at object-level perception tasks with auxiliary perception modalities

VCoder: Versatile Vision Encoders for Multimodal Large Language Models, arXiv 2023 / CVPR 2024

GitHub

261 stars
9 watching
15 forks
Language: Python
last commit: 7 months ago

Related projects:

Repository Description Stars
lhoyer/mic An unsupervised domain adaptation method that uses contextual information to improve performance on visual recognition tasks 269
shi-labs/gfr-dsod Improving Object Detection from Scratch via Gated Feature Reuse 65
vchitect/vbench A tool for evaluating and benchmarking video generative models in computer vision and artificial intelligence 576
roboflow/maestro A tool to streamline fine-tuning of multimodal models for vision-language tasks 1,386
wasidennis/adaptsegnet This project implements a deep learning-based approach to adapt semantic segmentation models from one domain to another. 849
vision-cair/longvu An artificial intelligence system designed to understand and describe long-form video content 270
yiyangzhou/lure Analyzing and mitigating object hallucination in large vision-language models to improve their accuracy and reliability. 134
yunxinli/lingcloud An approach to enhance large language models by incorporating visual information using human-like eyes 48
gordonhu608/mqt-llava A vision-language model that uses a query transformer to encode images as visual tokens and allows flexible choice of the number of visual tokens. 97
byungkwanlee/moai Improves performance of vision language tasks by integrating computer vision capabilities into large language models 311
vlf-silkie/vlfeedback An annotated preference dataset and training framework for improving large vision language models. 85
tianyi-lab/hallusionbench An image-context reasoning benchmark designed to challenge large vision-language models and help improve their accuracy 243
thecodrr/vspeech Provides an interface to Mozilla's DeepSpeech TensorFlow-based Speech-to-Text library using V bindings. 50
cvondrick/vatic Tools for efficiently scaling up video annotation using crowdsourced marketplaces. 607
sergioburdisso/pyss3 A Python package implementing an interpretable machine learning model for text classification with visualization tools 336