VCoder
Perception adapter
An adapter for improving large language models at object-level perception tasks with auxiliary perception modalities
VCoder: Versatile Vision Encoders for Multimodal Large Language Models, arXiv 2023 / CVPR 2024
266 stars
9 watching
15 forks
Language: Python
last commit: 9 months ago Related projects:
Repository | Description | Stars |
---|---|---|
lhoyer/mic | An unsupervised domain adaptation method that uses contextual information to improve performance on visual recognition tasks | 271 |
shi-labs/gfr-dsod | Improving Object Detection from Scratch via Gated Feature Reuse | 65 |
vchitect/vbench | A benchmark suite for evaluating the performance of video generative models | 643 |
roboflow/maestro | A tool to streamline fine-tuning of multimodal models for vision-language tasks | 1,415 |
wasidennis/adaptsegnet | This project implements a deep learning-based approach to adapt semantic segmentation models from one domain to another. | 851 |
vision-cair/longvu | An artificial intelligence system designed to understand and describe long-form video content | 329 |
yiyangzhou/lure | Analyzing and mitigating object hallucination in large vision-language models to improve their accuracy and reliability. | 136 |
yunxinli/lingcloud | Enhances language models by incorporating human-like eyes to improve visual comprehension and interaction with external world | 48 |
gordonhu608/mqt-llava | A vision-language model that uses a query transformer to encode images as visual tokens and allows flexible choice of the number of visual tokens. | 101 |
byungkwanlee/moai | Improves performance of vision language tasks by integrating computer vision capabilities into large language models | 314 |
vlf-silkie/vlfeedback | An annotated preference dataset and training framework for improving large vision language models. | 88 |
tianyi-lab/hallusionbench | An image-context reasoning benchmark designed to challenge large vision-language models and help improve their accuracy | 259 |
thecodrr/vspeech | Provides an interface to Mozilla's DeepSpeech TensorFlow-based Speech-to-Text library using V bindings. | 49 |
cvondrick/vatic | Tools for efficiently scaling up video annotation using crowdsourced marketplaces. | 609 |
sergioburdisso/pyss3 | A Python package implementing an interpretable machine learning model for text classification with visualization tools | 336 |