VCoder
Perception adapter
An adapter for improving large language models at object-level perception tasks with auxiliary perception modalities
VCoder: Versatile Vision Encoders for Multimodal Large Language Models, arXiv 2023 / CVPR 2024
261 stars
9 watching
15 forks
Language: Python
last commit: 7 months ago Related projects:
Repository | Description | Stars |
---|---|---|
lhoyer/mic | An unsupervised domain adaptation method that uses contextual information to improve performance on visual recognition tasks | 269 |
shi-labs/gfr-dsod | Improving Object Detection from Scratch via Gated Feature Reuse | 65 |
vchitect/vbench | A tool for evaluating and benchmarking video generative models in computer vision and artificial intelligence | 576 |
roboflow/maestro | A tool to streamline fine-tuning of multimodal models for vision-language tasks | 1,386 |
wasidennis/adaptsegnet | This project implements a deep learning-based approach to adapt semantic segmentation models from one domain to another. | 849 |
vision-cair/longvu | An artificial intelligence system designed to understand and describe long-form video content | 270 |
yiyangzhou/lure | Analyzing and mitigating object hallucination in large vision-language models to improve their accuracy and reliability. | 134 |
yunxinli/lingcloud | An approach to enhance large language models by incorporating visual information using human-like eyes | 48 |
gordonhu608/mqt-llava | A vision-language model that uses a query transformer to encode images as visual tokens and allows flexible choice of the number of visual tokens. | 97 |
byungkwanlee/moai | Improves performance of vision language tasks by integrating computer vision capabilities into large language models | 311 |
vlf-silkie/vlfeedback | An annotated preference dataset and training framework for improving large vision language models. | 85 |
tianyi-lab/hallusionbench | An image-context reasoning benchmark designed to challenge large vision-language models and help improve their accuracy | 243 |
thecodrr/vspeech | Provides an interface to Mozilla's DeepSpeech TensorFlow-based Speech-to-Text library using V bindings. | 50 |
cvondrick/vatic | Tools for efficiently scaling up video annotation using crowdsourced marketplaces. | 607 |
sergioburdisso/pyss3 | A Python package implementing an interpretable machine learning model for text classification with visualization tools | 336 |