CoLLaVO
Vision Language Model
Develops a PyTorch implementation of an enhanced vision language model
[ACL 2024 Findings] Official PyTorch Implementation code for realizing the technical part of CoLLaVO: Crayon Large Language and Vision mOdel to significantly improve zero-shot vision language performances
93 stars
5 watching
13 forks
Language: Python
last commit: 8 months ago Related projects:
Repository | Description | Stars |
---|---|---|
| Improves performance of vision language tasks by integrating computer vision capabilities into large language models | 314 |
| A PyTorch toolbox for supporting research and development of domain adaptation, generalization, and semi-supervised learning methods in computer vision. | 1,236 |
| An implementation of Mamba-based traversal of rationale to improve performance of numerous vision language models. | 102 |
| An implementation of semantic image synthesis via adversarial learning using PyTorch | 145 |
| A Python package for building and deploying computer vision models with PyTorch | 614 |
| Improves large vision-language models' ability to accurately describe images by combining global and local attention mechanisms. | 18 |
| Develops and trains models for vision-language learning with decoupled language pre-training | 24 |
| An efficient framework for end-to-end learning on image-text and video-text tasks | 709 |
| A PyTorch implementation of an encoder-free vision-language model that can be fine-tuned for various tasks and modalities | 246 |
| This project provides an official PyTorch implementation of a method to interpret and edit vision-language representations to mitigate hallucinations in image captions. | 46 |
| This project integrates visual knowledge into large language models to improve their capabilities and reduce hallucinations. | 124 |
| Implementing a unified modal learning framework for generative vision-language models | 43 |
| PyTorch implementation of video captioning, combining deep learning and computer vision techniques. | 402 |
| An open-source project that enables the transfer of language understanding to vision capabilities through long context processing. | 347 |
| A deep learning framework for training multi-modal models with vision and language capabilities. | 1,299 |