CoLLaVO
Vision Language Model
Develops a PyTorch implementation of an enhanced vision language model
[ACL 2024 Findings] Official PyTorch Implementation code for realizing the technical part of CoLLaVO: Crayon Large Language and Vision mOdel to significantly improve zero-shot vision language performances
93 stars
5 watching
13 forks
Language: Python
last commit: over 1 year ago Related projects:
| Repository | Description | Stars |
|---|---|---|
| | Improves performance of vision language tasks by integrating computer vision capabilities into large language models | 314 |
| | A PyTorch toolbox for supporting research and development of domain adaptation, generalization, and semi-supervised learning methods in computer vision. | 1,236 |
| | An implementation of Mamba-based traversal of rationale to improve performance of numerous vision language models. | 102 |
| | An implementation of semantic image synthesis via adversarial learning using PyTorch | 145 |
| | A Python package for building and deploying computer vision models with PyTorch | 614 |
| | Improves large vision-language models' ability to accurately describe images by combining global and local attention mechanisms. | 18 |
| | Develops and trains models for vision-language learning with decoupled language pre-training | 24 |
| | An efficient framework for end-to-end learning on image-text and video-text tasks | 709 |
| | A PyTorch implementation of an encoder-free vision-language model that can be fine-tuned for various tasks and modalities | 246 |
| | This project provides an official PyTorch implementation of a method to interpret and edit vision-language representations to mitigate hallucinations in image captions. | 46 |
| | This project integrates visual knowledge into large language models to improve their capabilities and reduce hallucinations. | 124 |
| | Implementing a unified modal learning framework for generative vision-language models | 43 |
| | PyTorch implementation of video captioning, combining deep learning and computer vision techniques. | 402 |
| | An open-source project that enables the transfer of language understanding to vision capabilities through long context processing. | 347 |
| | A deep learning framework for training multi-modal models with vision and language capabilities. | 1,299 |