Chat-UniVi
Visual unification framework
A framework for unified visual representation in image and video understanding models, enabling efficient training of large language models on multimodal data.
[CVPR 2024 Highlightš„] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
895 stars
7 watching
43 forks
Language: Python
last commit: about 1 year ago image-understandinglarge-language-modelsvideo-understandingvision-language-model
Related projects:
| Repository | Description | Stars |
|---|---|---|
| | Extending pretraining models to handle multiple modalities by aligning language and video representations | 751 |
| | Evaluates and benchmarks large language models' video understanding capabilities | 121 |
| | Improves performance of vision language tasks by integrating computer vision capabilities into large language models | 314 |
| | A unified framework for training large language models to understand and generate visual content | 544 |
| | A deep learning framework for training multi-modal models with vision and language capabilities. | 1,299 |
| | A system designed to enable large multimodal models to understand arbitrary visual prompts | 302 |
| | A project aimed at improving visual representations in vision-language models by developing an object detection model for richer visual object and concept representations. | 350 |
| | Develops a multimodal task and dataset to assess vision-language models' ability to handle interleaved image-text inputs. | 33 |
| | A large vision-language model using a mixture-of-experts architecture to improve performance on multi-modal learning tasks | 2,023 |
| | A deep learning framework for iteratively decomposing vision and language reasoning via large language models. | 32 |
| | Implementing a unified modal learning framework for generative vision-language models | 43 |
| | This project integrates visual knowledge into large language models to improve their capabilities and reduce hallucinations. | 124 |
| | An end-to-end image captioning system that uses large multi-modal models and provides tools for training, inference, and demo usage. | 1,849 |
| | PyTorch implementation of guided visual search mechanism for multimodal LLMs | 541 |
| | An unsupervised deep learning framework for translating images between different modalities | 1,994 |