Chat-UniVi
Visual unification framework
A framework for unified visual representation in image and video understanding models, enabling efficient training of large language models on multimodal data.
[CVPR 2024 Highlightš„] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
895 stars
7 watching
43 forks
Language: Python
last commit: 4 months ago image-understandinglarge-language-modelsvideo-understandingvision-language-model
Related projects:
Repository | Description | Stars |
---|---|---|
| Extending pretraining models to handle multiple modalities by aligning language and video representations | 751 |
| Evaluates and benchmarks large language models' video understanding capabilities | 121 |
| Improves performance of vision language tasks by integrating computer vision capabilities into large language models | 314 |
| A unified framework for training large language models to understand and generate visual content | 544 |
| A deep learning framework for training multi-modal models with vision and language capabilities. | 1,299 |
| A system designed to enable large multimodal models to understand arbitrary visual prompts | 302 |
| A project aimed at improving visual representations in vision-language models by developing an object detection model for richer visual object and concept representations. | 350 |
| Develops a multimodal task and dataset to assess vision-language models' ability to handle interleaved image-text inputs. | 33 |
| A large vision-language model using a mixture-of-experts architecture to improve performance on multi-modal learning tasks | 2,023 |
| A deep learning framework for iteratively decomposing vision and language reasoning via large language models. | 32 |
| Implementing a unified modal learning framework for generative vision-language models | 43 |
| This project integrates visual knowledge into large language models to improve their capabilities and reduce hallucinations. | 124 |
| An end-to-end image captioning system that uses large multi-modal models and provides tools for training, inference, and demo usage. | 1,849 |
| PyTorch implementation of guided visual search mechanism for multimodal LLMs | 541 |
| An unsupervised deep learning framework for translating images between different modalities | 1,994 |