Chat-UniVi
Visual unification framework
A framework for unified visual representation in image and video understanding models, enabling efficient training of large language models on multimodal data.
[CVPR 2024 Highlightš„] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
895 stars
7 watching
43 forks
Language: Python
last commit: 3 months ago image-understandinglarge-language-modelsvideo-understandingvision-language-model
Related projects:
Repository | Description | Stars |
---|---|---|
pku-yuangroup/languagebind | Extending pretraining models to handle multiple modalities by aligning language and video representations | 751 |
pku-yuangroup/video-bench | Evaluates and benchmarks large language models' video understanding capabilities | 121 |
byungkwanlee/moai | Improves performance of vision language tasks by integrating computer vision capabilities into large language models | 314 |
jy0205/lavit | A unified framework for training large language models to understand and generate visual content | 544 |
nvlabs/prismer | A deep learning framework for training multi-modal models with vision and language capabilities. | 1,299 |
wisconsinaivision/vip-llava | A system designed to enable large multimodal models to understand arbitrary visual prompts | 302 |
pzzhang/vinvl | A project aimed at improving visual representations in vision-language models by developing an object detection model for richer visual object and concept representations. | 350 |
zhourax/vega | Develops a multimodal task and dataset to assess vision-language models' ability to handle interleaved image-text inputs. | 33 |
pku-yuangroup/moe-llava | A large vision-language model using a mixture-of-experts architecture to improve performance on multi-modal learning tasks | 2,023 |
hxyou/idealgpt | A deep learning framework for iteratively decomposing vision and language reasoning via large language models. | 32 |
shizhediao/davinci | Implementing a unified modal learning framework for generative vision-language models | 43 |
jiutian-vl/jiutian-lion | This project integrates visual knowledge into large language models to improve their capabilities and reduce hallucinations. | 124 |
yuliang-liu/monkey | An end-to-end image captioning system that uses large multi-modal models and provides tools for training, inference, and demo usage. | 1,849 |
penghao-wu/vstar | PyTorch implementation of guided visual search mechanism for multimodal LLMs | 541 |
mingyuliutw/unit | An unsupervised deep learning framework for translating images between different modalities | 1,994 |