Chat-UniVi

Visual unification framework

A framework for unified visual representation in image and video understanding models, enabling efficient training of large language models on multimodal data.

[CVPR 2024 HighlightšŸ”„] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

GitHub

895 stars
7 watching
43 forks
Language: Python
last commit: 3 months ago
image-understandinglarge-language-modelsvideo-understandingvision-language-model

Related projects:

Repository Description Stars
pku-yuangroup/languagebind Extending pretraining models to handle multiple modalities by aligning language and video representations 751
pku-yuangroup/video-bench Evaluates and benchmarks large language models' video understanding capabilities 121
byungkwanlee/moai Improves performance of vision language tasks by integrating computer vision capabilities into large language models 314
jy0205/lavit A unified framework for training large language models to understand and generate visual content 544
nvlabs/prismer A deep learning framework for training multi-modal models with vision and language capabilities. 1,299
wisconsinaivision/vip-llava A system designed to enable large multimodal models to understand arbitrary visual prompts 302
pzzhang/vinvl A project aimed at improving visual representations in vision-language models by developing an object detection model for richer visual object and concept representations. 350
zhourax/vega Develops a multimodal task and dataset to assess vision-language models' ability to handle interleaved image-text inputs. 33
pku-yuangroup/moe-llava A large vision-language model using a mixture-of-experts architecture to improve performance on multi-modal learning tasks 2,023
hxyou/idealgpt A deep learning framework for iteratively decomposing vision and language reasoning via large language models. 32
shizhediao/davinci Implementing a unified modal learning framework for generative vision-language models 43
jiutian-vl/jiutian-lion This project integrates visual knowledge into large language models to improve their capabilities and reduce hallucinations. 124
yuliang-liu/monkey An end-to-end image captioning system that uses large multi-modal models and provides tools for training, inference, and demo usage. 1,849
penghao-wu/vstar PyTorch implementation of guided visual search mechanism for multimodal LLMs 541
mingyuliutw/unit An unsupervised deep learning framework for translating images between different modalities 1,994