Chat-UniVi

Visual unification framework

A framework for unified visual representation in image and video understanding models, enabling efficient training of large language models on multimodal data.

[CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

GitHub

895 stars

7 watching

43 forks

Language: Python

last commit: about 1 year ago

image-understandinglarge-language-modelsvideo-understandingvision-language-model

Screenshot of PKU-YuanGroup/Chat-UniVi website

arxiv.org/abs/2311.08046

Related projects:

Repository	Description	Stars
pku-yuangroup/languagebind	Extending pretraining models to handle multiple modalities by aligning language and video representations	751
pku-yuangroup/video-bench	Evaluates and benchmarks large language models' video understanding capabilities	121
byungkwanlee/moai	Improves performance of vision language tasks by integrating computer vision capabilities into large language models	314
jy0205/lavit	A unified framework for training large language models to understand and generate visual content	544
nvlabs/prismer	A deep learning framework for training multi-modal models with vision and language capabilities.	1,299
wisconsinaivision/vip-llava	A system designed to enable large multimodal models to understand arbitrary visual prompts	302
pzzhang/vinvl	A project aimed at improving visual representations in vision-language models by developing an object detection model for richer visual object and concept representations.	350
zhourax/vega	Develops a multimodal task and dataset to assess vision-language models' ability to handle interleaved image-text inputs.	33
pku-yuangroup/moe-llava	A large vision-language model using a mixture-of-experts architecture to improve performance on multi-modal learning tasks	2,023
hxyou/idealgpt	A deep learning framework for iteratively decomposing vision and language reasoning via large language models.	32
shizhediao/davinci	Implementing a unified modal learning framework for generative vision-language models	43
jiutian-vl/jiutian-lion	This project integrates visual knowledge into large language models to improve their capabilities and reduce hallucinations.	124
yuliang-liu/monkey	An end-to-end image captioning system that uses large multi-modal models and provides tools for training, inference, and demo usage.	1,849
penghao-wu/vstar	PyTorch implementation of guided visual search mechanism for multimodal LLMs	541
mingyuliutw/unit	An unsupervised deep learning framework for translating images between different modalities	1,994