Video-LLaVA

Visual Reasoning Library

This project enables large language models to perform visual reasoning capabilities on images and videos simultaneously by learning united visual representations before projection.

【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

GitHub

3k stars
28 watching
219 forks
Language: Python
last commit: about 2 months ago
instruction-tuninglarge-vision-language-modelmulti-modal

Related projects:

Repository Description Stars
haotian-liu/llava A system that uses large language and vision models to generate and process visual instructions 20,232
llava-vl/llava-next Develops large multimodal models for various computer vision tasks including image and video analysis 2,872
pku-yuangroup/languagebind Extending pretraining models to handle multiple modalities by aligning language and video representations 723
damo-nlp-sg/video-llama An audio-visual language model designed to understand and respond to video content with improved instruction-following capabilities 2,802
pku-yuangroup/moe-llava Develops a neural network architecture for multi-modal learning with large vision-language models 1,980
yfzhang114/llava-align Debiasing techniques to minimize hallucinations in large visual language models 71
open-mmlab/mmaction2 A comprehensive video understanding toolbox and benchmark with modular design, supporting various tasks such as action recognition, localization, and retrieval. 4,296
dvlab-research/mgm An open-source framework for training large language models with vision capabilities. 3,211
opengvlab/llama-adapter An implementation of a method for fine-tuning language models to follow instructions with high efficiency and accuracy 5,754
hiyouga/llama-factory A unified platform for fine-tuning multiple large language models with various training approaches and methods 34,436
luodian/otter A multi-modal AI model developed for improved instruction-following and in-context learning, utilizing large-scale architectures and various training datasets. 3,563
x-plug/mplug-owl Develops large language models that can understand and generate human-like visual and video content 2,321
pku-yuangroup/video-bench Evaluates and benchmarks large language models' video understanding capabilities 117
sgl-project/sglang A framework for serving large language models and vision models with efficient runtime and flexible interface. 6,082
pku-yuangroup/chat-univi A framework for unified visual representation in image and video understanding models, enabling efficient training of large language models on multimodal data. 847