Video-LLaVA
Visual Reasoning Library
This project enables large language models to perform visual reasoning capabilities on images and videos simultaneously by learning united visual representations before projection.
【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
3k stars
28 watching
219 forks
Language: Python
last commit: about 2 months ago instruction-tuninglarge-vision-language-modelmulti-modal
Related projects:
Repository | Description | Stars |
---|---|---|
haotian-liu/llava | A system that uses large language and vision models to generate and process visual instructions | 20,232 |
llava-vl/llava-next | Develops large multimodal models for various computer vision tasks including image and video analysis | 2,872 |
pku-yuangroup/languagebind | Extending pretraining models to handle multiple modalities by aligning language and video representations | 723 |
damo-nlp-sg/video-llama | An audio-visual language model designed to understand and respond to video content with improved instruction-following capabilities | 2,802 |
pku-yuangroup/moe-llava | Develops a neural network architecture for multi-modal learning with large vision-language models | 1,980 |
yfzhang114/llava-align | Debiasing techniques to minimize hallucinations in large visual language models | 71 |
open-mmlab/mmaction2 | A comprehensive video understanding toolbox and benchmark with modular design, supporting various tasks such as action recognition, localization, and retrieval. | 4,296 |
dvlab-research/mgm | An open-source framework for training large language models with vision capabilities. | 3,211 |
opengvlab/llama-adapter | An implementation of a method for fine-tuning language models to follow instructions with high efficiency and accuracy | 5,754 |
hiyouga/llama-factory | A unified platform for fine-tuning multiple large language models with various training approaches and methods | 34,436 |
luodian/otter | A multi-modal AI model developed for improved instruction-following and in-context learning, utilizing large-scale architectures and various training datasets. | 3,563 |
x-plug/mplug-owl | Develops large language models that can understand and generate human-like visual and video content | 2,321 |
pku-yuangroup/video-bench | Evaluates and benchmarks large language models' video understanding capabilities | 117 |
sgl-project/sglang | A framework for serving large language models and vision models with efficient runtime and flexible interface. | 6,082 |
pku-yuangroup/chat-univi | A framework for unified visual representation in image and video understanding models, enabling efficient training of large language models on multimodal data. | 847 |