Video-LLaVA
Video generator
A deep learning framework for generating videos from text inputs and visual features.
【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
3k stars
29 watching
219 forks
Language: Python
last commit: 11 months ago instruction-tuninglarge-vision-language-modelmulti-modal
Related projects:
| Repository | Description | Stars |
|---|---|---|
| | A system that uses large language and vision models to generate and process visual instructions | 20,683 |
| | Develops large multimodal models for various computer vision tasks including image and video analysis | 3,099 |
| | Extending pretraining models to handle multiple modalities by aligning language and video representations | 751 |
| | An audio-visual language model designed to understand and respond to video content with improved instruction-following capabilities | 2,842 |
| | A large vision-language model using a mixture-of-experts architecture to improve performance on multi-modal learning tasks | 2,023 |
| | Debiasing techniques to minimize hallucinations in large visual language models | 75 |
| | A comprehensive video understanding toolbox and benchmark with modular design, supporting various tasks such as action recognition, localization, and retrieval. | 4,360 |
| | An open-source framework for training large language models with vision capabilities. | 3,229 |
| | An implementation of a method for fine-tuning language models to follow instructions with high efficiency and accuracy | 5,775 |
| | A tool for efficiently fine-tuning large language models across multiple architectures and methods. | 36,219 |
| | A multi-modal AI model developed for improved instruction-following and in-context learning, utilizing large-scale architectures and various training datasets. | 3,570 |
| | Develops large language models that can understand and generate human-like visual and video content | 2,365 |
| | Evaluates and benchmarks large language models' video understanding capabilities | 121 |
| | A fast serving framework for large language models and vision language models. | 6,551 |
| | A framework for unified visual representation in image and video understanding models, enabling efficient training of large language models on multimodal data. | 895 |