VideoLLaMA2
Video processor
An audio-visual language model designed to advance spatial-temporal modeling and audio understanding in video processing.
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
957 stars
11 watching
62 forks
Language: Python
last commit: 3 months ago Related projects:
Repository | Description | Stars |
---|---|---|
| Transforms video content into a long document containing visual and audio information that can be used for chat or other applications. | 545 |
| Converts music represented by a GNU LilyPond file into a video containing a horizontally scrolling music staff synchronized with audio rendering. | 158 |
| An image-based language model that uses large language models to generate visual and text features from videos | 748 |
| A video Large Language Model designed for fine-grained comprehension and localization in videos with a custom Temporal Perception Module for improved temporal modeling | 58 |
| This project enables text-to-video generation using a combination of pixel and latent diffusion models. | 1,110 |
| A collection of information about various large language models used in natural language processing | 272 |
| A video conversation model that generates meaningful conversations about videos using large vision and language models | 1,246 |
| Transfers visual prompt generators across large language models to reduce training costs and enable customization of multimodal LLMs | 270 |
| An implementation of a deep learning model to generate videos with dynamic scenes | 15 |
| A comprehensive toolkit for high-performance video generation and processing | 1,819 |
| An offline video assistant system powered by large language models and computer vision techniques. | 210 |
| Develops an end-to-end model for multiple visual perception and reasoning tasks using a single encoder, decoder, and large language model. | 1,336 |
| This project develops an AI model for long-term video understanding | 254 |
| Unofficial implementation of a deep learning model to generate or modify video content | 191 |
| A command-line interface to generate textual datasets with Large Language Models | 293 |