VTimeLLM
Video Moment LLM
A PyTorch-based Video LLM designed to understand and reason about video moments in terms of time boundaries.
[CVPR'2024 Highlight] Official PyTorch implementation of the paper "VTimeLLM: Empower LLM to Grasp Video Moments".
231 stars
2 watching
11 forks
Language: Python
last commit: 8 months ago Related projects:
Repository | Description | Stars |
---|---|---|
| A video Large Language Model designed for fine-grained comprehension and localization in videos with a custom Temporal Perception Module for improved temporal modeling | 58 |
| This project develops an AI model for long-term video understanding | 254 |
| Transfers visual prompt generators across large language models to reduce training costs and enable customization of multimodal LLMs | 270 |
| An efficient framework for end-to-end learning on image-text and video-text tasks | 709 |
| A tool to evaluate video language models' ability to understand and describe video content | 91 |
| Provides code and tools for learning joint text-video embeddings using the HowTo100M dataset | 254 |
| PyTorch implementation of guided visual search mechanism for multimodal LLMs | 541 |
| An implementation of a spatio-temporal convolutional LSTM module for video autoencoders with differentiable memory | 293 |
| A PyTorch-based framework for training large language models in parallel on multiple devices | 679 |
| An audio-visual language model designed to advance spatial-temporal modeling and audio understanding in video processing. | 957 |
| A collection of miscellaneous PyTorch implementations covering various machine learning concepts and techniques | 468 |
| An open-source implementation of the Mixture-of-Embeddings-Experts model in Pytorch for video-text retrieval tasks. | 118 |
| An open-source implementation of a vision-language instructed large language model | 513 |
| An image-based language model that uses large language models to generate visual and text features from videos | 748 |
| PyTorch implementation of video question answering system based on TVQA dataset | 172 |