VTimeLLM

Video Moment LLM

A PyTorch-based Video LLM designed to understand and reason about video moments in terms of time boundaries.

[CVPR'2024 Highlight] Official PyTorch implementation of the paper "VTimeLLM: Empower LLM to Grasp Video Moments".

GitHub

231 stars
2 watching
11 forks
Language: Python
last commit: 7 months ago

Related projects:

Repository Description Stars
dcdmllm/momentor A video Large Language Model designed for fine-grained comprehension and localization in videos with a custom Temporal Perception Module for improved temporal modeling 58
boheumd/ma-lmm This project develops an AI model for long-term video understanding 254
vpgtrans/vpgtrans Transfers visual prompt generators across large language models to reduce training costs and enable customization of multimodal LLMs 270
jayleicn/clipbert An efficient framework for end-to-end learning on image-text and video-text tasks 709
llyx97/tempcompass A tool to evaluate video language models' ability to understand and describe video content 91
antoine77340/howto100m Provides code and tools for learning joint text-video embeddings using the HowTo100M dataset 254
penghao-wu/vstar PyTorch implementation of guided visual search mechanism for multimodal LLMs 541
viorik/convlstm An implementation of a spatio-temporal convolutional LSTM module for video autoencoders with differentiable memory 293
volcengine/vescale A PyTorch-based framework for training large language models in parallel on multiple devices 679
damo-nlp-sg/videollama2 An audio-visual language model designed to advance spatial-temporal modeling and audio understanding in video processing. 957
t-vi/pytorch-tvmisc A collection of miscellaneous PyTorch implementations covering various machine learning concepts and techniques 468
antoine77340/mixture-of-embedding-experts An open-source implementation of the Mixture-of-Embeddings-Experts model in Pytorch for video-text retrieval tasks. 118
luogen1996/lavin An open-source implementation of a vision-language instructed large language model 513
dvlab-research/llama-vid An image-based language model that uses large language models to generate visual and text features from videos 748
jayleicn/tvqa PyTorch implementation of video question answering system based on TVQA dataset 172