VTimeLLM

Video Moment LLM

A PyTorch-based Video LLM designed to understand and reason about video moments in terms of time boundaries.

[CVPR'2024 Highlight] Official PyTorch implementation of the paper "VTimeLLM: Empower LLM to Grasp Video Moments".

GitHub

225 stars
2 watching
11 forks
Language: Python
last commit: 5 months ago

Related projects:

Repository Description Stars
dcdmllm/momentor A video Large Language Model designed for fine-grained comprehension and localization in videos with a custom Temporal Perception Module for improved temporal modeling 54
boheumd/ma-lmm This project develops an AI model for long-term video understanding 244
vpgtrans/vpgtrans Transfers visual prompt generators across large language models to reduce training costs and enable customization of multimodal LLMs 269
jayleicn/clipbert An efficient framework for end-to-end learning on image-text and video-text tasks 704
llyx97/tempcompass A tool to evaluate video language models' ability to understand and describe video content 84
antoine77340/howto100m Provides code and tools for learning joint text-video embeddings using the HowTo100M dataset 250
penghao-wu/vstar PyTorch implementation of guided visual search mechanism for multimodal LLMs 527
viorik/convlstm An implementation of a spatio-temporal convolutional LSTM module for video autoencoders with differentiable memory 292
volcengine/vescale A PyTorch-based framework for training large language models in parallel on multiple devices 663
damo-nlp-sg/videollama2 An audio-visual language model designed to understand and generate video content 871
t-vi/pytorch-tvmisc A collection of utilities and tools for building and improving deep learning models in PyTorch 468
antoine77340/mixture-of-embedding-experts An open-source implementation of the Mixture-of-Embeddings-Experts model in Pytorch for video-text retrieval tasks. 118
luogen1996/lavin An open-source implementation of a vision-language instructed large language model 508
dvlab-research/llama-vid An image-based language model that uses large language models to generate visual and text features from videos 733
jayleicn/tvqa PyTorch implementation of video question answering system based on TVQA dataset 172