VTimeLLM

Video Moment LLM

A PyTorch-based Video LLM designed to understand and reason about video moments in terms of time boundaries.

[CVPR'2024 Highlight] Official PyTorch implementation of the paper "VTimeLLM: Empower LLM to Grasp Video Moments".

GitHub

231 stars

2 watching

11 forks

Language: Python

last commit: over 1 year ago

arxiv.org/pdf/2311.18445.pdf

Related projects:

Repository	Description	Stars
dcdmllm/momentor	A video Large Language Model designed for fine-grained comprehension and localization in videos with a custom Temporal Perception Module for improved temporal modeling	58
boheumd/ma-lmm	This project develops an AI model for long-term video understanding	254
vpgtrans/vpgtrans	Transfers visual prompt generators across large language models to reduce training costs and enable customization of multimodal LLMs	270
jayleicn/clipbert	An efficient framework for end-to-end learning on image-text and video-text tasks	709
llyx97/tempcompass	A tool to evaluate video language models' ability to understand and describe video content	91
antoine77340/howto100m	Provides code and tools for learning joint text-video embeddings using the HowTo100M dataset	254
penghao-wu/vstar	PyTorch implementation of guided visual search mechanism for multimodal LLMs	541
viorik/convlstm	An implementation of a spatio-temporal convolutional LSTM module for video autoencoders with differentiable memory	293
volcengine/vescale	A PyTorch-based framework for training large language models in parallel on multiple devices	679
damo-nlp-sg/videollama2	An audio-visual language model designed to advance spatial-temporal modeling and audio understanding in video processing.	957
t-vi/pytorch-tvmisc	A collection of miscellaneous PyTorch implementations covering various machine learning concepts and techniques	468
antoine77340/mixture-of-embedding-experts	An open-source implementation of the Mixture-of-Embeddings-Experts model in Pytorch for video-text retrieval tasks.	118
luogen1996/lavin	An open-source implementation of a vision-language instructed large language model	513
dvlab-research/llama-vid	An image-based language model that uses large language models to generate visual and text features from videos	748
jayleicn/tvqa	PyTorch implementation of video question answering system based on TVQA dataset	172