VTimeLLM
Video Moment LLM
A PyTorch-based Video LLM designed to understand and reason about video moments in terms of time boundaries.
[CVPR'2024 Highlight] Official PyTorch implementation of the paper "VTimeLLM: Empower LLM to Grasp Video Moments".
225 stars
2 watching
11 forks
Language: Python
last commit: 5 months ago Related projects:
Repository | Description | Stars |
---|---|---|
dcdmllm/momentor | A video Large Language Model designed for fine-grained comprehension and localization in videos with a custom Temporal Perception Module for improved temporal modeling | 54 |
boheumd/ma-lmm | This project develops an AI model for long-term video understanding | 244 |
vpgtrans/vpgtrans | Transfers visual prompt generators across large language models to reduce training costs and enable customization of multimodal LLMs | 269 |
jayleicn/clipbert | An efficient framework for end-to-end learning on image-text and video-text tasks | 704 |
llyx97/tempcompass | A tool to evaluate video language models' ability to understand and describe video content | 84 |
antoine77340/howto100m | Provides code and tools for learning joint text-video embeddings using the HowTo100M dataset | 250 |
penghao-wu/vstar | PyTorch implementation of guided visual search mechanism for multimodal LLMs | 527 |
viorik/convlstm | An implementation of a spatio-temporal convolutional LSTM module for video autoencoders with differentiable memory | 292 |
volcengine/vescale | A PyTorch-based framework for training large language models in parallel on multiple devices | 663 |
damo-nlp-sg/videollama2 | An audio-visual language model designed to understand and generate video content | 871 |
t-vi/pytorch-tvmisc | A collection of utilities and tools for building and improving deep learning models in PyTorch | 468 |
antoine77340/mixture-of-embedding-experts | An open-source implementation of the Mixture-of-Embeddings-Experts model in Pytorch for video-text retrieval tasks. | 118 |
luogen1996/lavin | An open-source implementation of a vision-language instructed large language model | 508 |
dvlab-research/llama-vid | An image-based language model that uses large language models to generate visual and text features from videos | 733 |
jayleicn/tvqa | PyTorch implementation of video question answering system based on TVQA dataset | 172 |