awesome-video-text-retrieval
Video-text retrieval techniques
A curated list of resources and papers on deep learning techniques for text-video retrieval tasks
A curated list of deep learning resources for video-text retrieval.
593 stars
20 watching
67 forks
last commit: about 1 year ago
Linked from 1 awesome list
Awesome Video-Text Retrieval by Deep Learning / Implementations | |||
hybrid_space | 87 | almost 2 years ago | |
dual_encoding | 155 | almost 2 years ago | |
w2vvpp | 28 | 4 months ago | |
Mixture-of-Embedding-Experts | 118 | over 4 years ago | |
howto100m | 250 | over 4 years ago | |
collaborative | 336 | almost 2 years ago | |
hgr | 209 | over 4 years ago | |
coot | 288 | about 2 years ago | |
mmt | 258 | about 1 month ago | |
ClipBERT | 704 | over 1 year ago | |
jsfusion | 31 | about 6 years ago | |
w2vv | 69 | almost 5 years ago | (Keras) |
Extracting CNN features from video frames by MXNet | 31 | over 2 years ago | |
Awesome Video-Text Retrieval by Deep Learning / Papers / 2023 | |||
[paper] | CLIPPING: Distilling CLIP-Based Models with a Student Base for Video-Language Retrieval. CVPR, 2023 | ||
[paper] | SViTT: Temporal Learning of Sparse Video-Text Transformers. CVPR, 2023 | ||
[paper] | Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval. CVPR, 2023 | ||
[paper] | MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models. CVPR, 2023 | ||
[paper] | All in One: Exploring Unified Video-Language Pre-Training. CVPR, 2023 | ||
[paper] | IMAGEBIND: One Embedding Space To Bind Them All. CVPR, 2023 | ||
[paper] | VoP: Text-Video Co-Operative Prompt Tuning for Cross-Modal Retrieval. CVPR, 2023 | ||
[paper] | LAVENDER: Unifying Video-Language Understanding As Masked Language Modeling. CVPR, 2023 | ||
[paper] | Clover: Towards a Unified Video-Language Alignment and Fusion Model. CVPR, 2023 | ||
[paper] | Seeing What You Miss: Vision-Language Pre-Training With Semantic Completion Learning. CVPR, 2023 | ||
[paper] | CNVid-3.5M: Build, Filter, and Pre-train the Large-scale Public Chinese Video-text Dataset. CVPR, 2023 | ||
[paper] | Cali-NCE: Boosting Cross-Modal Video Representation Learning With Calibrated Alignment. CVPRWorkshop, 2023 | ||
[paper] | Using Multimodal Contrastive Knowledge Distillation for Video-Text Retrieval. TCSVT, 2023 | ||
Awesome Video-Text Retrieval by Deep Learning / Papers / 2022 | |||
[homepage] | Partially Relevant Video Retrieval. ACM Multimedia, 2022 | ||
[paper] | Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning. ACM Multimedia, 2022 | ||
[paper] | Learn to Understand Negation in Video Retrieval. ACM Multimedia, 2022 | ||
[paper] | A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval. ACM Multimedia, 2022 | ||
[paper] | X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval. ACM Multimedia, 2022 | ||
[paper] | Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval. ECCV, 2022 | ||
[paper] | TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval. ECCV, 2022 | ||
[paper] | Reading-strategy Inspired Visual Representation Learning for Text-to-Video Retrieval. TCSVT, 2022 | ||
[paper] | Align and Prompt: Video-and-Language Pre-training with Entity Prompts, CVPR, 2022 | ||
[paper] | Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval. CVPR, 2022 | ||
[paper] | Bridging Video-text Retrieval with Multiple Choice Questions. CVPR, 2022 | ||
[paper] | Temporal Alignment Networks for Long-term Video. CVPR.2022 | ||
[paper] | X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval. CVPR, 2022 | ||
[paper] | LGDN: Language-Guided Denoising Network for Video-Language Modeling. NIPS, 2022 | ||
[paper] | Animating Images to Transfer CLIP for Video-Text Retrieval. SIGIR, 2022 | ||
[paper] | CenterCLIP: Token Clustering for Efficient Text-Video Retrieval. SIGIR, 2022 | ||
[paper] | Cross-Modal Discrete Representation Learning. ACL, 2022 | ||
[paper] | Masking Modalities for Cross-modal Video Retrieval. WACV, 2022 | ||
[paper] | Visual Consensus Modeling for Video-Text Retrieval. AAAI, 2022 | ||
[paper] | Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss. AAAI, 2022 | ||
[paper] | Many Hands Make Light Work: Transferring Knowledge from Auxiliary Tasks for Video-Text Retrieval. IEEE Transactions on Multimedia, 2022 | ||
[paper] | Exposing the Limits of Video-Text Models through Contrast Sets. NAACL, 2022 | ||
[paper] | Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval. TOMM, 2022 | ||
[paper] | LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval. arXiv:2207.04858, 2022 | ||
[paper] | A CLIP-Hitchhiker's Guide to Long Video Retrieval. arXiv:2205.08508, 2022 | ||
[paper] | CLIP2TV: Align, Match and Distill for Video-Text Retrieval. arXiv:2111.05610, 2022 | ||
[paper] | Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations. arXiv:2204.03382, 2022 | ||
Awesome Video-Text Retrieval by Deep Learning / Papers / 2021 | |||
[paper] | Dual Encoding for Video Retrieval by Text. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021 | ||
[paper] | Universal Weighting Metric Learning for Cross-Modal Retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021 | ||
[paper] | Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling. CVPR, 2021 | ||
[paper] | On Semantic Similarity in Video Retrieval. CVPR, 2021 | ||
[paper] | Learning the Best Pooling Strategy for Visual Semantic Embedding. CVPR, 2021 | ||
[paper] | T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval. CVPR, 2021 | ||
[paper] | Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers. CVPR, 2021 | ||
[paper] | Adaptive Cross-Modal Prototypes for Cross-Domain Visual-Language Retrieval. CVPR, 2021 | ||
[paper] | Multimodal Clustering Networks for Self-Supervised Learning from Unlabeled Videos. ICCV, 2021 | ||
[paper] | TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval. ICCV, 2021 | ||
[paper] | TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment. ICCV, 2021 | ||
[paper] | Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. ICCV, 2021 | ||
[paper] | COOKIE: Contrastive Cross-Modal Knowledge Sharing Pre-Training for Vision-Language Representation. ICCV, 2021 | ||
[paper] | CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising. ACM Multimedia, 2021 | ||
[paper] | HANet: Hierarchical Alignment Networks for Video-Text Retrieval. ACM Multimedia, 2021 | ||
[paper] | Progressive Semantic Matching for Video-Text Retrieval. ACM Multimedia, 2021 | ||
[paper] | Fine-grained Cross-modal Alignment Network for Text-Video Retrieval. ACM Multimedia, 2021 | ||
[paper] | Meta Self-Paced Learning for Cross-Modal Matching. ACM Multimedia, 2021 | ||
[paper] | Support-set Bottlenecks for Video-text Representation Learning. ICLR, 2021 | ||
[paper] | Semantics-Aware Spatial-Temporal Binaries for Cross-Modal Video Retrieval. IEEE Transactions on Image Processing, 2021 | ||
[paper] | Spatial-temporal Graphs for Cross-modal Text2Video Retrieval. IEEE Transactions on Multimedia, 2021 | ||
[paper] | Multi-level Alignment Network for Domain Adaptive Cross-modal Retrieval. Neurocomputing, 2021 | ||
[paper] | Hierarchical Cross-Modal Graph Consistency Learning for Video-Text Retrieval. SIGIR, 2020 | ||
[paper] | Improving Video Retrieval by Adaptive Margin. SIGIR, 2021 | ||
[paper] | Dig into Multi-modal Cues for Video Retrieval with Hierarchical Alignment. IJCAI, 2021 | ||
[paper] | Mind-the-Gap! Unsupervised Domain Adaptation for Text-Video Retrieval. AAAI, 2021 | ||
[paper] | What Matters: Attentive and Relational Feature Aggregation Network for Video-Text Retrieval. ICME, 2021 | ||
[paper] | Multi-Dimensional Attentive Hierarchical Graph Pooling Network for Video-Text Retrieval. ICME, 2021 | ||
[paper] | Semantic-Preserving Metric Learning for Video-Text Retrieval. IEEE International Conference on Image Processing, 2021 | ||
[paper] | Multi-Feature Graph Attention Network for Cross-Modal Video-Text Retrieval. ICMR, 2021 | ||
[paper] | HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval. arXiv:2103.15049, 2021 | ||
[paper] | VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text. arXiv:2104.11178 , 2021 | ||
[paper] | CLIP2Video: Mastering Video-Text Retrieval via Image CLIP. arXiv:2106.11097, 2021 | ||
[paper] | CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval. arXiv:2104.08860, 2021 | ||
[paper] | Align and Prompt: Video-and-Language Pre-training with Entity Prompts. arXiv:2112.09583, 2021 | ||
Awesome Video-Text Retrieval by Deep Learning / Papers / 2020 | |||
[paper] | Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval. SIGIR, 2020 | ||
[paper] | COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning. NeurIPS, 2020 | ||
[paper] | Multi-modal Transformer for Video Retrieval. ECCV, 2020 | ||
[paper] | SEA: Sentence Encoder Assembly for Video Retrieval by Textual Queries. IEEE Transactions on Multimedia, 2020 | ||
[paper] | Learning Coarse-to-Fine Graph Neural Networks for Video-Text Retrieval. IEEE Transactions on Multimedia, 2020 | ||
[paper] | Interclass-Relativity-Adaptive Metric Learning for Cross-Modal Matching and Beyond. IEEE Transactions on Multimedia, 2020 | ||
[paper] | Interpretable Embedding for Ad-Hoc Video Search. ACM Multimedia, 2020 | ||
[paper] | Exploiting Visual Semantic Reasoning for Video-Text Retrieval. IJCAI, 2020 | ||
[paper] | Universal Weighting Metric Learning for Cross-Modal Retrieval. CVPR, 2020 | ||
[paper] | Action Modifiers: Learning from Adverbs in Instructional Videos. CVPR, 2020 | ||
[paper] | Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning. CVPR, 2020 | ||
[paper] | ActBERT: Learning Global-Local Video-Text Representations. CVPR, 2020 | ||
[paper] | End-to-End Learning of Visual Representations From Uncurated Instructional Videos. CVPR, 2020 | ||
[paper] | Stacked Convolutional Deep Encoding Network For Video-Text Retrieval. ICME, 2020 | ||
[paper] | UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation. arXiv:2002.06353, 2020 | ||
Awesome Video-Text Retrieval by Deep Learning / Papers / 2019 | |||
[paper] | Dual Encoding for Zero-Example Video Retrieval. CVPR, 2019 | ||
[paper] | Polysemous visual-semantic embedding for cross-modal retrieval. CVPR, 2019 | ||
[paper] | Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings. ICCV, 2019 | ||
[paper] | A Graph-Based Framework to Bridge Movies and Synopses. ICCV, 2019 | ||
[paper] | W2VV++ Fully Deep Learning for Ad-hoc Video Search. ACM Multimedia, 2019 | ||
[paper] | Use What You Have: Video Retrieval Using Representations From Collaborative Experts. MBVC, 2019 | ||
[paper] | From Intra-Modal to Inter-Modal Space: Multi-Task Learning of Shared Representations for Cross-Modal Retrieval. International Conference on Multimedia Big Data, 2019 | ||
Awesome Video-Text Retrieval by Deep Learning / Papers / 2018 | |||
[paper] | Predicting visual features from text for image and video caption retrieval. IEEE Transactions on Multimedia, 2018 | ||
[paper] | Cross-Modal and Hierarchical Modeling of Video and Text. ECCV, 2018 | ||
[paper] | A Joint Sequence Fusion Model for Video Question Answering and Retrieval. ECCV, 2018 | ||
[paper] | Find and focus: Retrieve and localize video events with natural language queries. ECCV, 2018 | ||
[paper] | Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval. ICMR, 2018 | ||
[paper] | Learning a Text-Video Embedding from Incomplete and Heterogeneous Data. arXiv preprint arXiv:1804.02516, 2018 | ||
Awesome Video-Text Retrieval by Deep Learning / Papers / Before | |||
[paper] | End-to-end concept word detection for video captioning, retrieval, and question answering. CVPR, 2017 | ||
[paper] | Learning joint representations of videos and sentences with web image search. ECCV Workshop, 2016 | ||
[paper] | Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. AAAI, 2015 | ||
Awesome Video-Text Retrieval by Deep Learning / Papers / Ad-hoc Video Search | |||
TRECVID | For the papers targeting at ad-hoc video search in the context of , please refer to | ||
Awesome Video-Text Retrieval by Deep Learning / Papers / Other Related | |||
[paper] | AVLnet: Learning Audio-Visual Language Representations from Instructional Videos. Interspeech, 2021 | ||
[paper] | Learning Spatiotemporal Features via Video and Text Pair Discrimination. arXiv preprint arXiv:2001.05691, 2020 | ||
Awesome Video-Text Retrieval by Deep Learning / Datasets | |||
[paper] | David et al. Collecting Highly Parallel Data for Paraphrase Evaluation. ACL, 2011 | ||
[paper] | Xu et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. CVPR, 2016 | ||
[paper] | Li et al. TGIF: A new dataset and benchmark on animated GIF description. CVPR, 2016 | ||
[paper] | Awad et al. Trecvid 2016: Evaluating video search, video event detection, localization, and hyperlinking. TRECVID Workshop, 2016 | ||
[paper] | Rohrbach et al. Movie description. IJCV, 2017 | ||
[paper] | Krishna et al. Dense-captioning events in videos. ICCV, 2017 | ||
[paper] | Hendricks et al. Localizing Moments in Video with Natural Language. ICCV, 2017 | ||
[paper] | Miech et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. ICCV, 2019 | ||
[paper] | Wang et al. VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research. ICCV, 2019 |