Awesome Action Recognition: / Action Recognition and Video Understanding / Summary posts |
Deep Learning for Videos: A 2018 Guide to Action Recognition | | | Summary of major landmark action recognition research papers till 2018 |
Literature Survey: Human Action Recognition | | | Brief human action recognition literature survey of work published between 2014 and 2019 |
Awesome Action Recognition: / Action Recognition and Video Understanding / Video Representation |
Why Can't I Dance in the Mall? Learning to Mitigate Scene Bias in Action Recognition | | | J. Choi et al., NeurIPS2019 |
SlowFast Networks for Video Recognition | | | C. Feichtenhofer et al., ICCV2019 |
Large-scale weakly-supervised pre-training for video action recognition | | | D. Ghadiyaram et al., arXiv2019 |
Video Classification with Channel-Separated Convolutional Networks | | | D. Tran et al., arXiv2019 |
DistInit: Learning Video Representations without a Single Labeled Video | | | R. Girdhar et al., arXiv2019 |
SCSampler: Sampling Salient Clips from Video for Efficient Action Recognition | | | B. Korbar et al., arXiv2019 |
Video Action Transformer Network | | | R. Girdhar et al., CVPR2019 |
Learning Correspondence from the Cycle-consistency of Time | | | X. Wang et al., CVPR2019 |
Representation Flow for Action Recognition | | | AJ. Piergiovanni and M. S. Ryoo et al., CVPR2019 |
Collaborative Spatiotemporal Feature Learning for Video Action Recognition | | | C. Li et al., CVPR2019 |
Learning Video Representations from Correspondence Proposals | | | X. Liu et al., CVPR2019 |
Timeception for Complex Action Recognition | | | N. Hussein et al., CVPR2019 |
The Visual Centrifuge: Model-Free Layered Video Representations | | | J.-B. Alayrac et al., CVPR2019 |
Long-Term Feature Banks for Detailed Video Understanding | | | C.-Y. Wu. et al., CVPR2019 |
Temporal Relational Reasoning in Videos | | | B. Zhou et al., ECCV2018 |
Action Recognition Zoo | 244 | over 5 years ago | -
Codes for popular action recognition models, written based on pytorch, verified on the something-something dataset |
Videos as Space-Time Region Graphs | | | X. Wang and A. Gupta, ECCV2018 |
Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? | | | K. Hara et al., CVPR2019 |
A Closer Look at Spatiotemporal Convolutions for Action Recognition | | | D. Tran et al., CVPR2018 |
Attend and Interact: Higher-Order Object Interactions for Video Understanding | | | CY. Ma et al., CVPR 2018 |
Non-Local Neural Networks | | | X. Wang et al., CVPR2018 |
Rethinking Spatiotemporal Feature Learning For Video Understanding | | | S. Xie et al., arXiv2017 |
ConvNet Architecture Search for Spatiotemporal Feature Learning | | | D. Tran et al, arXiv2017. Note: Aka Res3D. : In the repository, C3D-v1.1 is the Res3D implementation |
Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks | | | Z. Qui et al, ICCV2017 |
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset | | | J. Carreira et al, CVPR2017. , |
Learning Spatiotemporal Features with 3D Convolutional Networks | | | D. Tran et al, ICCV2015. Note: Aka C3D. Note that the official caffe does not support python wrapper. , , , : , |
Deep Temporal Linear Encoding Networks | | | A. Diba et al, CVPR2017 |
Temporal Convolutional Networks: A Unified Approach to Action Segmentation and Detection | | | C. Lea et al, CVPR 2017 |
Long-term Temporal Convolutions | | | G. Varol et al, TPAMI2017 |
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition | | | L. Wang et al, arXiv 2016 |
Convolutional Two-Stream Network Fusion for Video Action Recognition | | | C. Feichtenhofer et al, CVPR2016 |
Two-Stream Convolutional Networks for Action Recognition in Videos | | | K. Simonyan and A. Zisserman, NIPS2014 |
Temporal Recurrent Networks for Online Action Detection | | | M. Xu et al, ICCV2019 |
Long Short-Term Transformer for Online Action Detection | | | M. Xu et al, Neurips2021 |
[3D ResNet PyTorch] | 3,912 | almost 4 years ago | |
[PyTorch Video Research] | 533 | over 5 years ago | |
[M-PACT: Michigan Platform for Activity Classification in Tensorflow] | 107 | over 5 years ago | |
[Inflated models on PyTorch] | 148 | over 3 years ago | |
[I3D models transfered from Tensorflow to PyTorch] | 529 | 6 months ago | |
[A Two Stream Baseline on Kinectics dataset] | 42 | almost 6 years ago | |
[MMAction] | 1,864 | over 2 years ago | |
[MMAction2] | 4,315 | 4 months ago | |
[PySlowFast] | 6,652 | 7 days ago | |
[Decord] | 1,906 | 5 months ago | Efficient video reader for python |
[I3D models converted from Tensorflow to Core ML] | 24 | over 4 years ago | |
[Extract frame and optical-flow from videos, #docker] | 133 | over 2 years ago | |
[NVIDIA-DALI, video loading pipelines] | | | |
[NVIDIA optical-flow SDK] | | | |
Awesome Action Recognition: / Action Recognition and Video Understanding / Action Classification |
Guided Weak Supervision for Action Recognition with Scarce Data to Assess Skills of Children with Autism | | | P. Pandey et al, AAAI 2020 |
Neural Graph Matching Networks for Fewshot 3D Action Recognition | | | M. Guo et al., ECCV2018 |
Temporal 3D ConvNets using Temporal Transition Layer | | | A. Diba et al., CVPRW2018 |
Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification | | | A. Diba et al., arXiv2017 |
Attentional Pooling for Action Recognition | | | R. Girdhar and D. Ramanan, NIPS2017 |
Fully Context-Aware Video Prediction | | | Byeon et al, arXiv2017 |
Hidden Two-Stream Convolutional Networks for Action Recognition | | | Y. Zhu et al, arXiv2017 |
Dynamic Image Networks for Action Recognition | | | H. Bilen et al, CVPR2016 |
Long-term Recurrent Convolutional Networks for Visual Recognition and Description | | | J. Donahue et al, CVPR2015 |
Describing Videos by Exploiting Temporal Structure | | | L. Yao et al, ICCV2015. note: from the same group of RCN paper “Delving Deeper into Convolutional Networks for Learning Video Representations" |
Two-Stream SR-CNNs for Action Recognition in Videos | | | L. Wang et al, BMVC2016 |
Real-time Action Recognition with Enhanced Motion Vector CNNs | | | B. Zhang et al, CVPR2016 |
Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors | | | L. Wang et al, CVPR2015 |
Awesome Action Recognition: / Action Recognition and Video Understanding / Skeleton-Based Action Classification |
Actional-Structural Graph Convolutional Networks for Skeleton-Based Action Recognition | | | M. Li et al., CVPR2019 |
An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition | | | C. Si et al., CVPR2019 |
View Adaptive Neural Networks for High Performance Skeleton-Based Human Action Recognition | | | P. Zhang et al., TPAMI2019 |
Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition | | | S. Yan et al., AAAI2018 |
Deep Progressive Reinforcement Learning for Skeleton-Based Action Recognition | | | Y. Tang et al., CVPR2018 |
Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation | | | C. Li et al., IJCAI2018 |
Part-based Graph Convolutional Network for Action Recognition | | | K. Thakkar et al., BMVC2018 |
Awesome Action Recognition: / Action Recognition and Video Understanding / Temporal Action Detection |
Rethinking the Faster R-CNN Architecture for Temporal Action Localization | | | Yu-Wei Chao et al., CVPR2018 |
Weakly Supervised Action Localization by Sparse Temporal Pooling Network | | | Phuc Nguyen et al., CVPR 2018 |
Temporal Deformable Residual Networks for Action Segmentation in Videos | | | P. Lei and S. Todrovic., CVPR2018 |
End-to-End, Single-Stream Temporal Action Detection in Untrimmed Videos | | | Shayamal Buch et al., BMVC 2017 |
Cascaded Boundary Regression for Temporal Action Detection | | | Jiyang Gao et al., BMVC 2017 [ ] |
Temporal Tessellation: A Unified Approach for Video Analysis | | | Kaufman et al., ICCV2017 |
Temporal Action Detection with Structured Segment Networks | | | Y. Zhao et al., ICCV2017 |
Temporal Context Network for Activity Localization in Videos | | | X. Dai et al., ICCV2017 |
Detecting the Moment of Completion: Temporal Models for Localising Action Completion | | | F. Heidarivincheh et al., arXiv2017 |
CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos | | | Z. Shou et al, CVPR2017 |
SST: Single-Stream Temporal Action Proposals | | | S. Buch et al, CVPR2017 |
R-C3D: Region Convolutional 3D Network for Temporal Activity Detection | | | H. Xu et al, arXiv2017 |
DAPs: Deep Action Proposals for Action Understanding | | | V. Escorcia et al, ECCV2016 |
Online Action Detection using Joint Classification-Regression Recurrent Neural Networks | | | Y. Li et al, ECCV2016. Noe: RGB-D Action Detection |
Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs | | | Z. Shou et al, CVPR2016. Note: Aka S-CNN |
Fast Temporal Activity Proposals for Efficient Detection of Human Actions in Untrimmed Videos | | | F. Heilbron et al, CVPR2016. Note: Depends on , aka SparseProp |
Actionness Estimation Using Hybrid Fully Convolutional Networks | | | L. Wang et al, CVPR2016. Note: The code is not a complete verision. It only contains a demo, not training |
Learning Activity Progression in LSTMs for Activity Detection and Early Detection | | | S. Ma et al, CVPR2016 |
End-to-end Learning of Action Detection from Frame Glimpses in Videos | | | S. Yeung et al, CVPR2016. Note: This method uses reinforcement learning |
Fast Action Proposals for Human Action Detection and Search | | | G. Yu and J. Yuan, CVPR2015. Note: code for FAP is NOT available online. Note: Aka FAP |
Bag-of-fragments: Selecting and encoding video fragments for event detection and recounting | | | P. Mettes et al, ICMR2015 |
Action localization in videos through context walk | | | K. Soomro et al, ICCV2015 |
Awesome Action Recognition: / Action Recognition and Video Understanding / Spatio-Temporal Action Detection |
A Better Baseline for AVA | | | R. Girdhar et al., ActivityNet Workshop, CVPR2018 |
Real-Time End-to-End Action Detection with Two-Stream Networks | | | A. El-Nouby and G. Taylor, arXiv2018 |
Human Action Localization with Sparse Spatial Supervision | | | P. Weinzaepfel et al., arXiv2017 |
Unsupervised Action Discovery and Localization in Videos | | | K. Soomro and M. Shah, ICCV2017 |
Spatial-Aware Object Embeddings for Zero-Shot Localization and Classification of Actions | | | P. Mettes and C. G. M. Snoek, ICCV2017 |
Action Tubelet Detector for Spatio-Temporal Action Localization | | | V. Kalogeiton et al, ICCV2017 |
Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos | | | et al, ICCV2017 |
Chained Multi-stream Networks Exploiting Pose, Motion, and Appearance for Action Classification and Detection | | | M. Zolfaghari et al, ICCV2017 |
TORNADO: A Spatio-Temporal Convolutional Regression Network for Video Action Proposal | | | H. Zhu et al., ICCV2017 |
Online Real time Multiple Spatiotemporal Action Localisation and Prediction | | | et al, ICCV2017 |
AMTnet: Action-Micro-Tube regression by end-to-end trainable deep architecture | | | S. Saha et al, ICCV2017 |
Am I Done? Predicting Action Progress in Videos | | | F. Becattini et al, BMVC2017 |
Generic Tubelet Proposals for Action Localization | | | J. He et al, arXiv2017 |
Incremental Tube Construction for Human Action Detection | | | H. S. Behl et al, arXiv2017 |
Multi-region two-stream R-CNN for action detection | | | and C. Schmid. ECCV2016 |
Spot On: Action Localization from Pointly-Supervised Proposals | | | P. Mettes et al, ECCV2016 |
Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos | | | S. Saha et al, BMVC2016 |
Learning to track for spatio-temporal action localization | | | P. Weinzaepfel et al. ICCV2015 |
Action detection by implicit intentional motion clustering | | | W. Chen and J. Corso, ICCV2015 |
Finding Action Tubes | | | G. Gkioxari and J. Malik CVPR2015 |
APT: Action localization proposals from dense trajectories | | | J. Gemert et al, BMVC2015 |
Spatio-Temporal Object Detection Proposals | | | D. Oneata et al, ECCV2014 |
Action localization with tubelets from motion | | | M. Jain et al, CVPR2014 |
Spatiotemporal deformable part models for action detection | | | et al, CVPR2013 |
Action localization in videos through context walk | | | K. Soomro et al, ICCV2015 |
Fast Action Proposals for Human Action Detection and Search | | | G. Yu and J. Yuan, CVPR2015. Note: code for FAP is NOT available online. Note: Aka FAP |
Awesome Action Recognition: / Action Recognition and Video Understanding / Ego-Centric Action Recognition |
Actor and Observer: Joint Modeling of First and Third-Person Videos | | | G. Sigurdsson et al., CVPR2018 |
Awesome Action Recognition: / Action Recognition and Video Understanding / Miscellaneous |
What and How Well You Performed? A Multitask Learning Approach to Action Quality Assessment | | | P. Parma and B. T. Morris. CVPR2019 |
PathTrack: Fast Trajectory Annotation with Path Supervision | | | S. Manen et al., ICCV2017 |
CortexNet: a Generic Network Family for Robust Visual Temporal Representations | | | A. Canziani and E. Culurciello - arXiv2017 |
Slicing Convolutional Neural Network for Crowd Video Understanding | | | J. Shao et al, CVPR2016 |
Two-Stream (RGB and Flow) pretrained model weights | 26 | about 8 years ago | |
Awesome Action Recognition: / Action Recognition and Video Understanding / Action Recognition Datasets |
Video Dataset Overview from Antoine Miech | | | |
HACS | | | |
Moments in Time | | | , |
AVA | | | , , for missing videos |
Kinetics | | | , , |
OOPS | | | A dataset of unintentional action, |
COIN | | | a large-scale dataset for comprehensive instructional video analysis, |
YouTube-8M | | | , |
YouTube-BB | | | , |
DALY | | | Daily Action Localization in Youtube videos. Note: Weakly supervised action detection dataset. Annotations consist of start and end time of each action, one bounding box per each action per video |
20BN-JESTER | | | , |
ActivityNet | | | Note: They provide a download script and evaluation code |
Charades | | | |
Charades-Ego | | | , - First person and third person video aligned dataset |
EPIC-Kitchens | | | , - First person videos recorded in kitchens. Note they provide download scripts and a python library |
Sports-1M | | | Large scale action recognition dataset |
THUMOS14 | | | Note: It overlaps with dataset |
THUMOS15 | | | Note: It overlaps with dataset |
HOLLYWOOD2 | | | : |
UCF-101 | | | , , and , and . And there are also some pre-computed spatiotemporal action detection |
UCF-50 | | | |
UCF-Sports | | | , note: the train/test split link in the official website is broken. Instead, you can download it from |
HMDB | | | |
J-HMDB | | | |
LIRIS-HARL | | | |
KTH | | | |
MSR Action | | | Note: It overlaps with datset |
Sports Videos in the Wild | | | |
NTU RGB+D | 759 | almost 3 years ago | |
Mixamo Mocap Dataset | | | |
UWA3D Multiview Activity II Dataset | | | |
Northwestern-UCLA Dataset | | | |
SYSU 3D Human-Object Interaction Dataset | | | |
MEVA (Multiview Extended Video with Activities) Dataset | | | |
Awesome Action Recognition: / Action Recognition and Video Understanding / Video Annotation |
Efficiently scaling up crowdsourced video annotation | | | C. Vondrick et. al, IJCV2013 |
The Design and Implementation of ViPER | | | D. Mihalcik and D. Doermann, Technical report |
VTT: Visual Object Tagging Tool | 4,320 | almost 3 years ago | . Modern app to annotate objects in videos and images. It facilitates the development of an end-to-end machine learning pipeline encompassing the annotation/export/import of assets. Moreover, it could run as a native app or via web |
VIA: VGG Image Annotator | | | . Simple and standalone manual annotation web-app for image, audio and video. It runs in the web browser and does not require any installation or setup |
Awesome Action Recognition: / Object Recognition / Object Detection |
Deformable Convolutional Networks | | | J. Dai et al., ICCV2017 |
Detectron | 26,276 | about 1 year ago | Open Source Object Detection Framework from Facebook AI Research. Includes Mask R-CNN, FPN, and etc. Caffe2 implementation |
Mask R-CNN | | | K. He et al, , , , , - State-of-the-art object detection/instance segmentation algorithm |
Faster R-CNN | | | S. Ren et al, NIPS2015. , , , - State-of-the-art object detector |
YOLO | | | J. Redmon et al, CVPR2016. , - Fast object detector |
YOLO9000 | | | J. Redmon and A. Farhadi, CVPR2017. - State-of-the-art object detector which can detect 9000 objects in realtime |
SSD | | | W. Liu et al, ECCV2016. , , - State-of-the-art object detector with realtime processing speed |
RetinaNet | | | Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He and Piotr Dollár, Facebook AI Research FAIR & ICCV 2017. - State-of-the-art object detector with realtime processing speed |
Awesome Action Recognition: / Object Recognition / Video Object Detection |
[code] | 553 | over 6 years ago | [Detect to Track and Track to Detect] - C. Feichtenhofer et al., ICCV2017. , |
[code] | 723 | about 3 years ago | [Flow-Guided Feature Aggregation for Video Object Detection] - X. Zhu et al., ICCV2017. , aka FGFA |
Awesome Action Recognition: / Object Recognition / Video Object Detection Datasets |
ImageNet VID | | | |
YouTube-8M | | | , |
YouTube-BB | | | , |
Awesome Action Recognition: / Pose Estimation / Pose Estimation |
AlphaPose | 8,065 | 7 months ago | PyTorch based realtime and accurate pose estimation and tracking tool from SJTU |
Detect-and-Track: Efficient Pose Estimation in Videos | | | R. Girdhar et al., arXiv2017 |
OpenPose Library | 31,387 | 4 months ago | Caffe based realtime pose estimation library from CMU |
Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields | | | Z. Cao et al, CVPR2017. depends on the - Earlier version of OpenPose from CMU |
DensePose | | | Dense pose human estimation in the wild implemented in the Detectron framework |
MultiPoseNet: Fast Multi-Person Pose Estimation using Pose Residual Network | | | M. Kocabas et al, ECCV2018 |
DeepLabCut: markerless pose estimation of user-defined body parts with deep learning | | | A. Mathis et al, Nature Neuroscience 2018 |
Awesome Action Recognition: / Competitions / Competitions |
ActEV (Activities in Extended Video | | | Activity detection in security camera videos. Runs through 2021. Hosted by NIST |