awesome-action-recognition
Action Recognition Resources
A curated collection of resources and research papers on action recognition and video understanding techniques.
A curated list of action recognition and related area resources
4k stars
207 watching
724 forks
last commit: about 2 years ago
Linked from 3 awesome lists
action-classificationaction-detectionaction-recognitionactivity-recognitionactivity-understandingawesomeawesome-listobject-recognitionpose-estimationvideo-processingvideo-recognitionvideo-understanding
Awesome Action Recognition: / Action Recognition and Video Understanding / Summary posts | |||
Deep Learning for Videos: A 2018 Guide to Action Recognition | Summary of major landmark action recognition research papers till 2018 | ||
Literature Survey: Human Action Recognition | Brief human action recognition literature survey of work published between 2014 and 2019 | ||
Awesome Action Recognition: / Action Recognition and Video Understanding / Video Representation | |||
Why Can't I Dance in the Mall? Learning to Mitigate Scene Bias in Action Recognition | J. Choi et al., NeurIPS2019 | ||
SlowFast Networks for Video Recognition | C. Feichtenhofer et al., ICCV2019 | ||
Large-scale weakly-supervised pre-training for video action recognition | D. Ghadiyaram et al., arXiv2019 | ||
Video Classification with Channel-Separated Convolutional Networks | D. Tran et al., arXiv2019 | ||
DistInit: Learning Video Representations without a Single Labeled Video | R. Girdhar et al., arXiv2019 | ||
SCSampler: Sampling Salient Clips from Video for Efficient Action Recognition | B. Korbar et al., arXiv2019 | ||
Video Action Transformer Network | R. Girdhar et al., CVPR2019 | ||
Learning Correspondence from the Cycle-consistency of Time | X. Wang et al., CVPR2019 | ||
Representation Flow for Action Recognition | AJ. Piergiovanni and M. S. Ryoo et al., CVPR2019 | ||
Collaborative Spatiotemporal Feature Learning for Video Action Recognition | C. Li et al., CVPR2019 | ||
Learning Video Representations from Correspondence Proposals | X. Liu et al., CVPR2019 | ||
Timeception for Complex Action Recognition | N. Hussein et al., CVPR2019 | ||
The Visual Centrifuge: Model-Free Layered Video Representations | J.-B. Alayrac et al., CVPR2019 | ||
Long-Term Feature Banks for Detailed Video Understanding | C.-Y. Wu. et al., CVPR2019 | ||
Temporal Relational Reasoning in Videos | B. Zhou et al., ECCV2018 | ||
Action Recognition Zoo | 244 | about 6 years ago | - Codes for popular action recognition models, written based on pytorch, verified on the something-something dataset |
Videos as Space-Time Region Graphs | X. Wang and A. Gupta, ECCV2018 | ||
Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? | K. Hara et al., CVPR2019 | ||
A Closer Look at Spatiotemporal Convolutions for Action Recognition | D. Tran et al., CVPR2018 | ||
Attend and Interact: Higher-Order Object Interactions for Video Understanding | CY. Ma et al., CVPR 2018 | ||
Non-Local Neural Networks | X. Wang et al., CVPR2018 | ||
Rethinking Spatiotemporal Feature Learning For Video Understanding | S. Xie et al., arXiv2017 | ||
ConvNet Architecture Search for Spatiotemporal Feature Learning | D. Tran et al, arXiv2017. Note: Aka Res3D. : In the repository, C3D-v1.1 is the Res3D implementation | ||
Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks | Z. Qui et al, ICCV2017 | ||
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset | J. Carreira et al, CVPR2017. , | ||
Learning Spatiotemporal Features with 3D Convolutional Networks | D. Tran et al, ICCV2015. Note: Aka C3D. Note that the official caffe does not support python wrapper. , , , : , | ||
Deep Temporal Linear Encoding Networks | A. Diba et al, CVPR2017 | ||
Temporal Convolutional Networks: A Unified Approach to Action Segmentation and Detection | C. Lea et al, CVPR 2017 | ||
Long-term Temporal Convolutions | G. Varol et al, TPAMI2017 | ||
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition | L. Wang et al, arXiv 2016 | ||
Convolutional Two-Stream Network Fusion for Video Action Recognition | C. Feichtenhofer et al, CVPR2016 | ||
Two-Stream Convolutional Networks for Action Recognition in Videos | K. Simonyan and A. Zisserman, NIPS2014 | ||
Temporal Recurrent Networks for Online Action Detection | M. Xu et al, ICCV2019 | ||
Long Short-Term Transformer for Online Action Detection | M. Xu et al, Neurips2021 | ||
[3D ResNet PyTorch] | 3,920 | over 4 years ago | |
[PyTorch Video Research] | 533 | almost 6 years ago | |
[M-PACT: Michigan Platform for Activity Classification in Tensorflow] | 107 | about 6 years ago | |
[Inflated models on PyTorch] | 148 | about 4 years ago | |
[I3D models transfered from Tensorflow to PyTorch] | 532 | about 1 year ago | |
[A Two Stream Baseline on Kinectics dataset] | 42 | over 6 years ago | |
[MMAction] | 1,863 | about 3 years ago | |
[MMAction2] | 4,360 | 10 months ago | |
[PySlowFast] | 6,680 | 6 months ago | |
[Decord] | 1,923 | 11 months ago | Efficient video reader for python |
[I3D models converted from Tensorflow to Core ML] | 24 | almost 5 years ago | |
[Extract frame and optical-flow from videos, #docker] | 133 | almost 3 years ago | |
[NVIDIA-DALI, video loading pipelines] | |||
[NVIDIA optical-flow SDK] | |||
Awesome Action Recognition: / Action Recognition and Video Understanding / Action Classification | |||
Guided Weak Supervision for Action Recognition with Scarce Data to Assess Skills of Children with Autism | P. Pandey et al, AAAI 2020 | ||
Neural Graph Matching Networks for Fewshot 3D Action Recognition | M. Guo et al., ECCV2018 | ||
Temporal 3D ConvNets using Temporal Transition Layer | A. Diba et al., CVPRW2018 | ||
Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification | A. Diba et al., arXiv2017 | ||
Attentional Pooling for Action Recognition | R. Girdhar and D. Ramanan, NIPS2017 | ||
Fully Context-Aware Video Prediction | Byeon et al, arXiv2017 | ||
Hidden Two-Stream Convolutional Networks for Action Recognition | Y. Zhu et al, arXiv2017 | ||
Dynamic Image Networks for Action Recognition | H. Bilen et al, CVPR2016 | ||
Long-term Recurrent Convolutional Networks for Visual Recognition and Description | J. Donahue et al, CVPR2015 | ||
Describing Videos by Exploiting Temporal Structure | L. Yao et al, ICCV2015. note: from the same group of RCN paper “Delving Deeper into Convolutional Networks for Learning Video Representations" | ||
Two-Stream SR-CNNs for Action Recognition in Videos | L. Wang et al, BMVC2016 | ||
Real-time Action Recognition with Enhanced Motion Vector CNNs | B. Zhang et al, CVPR2016 | ||
Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors | L. Wang et al, CVPR2015 | ||
Awesome Action Recognition: / Action Recognition and Video Understanding / Skeleton-Based Action Classification | |||
Actional-Structural Graph Convolutional Networks for Skeleton-Based Action Recognition | M. Li et al., CVPR2019 | ||
An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition | C. Si et al., CVPR2019 | ||
View Adaptive Neural Networks for High Performance Skeleton-Based Human Action Recognition | P. Zhang et al., TPAMI2019 | ||
Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition | S. Yan et al., AAAI2018 | ||
Deep Progressive Reinforcement Learning for Skeleton-Based Action Recognition | Y. Tang et al., CVPR2018 | ||
Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation | C. Li et al., IJCAI2018 | ||
Part-based Graph Convolutional Network for Action Recognition | K. Thakkar et al., BMVC2018 | ||
Awesome Action Recognition: / Action Recognition and Video Understanding / Temporal Action Detection | |||
Rethinking the Faster R-CNN Architecture for Temporal Action Localization | Yu-Wei Chao et al., CVPR2018 | ||
Weakly Supervised Action Localization by Sparse Temporal Pooling Network | Phuc Nguyen et al., CVPR 2018 | ||
Temporal Deformable Residual Networks for Action Segmentation in Videos | P. Lei and S. Todrovic., CVPR2018 | ||
End-to-End, Single-Stream Temporal Action Detection in Untrimmed Videos | Shayamal Buch et al., BMVC 2017 | ||
Cascaded Boundary Regression for Temporal Action Detection | Jiyang Gao et al., BMVC 2017 [ ] | ||
Temporal Tessellation: A Unified Approach for Video Analysis | Kaufman et al., ICCV2017 | ||
Temporal Action Detection with Structured Segment Networks | Y. Zhao et al., ICCV2017 | ||
Temporal Context Network for Activity Localization in Videos | X. Dai et al., ICCV2017 | ||
Detecting the Moment of Completion: Temporal Models for Localising Action Completion | F. Heidarivincheh et al., arXiv2017 | ||
CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos | Z. Shou et al, CVPR2017 | ||
SST: Single-Stream Temporal Action Proposals | S. Buch et al, CVPR2017 | ||
R-C3D: Region Convolutional 3D Network for Temporal Activity Detection | H. Xu et al, arXiv2017 | ||
DAPs: Deep Action Proposals for Action Understanding | V. Escorcia et al, ECCV2016 | ||
Online Action Detection using Joint Classification-Regression Recurrent Neural Networks | Y. Li et al, ECCV2016. Noe: RGB-D Action Detection | ||
Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs | Z. Shou et al, CVPR2016. Note: Aka S-CNN | ||
Fast Temporal Activity Proposals for Efficient Detection of Human Actions in Untrimmed Videos | F. Heilbron et al, CVPR2016. Note: Depends on , aka SparseProp | ||
Actionness Estimation Using Hybrid Fully Convolutional Networks | L. Wang et al, CVPR2016. Note: The code is not a complete verision. It only contains a demo, not training | ||
Learning Activity Progression in LSTMs for Activity Detection and Early Detection | S. Ma et al, CVPR2016 | ||
End-to-end Learning of Action Detection from Frame Glimpses in Videos | S. Yeung et al, CVPR2016. Note: This method uses reinforcement learning | ||
Fast Action Proposals for Human Action Detection and Search | G. Yu and J. Yuan, CVPR2015. Note: code for FAP is NOT available online. Note: Aka FAP | ||
Bag-of-fragments: Selecting and encoding video fragments for event detection and recounting | P. Mettes et al, ICMR2015 | ||
Action localization in videos through context walk | K. Soomro et al, ICCV2015 | ||
Awesome Action Recognition: / Action Recognition and Video Understanding / Spatio-Temporal Action Detection | |||
A Better Baseline for AVA | R. Girdhar et al., ActivityNet Workshop, CVPR2018 | ||
Real-Time End-to-End Action Detection with Two-Stream Networks | A. El-Nouby and G. Taylor, arXiv2018 | ||
Human Action Localization with Sparse Spatial Supervision | P. Weinzaepfel et al., arXiv2017 | ||
Unsupervised Action Discovery and Localization in Videos | K. Soomro and M. Shah, ICCV2017 | ||
Spatial-Aware Object Embeddings for Zero-Shot Localization and Classification of Actions | P. Mettes and C. G. M. Snoek, ICCV2017 | ||
Action Tubelet Detector for Spatio-Temporal Action Localization | V. Kalogeiton et al, ICCV2017 | ||
Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos | et al, ICCV2017 | ||
Chained Multi-stream Networks Exploiting Pose, Motion, and Appearance for Action Classification and Detection | M. Zolfaghari et al, ICCV2017 | ||
TORNADO: A Spatio-Temporal Convolutional Regression Network for Video Action Proposal | H. Zhu et al., ICCV2017 | ||
Online Real time Multiple Spatiotemporal Action Localisation and Prediction | et al, ICCV2017 | ||
AMTnet: Action-Micro-Tube regression by end-to-end trainable deep architecture | S. Saha et al, ICCV2017 | ||
Am I Done? Predicting Action Progress in Videos | F. Becattini et al, BMVC2017 | ||
Generic Tubelet Proposals for Action Localization | J. He et al, arXiv2017 | ||
Incremental Tube Construction for Human Action Detection | H. S. Behl et al, arXiv2017 | ||
Multi-region two-stream R-CNN for action detection | and C. Schmid. ECCV2016 | ||
Spot On: Action Localization from Pointly-Supervised Proposals | P. Mettes et al, ECCV2016 | ||
Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos | S. Saha et al, BMVC2016 | ||
Learning to track for spatio-temporal action localization | P. Weinzaepfel et al. ICCV2015 | ||
Action detection by implicit intentional motion clustering | W. Chen and J. Corso, ICCV2015 | ||
Finding Action Tubes | G. Gkioxari and J. Malik CVPR2015 | ||
APT: Action localization proposals from dense trajectories | J. Gemert et al, BMVC2015 | ||
Spatio-Temporal Object Detection Proposals | D. Oneata et al, ECCV2014 | ||
Action localization with tubelets from motion | M. Jain et al, CVPR2014 | ||
Spatiotemporal deformable part models for action detection | et al, CVPR2013 | ||
Action localization in videos through context walk | K. Soomro et al, ICCV2015 | ||
Fast Action Proposals for Human Action Detection and Search | G. Yu and J. Yuan, CVPR2015. Note: code for FAP is NOT available online. Note: Aka FAP | ||
Awesome Action Recognition: / Action Recognition and Video Understanding / Ego-Centric Action Recognition | |||
Actor and Observer: Joint Modeling of First and Third-Person Videos | G. Sigurdsson et al., CVPR2018 | ||
Awesome Action Recognition: / Action Recognition and Video Understanding / Miscellaneous | |||
What and How Well You Performed? A Multitask Learning Approach to Action Quality Assessment | P. Parma and B. T. Morris. CVPR2019 | ||
PathTrack: Fast Trajectory Annotation with Path Supervision | S. Manen et al., ICCV2017 | ||
CortexNet: a Generic Network Family for Robust Visual Temporal Representations | A. Canziani and E. Culurciello - arXiv2017 | ||
Slicing Convolutional Neural Network for Crowd Video Understanding | J. Shao et al, CVPR2016 | ||
Two-Stream (RGB and Flow) pretrained model weights | 26 | over 8 years ago | |
Awesome Action Recognition: / Action Recognition and Video Understanding / Action Recognition Datasets | |||
Video Dataset Overview from Antoine Miech | |||
HACS | |||
Moments in Time | , | ||
AVA | , , for missing videos | ||
Kinetics | , , | ||
OOPS | A dataset of unintentional action, | ||
COIN | a large-scale dataset for comprehensive instructional video analysis, | ||
YouTube-8M | , | ||
YouTube-BB | , | ||
DALY | Daily Action Localization in Youtube videos. Note: Weakly supervised action detection dataset. Annotations consist of start and end time of each action, one bounding box per each action per video | ||
20BN-JESTER | , | ||
ActivityNet | Note: They provide a download script and evaluation code | ||
Charades | |||
Charades-Ego | , - First person and third person video aligned dataset | ||
EPIC-Kitchens | , - First person videos recorded in kitchens. Note they provide download scripts and a python library | ||
Sports-1M | Large scale action recognition dataset | ||
THUMOS14 | Note: It overlaps with dataset | ||
THUMOS15 | Note: It overlaps with dataset | ||
HOLLYWOOD2 | : | ||
UCF-101 | , , and , and . And there are also some pre-computed spatiotemporal action detection | ||
UCF-50 | |||
UCF-Sports | , note: the train/test split link in the official website is broken. Instead, you can download it from | ||
HMDB | |||
J-HMDB | |||
LIRIS-HARL | |||
KTH | |||
MSR Action | Note: It overlaps with datset | ||
Sports Videos in the Wild | |||
NTU RGB+D | 763 | over 3 years ago | |
Mixamo Mocap Dataset | |||
UWA3D Multiview Activity II Dataset | |||
Northwestern-UCLA Dataset | |||
SYSU 3D Human-Object Interaction Dataset | |||
MEVA (Multiview Extended Video with Activities) Dataset | |||
Awesome Action Recognition: / Action Recognition and Video Understanding / Video Annotation | |||
Efficiently scaling up crowdsourced video annotation | C. Vondrick et. al, IJCV2013 | ||
The Design and Implementation of ViPER | D. Mihalcik and D. Doermann, Technical report | ||
VTT: Visual Object Tagging Tool | 4,331 | over 3 years ago | . Modern app to annotate objects in videos and images. It facilitates the development of an end-to-end machine learning pipeline encompassing the annotation/export/import of assets. Moreover, it could run as a native app or via web |
VIA: VGG Image Annotator | . Simple and standalone manual annotation web-app for image, audio and video. It runs in the web browser and does not require any installation or setup | ||
Awesome Action Recognition: / Object Recognition / Object Detection | |||
Deformable Convolutional Networks | J. Dai et al., ICCV2017 | ||
Detectron | 26,295 | over 1 year ago | Open Source Object Detection Framework from Facebook AI Research. Includes Mask R-CNN, FPN, and etc. Caffe2 implementation |
Mask R-CNN | K. He et al, , , , , - State-of-the-art object detection/instance segmentation algorithm | ||
Faster R-CNN | S. Ren et al, NIPS2015. , , , - State-of-the-art object detector | ||
YOLO | J. Redmon et al, CVPR2016. , - Fast object detector | ||
YOLO9000 | J. Redmon and A. Farhadi, CVPR2017. - State-of-the-art object detector which can detect 9000 objects in realtime | ||
SSD | W. Liu et al, ECCV2016. , , - State-of-the-art object detector with realtime processing speed | ||
RetinaNet | Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He and Piotr Dollár, Facebook AI Research FAIR & ICCV 2017. - State-of-the-art object detector with realtime processing speed | ||
Awesome Action Recognition: / Object Recognition / Video Object Detection | |||
[code] | 553 | almost 7 years ago | [Detect to Track and Track to Detect] - C. Feichtenhofer et al., ICCV2017. , |
[code] | 724 | over 3 years ago | [Flow-Guided Feature Aggregation for Video Object Detection] - X. Zhu et al., ICCV2017. , aka FGFA |
Awesome Action Recognition: / Object Recognition / Video Object Detection Datasets | |||
ImageNet VID | |||
YouTube-8M | , | ||
YouTube-BB | , | ||
Awesome Action Recognition: / Pose Estimation / Pose Estimation | |||
AlphaPose | 8,084 | about 1 year ago | PyTorch based realtime and accurate pose estimation and tracking tool from SJTU |
Detect-and-Track: Efficient Pose Estimation in Videos | R. Girdhar et al., arXiv2017 | ||
OpenPose Library | 31,463 | 10 months ago | Caffe based realtime pose estimation library from CMU |
Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields | Z. Cao et al, CVPR2017. depends on the - Earlier version of OpenPose from CMU | ||
DensePose | Dense pose human estimation in the wild implemented in the Detectron framework | ||
MultiPoseNet: Fast Multi-Person Pose Estimation using Pose Residual Network | M. Kocabas et al, ECCV2018 | ||
DeepLabCut: markerless pose estimation of user-defined body parts with deep learning | A. Mathis et al, Nature Neuroscience 2018 | ||
Awesome Action Recognition: / Competitions / Competitions | |||
ActEV (Activities in Extended Video | Activity detection in security camera videos. Runs through 2021. Hosted by NIST |
More related projects:
-
open-mmlab/mmdetection
-
open-mmlab/mmcv
-
open-mmlab/mmdetection3d
-
open-mmlab/mmsegmentation
-
open-mmlab/mmpose
-
open-mmlab/mmgeneration
-
open-mmlab/mmengine
-
open-mmlab/mmdeploy
-
open-mmlab/mmhuman3d
-
terrychenism/deformable-convnets
-
msracver/deformable-convnets
-
msracver/fcis
-
msracver/relation-networks-for-object-detection
-
huaizhengzhang/awsome-deep-learning-for-video-analysis
-
skyhehe123/sa-ssd