Awesome-Transformer-Attention

Transformer papers

A comprehensive collection of papers, codes, and related resources for understanding vision transformer and attention mechanisms in computer vision and deep learning.

An ultimately comprehensive paper list of Vision Transformer/Attention, including papers, codes, and related websites

GitHub

5k stars

130 watching

490 forks

last commit: over 1 year ago

Linked from 2 awesome lists

attention-mechanismattention-mechanismsawesome-listcomputer-visiondeep-learningdetrpapersself-attentiontransformertransformer-architecturetransformer-awesometransformer-cvtransformer-modelstransformer-with-cvtransformersvision-transformervisual-transformervit

Ultimate-Awesome-Transformer-Attention / Overview
Multi-Modality
Ultimate-Awesome-Transformer-Attention / Overview / Multi-Modality
Visual Captioning
Visual Question Answering
Visual Grounding
Multi-Modal Representation Learning
Multi-Modal Retrieval
Multi-Modal Generation
Prompt Learning/Tuning
Visual Document Understanding
Other Multi-Modal Tasks
Ultimate-Awesome-Transformer-Attention / Overview
Other High-level Vision Tasks
Ultimate-Awesome-Transformer-Attention / Overview / Other High-level Vision Tasks
Point Cloud / 3D
Pose Estimation
Tracking
Re-ID
Face
Scene Graph
Neural Architecture Search
Ultimate-Awesome-Transformer-Attention / Overview
Transfer / X-Supervised / X-Shot / Continual Learning
Low-level Vision Tasks
Ultimate-Awesome-Transformer-Attention / Overview / Low-level Vision Tasks
Image Restoration
Video Restoration
Inpainting / Completion / Outpainting
Image Generation
Video Generation
Transfer / Translation / Manipulation
Other Low-Level Tasks
Ultimate-Awesome-Transformer-Attention / Overview
Reinforcement Learning
Ultimate-Awesome-Transformer-Attention / Overview / Reinforcement Learning
Navigation
Other RL Tasks
Ultimate-Awesome-Transformer-Attention / Overview
Medical
Ultimate-Awesome-Transformer-Attention / Overview / Medical
Medical Segmentation
Medical Classification
Medical Detection
Medical Reconstruction
Medical Low-Level Vision
Medical Vision-Language
Medical Others
Ultimate-Awesome-Transformer-Attention / Overview
Other Tasks
Attention Mechanisms in Vision/NLP
Ultimate-Awesome-Transformer-Attention / Overview / Attention Mechanisms in Vision/NLP
Attention for Vision
NLP
Both
Others
Ultimate-Awesome-Transformer-Attention / Survey
Paper			"A Survey on Multimodal Large Language Models for Autonomous Driving", WACVW, 2024 ( ). [ ][ ]
Paper			"Efficient Multimodal Large Language Models: A Survey", arXiv, 2024 ( ). [ ][ ]
Paper			"From Sora What We Can See: A Survey of Text-to-Video Generation", arXiv, 2024 ( ). [ ][ ]
Paper			"When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models", arXiv, 2024 ( ). [ ][ ]
Paper			"Foundation Models for Video Understanding: A Survey", arXiv, 2024 ( ). [ ][ ]
Paper			"Vision Mamba: A Comprehensive Survey and Taxonomy", arXiv, 2024 ( ). [ ][ ]
Paper			"Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond", arXiv, 2024 ( ). [ ][ ]
Paper			"Video Diffusion Models: A Survey", arXiv, 2024 ( ). [ ][ ]
Paper			"Unleashing the Power of Multi-Task Learning: A Comprehensive Survey Spanning Traditional, Deep, and Pretrained Foundation Model Eras", arXiv, 2024 ( ). [ ]
Paper			"Hallucination of Multimodal Large Language Models: A Survey", arXiv, 2024 ( ). [ ][ ]
Paper			"A Survey on Vision Mamba: Models, Applications and Challenges", arXiv, 2024 ( ). [ ][ ]
Paper			"State Space Model for New-Generation Network Alternative to Transformers: A Survey", arXiv, 2024 ( ). [ ][ ]
Paper			"Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions", arXiv, 2024 ( ). [ ]
Paper			"From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models", arXiv, 2024 ( ). [ ][ ]
Paper			"Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey", arXiv, 2024 ( ). [ ]
Paper			"Sora as an AGI World Model? A Complete Survey on Text-to-Video Generation", arXiv, 2024 ( ). [ ]
Paper			"Controllable Generation with Text-to-Image Diffusion Models: A Survey", arXiv, 2024 ( ). [ ][ ]
Paper			"Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models", arXiv, 2024 ( ). [ ][ ]
Paper			"Large Multimodal Agents: A Survey", arXiv, 2024 ( ). [ ][ ]
Paper			"Parameter-Efficient Fine-Tuning for Pre-Trained Vision Models: A Survey", arXiv, 2024 ( ). [ ][ ]
Paper			"Vision-Language Navigation with Embodied Intelligence: A Survey", arXiv, 2024 ( ). [ ]
Paper			"The (R)Evolution of Multimodal Large Language Models: A Survey", arXiv, 2024 ( ). [ ]
Paper			"Masked Modeling for Self-supervised Representation Learning on Vision and Beyond", arXiv, 2024 ( ). [ ][ ]
Paper			"Transformer for Object Re-Identification: A Survey", arXiv, 2024 ( ). [ ]
Paper			"Forging Vision Foundation Models for Autonomous Driving: Challenges, Methodologies, and Opportunities", arXiv, 2024 ( ). [ ][ ]
Paper			"MM-LLMs: Recent Advances in MultiModal Large Language Models", arXiv, 2024 ( ). [ ]
Paper			"From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities", arXiv, 2024 ( ). [ ]
Paper			"A Survey on Hallucination in Large Vision-Language Models", arXiv, 2024 ( ). [ ]
Paper			"A Survey for Foundation Models in Autonomous Driving", arXiv, 2024 ( ). [ ]
Paper			"A Survey on Transformer Compression", arXiv, 2024 ( ). [ ]
Paper			"Vision + Language Applications: A Survey", CVPRW, 2023 ( ). [ ][ ]
Paper			"Multimodal Learning With Transformers: A Survey", TPAMI, 2023 ( ). [ ]
Paper			"A Survey of Visual Transformers", TNNLS, 2023 ( ). [ ][ ]
Paper			"Video Understanding with Large Language Models: A Survey", arXiv, 2023 ( ). [ ][ ]
Paper			"Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey", arXiv, 2023 ( ). [ ]
Paper			"A Survey of Reasoning with Foundation Models: Concepts, Methodologies, and Outlook", arXiv, 2023 ( ). [ ][ ]
Paper			"A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise", arXiv, 2023 ( ). [ ] ]
Paper			"Towards the Unification of Generative and Discriminative Visual Foundation Model: A Survey", arXiv, 2023 ( ). [ ]
Paper			"Explainability of Vision Transformers: A Comprehensive Review and New Perspectives", arXiv, 2023 ( ). [ ]
Paper			"Vision-Language Instruction Tuning: A Review and Analysis", arXiv, 2023 ( ). [ ][ ]
Paper			"Understanding Video Transformers for Segmentation: A Survey of Application and Interpretability", arXiv, 2023 ( ). [ ]
Paper			"Unsupervised Object Localization in the Era of Self-Supervised ViTs: A Survey", arXiv, 2023 ( ). [ ][ ]
Paper			"A Survey on Video Diffusion Models", arXiv, 2023 ( ). [ ][ ]
Paper			"The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)", arXiv, 2023 ( ). [ ]
Paper			"Multimodal Foundation Models: From Specialists to General-Purpose Assistants", arXiv, 2023 ( ). [ ]
Paper			"Transformers in Small Object Detection: A Benchmark and Survey of State-of-the-Art", arXiv, 2023 ( ). [ ]
Paper			"RenAIssance: A Survey into AI Text-to-Image Generation in the Era of Large Model", arXiv, 2023 ( ). [ ]
Paper			"A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking", arXiv, 2023 ( ). [ ]
Paper			"From CNN to Transformer: A Review of Medical Image Segmentation Models", arXiv, 2023 ( ). [ ]
Paper			"Foundational Models Defining a New Era in Vision: A Survey and Outlook", arXiv, 2023 ( ). [ ][ ]
Paper			"A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models", arXiv, 2023 ( ). [ ]
Paper			"Robust Visual Question Answering: Datasets, Methods, and Future Challenges", arXiv, 2023 ( ). [ ]
Paper			"A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future", arXiv, 2023 ( ). [ ]
Paper			"Transformers in Reinforcement Learning: A Survey", arXiv, 2023 ( ). [ ]
Paper			"Vision Language Transformers: A Survey", arXiv, 2023 ( ). [ ]
Paper			"Towards Open Vocabulary Learning: A Survey", arXiv, 2023 ( ). [ ][ ]
Paper			"Large Multimodal Models: Notes on CVPR 2023 Tutorial", arXiv, 2023 ( ). [ ]
Paper			"A Survey on Multimodal Large Language Models", arXiv, 2023 ( ). [ ][ ]
Paper			"2D Object Detection with Transformers: A Review", arXiv, 2023 ( ). [ ]
Paper			"Visual Question Answering: A Survey on Techniques and Common Trends in Recent Literature", arXiv, 2023 ( ). [ ]
Paper			"Vision-Language Models in Remote Sensing: Current Progress and Future Trends", arXiv, 2023 ( ). [ ]
Paper			"Visual Tuning", arXiv, 2023 ( ). [ ]
Paper			"Self-supervised Learning for Pre-Training 3D Point Clouds: A Survey", arXiv, 2023 ( ). [ ]
Paper			"Semantic Segmentation using Vision Transformers: A survey", arXiv, 2023 ( ). [ ]
Paper			"A Review of Deep Learning for Video Captioning", arXiv, 2023 ( ). [ ]
Paper			"Transformer-Based Visual Segmentation: A Survey", arXiv, 2023 ( ). [ ][ ]
Paper			"Vision-Language Models for Vision Tasks: A Survey", arXiv, 2023 ( ). [ ][ ]
Paper			"Text-to-image Diffusion Model in Generative AI: A Survey", arXiv, 2023 ( ). [ ]
Paper			"Foundation Models for Decision Making: Problems, Methods, and Opportunities", arXiv, 2023 ( ). [ ]
Paper			"Advances in Medical Image Analysis with Vision Transformers: A Comprehensive Review", arXiv, 2023 ( ). [ ][ ]
Paper			"Efficiency 360: Efficient Vision Transformers", arXiv, 2023 ( ). [ ][ ]
Paper			"Transformer-based Generative Adversarial Networks in Computer Vision: A Comprehensive Survey", arXiv, 2023 ( ). [ ]
Paper			"Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey", arXiv, 2023 ( ). [ ][ ]
Paper			"A Survey on Visual Transformer", TPAMI, 2022 ( ). [ ]
Paper			"Attention mechanisms in computer vision: A survey", Computational Visual Media, 2022 ( ). [ ][ ][ ]
Paper			"A Comprehensive Study of Vision Transformers on Dense Prediction Tasks", VISAP, 2022 ( ). [ ]
Paper			"Vision-and-Language Pretrained Models: A Survey", IJCAI, 2022 ( ). [ ]
Paper			"Vision Transformers in Medical Imaging: A Review", arXiv, 2022 ( ). [ ]
Paper			"A Comprehensive Survey of Transformers for Computer Vision", arXiv, 2022 ( ). [ ]
Paper			"Vision-Language Pre-training: Basics, Recent Advances, and Future Trends", arXiv, 2022 ( ). [ ]
Paper			"Vision+X: A Survey on Multimodal Learning in the Light of Data", arXiv, 2022 ( ). [ ]
Paper			"Vision Transformers for Action Recognition: A Survey", arXiv, 2022 ( ). [ ]
Paper			"VLP: A Survey on Vision-Language Pre-training", arXiv, 2022 ( ). [ ]
Paper			"Transformers in Remote Sensing: A Survey", arXiv, 2022 ( ). [ ][ ]
Paper			"Medical image analysis based on transformer: A Review", arXiv, 2022 ( ). [ ]
Paper			"3D Vision with Transformers: A Survey", arXiv, 2022 ( ). [ ][ ]
Paper			"Vision Transformers: State of the Art and Research Challenges", arXiv, 2022 ( ). [ ]
Paper			"Transformers in Medical Imaging: A Survey", arXiv, 2022 ( ). [ ][ ]
Paper			"Multimodal Learning with Transformers: A Survey", arXiv, 2022 ( ). [ ]
Paper			"Transforming medical imaging with Transformers? A comparative review of key properties, current progresses, and future perspectives", arXiv, 2022 ( ). [ ]
Paper			"Transformers in 3D Point Clouds: A Survey", arXiv, 2022 ( ). [ ]
Paper			"A survey on attention mechanisms for medical applications: are we moving towards better algorithms?", arXiv, 2022 ( ). [ ]
Paper			"Efficient Transformers: A Survey", arXiv, 2022 ( ). [ ]
Paper			"Are we ready for a new paradigm shift? A Survey on Visual Deep MLP", arXiv, 2022 ( ). [ ]
Paper			"Vision Transformers in Medical Computer Vision - A Contemplative Retrospection", arXiv, 2022 ( ). [ ]
Paper			"Video Transformers: A Survey", arXiv, 2022 ( ). [ ]
Paper			"Transformers in Medical Image Analysis: A Review", arXiv, 2022 ( ). [ ]
Paper			"Recent Advances in Vision Transformer: A Survey and Outlook of Recent Work", arXiv, 2022 ( ). [ ]
Paper			"Transformers Meet Visual Learning Understanding: A Comprehensive Review", arXiv, 2022 ( ). [ ]
Paper			"Image Captioning In the Transformer Age", arXiv, 2022 ( ). [ ][ ]
Paper			"Visual Attention Methods in Deep Learning: An In-Depth Survey", arXiv, 2022 ( ). [ ]
Paper			"Transformers in Vision: A Survey", ACM Computing Surveys, 2021 ( ). [ ]
Paper			"Survey: Transformer based Video-Language Pre-training", arXiv, 2021 ( ). [ ]
Paper			"A Survey of Transformers", arXiv, 2021 ( ). [ ]
Paper			"Attention mechanisms and deep learning for machine vision: A survey of the state of the art", arXiv, 2021 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Image Classification / Backbone / Replace Conv w/ Attention
Paper			: "Local Relation Networks for Image Recognition", ICCV, 2019 ( ). [ ][ ]
Paper			: "Stand-Alone Self-Attention in Vision Models", NeurIPS, 2019 ( ). [ ][ ][ ]
Paper			: "Axial Attention in Multidimensional Transformers", arXiv, 2019 ( ). [ ][ ]
Paper			: "Exploring Self-attention for Image Recognition", CVPR, 2020 ( ). [ ][ ]
Paper			: "Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation", ECCV, 2020 ( ). [ ][ ]
Paper			: "Global Self-Attention Networks for Image Recognition", arXiv, 2020 ( ). [ ][ ]
Paper			: "Scaling Local Self-Attention For Parameter Efficient Visual Backbones", CVPR, 2021 ( ). [ ][ ]
Paper			: "Contextual Transformer Networks for Visual Recognition", CVPRW, 2021 ( ). [ ][ ]
Paper			: "Vision Transformers with Hierarchical Attention", arXiv, 2022 ( ). [ ][ ]
Paper			: "Attention Augmented Convolutional Networks", ICCV, 2019 ( ). [ ][ ][ ]
Paper			: "Global Context Networks", ICCVW, 2019 (& TPAMI 2020) ( ). [ ][ ]
Paper			: "LambdaNetworks: Modeling long-range Interactions without Attention", ICLR, 2021 ( ). [ ][ ][ ]
Paper			: "Bottleneck Transformers for Visual Recognition", CVPR, 2021 ( ). [ ][ ][ ]
Paper			: "Gaussian Context Transformer", CVPR, 2021 ( ). [ ]
Paper			: "CoAtNet: Marrying Convolution and Attention for All Data Sizes", NeurIPS, 2021 ( ). [ ]
Paper			: "On the Integration of Self-Attention and Convolution", CVPR, 2022 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Image Classification / Backbone / Vision Transformer
Paper			: "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale", ICLR, 2021 ( ). [ ][ ][ ][ ]
Paper			: "Perceiver: General Perception with Iterative Attention", ICML, 2021 ( ). [ ][ ]
Paper			: "Rethinking Spatial Dimensions of Vision Transformers", ICCV, 2021 ( ). [ ][ ]
Paper			: "Visual Transformers: Where Do Transformers Really Belong in Vision Models?", ICCV, 2021 ( ). [ ][ ]
Paper			: "Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions", ICCV, 2021 ( ). [ ][ ]
Paper			: "Rethinking and Improving Relative Position Encoding for Vision Transformer", ICCV, 2021 ( ). [ ][ ]
Paper			: "Going deeper with Image Transformers", ICCV, 2021 ( ). [ ][ ]
Paper			: "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows", ICCV, 2021 ( ). [ ][ ][ ]
Paper			: "Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet", ICCV, 2021 ( ). [ ][ ]
Paper			: "Leveraging Batch Normalization for Vision Transformers", ICCVW, 2021 ( ). [ ]
Paper			: "DPT: Deformable Patch-based Transformer for Visual Recognition", ACMMM, 2021 ( ). [ ][ ]
Paper			: "Focal Attention for Long-Range Interactions in Vision Transformers", NeurIPS, 2021 ( ). [ ][ ]
Paper			: "XCiT: Cross-Covariance Image Transformers", NeurIPS, 2021 ( ). [ ]
Paper			: "Twins: Revisiting Spatial Attention Design in Vision Transformers", NeurIPS, 2021 ( ). [ ][ ]
Paper			: "Blending Anti-Aliasing into Vision Transformer", NeurIPS, 2021 ( ). [ ][ ]
Paper			: "Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length", NeurIPS, 2021 ( ). [ ][ ]
Paper			: "Augmented Shortcuts for Vision Transformers", NeurIPS, 2021 ( ). [ ]
Paper			: "Transformer in Transformer", NeurIPS, 2021 ( ). [ ][ ][ ]
Paper			: "ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias", NeurIPS, 2021 ( ). [ ][ ]
Paper			: "DeepViT: Towards Deeper Vision Transformer", arXiv, 2021 ( ). [ ][ ]
Paper			: "So-ViT: Mind Visual Tokens for Vision Transformer", arXiv, 2021 ( ). [ ][ ]
Paper			: "All Tokens Matter: Token Labeling for Training Better Vision Transformers", NeurIPS, 2021 ( ). [ ][ ]
Paper			: "Aggregating Nested Transformers", arXiv, 2021 ( ). [ ][ ]
Paper			: "KVT: k-NN Attention for Boosting Vision Transformers", arXiv, 2021 ( ). [ ]
Paper			: "Refiner: Refining Self-attention for Vision Transformers", arXiv, 2021 ( ). [ ][ ]
Paper			: "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer", arXiv, 2021 ( ). [ ]
Paper			: "CAT: Cross Attention in Vision Transformer", arXiv, 2021 ( ). [ ][ ]
Paper			: "Scaling Vision with Sparse Mixture of Experts", arXiv, 2021 ( ). [ ]
Paper			: "P2T: Pyramid Pooling Transformer for Scene Understanding", arXiv, 2021 ( ). [ ]
Paper			: "PVTv2: Improved Baselines with Pyramid Vision Transformer", arXiv, 2021 ( ). [ ][ ]
Paper			: "Local-to-Global Self-Attention in Vision Transformers", arXiv, 2021 ( ). [ ]
Paper			: "Visual Parser: Representing Part-whole Hierarchies with Transformers", arXiv, 2021 ( ). [ ]
Paper			: "Scaled ReLU Matters for Training Vision Transformers", AAAI, 2022 ( ). [ ]
Paper			: "Less is More: Pay Less Attention in Vision Transformers", AAAI, 2022 ( ). [ ][ ]
Paper			: "Dynamic Token Normalization Improves Vision Transformer", ICLR, 2022 ( ). [ ][ ]
Paper			: "RegionViT: Regional-to-Local Attention for Vision Transformers", ICLR, 2022 ( ). [ ][ ]
Paper			: "CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention", ICLR, 2022 ( ). [ ][ ]
Paper			: "Scaling the Depth of Vision Transformers via the Fourier Domain Analysis", ICLR, 2022 ( ). [ ]
Paper			: "Scaling Vision Transformers", CVPR, 2022 ( ). [ ]
Paper			: "CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows", CVPR, 2022 ( ). [ ][ ]
Paper			: "MPViT: Multi-Path Vision Transformer for Dense Prediction", CVPR, 2022 ( ). [ ][ ]
Paper			: "The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy", CVPR, 2022 ( ). [ ][ ]
Paper			: "Beyond Fixation: Dynamic Window Visual Transformer", CVPR, 2022 ( ). [ ][ ]
Paper			: "MixFormer: Mixing Features across Windows and Dimensions", CVPR, 2022 ( ). [ ][ ]
Paper			: "Vision Transformer with Deformable Attention", CVPR, 2022 ( ). [ ][ ]
Paper			: "Swin Transformer V2: Scaling Up Capacity and Resolution", CVPR, 2022 ( ). [ ][ ]
Paper			: "MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens", CVPR, 2022 ( ). [ ][ ]
Paper			: "NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition", CVPR, 2022 ( ). [ ][ ]
Paper			: "Shunted Self-Attention via Multi-Scale Token Aggregation", CVPR, 2022 ( ). [ ][ ]
Paper			: "PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture", CVPRW, 2022 ( ). [ ][ ]
Paper			: "X-ViT: High Performance Linear Vision Transformer without Softmax", CVPRW, 2022 ( ). [ ]
Paper			: "ReMixer: Object-aware Mixing Layer for Vision Transformers", CVPRW, 2022 ( ). [ ][ ]
Paper			: "Unified Normalization for Accelerating and Stabilizing Transformers", ACMMM, 2022 ( ). [ ][ ]
Paper			: "Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning", ECCV, 2022 ( ). [ ][ ]
Paper			: "DaViT: Dual Attention Vision Transformers", ECCV, 2022 ( ). [ ][ ]
Paper			: "ScalableViT: Rethinking the Context-oriented Generalization of Vision Transformer", ECCV, 2022 ( ). [ ]
Paper			: "MaxViT: Multi-Axis Vision Transformer", ECCV, 2022 ( ). [ ][ ]
Paper			: "VSA: Learning Varied-Size Window Attention in Vision Transformers", ECCV, 2022 ( ). [ ][ ]
Paper			: "Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning", NeurIPS, 2022 ( ). [ ]
Paper			: "Orthogonal Transformer: An Efficient Vision Transformer Backbone with Token Orthogonalization", NeurIPS, 2022 ( ). [ ]
Paper			: "Peripheral Vision Transformer", NeurIPS, 2022 ( ). [ ]
Paper			: "Fast Vision Transformers with HiLo Attention", NeurIPS, 2022 ( ). [ ][ ]
Paper			: "BViT: Broad Attention based Vision Transformer", arXiv, 2022 ( ). [ ]
Paper			: "O-ViT: Orthogonal Vision Transformer", arXiv, 2022 ( ). [ ]
Paper			: "Aggregating Global Features into Local Vision Transformer", arXiv, 2022 ( ). [ ][ ]
Paper			: "BOAT: Bilateral Local Attention Vision Transformer", arXiv, 2022 ( ). [ ]
Paper			: "ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond", arXiv, 2022 ( ). [ ]
Paper			: "Hierarchical Perceiver", arXiv, 2022 ( ). [ ]
Paper			: "Learning to Merge Tokens in Vision Transformers", arXiv, 2022 ( ). [ ]
Paper			: "Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention", arXiv, 2022 ( ). [ ]
Paper			: "Neighborhood Attention Transformer", arXiv, 2022 ( ). [ ][ ]
Paper			: "Adaptive Split-Fusion Transformer", arXiv, 2022 ( ). [ ][ ]
Paper			: "SP-ViT: Learning 2D Spatial Priors for Vision Transformers", arXiv, 2022 ( ). [ ]
Paper			: "EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm", arXiv, 2022 ( ). [ ]
Paper			: "Rethinking Query-Key Pairwise Interactions in Vision Transformers", arXiv, 2022 ( ). [ ]
Paper			: "Dual Vision Transformer", arXiv, 2022 ( ). [ ][ ]
Paper			: "Multi-manifold Attention for Vision Transformers", arXiv, 2022 ( ). [ ]
Paper			: "MAFormer: A Transformer Network with Multi-scale Attention Fusion for Visual Recognition", arXiv, 2022 ( ). [ ]
Paper			: "Axially Expanded Windows for Local-Global Interaction in Vision Transformers", arXiv, 2022 ( ). [ ]
Paper			: "Grafting Vision Transformers", arXiv, 2022 ( ). [ ]
Paper			: "Rethinking Hierarchicies in Pre-trained Plain Vision Transformer", arXiv, 2022 ( ). [ ]
Paper			: "The Lottery Ticket Hypothesis for Vision Transformers", arXiv, 2022 ( ). [ ]
Paper			: "Token Transformer: Can class token help window-based transformer build better long-range interactions?", arXiv, 2022 ( ). [ ]
Paper			: "INTERN: A New Learning Paradigm Towards General Vision", arXiv, 2022 ( ). [ ][ ]
Paper			: "Group Generalized Mean Pooling for Vision Transformer", arXiv, 2022 ( ). [ ]
Paper			: "GPViT: A High Resolution Non-Hierarchical Vision Transformer with Group Propagation", ICLR, 2023 ( ). [ ][ ]
Paper			: "Conditional Positional Encodings for Vision Transformers", ICLR, 2023 ( ). [ ][ ]
Paper			: "LipsFormer: Introducing Lipschitz Continuity to Vision Transformers", ICLR, 2023 ( ). [ ][ ]
Paper			: "BiFormer: Vision Transformer with Bi-Level Routing Attention", CVPR, 2023 ( ). [ ][ ]
Paper			: "Top-Down Visual Attention from Analysis by Synthesis", CVPR, 2023 ( ). [ ][ ][ ]
Paper			: "Visual Dependency Transformers: Dependency Tree Emerges From Reversed Attention", CVPR, 2023 ( ). [ ][ ]
Paper			: "ResFormer: Scaling ViTs with Multi-Resolution Training", CVPR, 2023 ( ). [ ][ ]
Paper			: "Vision Transformer with Super Token Sampling", CVPR, 2023 ( ). [ ]
Paper			: "PaCa-ViT: Learning Patch-to-Cluster Attention in Vision Transformers", CVPR, 2023 ( ). [ ][ ]
Paper			: "Global Context Vision Transformers", ICML, 2023 ( ). [ ][ ]
Paper			: "MAGNETO: A Foundation Transformer", ICML, 2023 ( ). [ ]
Paper			: "Fcaformer: Forward Cross Attention in Hybrid Vision Transformer", ICCV, 2023 ( ). [ ][ ]
Paper			: "Scale-Aware Modulation Meet Transformer", ICCV, 2023 ( ). [ ][ ]
Paper			: "FLatten Transformer: Vision Transformer using Focused Linear Attention", ICCV, 2023 ( ). [ ][ ]
Paper			: "Revisiting Vision Transformer from the View of Path Ensemble", ICCV, 2023 ( ). [ ]
Paper			: "SG-Former: Self-guided Transformer with Evolving Token Reallocation", ICCV, 2023 ( ). [ ][ ]
Paper			: "Keep It SimPool: Who Said Supervised Transformers Suffer from Attention Deficit?", ICCV, 2023 ( ). [ ]
Paper			: "LaPE: Layer-adaptive Position Embedding for Vision Transformers with Independent Layer Normalization", ICCV, 2023 ( ). [ ][ ]
Paper			: "Scratching Visual Transformer's Back with Uniform Attention", ICCV, 2023 ( ). [ ]
Paper			: "Fully Attentional Networks with Self-emerging Token Labeling", ICCV, 2023 ( ). [ ][ ]
Paper			: "ClusterFormer: Clustering As A Universal Visual Learner", NeurIPS, 2023 ( ). [ ]
Paper			: "Scattering Vision Transformer: Spectral Mixing Matters", NeurIPS, 2023 ( ). [ ][ ][ ]
Paper			: "CrossFormer++: A Versatile Vision Transformer Hinging on Cross-scale Attention", arXiv, 2023 ( ). [ ][ ]
Paper			: "Vision Transformer with Quadrangle Attention", arXiv, 2023 ( ). [ ][ ]
Paper			: "ViT-Calibrator: Decision Stream Calibration for Vision Transformer", arXiv, 2023 ( ). [ ]
Paper			: "SpectFormer: Frequency and Attention is what you need in a Vision Transformer", arXiv, 2023 ( ). [ ][ ][ ]
Paper			: "UniNeXt: Exploring A Unified Architecture for Vision Recognition", arXiv, 2023 ( ). [ ]
Paper			: "CageViT: Convolutional Activation Guided Efficient Vision Transformer", arXiv, 2023 ( ). [ ]
Paper			: "Making Vision Transformers Truly Shift-Equivariant", arXiv, 2023 ( ). [ ]
Paper			: "2-D SSM: A General Spatial Layer for Visual Transformers", arXiv, 2023 ( ). [ ][ ]
Paper			: "Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution", NeurIPS, 2023 ( ). [ ]
Paper			: "DAT++: Spatially Dynamic Vision Transformer with Deformable Attention", arXiv, 2023 ( ). [ ][ ]
Paper			: "Replacing softmax with ReLU in Vision Transformers", arXiv, 2023 ( ). [ ]
Paper			: "RMT: Retentive Networks Meet Vision Transformers", arXiv, 2023 ( ). [ ]
Paper			: "Vision Transformers Need Registers", arXiv, 2023 ( ). [ ]
Paper			: "Channel Vision Transformers: An Image Is Worth C x 16 x 16 Words", arXiv, 2023 ( ). [ ]
Paper			: "EViT: An Eagle Vision Transformer with Bi-Fovea Self-Attention", arXiv, 2023 ( ). [ ]
Paper			: "ViR: Vision Retention Networks", arXiv, 2023 ( ). [ ]
Paper			: "Window Attention is Bugged: How not to Interpolate Position Embeddings", arXiv, 2023 ( ). [ ]
Paper			: "FMViT: A multiple-frequency mixing Vision Transformer", arXiv, 2023 ( ). [ ][ ]
Paper			: "Advancing Vision Transformers with Group-Mix Attention", arXiv, 2023 ( ). [ ][ ]
Paper			: "Perceptual Group Tokenizer: Building Perception with Iterative Grouping", arXiv, 2023 ( ). [ ]
Paper			: "SCHEME: Scalable Channer Mixer for Vision Transformers", arXiv, 2023 ( ). [ ]
Paper			: "Agent Attention: On the Integration of Softmax and Linear Attention", arXiv, 2023 ( ). [ ][ ]
Paper			: "ViTamin: Designing Scalable Vision Models in the Vision-Language Era", CVPR, 2024 ( ). [ ][ ]
Paper			: "HIRI-ViT: Scaling Vision Transformer with High Resolution Inputs", TPAMI, 2024 ( ). [ ]
Paper			: "SPFormer: Enhancing Vision Transformer with Superpixel Representation", arXiv, 2024 ( ). [ ]
Paper			: "A Manifold Representation of the Key in Vision Transformers", arXiv, 2024 ( ). [ ]
Paper			: "Perceiving Longer Sequences With Bi-Directional Cross-Attention Transformers", arXiv, 2024 ( ). [ ]
Paper			: "VisionLLaMA: A Unified LLaMA Interface for Vision Tasks", arXiv, 2024 ( ). [ ][ ]
Paper			: "xT: Nested Tokenization for Larger Context in Large Images", arXiv, 2024 ( ). [ ]
Paper			: "ACC-ViT: Atrous Convolution's Comeback in Vision Transformers", arXiv, 2024 ( ). [ ]
Paper			: "ViTAR: Vision Transformer with Any Resolution", arXiv, 2024 ( ). [ ]
Paper			: "Adapting LLaMA Decoder to Vision Transformer", arXiv, 2024 ( ). [ ]
Paper			: "Training data-efficient image transformers & distillation through attention", ICML, 2021 ( ). [ ][ ]
Paper			: "ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases", ICML, 2021 ( ). [ ][ ]
Paper			: "Improving the Efficiency of Transformers for Resource-Constrained Devices", DSD, 2021 ( ). [ ]
Paper			: "Vision Transformer with Progressive Sampling", ICCV, 2021 ( ). [ ]
Paper			: "Scalable Visual Transformers with Hierarchical Pooling", ICCV, 2021 ( ). [ ][ ]
Paper			: "CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification", ICCV, 2021 ( ). [ ][ ]
Paper			: "Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding", ICCV, 2021 ( ). [ ][ ]
Paper			: "Visformer: The Vision-friendly Transformer", ICCV, 2021 ( ). [ ][ ]
Paper			: "Multi-Exit Vision Transformer for Dynamic Inference", BMVC, 2021 ( ). [ ][ ]
Paper			: "Chasing Sparsity in Vision Transformers: An End-to-End Exploration", NeurIPS, 2021 ( ). [ ][ ]
Paper			: "Dynamic Grained Encoder for Vision Transformers", NeurIPS, 2021 ( ). [ ][ ]
Paper			: "Glance-and-Gaze Vision Transformer", NeurIPS, 2021 ( ). [ ][ ]
Paper			: "DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification", NeurIPS, 2021 ( ). [ ][ ][ ]
Paper			: "ResT: An Efficient Transformer for Visual Recognition", NeurIPS, 2021 ( ). [ ][ ]
Paper			: "Adder Attention for Vision Transformer", NeurIPS, 2021 ( ). [ ]
Paper			: "SOFT: Softmax-free Transformer with Linear Complexity", NeurIPS, 2021 ( ). [ ][ ][ ]
Paper			: "IA-RED : Interpretability-Aware Redundancy Reduction for Vision Transformers", NeurIPS, 2021 ( ). [ ][ ]
Paper			: "LocalViT: Bringing Locality to Vision Transformers", arXiv, 2021 ( ). [ ][ ]
Paper			: "Escaping the Big Data Paradigm with Compact Transformers", arXiv, 2021 ( ). [ ][ ]
Paper			: "Vision Transformers with Patch Diversification", arXiv, 2021 ( ). [ ][ ]
Paper			: "Single-Layer Vision Transformers for More Accurate Early Exits with Less Overhead", arXiv, 2021 ( ). [ ]
Paper			: "Multi-Exit Vision Transformer for Dynamic Inference", arXiv, 2021 ( ). [ ]
Paper			: "Vision Xformers: Efficient Attention for Image Classification", arXiv, 2021 ( ). [ ]
Paper			: "Long-Short Transformer: Efficient Transformers for Language and Vision", NeurIPS, 2021 ( ). [ ][ ]
Paper			: "Go Wider Instead of Deeper", arXiv, 2021 ( ). [ ]
Paper			: "Armour: Generalizable Compact Self-Attention for Vision Transformers", arXiv, 2021 ( ). [ ]
Paper			: "Exploring and Improving Mobile Level Vision Transformers", arXiv, 2021 ( ). [ ]
Paper			: "DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and Transformers", arXiv, 2021 ( ). [ ][ ]
Paper			: "UFO-ViT: High Performance Linear Vision Transformer without Softmax", arXiv, 2021 ( ). [ ]
Paper			: "Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer", AAAI, 2022 ( ). [ ][ ]
Paper			: "Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention", AAAI, 2022 ( ). [ ][ ]
Paper			: "When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism", AAAI, 2022 ( ). [ ][ ]
Paper			: "Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations", ICLR, 2022 ( ). [ ][ ]
Paper			: "QuadTree Attention for Vision Transformers", ICLR, 2022 ( ). [ ][ ]
Paper			: "Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice", ICLR, 2022 ( ). [ ][ ]
Paper			: "Learned Queries for Efficient Local Attention", CVPR, 2022 ( ). [ ][ ]
Paper			: "Lite Vision Transformer with Enhanced Self-Attention", CVPR, 2022 ( ). [ ][ ]
Paper			: "A-ViT: Adaptive Tokens for Efficient Vision Transformer", CVPR, 2022 ( ). [ ][ ]
Paper			: "Patch Slimming for Efficient Vision Transformers", CVPR, 2022 ( ). [ ]
Paper			: "Reversible Vision Transformers", CVPR, 2022 ( ). [ ][ ][ ]
Paper			: "AdaViT: Adaptive Vision Transformers for Efficient Image Recognition", CVPR, 2022 ( ). [ ]
Paper			: "Dynamic Query Selection for Fast Visual Perceiver", CVPRW, 2022 ( ). [ ]
Paper			: "Adaptive Token Sampling For Efficient Vision Transformers", ECCV, 2022 ( ). [ ][ ]
Paper			: "EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers", ECCV, 2022 ( ). [ ][ ]
Paper			: "Sliced Recursive Transformer", ECCV, 2022 ( ). [ ][ ]
Paper			: "Self-slimmed Vision Transformer", ECCV, 2022 ( ). [ ][ ]
Paper			: "Doubly-Fused ViT: Fuse Information from Vision Transformer Doubly with Local Representation", ECCV, 2022 ( ). [ ]
Paper			: "M ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design", NeurIPS, 2022 ( ). [ ][ ]
Paper			: "ResT V2: Simpler, Faster and Stronger", NeurIPS, 2022 ( ). [ ][ ]
Paper			: "Learning Efficient Vision Transformers via Fine-Grained Manifold Distillation", NeurIPS, 2022 ( ). [ ]
Paper			: "EfficientFormer: Vision Transformers at MobileNet Speed", NeurIPS, 2022 ( ). [ ][ ]
Paper			: "GhostNetV2: Enhance Cheap Operation with Long-Range Attention", NeurIPS, 2022 ( ). [ ][ ]
Paper			: "Training a Vision Transformer from scratch in less than 24 hours with 1 GPU", NeurIPSW, 2022 ( ). [ ]
Paper			: "TerViT: An Efficient Ternary Vision Transformer", arXiv, 2022 ( ). [ ]
Paper			: "Multi-Tailed Vision Transformer for Efficient Inference", arXiv, 2022 ( ). [ ]
Paper			: "ViT-P: Rethinking Data-efficient Vision Transformers from Locality", arXiv, 2022 ( ). [ ]
Paper			: "Coarse-to-Fine Vision Transformer", arXiv, 2022 ( ). [ ][ ]
Paper			: "EIT: Efficiently Lead Inductive Biases to ViT", arXiv, 2022 ( ). [ ]
Paper			: "SepViT: Separable Vision Transformer", arXiv, 2022 ( ). [ ]
Paper			: "TRT-ViT: TensorRT-oriented Vision Transformer", arXiv, 2022 ( ). [ ]
Paper			: "Super Vision Transformer", arXiv, 2022 ( ). [ ][ ]
Paper			: "Tutel: Adaptive Mixture-of-Experts at Scale", arXiv, 2022 ( ). [ ][ ]
Paper			: "SimA: Simple Softmax-free Attention for Vision Transformers", arXiv, 2022 ( ). [ ][ ]
Paper			: "EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications", arXiv, 2022 ( ). [ ][ ]
Paper			: "Vicinity Vision Transformer", arXiv, 2022 ( ). [ ][ ]
Paper			: "Softmax-free Linear Transformers", arXiv, 2022 ( ). [ ][ ]
Paper			: "MaiT: Leverage Attention Masks for More Efficient Image Transformers", arXiv, 2022 ( ). [ ]
Paper			: "LightViT: Towards Light-Weight Convolution-Free Vision Transformers", arXiv, 2022 ( ). [ ][ ]
Paper			: "Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios", arXiv, 2022 ( ). [ ]
Paper			: "Lightweight Vision Transformer with Cross Feature Attention", arXiv, 2022 ( ). [ ]
Paper			: "PatchDropout: Economizing Vision Transformers Using Patch Dropout", arXiv, 2022 ( ). [ ]
Paper			: "ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers", arXiv, 2022 ( ). [ ]
Paper			: "Dilated Neighborhood Attention Transformer", arXiv, 2022 ( ). [ ][ ]
Paper			: "MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features", arXiv, 2022 ( ). [ ][ ]
Paper			: "ViT-LSLA: Vision Transformer with Light Self-Limited-Attention", arXiv, 2022 ( ). [ ]
Paper			: "Token Pooling in Vision Transformers for Image Classification", WACV, 2023 ( ). [ ]
Paper			: "Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training", AAAI, 2023 ( ). [ ][ ]
Paper			: "ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design", IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023 ( ). [ ]
Paper			: "ViTALiTy: Unifying Low-rank and Sparse Approximation for Vision Transformer Acceleration with a Linear Taylor Attention", IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023 ( ). [ ]
Paper			: "HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers", IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023 ( ). [ ]
Paper			: "Token Merging: Your ViT But Faster", ICLR, 2023 ( ). [ ][ ]
Paper			: "HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer", ICLR, 2023 ( ). [ ][ ]
Paper			: "Making Vision Transformers Efficient from A Token Sparsification View", CVPR, 2023 ( ). [ ][ ]
Paper			: "SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer", CVPR, 2023 ( ). [ ][ ]
Paper			: "Slide-Transformer: Hierarchical Vision Transformer with Local Self-Attention", CVPR, 2023 ( ). [ ][ ]
Paper			: "RIFormer: Keep Your Vision Backbone Effective While Removing Token Mixer", CVPR, 2023 ( ). [ ][ ][ ]
Paper			: "EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention", CVPR, 2023 ( ). [ ][ ]
Paper			: "Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention During Vision Transformer Inference", CVPR, 2023 ( ). [ ]
Paper			: "RGB no more: Minimally-decoded JPEG Vision Transformers", CVPR, 2023 ( ). [ ]
Paper			: "Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient Vision Transformers", CVPR, 2023 ( ). [ ]
Paper			: "Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers", CVPR, 2023 ( ). [ ]
Paper			: "Learned Thresholds Token Merging and Pruning for Vision Transformers", ICMLW, 2023 ( ). [ ][ ][ ]
Paper			: "Make A Long Image Short: Adaptive Token Length for Vision Transformers", ECML PKDD, 2023 ( ). [ ]
Paper			: "EfficientViT: Enhanced Linear Attention for High-Resolution Low-Computation Visual Recognition", ICCV, 2023 ( ). [ ][ ]
Paper			: "MPCViT: Searching for Accurate and Efficient MPC-Friendly Vision Transformer with Heterogeneous Attention", ICCV, 2023 ( ). [ ][ ]
Paper			: "Masked Spiking Transformer", ICCV, 2023 ( ). [ ]
Paper			: "Rethinking Vision Transformers for MobileNet Size and Speed", ICCV, 2023 ( ). [ ][ ]
Paper			: "DiffRate: Differentiable Compression Rate for Efficient Vision Transformers", ICCV, 2023 ( ). [ ][ ]
Paper			: "ElasticViT: Conflict-aware Supernet Training for Deploying Fast Vision Transformer on Diverse Mobile Devices", ICCV, 2023 ( ). [ ]
Paper			: "FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization", ICCV, 2023 ( ). [ ][ ]
Paper			: "SeiT: Storage-Efficient Vision Training with Tokens Using 1% of Pixel Storage", ICCV, 2023 ( ). [ ][ ]
Paper			: "Which Tokens to Use? Investigating Token Reduction in Vision Transformers", ICCVW, 2023 ( ). [ ][ ][ ]
Paper			: "LGViT: Dynamic Early Exiting for Accelerating Vision Transformer", ACMMM, 2023 ( ). [ ]
Paper			: "Efficient Low-rank Backpropagation for Vision Transformer Adaptation", NeurIPS, 2023 ( ). [ ]
Paper			: "Lightweight Vision Transformer with Bidirectional Interaction", NeurIPS, 2023 ( ). [ ][ ]
Paper			: "MCUFormer: Deploying Vision Transformers on Microcontrollers with Limited Memory", NeurIPS, 2023 ( ). [ ][ ]
Paper			: "Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design", NeurIPS, 2023 ( ). [ ]
Paper			: "Rethinking Local Perception in Lightweight Vision Transformer", arXiv, 2023 ( ). [ ]
Paper			: "Vision Transformers with Mixed-Resolution Tokenization", arXiv, 2023 ( ). [ ][ ]
Paper			: "SparseFormer: Sparse Visual Recognition via Limited Latent Tokens", arXiv, 2023 ( ). [ ][ ]
Paper			: "Rethinking Mobile Block for Efficient Attention-based Models", arXiv, 2023 ( ). [ ][ ]
Paper			: "Bytes Are All You Need: Transformers Operating Directly On File Bytes", arXiv, 2023 ( ). [ ]
Paper			: "Muti-Scale And Token Mergence: Make Your ViT More Efficient", arXiv, 2023 ( ). [ ]
Paper			: "FasterViT: Fast Vision Transformers with Hierarchical Attention", arXiv, 2023 ( ). [ ]
Paper			: "Vision Transformer with Attention Map Hallucination and FFN Compaction", arXiv, 2023 ( ). [ ]
Paper			: "Skip-Attention: Improving Vision Transformers by Paying Less Attention", arXiv, 2023 ( ). [ ]
Paper			: "MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers", arXiv, 2023 ( ). [ ]
Paper			: "DiT: Efficient Vision Transformers with Dynamic Token Routing", arXiv, 2023 ( ). [ ][ ]
Paper			: "Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers", arXiv, 2023 ( ). [ ][ ]
Paper			: "Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts", arXiv, 2023 ( ). [ ]
Paper			: "PPT: Token Pruning and Pooling for Efficient Vision Transformers", arXiv, 2023 ( ). [ ]
Paper			: "MatFormer: Nested Transformer for Elastic Inference", arXiv, 2023 ( ). [ ]
Paper			: "Bootstrapping SparseFormers from Vision Foundation Models", arXiv, 2023 ( ). [ ][ ]
Paper			: "GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation", WACV, 2024 ( ). [ ][ ]
Paper			: "Token Fusion: Bridging the Gap between Token Pruning and Token Merging", WACV, 2024 ( ). [ ]
Paper			: "Cached Transformers: Improving Transformers with Differentiable Memory Cache", AAAI, 2024 ( ). [ ]
Paper			: "LF-ViT: Reducing Spatial Redundancy in Vision Transformer for Efficient Image Recognition", AAAI, 2024 ( ). [ ][ ]
Paper			: "Efficient Modulation for Vision Networks", ICLR, 2024 ( ). [ ][ ]
Paper			: "MLP Can Be A Good Transformer Learner", CVPR, 2024 ( ). [ ][ ]
Paper			: "SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization", ICML, 2024 ( ). [ ][ ]
Paper			: "When Do We Not Need Larger Vision Models?", arXiv, 2024 ( ). [ ][ ]
Paper			: "LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference", ICCV, 2021 ( ). [ ][ ]
Paper			: "Incorporating Convolution Designs into Visual Transformers", ICCV, 2021 ( ). [ ][ ]
Paper			: "Conformer: Local Features Coupling Global Representations for Visual Recognition", ICCV, 2021 ( ). [ ][ ]
Paper			: "Co-Scale Conv-Attentional Image Transformers", ICCV, 2021 ( ). [ ][ ]
Paper			: "CvT: Introducing Convolutions to Vision Transformers", ICCV, 2021 ( ). [ ][ ]
Paper			: "Early Convolutions Help Transformers See Better", NeurIPS, 2021 ( ). [ ]
Paper			: "ConTNet: Why not use convolution and transformer at the same time?", arXiv, 2021 ( ). [ ][ ]
Paper			: "A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP", arXiv, 2021 ( ). [ ]
Paper			: "MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer", ICLR, 2022 ( ). [ ][ ]
Paper			: "CMT: Convolutional Neural Networks Meet Vision Transformers", CVPR, 2022 ( ). [ ]
Paper			: "Mobile-Former: Bridging MobileNet and Transformer", CVPR, 2022 ( ). [ ][ ]
Paper			: "TinyViT: Fast Pretraining Distillation for Small Vision Transformers", ECCV, 2022 ( ). [ ][ ]
Paper			: "Convolutional Embedding Makes Hierarchical Vision Transformer Stronger", ECCV, 2022 ( ). [ ]
Paper			: "ParC-Net: Position Aware Circular Convolution with Merits from ConvNets and Transformer", ECCV, 2022 ( ). [ ][ ]
Paper			: "How to Train Vision Transformer on Small-scale Datasets?", BMVC, 2022 ( ). [ ][ ]
Paper			: "Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets", NeurIPS, 2022 ( ). [ ][ ]
Paper			: "Inception Transformer", NeurIPS, 2022 ( ). [ ][ ]
Paper			: "Explicitly Increasing Input Information Density for Vision Transformers on Small Datasets", NeurIPSW, 2022 ( ). [ ]
Paper			: "Convolutional Xformers for Vision", arXiv, 2022 ( ). [ ][ ]
Paper			: "Patches Are All You Need?", arXiv, 2022 ( ). [ ][ ]
Paper			: "Separable Self-attention for Mobile Vision Transformers", arXiv, 2022 ( ). [ ][ ]
Paper			: "UniFormer: Unifying Convolution and Self-attention for Visual Recognition", arXiv, 2022 ( ). [ ][ ]
Paper			: "EdgeFormer: Improving Light-weight ConvNets by Learning from Vision Transformers", arXiv, 2022 ( ). [ ]
Paper			: "MoCoViT: Mobile Convolutional Vision Transformer", arXiv, 2022 ( ). [ ]
Paper			: "Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks", arXiv, 2022 ( ). [ ][ ]
Paper			: "ConvFormer: Closing the Gap Between CNN and Vision Transformers", arXiv, 2022 ( ). [ ]
Paper			: "Fast-ParC: Position Aware Global Kernel for ConvNets and ViTs", arXiv, 2022 ( ). [ ]
Paper			: "MetaFormer Baselines for Vision", arXiv, 2022 ( ). [ ][ ]
Paper			: "Demystify Transformers & Convolutions in Modern Image Deep Networks", arXiv, 2022 ( ). [ ][ ]
Paper			: "ParCNetV2: Oversized Kernel with Enhanced Attention", arXiv, 2022 ( ). [ ]
Paper			: "Visual Attention Network", arXiv, 2022 ( ). [ ][ ]
Paper			: "Masked autoencoders is an effective solution to transformer data-hungry", arXiv, 2022 ( ). [ ][ ]
Paper			: "Accumulated Trivial Attention Matters in Vision Transformers on Small Datasets", WACV, 2023 ( ). [ ][ ]
Paper			: "Sparse and Hierarchical Masked Modeling for Convolutional Representation Learning", ICLR, 2023 ( ). [ ][ ]
Paper			: "MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models", ICLR, 2023 ( ). [ ][ ]
Paper			: "InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions", CVPR, 2023 ( ). [ ][ ]
Paper			: "SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications", ICCV, 2023 ( ). [ ][ ]
Paper			: "SCSC: Spatial Cross-scale Convolution Module to Strengthen both CNNs and Transformers", ICCVW, 2023 ( ). [ ]
Paper			: "PSLT: A Light-weight Vision Transformer with Ladder Self-Attention and Progressive Shift", TPAMI, 2023 ( ). [ ][ ]
Paper			: "RepViT: Revisiting Mobile CNN From ViT Perspective", arXiv, 2023 ( ). [ ][ ]
Paper			: "Interpret Vision Transformers as ConvNets with Dynamic Convolutions", arXiv, 2023 ( ). [ ]
Paper			: "UPDP: A Unified Progressive Depth Pruner for CNN and Vision Transformer", AAAI, 2024 ( ). [ ]
Paper			: "Generative Pretraining From Pixels", ICML, 2020 ( ). [ ][ ]
Paper			: "Learning Transferable Visual Models From Natural Language Supervision", ICML, 2021 ( ). [ ][ ]
Paper			: "An Empirical Study of Training Self-Supervised Vision Transformers", ICCV, 2021 ( ). [ ]
Paper			: "Emerging Properties in Self-Supervised Vision Transformers", ICCV, 2021 ( ). [ ][ ]
Paper			: "Efficient Training of Visual Transformers with Small Datasets", NeurIPS, 2021 ( ). [ ][ ]
Paper			: "Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning", NeurIPS, 2021 ( ). [ ][ ]
Paper			: "MST: Masked Self-Supervised Transformer for Visual Representation", NeurIPS, 2021 ( ). [ ]
Paper			: "SiT: Self-supervised Vision Transformer", arXiv, 2021 ( ). [ ][ ]
Paper			: "Self-Supervised Learning with Swin Transformers", arXiv, 2021 ( ). [ ][ ]
Paper			: "Investigating Transfer Learning Capabilities of Vision Transformers and CNNs by Fine-Tuning a Single Trainable Block", arXiv, 2021 ( ). [ ]
Paper			: "Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations", WACV, 2022 ( ). [ ]
Paper			: "BEiT: BERT Pre-Training of Image Transformers", ICLR, 2022 ( ). [ ][ ]
Paper			: "Efficient Self-supervised Vision Transformers for Representation Learning", ICLR, 2022 ( ). [ ]
Paper			: "Image BERT Pre-training with Online Tokenizer", ICLR, 2022 ( ). [ ][ ]
Paper			: "Masked Feature Prediction for Self-Supervised Visual Pre-Training", CVPR, 2022 ( ). [ ]
Paper			: "Automated Progressive Learning for Efficient Training of Vision Transformers", CVPR, 2022 ( ). [ ][ ]
Paper			: "Masked Autoencoders Are Scalable Vision Learners", CVPR, 2022 ( ). [ ][ ][ ]
Paper			: "SimMIM: A Simple Framework for Masked Image Modeling", CVPR, 2022 ( ). [ ][ ]
Paper			: "Patch-Level Representation Learning for Self-Supervised Vision Transformers", CVPR, 2022 ( ). [ ][ ]
Paper			: "Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training", CVPR, 2022 ( ). [ ][ ]
Paper			: "TransMix: Attend to Mix for Vision Transformers", CVPR, 2022 ( ). [ ][ ]
Paper			: "PatchRot: A Self-Supervised Technique for Training Vision Transformers", CVPRW, 2022 ( ). [ ]
Paper			: "Are Large-scale Datasets Necessary for Self-Supervised Pre-training?", CVPRW, 2022 ( ). [ ]
Paper			: "MC-SSL: Towards Multi-Concept Self-Supervised Learning", CVPRW, 2022 ( ). [ ]
Paper			: "Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer", CVPRW, 2022 ( ). [ ]
Paper			: "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language", ICML, 2022 ( ). [ ][ ]
Paper			: "Self-supervised Models are Good Teaching Assistants for Vision Transformers", ICML, 2022 ( ). [ ][ ]
Paper			: "Position Prediction as an Effective Pretraining Strategy", ICML, 2022 ( ). [ ][ ]
Paper			: "Visual Transformer Meets CutMix for Improved Accuracy, Communication Efficiency, and Data Privacy in Split Learning", IJCAI, 2022 ( ). [ ]
Paper			: "Bootstrapped Masked Autoencoders for Vision BERT Pretraining", ECCV, 2022 ( ). [ ][ ]
Paper			: "TokenMix: Rethinking Image Mixing for Data Augmentation in Vision Transformers", ECCV, 2022 ( ). [ ][ ]
Paper			: "Locality Guidance for Improving Vision Transformers on Tiny Datasets", ECCV, 2022 ( ). [ ][ ]
Paper			: "Improving Vision Transformers by Revisiting High-frequency Components", ECCV, 2022 ( ). [ ][ ]
Paper			: "Training Vision Transformers with Only 2040 Images", ECCV, 2022 ( ). [ ]
Paper			: "What to Hide from Your Students: Attention-Guided Masked Image Modeling", ECCV, 2022 ( ). [ ][ ]
Paper			: "SLIP: Self-supervision meets Language-Image Pre-training", ECCV, 2022 ( ). [ ][ ]
Paper			: "mc-BEiT: Multi-Choice Discretization for Image BERT Pre-training", ECCV, 2022 ( ). [ ]
Paper			: "Scalable Learning to Optimize: A Learned Optimizer Can Train Big Models", ECCV, 2022 ( ). [ ][ ]
Paper			: "TokenMixup: Efficient Attention-guided Token-level Data Augmentation for Transformers", NeurIPS, 2022 ( ). [ ][ ]
Paper			: "PatchRot: A Self-Supervised Technique for Training Vision Transformers", NeurIPSW, 2022 ( ). [ ]
Paper			: "Green Hierarchical Vision Transformer for Masked Image Modeling", NeurIPS, 2022 ( ). [ ][ ]
Paper			: "Differentially Private CutMix for Split Learning with Vision Transformer", NeurIPSW, 2022 ( ). [ ]
Paper			: "How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers", Transactions on Machine Learning Research (TMLR), 2022 ( ). [ ][ ][ ]
Paper			: "PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers", arXiv, 2022 ( ). [ ]
Paper			: "RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training", arXiv, 2022 ( ). [ ]
Paper			: "Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers", arXiv, 2022 ( ). [ ][ ]
Paper			: "Parameter-efficient Fine-tuning for Vision Transformers", arXiv, 2022 ( ). [ ]
Paper			: "DILEMMA: Self-Supervised Shape and Texture Learning with Transformers", arXiv, 2022 ( ). [ ]
Paper			: "DeiT III: Revenge of the ViT", arXiv, 2022 ( ). [ ]
Paper			: "Better plain ViT baselines for ImageNet-1k", arXiv, 2022 ( ). [ ][ ]
Paper			: "ConvMAE: Masked Convolution Meets Masked Autoencoders", arXiv, 2022 ( ). [ ][ ]
Paper			: "Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality", arXiv, 2022 ( ). [ ][ ]
Paper			: "GMML is All you Need", arXiv, 2022 ( ). [ ][ ]
Paper			: "Siamese Image Modeling for Self-Supervised Vision Representation Learning", arXiv, 2022 ( ). [ ]
Paper			: "SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners", arXiv, 2022 ( ). [ ][ ]
Paper			: "Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction", arXiv, 2022 ( ). [ ]
Paper			: "Spatial Entropy Regularization for Vision Transformers", arXiv, 2022 ( ). [ ]
Paper			: "Extreme Masking for Learning Instance and Distributed Visual Representations", arXiv, 2022 ( ). [ ]
Paper			: "Exploring Feature Self-relation for Self-supervised Transformer", arXiv, 2022 ( ). [ ]
Paper			: "Position Labels for Self-Supervised Vision Transformer", arXiv, 2022 ( ). [ ]
Paper			: "Jigsaw-ViT: Learning Jigsaw Puzzles in Vision Transformer", arXiv, 2022 ( ). [ ][ ][ ]
Paper			: "BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers", arXiv, 2022 ( ). [ ][ ]
Paper			: "MILAN: Masked Image Pretraining on Language Assisted Representation", arXiv, 2022 ( ). [ ][ ]
Paper			: "Accelerating Vision Transformer Training via a Patch Sampling Schedule", arXiv, 2022 ( ). [ ][ ]
Paper			: "Exploring Target Representations for Masked Autoencoders", arXiv, 2022 ( ). [ ]
Paper			: "Effective Vision Transformer Training: A Data-Centric Perspective", arXiv, 2022 ( ). [ ]
Paper			: "Self-Distillation for Further Pre-training of Transformers", arXiv, 2022 ( ). [ ]
Paper			: "Learning Self-Regularized Adversarial Views for Self-Supervised Vision Transformers", arXiv, 2022 ( ). [ ][ ]
Paper			: "Location-Aware Self-Supervised Transformers", arXiv, 2022 ( ). [ ]
Paper			: "CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet", arXiv, 2022 ( ). [ ][ ]
Paper			: "MixPro: Data Augmentation with MaskMix and Progressive Attention Labeling for Vision Transformer", ICLR, 2023 ( ). [ ][ ]
Paper			: "Masked Image Modeling with Denoising Contrast", ICLR, 2023 ( ). [ ][ ]
Paper			: "Contextual Image Masking Modeling via Synergized Contrasting without View Augmentation for Faster and Better Visual Pretraining", ICLR, 2023 ( ). [ ]
Paper			: "Corrupted Image Modeling for Self-Supervised Visual Pre-Training", ICLR, 2023 ( ). [ ]
Paper			: "Masked Frequency Modeling for Self-Supervised Visual Pre-Training", ICLR, 2023 ( ). [ ][ ]
Paper			: "Mask3D: Pre-training 2D Vision Transformers by Learning Masked 3D Priors", CVPR, 2023 ( ). [ ]
Paper			: "Visual Atoms: Pre-training Vision Transformers with Sinusoidal Waves", CVPR, 2023 ( ). [ ][ ][ ]
Paper			: "Mixed Autoencoder for Self-supervised Visual Representation Learning", CVPR, 2023 ( ). [ ]
Paper			: "Token Boosting for Robust Self-Supervised Visual Transformer Pre-training", CVPR, 2023 ( ). [ ]
Paper			: "Learning Visual Representations via Language-Guided Sampling", CVPR, 2023 ( ). [ ][ ]
Paper			: "DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training", CVPR, 2023 ( ). [ ][ ]
Paper			: "MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining", CVPR, 2023 ( ). [ ][ ]
Paper			: "MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis", CVPR, 2023 ( ). [ ][ ]
Paper			: "MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning", CVPR, 2023 ( ). [ ][ ]
Paper			: "Integrally Pre-Trained Transformer Pyramid Networks", CVPR, 2023 ( ). [ ][ ]
Paper			: "DropKey for Vision Transformer", CVPR, 2023 ( ). [ ]
Paper			: "FlexiViT: One Model for All Patch Sizes", CVPR, 2023 ( ). [ ][ ]
Paper			: "RA-CLIP: Retrieval Augmented Contrastive Language-Image Pre-Training", CVPR, 2023 ( ). [ ]
Paper			: "CLIPPO: Image-and-Language Understanding from Pixels Only", CVPR, 2023 ( ). [ ][ ]
Paper			: "Masked Autoencoders Enable Efficient Knowledge Distillers", CVPR, 2023 ( ). [ ][ ]
Paper			: "Hard Patches Mining for Masked Image Modeling", CVPR, 2023 ( ). [ ][ ]
Paper			: "Masked Image Modeling with Local Multi-Scale Reconstruction", CVPR, 2023 ( ). [ ]
Paper			: "Stare at What You See: Masked Image Modeling without Reconstruction", CVPR, 2023 ( ). [ ][ ]
Paper			: "RILS: Masked Visual Reconstruction in Language Semantic Space", CVPR, 2023 ( ). [ ][ ]
Paper			: "Understanding Masked Image Modeling via Learning Occlusion Invariant Feature", CVPR, 2023 ( ). [ ]
Paper			: "Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens", CVPR, 2023 ( ). [ ][ ]
Paper			: "Prefix Conditioning Unifies Language and Label Supervision", CVPR, 2023 ( ). [ ]
Paper			: "Reproducible scaling laws for contrastive language-image learning", CVPR, 2023 ( ). [ ][ ]
Paper			: "Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training", CVPR, 2023 ( ). [ ][ ]
Paper			: "Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information", CVPR, 2023 ( ). [ ][ ]
Paper			: "Stitchable Neural Networks", CVPR, 2023 ( ). [ ][ ]
Paper			: "A Closer Look at Self-supervised Lightweight Vision Transformers", ICML, 2023 ( ). [ ][ ]
Paper			: "Scaling Vision Transformers to 22 Billion Parameters", ICML, 2023 ( ). [ ]
Paper			: "Can We Scale Transformers to Predict Parameters of Diverse ImageNet Models?", ICML, 2023 ( ). [ ][ ]
Paper			: "Architecture-Agnostic Masked Image Modeling - From ViT back to CNN", ICML, 2023 ( ). [ ][ ]
Paper			: "Patch-level Contrastive Learning via Positional Query for Visual Pre-training", ICML, 2023 ( ). [ ][ ]
Paper			: "DreamTeacher: Pretraining Image Backbones with Deep Generative Models", ICCV, 2023 ( ). [ ][ ]
Paper			: "Pre-training Vision Transformers with Very Limited Synthesized Images", ICCV, 2023 ( ). [ ][ ]
Paper			: "Improving Pixel-based MIM by Reducing Wasted Modeling Capability", ICCV, 2023 ( ). [ ][ ]
Paper			: "Token-Label Alignment for Vision Transformers", ICCV, 2023 ( ). [ ][ ]
Paper			: "SMMix: Self-Motivated Image Mixing for Vision Transformers", ICCV, 2023 ( ). [ ][ ]
Paper			: "Diffusion Models as Masked Autoencoders", ICCV, 2023 ( ). [ ][ ]
Paper			: "The effectiveness of MAE pre-pretraining for billion-scale pretraining", ICCV, 2023 ( ). [ ][ ]
Paper			: "Teaching CLIP to Count to Ten", ICCV, 2023 ( ). [ ]
Paper			: "Perceptual Grouping in Vision-Language Models", ICCV, 2023 ( ). [ ]
Paper			: "CiT: Curation in Training for Effective Vision-Language Data", ICCV, 2023 ( ). [ ][ ]
Paper			: "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture", ICCV, 2023 ( ). [ ]
Paper			: "EfficientTrain: Exploring Generalized Curriculum Learning for Training Visual Backbones", ICCV, 2023 ( ). [ ][ ]
Paper			: "StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners", NeurIPS, 2023 ( ). [ ][ ]
Paper			: "Improving CLIP Training with Language Rewrites", NeurIPS, 2023 ( ). [ ][ ]
Paper			: "DesCo: Learning Object Recognition with Rich Language Descriptions", NeurIPS, 2023 ( ). [ ]
Paper			: "Stable and low-precision training for large-scale vision-language models", NeurIPS, 2023 ( ). [ ]
Paper			: "Image Captioners Are Scalable Vision Learners Too", NeurIPS, 2023 ( ). [ ][ ]
Paper			: "Does Visual Pretraining Help End-to-End Reasoning?", NeurIPS, 2023 ( ). [ ]
Paper			: "An Inverse Scaling Law for CLIP Training", NeurIPS, 2023 ( ). [ ][ ]
Paper			: "Towards In-context Scene Understanding", NeurIPS, 2023 ( ). [ ]
Paper			: "RevColV2: Exploring Disentangled Representations in Masked Image Modeling", NeurIPS, 2023 ( ). [ ][ ]
Paper			: "Diversify Your Vision Datasets with Automatic Diffusion-Based Augmentation", NeurIPS, 2023 ( ). [ ][ ]
Paper			: "Improving Multimodal Datasets with Image Captioning", NeurIPS (Datasets and Benchmarks), 2023 ( ). [ ]
Paper			: "Centroid-centered Modeling for Efficient Vision Transformer Pre-training", arXiv, 2023 ( ). [ ]
Paper			: "SoftCLIP: Softer Cross-modal Alignment Makes CLIP Stronger", arXiv, 2023 ( ). [ ]
Paper			: "RECLIP: Resource-efficient CLIP by Training with Small Images", arXiv, 2023 ( ). [ ]
Paper			: "DINOv2: Learning Robust Visual Features without Supervision", arXiv, 2023 ( ). [ ]
Paper			: "Objectives Matter: Understanding the Impact of Self-Supervised Objectives on Vision Transformer Representations", arXiv, 2023 ( ). [ ]
Paper			: "Less is More: Removing Text-regions Improves CLIP Training Efficiency and Robustness", arXiv, 2023 ( ). [ ]
Paper			: "Improved baselines for vision-language pre-training", arXiv, 2023 ( ). [ ]
Paper			: "Three Towers: Flexible Contrastive Learning with Pretrained Image Models", arXiv, 2023 ( ). [ ]
Paper			: "ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process", arXiv, 2023 ( ). [ ]
Paper			: "MOFI: Learning Image Representations from Noisy Entity Annotated Images", arXiv, 2023 ( ). [ ]
Paper			: "Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training", arXiv, 2023 ( ). [ ][ ]
Paper			: "Retrieval-Enhanced Contrastive Vision-Text Models", arXiv, 2023 ( ). [ ]
Paper			: "CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a $10,000 Budget; An Extra $4,000 Unlocks 81.8% Accuracy", arXiv, 2023 ( ). [ ][ ]
Paper			: "Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing", arXiv, 2023 ( ). [ ][ ]
Paper			: "Stitched ViTs are Flexible Vision Backbones", arXiv, 2023 ( ). [ ][ ]
Paper			: "Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts", arXiv, 2023 ( ). [ ]
Paper			: "Predicting masked tokens in stochastic locations improves masked image modeling", arXiv, 2023 ( ). [ ]
Paper			: "From Sparse to Soft Mixtures of Experts", arXiv, 2023 ( ). [ ]
Paper			: "DropPos: Pre-Training Vision Transformers by Reconstructing Dropped Positions", NeurIPS, 2023 ( ). [ ][ ]
Paper			: "Masked Image Residual Learning for Scaling Deeper Vision Transformers", NeurIPS, 2023 ( ). [ ]
Paper			: "Investigating the Limitation of CLIP Models: The Worst-Performing Categories", arXiv, 2023 ( ). [ ]
Paper			: "Longer-range Contextualized Masked Autoencoder", arXiv, 2023 ( ). [ ]
Paper			: "SILC: Improving Vision Language Pretraining with Self-Distillation", arXiv, 2023 ( ). [ ]
Paper			: "CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement", arXiv, 2023 ( ). [ ]
Paper			: "Object Recognition as Next Token Prediction", arXiv, 2023 ( ). [ ][ ]
Paper			: "Scaling Laws of Synthetic Images for Model Training ... for Now", arXiv, 2023 ( ). [ ][ ]
Paper			: "Learning Vision from Models Rivals Learning Vision from Data", arXiv, 2023 ( ). [ ][ ]
Paper			: "Experts Weights Averaging: A New General Training Scheme for Vision Transformers", arXiv, 2023 ( ). [ ]
Paper			: "Masked Image Modeling via Dynamic Token Morphing", arXiv, 2023 ( ). [ ]
Paper			: "Limited Data, Unlimited Potential: A Study on ViTs Augmented by Masked Autoencoders", WACV, 2024 ( ). [ ][ ]
Paper			: "Neural Clustering based Visual Representation Learning", CVPR, 2024 ( ). [ ]
Paper			: "EfficientTrain++: Generalized Curriculum Learning for Efficient Visual Backbone Training", TPAMI, 2024 ( ). [ ][ ]
Paper			: "Denoising Vision Transformers", arXiv, 2024 ( ). [ ][ ][ ]
Paper			: "Scalable Pre-training of Large Autoregressive Image Models", arXiv, 2024 ( ). [ ][ ]
Paper			: "Deconstructing Denoising Diffusion Models for Self-Supervised Learning", arXiv, 2024 ( ). [ ]
Paper			: "Rethinking Patch Dependence for Masked Autoencoders", arXiv, 2024 ( ). [ ][ ][ ]
Paper			: "Learning and Leveraging World Models in Visual Representation Learning", arXiv, 2024 ( ). [ ]
Paper			: "Can Generative Models Improve Self-Supervised Representation Learning?", arXiv, 2024 ( ). [ ]
Paper			: "Understanding Robustness of Transformers for Image Classification", ICCV, 2021 ( ). [ ]
Paper			: "On the Robustness of Vision Transformers to Adversarial Examples", ICCV, 2021 ( ). [ ]
Paper			: "Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs", BMVC, 2021 ( ). [ ][ ]
Paper			: "Are Transformers More Robust Than CNNs?", NeurIPS, 2021 ( ). [ ][ ]
Paper			: "Transformed CNNs: recasting pre-trained convolutional layers with self-attention", arXiv, 2021 ( ). [ ]
Paper			: "On the Adversarial Robustness of Visual Transformers", arXiv, 2021 ( ). [ ]
Paper			: "Reveal of Vision Transformers Robustness against Adversarial Attacks", arXiv, 2021 ( ). [ ]
Paper			: "On Improving Adversarial Transferability of Vision Transformers", arXiv, 2021 ( ). [ ][ ]
Paper			: "Exploring Corruption Robustness: Inductive Biases in Vision Transformers and MLP-Mixers", arXiv, 2021 ( ). [ ]
Paper			: "Adversarial Token Attacks on Vision Transformers", arXiv, 2021 ( ). [ ]
Paper			: "Discrete Representations Strengthen Vision Transformer Robustness", arXiv, 2021 ( ). [ ]
Paper			: "Vision Transformers are Robust Learners", AAAI, 2022 ( ). [ ][ ]
Paper			: "Towards Transferable Adversarial Attacks on Vision Transformers", AAAI, 2022 ( ). [ ][ ]
Paper			: "MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input-Adaptation", AAAI, 2022 ( ). [ ]
Paper			: "Patch-Fool: Are Vision Transformers Always Robust Against Adversarial Perturbations?", ICLR, 2022 ( ). [ ][ ]
Paper			: "Delving Deep into the Generalization of Vision Transformers under Distribution Shifts", CVPR, 2022 ( ). [ ]
Paper			: "Towards Practical Certifiable Patch Defense with Vision Transformer", CVPR, 2022 ( ).[ ]
Paper			: "Give Me Your Attention: Dot-Product Attention Considered Harmful for Adversarial Patch Robustness", CVPR, 2022 ( ). [ ]
Paper			: "Fine-tuning Image Transformers using Learnable Memory", CVPR, 2022 ( ). [ ]
Paper			: "APRIL: Finding the Achilles' Heel on Privacy for Vision Transformers", CVPR, 2022 ( ). [ ]
Paper			: "Certified Patch Robustness via Smoothed Vision Transformers", CVPR, 2022 ( ). [ ][ ]
Paper			: "Towards Robust Vision Transformer", CVPR, 2022 ( ). [ ][ ]
Paper			: "Pyramid Adversarial Training Improves ViT Performance", CVPR, 2022 ( ). [ ]
Paper			: "Visual Attention Emerges from Recurrent Sparse Reconstruction", ICML, 2022 ( ). [ ][ ]
Paper			: "Understanding The Robustness in Vision Transformers", ICML, 2022 ( ). [ ][ ]
Paper			: "Robustifying Vision Transformer without Retraining from Scratch by Test-Time Class-Conditional Feature Alignment", IJCAI, 2022 ( ). [ ][ ]
Paper			: "Understanding Adversarial Robustness of Vision Transformers via Cauchy Problem", ECML-PKDD, 2022 ( ). [ ][ ]
Paper			: "An Impartial Take to the CNN vs Transformer Robustness Contest", ECCV, 2022 ( ). [ ]
Paper			: "Towards Efficient Adversarial Training on Vision Transformers", ECCV, 2022 ( ). [ ]
Paper			: "Are Vision Transformers Robust to Patch Perturbations?", ECCV, 2022 ( ). [ ]
Paper			: "ViP: Unified Certified Detection and Recovery for Patch Attack with Vision Transformers", ECCV, 2022 ( ). [ ][ ]
Paper			: "When Adversarial Training Meets Vision Transformers: Recipes from Training to Architecture", NeurIPS, 2022 ( ). [ ][ ]
Paper			: "Decision-based Black-box Attack Against Vision Transformers via Patch-wise Adversarial Removal", NeurIPS, 2022 ( ). [ ]
Paper			: "Optimizing Relevance Maps of Vision Transformers Improves Robustness", NeurIPS, 2022 ( ). [ ][ ]
Paper			: "Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation", NeurIPS, 2022 ( ). [ ]
Paper			: "Finding Differences Between Transformers and ConvNets Using Counterfactual Simulation Testing", NeurIPS, 2022 ( ). [ ]
Paper			: "Are Vision Transformers Robust to Spurious Correlations?", arXiv, 2022 ( ). [ ]
Paper			: "Boosting Adversarial Transferability of MLP-Mixer", arXiv, 2022 ( ). [ ]
Paper			: "Deeper Insights into ViTs Robustness towards Common Corruptions", arXiv, 2022 ( ). [ ]
Paper			: "Privacy-Preserving Image Classification Using Vision Transformer", arXiv, 2022 ( ). [ ]
Paper			: "Federated Adversarial Training with Transformers", arXiv, 2022 ( ). [ ]
Paper			: "Backdoor Attacks on Vision Transformers", arXiv, 2022 ( ). [ ][ ]
Paper			: "Defending Backdoor Attacks on Vision Transformer via Patch Processing", arXiv, 2022 ( ). [ ]
Paper			: "Image and Model Transformation with Secret Key for Vision Transformer", arXiv, 2022 ( ). [ ]
Paper			: "Analyzing Adversarial Robustness of Vision Transformers against Spatial and Spectral Attacks", arXiv, 2022 ( ). [ ]
Paper			: "CLIPping Privacy: Identity Inference Attacks on Multi-Modal Machine Learning Models", arXiv, 2022 ( ). [ ]
Paper			: "A Light Recipe to Train Robust Vision Transformers", arXiv, 2022 ( ). [ ]
Paper			: "Attacking Compressed Vision Transformers", arXiv, 2022 ( ). [ ]
Paper			: "Visual Prompting for Adversarial Robustness", arXiv, 2022 ( ). [ ]
Paper			: "Curved Representation Space of Vision Transformers", arXiv, 2022 ( ). [ ]
Paper			: "Robustify Transformers with Robust Kernel Density Estimation", arXiv, 2022 ( ). [ ]
Paper			: "Pretrained Transformers Do not Always Improve Robustness", arXiv, 2022 ( ). [ ]
Paper			: "Revisiting adapters with adversarial training", ICLR, 2023 ( ). [ ]
Paper			: "Budgeted Training for Vision Transformer", ICLR, 2023 ( ). [ ]
Paper			: "Can CNNs Be More Robust Than Transformers?", ICLR, 2023 ( ). [ ][ ]
Paper			: "Denoising Masked AutoEncoders are Certifiable Robust Vision Learners", ICLR, 2023 ( ). [ ][ ]
Paper			: "Transferable Adversarial Attacks on Vision Transformers with Token Gradient Regularization", CVPR, 2023 ( ). [ ][ ]
Paper			: "TrojViT: Trojan Insertion in Vision Transformers", CVPR, 2023 ( ). [ ]
Paper			: "Improving Robustness of Vision Transformers by Reducing Sensitivity to Patch Corruptions", CVPR, 2023 ( ). [ ]
Paper			: "Trade-off between Robustness and Accuracy of Vision Transformers", CVPR, 2023 ( ). [ ]
Paper			: "You Are Catching My Attention: Are Vision Transformers Bad Learners Under Backdoor Attacks?", CVPR, 2023 ( ). [ ]
Paper			: "Understanding and Defending Patched-based Adversarial Attacks for Vision Transformer", ICML, 2023 ( ). [ ]
Paper			: "Improving Adversarial Robustness of Masked Autoencoders via Test-time Frequency-domain Prompting", ICCV, 2023 ( ). [ ][ ]
Paper			: "Efficiently Robustify Pre-trained Models", ICCV, 2023 ( ). [ ]
Paper			: "Transferable Adversarial Attack for Both Vision Transformers and Convolutional Networks via Momentum Integrated Gradients", ICCV, 2023 ( ). [ ]
Paper			: "CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning", ICCV, 2023 ( ). [ ][ ]
Paper			: "Exploring Non-additive Randomness on ViT against Query-Based Black-Box Attacks", BMVC, 2023 ( ). [ ]
Paper			: "RBFormer: Improve Adversarial Robustness of Transformer by Robust Bias", BMVC, 2023 ( ). [ ]
Paper			: "Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding", PR, 2023 ( ). [ ]
Paper			: "CertViT: Certified Robustness of Pre-Trained Vision Transformers", arXiv, 2023 ( ). [ ][ ]
Paper			: "Robust Contrastive Language-Image Pretraining against Adversarial Attacks", arXiv, 2023 ( ). [ ]
Paper			: "DeepMIM: Deep Supervision for Masked Image Modeling", arXiv, 2023 ( ). [ ][ ]
Paper			: "Robustifying Token Attention for Vision Transformers", ICCV, 2023 ( ). [ ][ ]
Paper			: "Experts Weights Averaging: A New General Training Scheme for Vision Transformers", arXiv, 2023 ( ). [ ]
Paper			: "SlowFormer: Universal Adversarial Patch for Attack on Compute and Energy Efficiency of Inference Efficient Vision Transformers", arXiv, 2023 ( ). [ ][ ]
Paper			: "Masked Image Modeling via Dynamic Token Morphing", arXiv, 2023 ( ). [ ]
Paper			: "Not All Prompts Are Secure: A Switchable Backdoor Attack Against Pre-trained Vision Transformers", CVPR, 2024 ( ). [ ][ ]
Paper			: "Safety of Multimodal Large Language Models on Images and Text", arXiv, 2024 ( ). [ ]
Paper			: "Post-Training Quantization for Vision Transformer", NeurIPS, 2021 ( ). [ ]
Paper			: "Visual Transformer Pruning", arXiv, 2021 ( ). [ ]
Paper			: "Multi-Dimensional Model Compression of Vision Transformer", arXiv, 2021 ( ). [ ]
Paper			: "FQ-ViT: Fully Quantized Vision Transformer without Retraining", arXiv, 2021 ( ). [ ][ ]
Paper			: "Unified Visual Transformer Compression", ICLR, 2022 ( ). [ ][ ]
Paper			: "MiniViT: Compressing Vision Transformers with Weight Multiplexing", CVPR, 2022 ( ). [ ][ ]
Paper			: "Auto-ViT-Acc: An FPGA-Aware Automatic Acceleration Framework for Vision Transformer with Mixed-Scheme Quantization", International Conference on Field Programmable Logic and Applications (FPL), 2022 ( ). [ ]
Paper			: "Towards Accurate Post-Training Quantization for Vision Transformer", ACMMM, 2022 ( ). [ ]
Paper			: "SPViT: Enabling Faster Vision Transformers via Soft Token Pruning", ECCV, 2022 ( ). [ ][ ]
Paper			: "Patch Similarity Aware Data-Free Quantization for Vision Transformers", ECCV, 2022 ( ). [ ][ ]
Paper			: "PTQ4ViT: Post-Training Quantization Framework for Vision Transformers", ECCV, 2022 ( ). [ ]
Paper			: "EAPruning: Evolutionary Pruning for Vision Transformers and CNNs", BMVC, 2022 ( ). [ ]
Paper			: "Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer", NeurIPS, 2022 ( ). [ ][ ]
Paper			: "SAViT: Structure-Aware Vision Transformer Pruning via Collaborative Optimization", NeurIPS, 2022 ( ). [ ]
Paper			: "VTC-LFC: Vision Transformer Compression with Low-Frequency Components", NeurIPS, 2022 ( ). [ ][ ]
Paper			: "Q-ViT: Fully Differentiable Quantization for Vision Transformer", arXiv, 2022 ( ). [ ]
Paper			: "VAQF: Fully Automatic Software-Hardware Co-Design Framework for Low-Bit Vision Transformer", arXiv, 2022 ( ). [ ]
Paper			: "Vision Transformer Compression with Structured Pruning and Low Rank Approximation", arXiv, 2022 ( ). [ ]
Paper			: "Searching Intrinsic Dimensions of Vision Transformers", arXiv, 2022 ( ). [ ]
Paper			: "PSAQ-ViT V2: Towards Accurate and General Data-Free Quantization for Vision Transformers", arXiv, 2022 ( ). [ ][ ]
Paper			: "Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention", arXiv, 2022 ( ). [ ]
Paper			: "SaiT: Sparse Vision Transformers through Adaptive Token Pruning", arXiv, 2022 ( ). [ ]
Paper			: "oViT: An Accurate Second-Order Pruning Framework for Vision Transformers", arXiv, 2022 ( ). [ ]
Paper			: "CPT-V: A Contrastive Approach to Post-Training Quantization of Vision Transformers", arXiv, 2022 ( ). [ ]
Paper			: "Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers", CVPR, 2023 ( ). [ ][ ]
Paper			: "Boost Vision Transformer with GPU-Friendly Sparsity and Quantization", CVPR, 2023 ( ). [ ]
Paper			: "X-Pruner: eXplainable Pruning for Vision Transformers", CVPR, 2023 ( ). [ ][ ]
Paper			: "NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization for Vision Transformers", CVPR, 2023 ( ). [ ]
Paper			: "Global Vision Transformer Pruning with Hessian-Aware Saliency", CVPR, 2023 ( ). [ ]
Paper			: "BinaryViT: Pushing Binary Vision Transformers Towards Convolutional Models", CVPRW, 2023 ( ). [ ][ ]
Paper			: "Oscillation-free Quantization for Low-bit Vision Transformers", ICML, 2023 ( ). [ ][ ]
Paper			: "UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers", ICML, 2023 ( ). [ ][ ]
Paper			: "COMCAT: Towards Efficient Compression and Customization of Attention-Based Vision Models", ICML, 2023 ( ). [ ][ ]
Paper			: "Jumping through Local Minima: Quantization in the Loss Landscape of Vision Transformers", ICCV, 2023 ( ). [ ][ ]
Paper			: "BiViT: Extremely Compressed Binary Vision Transformer", ICCV, 2023 ( ). [ ]
Paper			: "I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference", ICCV, 2023 ( ). [ ][ ]
Paper			: "RepQ-ViT: Scale Reparameterization for Post-Training Quantization of Vision Transformers", ICCV, 2023 ( ). [ ][ ]
Paper			: "LLM-FP4: 4-Bit Floating-Point Quantized Transformers", EMNLP, 2023 ( ). [ ][ ]
Paper			: "Q-HyViT: Post-Training Quantization for Hybrid Vision Transformer with Bridge Block Reconstruction", arXiv, 2023 ( ). [ ]
Paper			: "Bi-ViT: Pushing the Limit of Vision Transformer Quantization", arXiv, 2023 ( ). [ ]
Paper			: "BinaryViT: Towards Efficient and Accurate Binary Vision Transformers", arXiv, 2023 ( ). [ ]
Paper			: "Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers", arXiv, 2023 ( ). [ ]
Paper			: "Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing", arXiv, 2023 ( ). [ ]
Paper			: "Variation-aware Vision Transformer Quantization", arXiv, 2023 ( ). [ ][ ]
Paper			: "Data-independent Module-aware Pruning for Hierarchical Vision Transformers", ICLR, 2024 ( ). [ ][ ]
Paper			: "MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer", CVPR, 2024 ( ). [ ][ ]
Paper			: "Dense Vision Transformer Compression with Few Samples", CVPR, 2024 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Image Classification / Backbone / Attention-Free
Paper			: "RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition", arXiv, 2021 ( ). [ ][ ]
Paper			: "Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks", arXiv, 2021 ( ). [ ]
Paper			: "Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet", arXiv, 2021 ( ). [ ][ ]
Paper			: "ResMLP: Feedforward networks for image classification with data-efficient training", arXiv, 2021 ( ). [ ]
Paper			: "Can Attention Enable MLPs To Catch Up With CNNs?", arXiv, 2021 ( ). [ ]
Paper			: "Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition", arXiv, 2021 ( ). [ ][ ]
Paper			: "Rethinking Token-Mixing MLP for MLP-based Vision Backbone", arXiv, 2021 ( ). [ ]
Paper			: "S -MLPv2: Improved Spatial-Shift MLP Architecture for Vision", arXiv, 2021 ( ). [ ]
Paper			: "RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision?", arXiv, 2021 ( ). [ ][ ]
Paper			: "Hire-MLP: Vision MLP via Hierarchical Rearrangement", arXiv, 2021 ( ). [ ]
Paper			: "Sparse-MLP: A Fully-MLP Architecture with Conditional Computation", arXiv, 2021 ( ). [ ]
Paper			: "ConvMLP: Hierarchical Convolutional MLPs for Vision", arXiv, 2021 ( ). [ ][ ]
Paper			: "Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?", arXiv, 2021 ( ). [ ]
Paper			: "MLP-Mixer: An all-MLP Architecture for Vision", NeurIPS, 2021 ( ). [ ][ ][ ][ ]
Paper			: "Pay Attention to MLPs", NeurIPS, 2021 ( ). [ ][ ]
Paper			: "S -MLP: Spatial-Shift MLP Architecture for Vision", WACV, 2022 ( ). [ ]
Paper			: "CycleMLP: A MLP-like Architecture for Dense Prediction", ICLR, 2022 ( ). [ ][ ]
Paper			: "AS-MLP: An Axial Shifted MLP Architecture for Vision", ICLR, 2022 ( ). [ ][ ]
Paper			: "An Image Patch is a Wave: Quantum Inspired Vision MLP", CVPR, 2022 ( ). [ ][ ]
Paper			: "DynaMixer: A Vision MLP Architecture with Dynamic Mixing", ICML, 2022 ( ). [ ][ ]
Paper			: "Spatial-Channel Token Distillation for Vision MLPs", ICML, 2022 ( ). [ ]
Paper			: " AMixer: Adaptive Weight Mixing for Self-Attention Free Vision Transformers", ECCV, 2022 ( ). [ ]
Paper			: "Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs", arXiv, 2022 ( ). [ ]
Paper			: "ActiveMLP: An MLP-like Architecture with Active Token Mixer", arXiv, 2022 ( ). [ ]
Paper			: "MDMLP: Image Classification from Scratch on Small Datasets with MLP", arXiv, 2022 ( ). [ ][ ]
Paper			: "Parameterization of Cross-Token Relations with Relative Positional Encoding for Vision MLP", arXiv, 2022 ( ). [ ][ ]
Paper			: "SplitMixer: Fat Trimmed From MLP-like Models", arXiv, 2022 ( ). [ ][ ]
Paper			: "gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window", arXiv, 2022 ( ). [ ]
Paper			: "Analysis of Quantization on MLP-based Vision Models", arXiv, 2022 ( ). [ ]
Paper			: "Adaptive Frequency Filters As Efficient Global Token Mixers", ICCV, 2023 ( ). [ ]
Paper			: "Strip-MLP: Efficient Token Interaction for Vision MLP", ICCV, 2023 ( ). [ ][ ]
Paper			: "On the Connection between Local Attention and Dynamic Depth-wise Convolution", ICLR, 2022 ( ). [ ][ ]
Paper			: "MetaFormer is Actually What You Need for Vision", CVPR, 2022 ( ). [ ][ ]
Paper			: "A ConvNet for the 2020s", CVPR, 2022 ( ). [ ][ ]
Paper			: "Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs", CVPR, 2022 ( ). [ ][ ][ ]
Paper			: "Focal Modulation Networks", NeurIPS, 2022 ( ). [ ][ ]
Paper			: "HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions", NeurIPS, 2022 ( ). [ ][ ][ ]
Paper			: "S4ND: Modeling Images and Videos as Multidimensional Signals Using State Spaces", NeurIPS, 2022 ( ). [ ]
Paper			: "Sequencer: Deep LSTM for Image Classification", arXiv, 2022 ( ). [ ]
Paper			: "Efficient Multi-order Gated Aggregation Network", arXiv, 2022 ( ). [ ]
Paper			: "Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition", arXiv, 2022 ( ). [ ]
Paper			: "Image as Set of Points", ICLR, 2023 ( ). [ ][ ]
Paper			: "More ConvNets in the 2020s: Scaling up Kernels Beyond 51x51 using Sparsity", ICLR, 2023 ( ). [ ][ ]
Paper			: "ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders", CVPR, 2023 ( ). [ ][ ]
Paper			: "SPANet: Frequency-balancing Token Mixer using Spectral Pooling Aggregation Modulation", ICCV, 2023 ( ). [ ][ ][ ]
Paper			: "FFT-based Dynamic Token Mixer for Vision", arXiv, 2023 ( ). [ ][ ]
Paper			: "ConvNets Match Vision Transformers at Scale", arXiv, 2023 ( ). [ ]
Paper			: "VMamba: Visual State Space Model", arXiv, 2024 ( ). [ ][ ]
Paper			: "Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model", arXiv, 2024 ( ). [ ][[PyTorch](
Paper			: "Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures", arXiv, 2024 ( ). [ ][ ]
Paper			: "LocalMamba: Visual State Space Model with Windowed Selective Scan", arXiv, 2024 ( ). [ ][ ]
Paper			: "SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series", arXiv, 2024 ( ). [ ][ ]
Paper			: "PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition", arXiv, 2024 ( ). [ ][ ]
Paper			: "EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba", arXiv, 2024 ( ). [ ][ ]
Paper			: "DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs", arXiv, 2024 ( ). [ ]
Paper			: "MambaOut: Do We Really Need Mamba for Vision?", arXiv, 2024 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Image Classification / Backbone / Analysis for Transformer
Paper			: "On the Relationship between Self-Attention and Convolutional Layers", ICLR, 2020 ( ). [ ][ ][ ]
Paper			: "Transformer Interpretability Beyond Attention Visualization", CVPR, 2021 ( ). [ ][ ]
Paper			: "Are Convolutional Neural Networks or Transformers more like human vision?", CogSci, 2021 ( ). [ ]
Paper			: "ConvNets vs. Transformers: Whose Visual Representations are More Transferable?", ICCVW, 2021 ( ). [ ]
Paper			: "Do Vision Transformers See Like Convolutional Neural Networks?", NeurIPS, 2021 ( ). [ ]
Paper			: "Intriguing Properties of Vision Transformers", NeurIPS, 2021 ( ). [ ][ ]
Paper			: "FoveaTer: Foveated Transformer for Image Classification", arXiv, 2021 ( ). [ ]
Paper			: "Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight", arXiv, 2021 ( ). [ ]
Paper			: "Revisiting the Calibration of Modern Neural Networks", arXiv, 2021 ( ). [ ]
Paper			: "What Makes for Hierarchical Vision Transformer?", arXiv, 2021 ( ). [ ]
Paper			: "Visualizing Paired Image Similarity in Transformer Networks", WACV, 2022 ( ). [ ][ ]
Paper			: "Can Vision Transformers Learn without Natural Images?", AAAI, 2022 ( ). [ ][ ][ ]
Paper			: "How Do Vision Transformers Work?", ICLR, 2022 ( ). [ ][ ]
Paper			: "When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations", ICLR, 2022 ( ). [ ][ ]
Paper			: "Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision Transformers", ICML, 2022 ( ). [ ]
Paper			: "Three things everyone should know about Vision Transformers", ECCV, 2022 ( ). [ ]
Paper			: "Vision Transformers provably learn spatial structure", NeurIPS, 2022 ( ). [ ]
Paper			: "Visualizing and Understanding Patch Interactions in Vision Transformer", arXiv, 2022 ( ). [ ]
Paper			: "CNNs and Transformers Perceive Hybrid Images Similar to Humans", arXiv, 2022 ( ). [ ][ ]
Paper			: "Masked Jigsaw Puzzle: A Versatile Position Embedding for Vision Transformers", CVPR, 2023 ( ). [ ][ ]
Paper			: "A Unified and Biologically-Plausible Relational Graph Representation of Vision Transformers", arXiv, 2022 ( ). [ ]
Paper			: "How Well Do Vision Transformers (VTs) Transfer To The Non-Natural Image Domain? An Empirical Study Involving Art Classification", arXiv, 2022 ( ). [ ]
Paper			: "Transformer Vs. MLP-Mixer Exponential Expressive Gap For NLP Problems", arXiv, 2022 ( ). [ ]
Paper			: "ProtoPFormer: Concentrating on Prototypical Parts in Vision Transformers for Interpretable Image Recognition", arXiv, 2022 ( ). [ ][ ]
Paper			: "Exploring Visual Interpretability for Contrastive Language-Image Pre-training", arXiv, 2022 ( ). [ ][ ]
Paper			: "Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers", arXiv, 2022 ( ). [ ]
Paper			: "Vision Transformer Visualization: What Neurons Tell and How Neurons Behave?", arXiv, 2022 ( ). [ ][ ]
Paper			: "ViT-CX: Causal Explanation of Vision Transformers", arXiv, 2022 ( ). [ ]
Paper			: "Demystify Self-Attention in Vision Transformers from a Semantic Perspective: Analysis and Application", arXiv, 2022 ( ). [ ]
Paper			: "Explanation on Pretraining Bias of Finetuned Vision Transformer", arXiv, 2022 ( ). [ ]
Paper			: "Learning to Estimate Shapley Values with Vision Transformers", ICLR, 2023 ( ). [ ][ ]
Paper			: "ImageNet-X: Understanding Model Mistakes with Factor of Variation Annotations", ICLR, 2023 ( ). [ ]
Paper			: "A Theoretical Understanding of Vision Transformers: Learning, Generalization, and Sample Complexity", ICLR, 2023 ( ). [ ]
Paper			: "What Do Self-Supervised Vision Transformers Learn?", ICLR, 2023 ( ). [ ][ ]
Paper			: "When and why Vision-Language Models behave like Bags-of-Words, and what to do about it?", ICLR, 2023 ( ). [ ]
Paper			: "CLIP-Dissect: Automatic Description of Neuron Representations in Deep Vision Networks", ICLR, 2023 ( ). [ ]
Paper			: "Understanding Masked Autoencoders via Hierarchical Latent Variable Models", CVPR, 2023 ( ). [ ]
Paper			: "Teaching Matters: Investigating the Role of Supervision in Vision Transformers", CVPR, 2023 ( ). [ ][ ][ ]
Paper			: "Masked Autoencoding Does Not Help Natural Language Supervision at Scale", CVPR, 2023 ( ). [ ]
Paper			: "On Data Scaling in Masked Image Modeling", CVPR, 2023 ( ). [ ][ ]
Paper			: "Revealing the Dark Secrets of Masked Image Modeling", CVPR, 2023 ( ). [ ]
Paper			: "VISION DIFFMASK: Faithful Interpretation of Vision Transformers with Differentiable Patch Masking", CVPRW, 2023 ( ). [ ][ ]
Paper			: "A Multidimensional Analysis of Social Biases in Vision Transformers", ICCV, 2023 ( ). [ ][ ]
Paper			: "Analyzing Vision Transformers for Image Classification in Class Embedding Space", NeurIPS, 2023 ( ). [ ]
Paper			: "Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks", NeurIPS, 2023 ( ). [ ][ ]
Paper			: "Are Vision Transformers More Data Hungry Than Newborn Visual Systems?", NeurIPS, 2023 ( ). [ ]
Paper			: "AtMan: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation", NeurIPS, 2023 ( ). [ ][ ]
Paper			: "AttentionViz: A Global View of Transformer Attention", arXiv, 2023 ( ). [ ][ ]
Paper			: "Understanding Gaussian Attention Bias of Vision Transformers Using Effective Receptive Fields", arXiv, 2023 ( ). [ ]
Paper			: "Reviving Shift Equivariance in Vision Transformers", arXiv, 2023 ( ). [ ]
Paper			: "ViT-ReciproCAM: Gradient and Attention-Free Visual Explanations for Vision Transformer", arXiv, 2023 ( ). [ ]
Paper			: "Eureka-Moments in Transformers: Multi-Step Tasks Reveal Softmax Induced Optimization Problems", arXiv, 2023 ( ). [ ]
Paper			: "A Simple Interpretable Transformer for Fine-Grained Image Classification and Analysis", arXiv, 2023 ( ). [ ][ ]
Paper			: "Attention Guided CAM: Visual Explanations of Vision Transformer Guided by Self-Attention", AAAI, 2024 ( ). [ ][ ]
Paper			: "Can Transformers Capture Spatial Relations between Objects?", ICLR, 2024 ( ). [ ][ ][ ]
Paper			: "Token Transformation Matters: Towards Faithful Post-hoc Explanation for Vision Transformer", CVPR, 2024 ( ). [ ]
Paper			: "On the Faithfulness of Vision Transformer Explanations", CVPR, 2024 ( ). [ ]
Paper			: "A Decade's Battle on Dataset Bias: Are We There Yet?", arXiv, 2024 ( ). [ ][ ]
Paper			: "LeGrad: An Explainability Method for Vision Transformers via Feature Formation Sensitivity", arXiv, 2024 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Detection / Object Detection / General:
Paper			: "detrex: Benchmarking Detection Transformers", arXiv, 2023 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Detection / Object Detection / CNN-based backbone:
Paper			: "End-to-End Object Detection with Transformers", ECCV, 2020 ( ). [ ][ ]
Paper			: "Deformable DETR: Deformable Transformers for End-to-End Object Detection", ICLR, 2021 ( ). [ ][ ]
Paper			: "UP-DETR: Unsupervised Pre-training for Object Detection with Transformers", CVPR, 2021 ( ). [ ][ ]
Paper			: "Fast Convergence of DETR with Spatially Modulated Co-Attention", ICCV, 2021 ( ). [ ][ ]
Paper			: "Conditional DETR for Fast Training Convergence", ICCV, 2021 ( ). [ ]
Paper			: "PnP-DETR: Towards Efficient Visual Analysis with Transformers", ICCV, 2021 ( ). [ ][ ]
Paper			: "Rethinking Transformer-based Set Prediction for Object Detection", ICCV, 2021 ( ). [ ]
Paper			: "Dynamic DETR: End-to-End Object Detection With Dynamic Attention", ICCV, 2021 ( ). [ ]
Paper			: "ViT-YOLO:Transformer-Based YOLO for Object Detection", ICCVW, 2021 ( ). [ ]
Paper			: "End-to-End Object Detection with Adaptive Clustering Transformer", BMVC, 2021 ( ). [ ][ ]
Paper			: "Paying Attention to Varying Receptive Fields: Object Detection with Atrous Filters and Vision Transformers", BMVC, 2021 ( ). [ ]
Paper			: "Efficient DETR: Improving End-to-End Object Detector with Dense Prior", arXiv, 2021 ( ). [ ]
Paper			: "Content-Augmented Feature Pyramid Network with Light Linear Transformers", arXiv, 2021 ( ). [ ]
Paper			: "DETReg: Unsupervised Pretraining with Region Priors for Object Detection", arXiv, 2021 ( ). [ ][ ]
Paper			: "Guiding Query Position and Performing Similar Attention for Transformer-Based Detection Heads", arXiv, 2021 ( ). [ ]
Paper			: "Anchor DETR: Query Design for Transformer-Based Detector", AAAI, 2022 ( ). [ ][ ]
Paper			: "Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity", ICLR, 2022 ( ). [ ][ ]
Paper			: "DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR", ICLR, 2022 ( ). [ ][ ]
Paper			: "DN-DETR: Accelerate DETR Training by Introducing Query DeNoising", CVPR, 2022 ( ). [ ][ ]
Paper			: "Accelerating DETR Convergence via Semantic-Aligned Matching", CVPR, 2022 ( ). [ ][ ]
Paper			: "AdaMixer: A Fast-Converging Query-Based Object Detector", CVPR, 2022 ( ). [ ][ ]
Paper			: "DESTR: Object Detection With Split Transformer", CVPR, 2022 ( ). [ ]
Paper			: "Recurrent Glimpse-based Decoder for Detection with Transformer", CVPR, 2022 ( ). [ ][ ]
Paper			: "Training Object Detectors From Scratch: An Empirical Study in the Era of Vision Transformer", CVPR, 2022 ( ). [ ]
Paper			: "Towards Data-Efficient Detection Transformers", ECCV, 2022 ( ). [ ][ ]
Paper			: "Efficient Decoder-free Object Detection with Transformers", ECCV, 2022 ( ). [ ]
Paper			: "Cornerformer: Purifying Instances for Corner-Based Detectors", ECCV, 2022 ( ). [ ]
Paper			: "A Simple Approach and Benchmark for 21,000-Category Object Detection", ECCV, 2022 ( ). [ ][ ]
Paper			: "Obj2Seq: Formatting Objects as Sequences with Class Prompt for Visual Tasks", NeurIPS, 2022 ( ). [ ][ ]
Paper			: "Knowledge Amalgamation for Object Detection with Transformers", arXiv, 2022 ( ). [ ]
Paper			: "Transformer-based Context Condensation for Boosting Feature Pyramids in Object Detection", arXiv, 2022 ( ). [ ]
Paper			: "Conditional DETR V2: Efficient Detection Transformer with Box Queries", arXiv, 2022 ( ). [ ]
Paper			: "Semantic-Aligned Matching for Enhanced DETR Convergence and Multi-Scale Feature Fusion", arXiv, 2022 ( ). [ ][ ]
Paper			: "ComplETR: Reducing the cost of annotations for object detection in dense scenes with vision transformers", arXiv, 2022 ( ). [ ]
Paper			: "Pair DETR: Contrastive Learning Speeds Up DETR Training", arXiv, 2022 ( ). [ ]
Paper			: "Group DETR v2: Strong Object Detector with Encoder-Decoder Pretraining", arXiv, 2022 ( ). [ ]
Paper			: "Knowledge Distillation for Detection Transformer with Consistent Distillation Points Sampling", arXiv, 2022 ( ). [ ]
Paper			: "D ETR: Decoder Distillation for Detection Transformer", arXiv, 2022 ( ). [ ]
Paper			: "Teach-DETR: Better Training DETR with Teachers", arXiv, 2022 ( ). [ ][ ]
Paper			: "NMS Strikes Back", arXiv, 2022 ( ). [ ][ ]
Paper			: "ViT-Adapter: Exploring Plain Vision Transformer for Accurate Dense Predictions", ICLR, 2023 ( ). [ ][ ]
Paper			: "Lite DETR: An Interleaved Multi-Scale Encoder for Efficient DETR", CVPR, 2023 ( ). [ ][ ]
Paper			: "Dense Distinct Query for End-to-End Object Detection", CVPR, 2023 ( ). [ ][ ]
Paper			: "Siamese DETR", CVPR, 2023 ( ). [ ][ ]
Paper			: "SAP-DETR: Bridging the Gap Between Salient Points and Queries-Based Transformer Detector for Fast Model Convergency", CVPR, 2023 ( ). [ ]
Paper			: "Q-DETR: An Efficient Low-Bit Quantized Detection Transformer", CVPR, 2023 ( ). [ ][ ]
Paper			: "Lite DETR: An Interleaved Multi-Scale Encoder for Efficient DETR", CVPR, 2023 ( ). [ ][ ]
Paper			: "DETRs with Hybrid Matching", CVPR, 2023 ( ). [ ][ ]
Paper			: "Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation", CVPR, 2023 ( ). [ ][ ]
Paper			: "Towards Efficient Use of Multi-Scale Features in Transformer-Based Object Detectors", CVPR, 2023 ( ). [ ][ ]
Paper			: "Enhanced Training of Query-Based Object Detection via Selective Query Recollection", CVPR, 2023 ( ). [ ][ ]
Paper			: "Learning Dynamic Query Combinations for Transformer-based Object Detection and Segmentation", ICML, 2023 ( ). [ ]
Paper			: "SpeedDETR: Speed-aware Transformers for End-to-end Object Detection", ICML, 2023 ( ). [ ]
Paper			: "AlignDet: Aligning Pre-training and Fine-tuning in Object Detection", ICCV, 2023 ( ). [ ][ ][ ]
Paper			: "Less is More: Focus Attention for Efficient DETR", ICCV, 2023 ( ). [ ][ ][ ]
Paper			: "DETR Doesn't Need Multi-Scale or Locality Design", ICCV, 2023 ( ). [ ][ ]
Paper			: "ASAG: Building Strong One-Decoder-Layer Sparse Detectors via Adaptive Sparse Anchor Generation", ICCV, 2023 ( ). [ ][ ]
Paper			: "Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection", ICCV, 2023 ( ). [ ][ ]
Paper			: "Detection Transformer with Stable Matching", ICCV, 2023 ( ). [ ][ ]
Paper			: "Integral Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection", ICCV, 2023 ( ). [ ][ ]
Paper			: "Group DETR: Fast Training Convergence with Decoupled One-to-Many Label Assignment", ICCV, 2023 ( ). [ ][ ]
Paper			: "DETRs with Collaborative Hybrid Assignments Training", ICCV, 2023 ( ). [ ][ ]
Paper			: "DETRDistill: A Universal Knowledge Distillation Framework for DETR-families", ICCV, 2023 ( ). [ ]
Paper			: "Decoupled DETR: Spatially Disentangling Localization and Classification for Improved End-to-End Object Detection", ICCV, 2023 ( ). [ ]
Paper			: "StageInteractor: Query-based Object Detector with Cross-stage Interaction", ICCV, 2023 ( ). [ ]
Paper			: "Rank-DETR for High Quality Object Detection", NeurIPS, 2023 ( ). [ ][ ]
Paper			: "Cal-DETR: Calibrated Detection Transformer", NeurIPS, 2023 ( ). [ ][ ]
Paper			: "KS-DETR: Knowledge Sharing in Attention Learning for Detection Transformer", arXiv, 2023 ( ). [ ][ ]
Paper			: "FeatAug-DETR: Enriching One-to-Many Matching for DETRs with Feature Augmentation", arXiv, 2023 ( ). [ ][ ]
Paper			: "DETRs Beat YOLOs on Real-time Object Detection", arXiv, 2023 ( ). [ ]
Paper			: "Align-DETR: Improving DETR with Simple IoU-aware BCE loss", arXiv, 2023 ( ). [ ][ ]
Paper			: "Box-DETR: Understanding and Boxing Conditional Spatial Queries", arXiv, 2023 ( ). [ ][ ]
Paper			: "Enhancing Your Trained DETRs with Box Refinement", arXiv, 2023 ( ). [ ][ ]
Paper			: "Revisiting DETR Pre-training for Object Detection", arXiv, 2023 ( ). [ ]
Paper			: "Gen2Det: Generate to Detect", arXiv, 2023 ( ). [ ]
Paper			: "ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions", CVPR, 2024 ( ). [ ][ ]
Paper			: "Salience-DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement", CVPR, 2024 ( ). [ ][ ]
Paper			: "MS-DETR: Efficient DETR Training with Mixed Supervision", arXiv, 2024 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Detection / Object Detection / Transformer-based backbone:
Paper			: "Toward Transformer-Based Object Detection", arXiv, 2020 ( ). [ ]
Paper			: "WB-DETR: Transformer-Based Detector Without Backbone", ICCV, 2021 ( ). [ ]
Paper			: "You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection", NeurIPS, 2021 ( ). [ ][ ]
Paper			: "Benchmarking Detection Transfer Learning with Vision Transformers", arXiv, 2021 ( ). [ ]
Paper			: "ViDT: An Efficient and Effective Fully Transformer-based Object Detector", ICLR, 2022 ( ). [ ][ ]
Paper			: "FP-DETR: Detection Transformer Advanced by Fully Pre-training", ICLR, 2022 ( ). [ ]
Paper			: "DETR++: Taming Your Multi-Scale Detection Transformer", CVPRW, 2022 ( ). [ ]
Paper			: "Exploring Plain Vision Transformer Backbones for Object Detection", ECCV, 2022 ( ). [ ]
Paper			: "A Simple Single-Scale Vision Transformer for Object Detection and Instance Segmentation", ECCV, 2022 ( ). [ ]
Paper			: "A Transformer-Based Object Detector with Coarse-Fine Crossing Representations", NeurIPS, 2022 ( ). [ ]
Paper			: "D ETR: Decoder-Only DETR with Computationally Efficient Cross-Scale Attention", arXiv, 2022 ( ). [ ]
Paper			: "DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection", ICLR, 2023 ( ). [ ][ ]
Paper			: "SimPLR: A Simple and Plain Transformer for Object Detection and Segmentation", arXiv, 2023 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Detection / 3D Object Detection
Paper			: "LiDAR-based Online 3D Video Object Detection with Graph-based Message Passing and Spatiotemporal Transformer Attention", CVPR, 2020 ( ). [ ][ ]
Paper			: "3D Object Detection with Pointformer", arXiv, 2020 ( ). [ ]
Paper			: "Improving 3D Object Detection with Channel-wise Transformer", ICCV, 2021 ( ). [ ][ ]
Paper			: "Group-Free 3D Object Detection via Transformers", ICCV, 2021 ( ). [ ][ ]
Paper			: "Voxel Transformer for 3D Object Detection", ICCV, 2021 ( ). [ ]
Paper			: "An End-to-End Transformer Model for 3D Object Detection", ICCV, 2021 ( ). [ ][ ][ ]
Paper			: "DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries", CoRL, 2021 ( ). [ ]
Paper			: "M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers", WACV, 2022 ( ). [ ][ ]
Paper			: "MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer", CVPR, 2022 ( ). [ ][ ]
Paper			: "Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds", CVPR, 2022 ( ). [ ][ ]
Paper			: "TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers", CVPR, 2022 ( ). [ ][ ]
Paper			: "CAT-Det: Contrastively Augmented Transformer for Multi-modal 3D Object Detection", CVPR, 2022 ( ). [ ]
Paper			: "Multimodal Token Fusion for Vision Transformers", CVPR, 2022 ( ). [ ]
Paper			: "Embracing Single Stride 3D Object Detector with Sparse Transformer", CVPR, 2022 ( ). [ ][ ]
Paper			: "LIFT: Learning 4D LiDAR Image Fusion Transformer for 3D Object Detection", CVPR, 2022 ( ). [ ]
Paper			: "BoxeR: Box-Attention for 2D and 3D Transformers", CVPR, 2022 ( ). [ ][ ]
Paper			: "Bridged Transformer for Vision and Point Cloud 3D Object Detection", CVPR, 2022 ( ). [ ]
Paper			: "VISTA: Boosting 3D Object Detection via Dual Cross-VIew SpaTial Attention", CVPR, 2022 ( ). [ ][ ]
Paper			: "Towards Self-Supervised Pre-Training of 3DETR for Label-Efficient 3D Object Detection", CVPRW, 2022 ( ). [ ]
Paper			: "Multimodal Transformer for Automatic 3D Annotation and Object Detection", ECCV, 2022 ( ). [ ][ ]
Paper			: "CenterFormer: Center-based Transformer for 3D Object Detection", ECCV, 2022 ( ). [ ][ ]
Paper			: "Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds", ECCV, 2022 ( ). [ ][ ][ ]
Paper			: "SpatialDETR: Robust Scalable Transformer-Based 3D Object Detection from Multi-View Camera Images with Global Cross-Sensor Attention", ECCV, 2022 ( ). [ ][ ]
Paper			: "CramNet: Camera-Radar Fusion with Ray-Constrained Cross-Attention for Robust 3D Object Detection", ECCV, 2022 ( ). [ ]
Paper			: "SWFormer: Sparse Window Transformer for 3D Object Detection in Point Clouds", ECCV, 2022 ( ). [ ]
Paper			: "Enhancing Multi-modal Features Using Local Self-Attention for 3D Object Detection", ECCV, 2022 ( ). [ ]
Paper			: "Unifying Voxel-based Representation with Transformer for 3D Object Detection", NeurIPS, 2022 ( ). [ ][ ]
Paper			: "MsSVT: Mixed-scale Sparse Voxel Transformer for 3D Object Detection on Point Clouds", NeurIPS, 2022 ( ). [ ][ ]
Paper			: "DeepInteraction: 3D Object Detection via Modality Interaction", NeurIPS, 2022 ( ). [ ][ ]
Paper			: "PETR: Position Embedding Transformation for Multi-View 3D Object Detection", arXiv, 2022 ( ). [ ]
Paper			: "Graph-DETR3D: Rethinking Overlapping Regions for Multi-View 3D Object Detection", arXiv, 2022 ( ). [ ]
Paper			: "PolarFormer: Multi-camera 3D Object Detection with Polar Transformer", arXiv, 2022 ( ). [ ][ ]
Paper			: "Graph Neural Network and Spatiotemporal Transformer Attention for 3D Video Object Detection from Point Clouds", arXiv, 2022 ( ). [ ]
Paper			: "SEFormer: Structure Embedding Transformer for 3D Object Detection", arXiv, 2022 ( ). [ ]
Paper			: "CRAFT: Camera-Radar 3D Object Detection with Spatio-Contextual Fusion Transformer", arXiv, 2022 ( ). [ ]
Paper			: "CrossDTR: Cross-view and Depth-guided Transformers for 3D Object Detection", arXiv, 2022 ( ). [ ][ ]
Paper			: "Focal-PETR: Embracing Foreground for Efficient Multi-Camera 3D Object Detection", arXiv, 2022 ( ). [ ]
Paper			: "Li3DeTr: A LiDAR based 3D Detection Transformer", WACV, 2023 ( ). [ ]
Paper			: "PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection", CVPR, 2023 ( ). [ ][ ]
Paper			: "OcTr: Octree-based Transformer for 3D Object Detection", CVPR, 2023 ( ). [ ]
Paper			: "MonoATT: Online Monocular 3D Object Detection with Adaptive Token Transformer", CVPR, 2023 ( ). [ ]
Paper			: "PVT-SSD: Single-Stage 3D Object Detector with Point-Voxel Transformer", CVPR, 2023 ( ). [ ][ ]
Paper			: "ConQueR: Query Contrast Voxel-DETR for 3D Object Detection", CVPR, 2023 ( ). [ ][ ][ ]
Paper			: "FrustumFormer: Adaptive Instance-aware Resampling for Multi-view 3D Detection", CVPR, 2023 ( ). [ ][ ]
Paper			: "DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets", CVPR, 2023 ( ). [ ][ ]
Paper			: "AShapeFormer: Semantics-Guided Object-Level Active Shape Encoding for 3D Object Detection via Transformers", CVPR, 2023 ( ). [ ][ ]
Paper			: "MV-JAR: Masked Voxel Jigsaw and Reconstruction for LiDAR-Based Self-Supervised Pre-Training", CVPR, 2023 ( ). [ ][ ]
Paper			: "FocalFormer3D: Focusing on Hard Instance for 3D Object Detection", ICCV, 2023 ( ). [ ][ ]
Paper			: "3D Point Positional Encoding for Multi-Camera 3D Object Detection Transformers", ICCV, 2023 ( ). [ ][ ]
Paper			: "Pixel-Aligned Recurrent Queries for Multi-View 3D Object Detection", ICCV, 2023 ( ). [ ][ ][ ]
Paper			: "Cross Modal Transformer: Towards Fast and Robust 3D Object Detection", ICCV, 2023 ( ). [ ][ ]
Paper			: "MonoDETR: Depth-aware Transformer for Monocular 3D Object Detection", ICCV, 2023 ( ). [ ][ ]
Paper			: "Efficient Transformer-based 3D Object Detection with Dynamic Token Halting", ICCV, 2023 ( ). [ ]
Paper			: "PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images", ICCV, 2023 ( ). [ ][ ]
Paper			: "Object as Query: Lifting any 2D Object Detector to 3D Detection", ICCV, 2023 ( ). [ ]
Paper			: "An Empirical Analysis of Range for 3D Object Detection", ICCVW, 2023 ( ). [ ]
Paper			: "Uni3DETR: Unified 3D Detection Transformer", NeurIPS, 2023 ( ). [ ][ ]
Paper			: "Diffusion-SS3D: Diffusion Model for Semi-supervised 3D Object Detection", NeurIPS, 2023 ( ). [ ][ ]
Paper			: "Spatial-Temporal Enhanced Transformer Towards Multi-Frame 3D Object Detection", arXiv, 2023 ( ). [ ][[Code (in construction)(
Paper			: "V-DETR: DETR with Vertex Relative Position Encoding for 3D Object Detection", arXiv, 2023 ( ). [ ][ ]
Paper			: "3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features", arXiv, 2023 ( ). [ ][ ][ ]
Paper			: "PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection", arXiv, 2023 ( ). [ ][ ]
Paper			: "Point-DETR3D: Leveraging Imagery Data with Spatial Point Prior for Weakly Semi-supervised 3D Object Detection", AAAI, 2024 ( ). [ ]
Paper			: "MixSup: Mixed-grained Supervision for Label-efficient LiDAR-based 3D Object Detection", ICLR, 2024 ( ). [ ][ ]
Paper			: "Enhancing 3D Object Detection with 2D Detection-Guided Query Anchors", CVPR, 2024 ( ). [ ]
Paper			: "ScatterFormer: Efficient Voxel Transformer with Scattered Linear Attention", arXiv, 2024 ( ). [ ][ ]
Paper			: "MsSVT++: Mixed-scale Sparse Voxel Transformer with Center Voting for 3D Object Detection", arXiv, 2024 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Detection / Multi-Modal Detection
Paper			: "Open-Vocabulary Object Detection Using Captions", CVPR, 2021 ( ). [ ][ ]
Paper			: "MDETR - Modulated Detection for End-to-End Multi-Modal Understanding", ICCV, 2021 ( ). [ ][ ][ ]
Paper			: "FETNet: Feature Exchange Transformer Network for RGB-D Object Detection", BMVC, 2021 ( ). [ ]
Paper			: "Exploiting Scene Depth for Object Detection with Multimodal Transformers", BMVC, 2021 ( ). [ ][ ]
Paper			: "StrucTexT: Structured Text Understanding with Multi-Modal Transformers", arXiv, 2021 ( ). [ ]
Paper			: "Class-agnostic Object Detection with Multi-modal Transformer", ECCV, 2022 ( ). [ ][ ]
Paper			: "Simple Open-Vocabulary Object Detection with Vision Transformers", ECCV, 2022 ( ). [ ][ ][ ]
Paper			: "X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks", ECCV, 2022 ( ). [ ]
Paper			: "simCrossTrans: A Simple Cross-Modality Transfer Learning for Object Detection with ConvNets or Vision Transformers", arXiv, 2022 ( ). [ ][ ]
Paper			: "DALL-E for Detection: Language-driven Context Image Synthesis for Object Detection", arXiv, 2022 ( ). [ ]
Paper			: "You Only Need One Detector: Unified Object Detector for Different Modalities based on Vision Transformers", arXiv, 2022 ( ). [ ][ ]
Paper			: "OmDet: Language-Aware Object Detection with Large-scale Vision-Language Multi-dataset Pre-training", arXiv, 2022 ( ). [ ]
Paper			: "Video Referring Expression Comprehension via Transformer with Content-aware Query", arXiv, 2022 ( ). [ ]
Paper			: "DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding", AAAI, 2023 ( ). [ ][ ]
Paper			: "F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models", ICLR, 2023 ( ). [ ][ ]
Paper			: "Open-Vocabulary Point-Cloud Object Detection without 3D Annotation", CVPR, 2023 ( ). [ ][ ]
Paper			: "Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding", CVPR, 2023 ( ). [ ]
Paper			: "OmniLabel: A Challenging Benchmark for Language-Based Object Detection", ICCV, 2023 ( ). [ ][ ][ ]
Paper			: "Multi-Modal Classifiers for Open-Vocabulary Object Detection", ICML, 2023 ( ). [ ][ ][ ]
Paper			: "CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection", NeurIPS, 2023 ( ). [ ][ ][ ]
Paper			: "Contextual Object Detection with Multimodal Large Language Models", arXiv, 2023 ( ). [ ][ ][ ]
Paper			: "Object2Scene: Putting Objects in Context for Open-Vocabulary 3D Detection", arXiv, 2023 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Detection / HOI Detection
Paper			: "End-to-End Human Object Interaction Detection with HOI Transformer", CVPR, 2021 ( ). [ ][ ]
Paper			: "HOTR: End-to-End Human-Object Interaction Detection with Transformers", CVPR, 2021 ( ). [ ][ ]
Paper			: "MSTR: Multi-Scale Transformer for End-to-End Human-Object Interaction Detection", CVPR, 2022 ( ). [ ]
Paper			: "What to look at and where: Semantic and Spatial Refined Transformer for detecting human-object interactions", CVPR, 2022 ( ). [ ]
Paper			: "Consistency Learning via Decoding Path Augmentation for Transformers in Human Object Interaction Detection", CVPR, 2022 ( ). [ ][ ]
Paper			: "Human-Object Interaction Detection via Disentangled Transformer", CVPR, 2022 ( ). [ ]
Paper			: "Exploring Structure-Aware Transformer Over Interaction Proposals for Human-Object Interaction Detection", CVPR, 2022 ( ). [ ][ ]
Paper			: "Distillation Using Oracle Queries for Transformer-Based Human-Object Interaction Detection", CVPR, 2022 ( ). [ ]
Paper			: "Efficient Two-Stage Detection of Human-Object Interactions with a Novel Unary-Pairwise Transformer", CVPR, 2022 ( ). [ ][ ][ ]
Paper			: "Category-Aware Transformer Network for Better Human-Object Interaction Detection", CVPR, 2022 ( ). [ ]
Paper			: "GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection", CVPR, 2022 ( ). [ ][ ]
Paper			: "Towards Hard-Positive Query Mining for DETR-based Human-Object Interaction Detection", ECCV, 2022 ( ). [ ][ ]
Paper			: "Iwin: Human-Object Interaction Detection via Transformer with Irregular Windows", ECCV, 2022 ( ). [ ]
Paper			: "RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection", NeurIPS, 2022 ( ). [ ][ ]
Paper			: "Video-based Human-Object Interaction Detection from Tubelet Tokens", NeurIPS, 2022 ( ). [ ]
Paper			: "Understanding Embodied Reference with Touch-Line Transformer", arXiv, 2022 ( ). [ ][ ]
Paper			: "Weakly-supervised HOI Detection via Prior-guided Bi-level Representation Learning", ICLR, 2023 ( ). [ ]
Paper			: "HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models", CVPR, 2023 ( ). [ ][ ]
Paper			: "ViPLO: Vision Transformer based Pose-Conditioned Self-Loop Graph for Human-Object Interaction Detection", CVPR, 2023 ( ). [ ][ ]
Paper			: "Open-Category Human-Object Interaction Pre-Training via Language Modeling Framework", CVPR, 2023 ( ). [ ]
Paper			: "Category Query Learning for Human-Object Interaction Classification", CVPR, 2023 ( ). [ ][ ]
Paper			: "Re-mine, Learn and Reason: Exploring the Cross-modal Semantic Correlations for Language-guided HOI detection", ICCV, 2023 ( ). [ ]
Paper			: "Exploring Predicate Visual Context in Detecting of Human-Object Interactions", ICCV, 2023 ( ). [ ][ ]
Paper			: "Agglomerative Transformer for Human-Object Interaction Detection", ICCV, 2023 ( ). [ ][ ]
Paper			: "RLIPv2: Fast Scaling of Relational Language-Image Pre-training", ICCV, 2023 ( ). [ ][ ]
Paper			: "EgoPCA: A New Framework for Egocentric Hand-Object Interaction Understanding", ICCV, 2023 ( ). [ ][ ]
Paper			: "Detecting Any Human-Object Interaction Relationship: Universal HOI Detector with Spatial Prompt Learning on Foundation Models", NeurIPS, 2023 ( ). [ ][ ]
Paper			: "Neural-Logic Human-Object Interaction Detection", NeurIPS, 2023 ( ). [ ][ ]
Paper			: "Exploiting CLIP for Zero-shot HOI Detection Requires Knowledge Distillation at Multiple Levels", arXiv, 2023 ( ). [ ]
Paper			: "Disentangled Pre-training for Human-Object Interaction Detection", CVPR, 2024 ( ). [ ][ ]
Paper			: "HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision", arXiv, 2024 ( ). [ ][ ][ ]
Ultimate-Awesome-Transformer-Attention / Detection / Salient Object Detection
Paper			: "Visual Saliency Transformer", ICCV, 2021 ( ). [ ]
Paper			: "Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction", NeurIPS, 2021 ( ). [ ]
Paper			: "SwinNet: Swin Transformer drives edge-aware RGB-D and RGB-T salient object detection", TCSVT, 2021 ( ). [ ][ ]
Paper			: "Transformer Transforms Salient Object Detection and Camouflaged Object Detection", arXiv, 2021 ( ). [ ]
Paper			: "Unifying Global-Local Representations in Salient Object Detection with Transformer", arXiv, 2021 ( ). [ ]
Paper			: "TriTransNet: RGB-D Salient Object Detection with a Triplet Transformer Embedding Network", arXiv, 2021 ( ). [ ]
Paper			: "Boosting Salient Object Detection with Transformer-based Asymmetric Bilateral U-Net", arXiv, 2021 ( ). [ ]
Paper			: "TranSalNet: Visual saliency prediction using transformers", arXiv, 2021 ( ). [ ]
Paper			: "DFTR: Depth-supervised Hierarchical Feature Fusion Transformer for Salient Object Detection", arXiv, 2022 ( ). [ ]
Paper			: "GroupTransNet: Group Transformer Network for RGB-D Salient Object Detection", arXiv, 2022 ( ). [ ]
Paper			: "SelfReformer: Self-Refined Network with Transformer for Salient Object Detection", arXiv, 2022 ( ). [ ]
Paper			: "Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient Object Detection", arXiv, 2022 ( ). [ ]
Paper			: "Mirror Complementary Transformer Network for RGB-thermal Salient Object Detection", arXiv, 2022 ( ). [ ][ ]
Paper			: "SiaTrans: Siamese Transformer Network for RGB-D Salient Object Detection with Depth Image Classification", arXiv, 2022 ( ). [ ]
Paper			: "PSFormer: Point Transformer for 3D Salient Object Detection", arXiv, 2022 ( ). [ ]
Paper			: "Recurrent Multi-scale Transformer for High-Resolution Salient Object Detection", ACMMM, 2023 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / X-supervised:
Paper			: "Localizing Objects with Self-Supervised Transformers and no Labels", BMVC, 2021 ( ). [ ][ ]
Paper			: "Omni-DETR: Omni-Supervised Object Detection with Transformers", CVPR, 2022 ( ). [ ][ ]
Paper			: "Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut", CVPR, 2022 ( ). [ ][ ][ ]
Paper			: "Scaling Novel Object Detection with Weakly Supervised Detection Transformers", CVPRW, 2022 ( ). [ ]
Paper			: "Re-Attention Transformer for Weakly Supervised Object Localization", arXiv, 2022 ( ). [ ][ ]
Paper			: "TokenCut: Segmenting Objects in Images and Videos with Self-supervised Transformer and Normalized Cut", arXiv, 2022 ( ). [ ][ ][ ]
Paper			: "Semi-DETR: Semi-Supervised Object Detection With Detection Transformers", CVPR, 2023 ( ). [ ][ ][ ]
Paper			: "Object Discovery from Motion-Guided Tokens", CVPR, 2023 ( ). [ ][ ][ ]
Paper			: "Cut and Learn for Unsupervised Object Detection and Instance Segmentation", CVPR, 2023 ( ). [ ][ ][ ]
Paper			: "Invariant Slot Attention: Object Discovery with Slot-Centric Reference Frames", ICML, 2023 ( ). [ ]
Paper			: "MOST: Multiple Object localization with Self-supervised Transformers for object discovery", ICCV, 2023 ( ). [ ][ ][ ]
Paper			: "Generative Prompt Model for Weakly Supervised Object Localization", ICCV, 2023 ( ). [ ][ ]
Paper			: "Spatial-Aware Token for Weakly Supervised Object Localization", ICCV, 2023 ( ). [ ][ ]
Paper			: "ALWOD: Active Learning for Weakly-Supervised Object Detection", ICCV, 2023 ( ). [ ][ ]
Paper	52	about 2 years ago	: "HASSOD: Hierarchical Adaptive Self-Supervised Object Detection", NeurIPS, 2023 ( ). [ ][ ][ ]
Paper			: "SeqCo-DETR: Sequence Consistency Training for Self-Supervised Object Detection with Transformers", arXiv, 2023 ( ). [ ]
Paper			: "R-MAE: Regions Meet Masked Autoencoders", arXiv, 2023 ( ). [ ]
Paper			: "SimDETR: Simplifying self-supervised pretraining for DETR", arXiv, 2023 ( ). [ ]
Paper			: "Unsupervised Universal Image Segmentation", arXiv, 2023 ( ). [ ][ ]
Paper			: "CuVLER: Enhanced Unsupervised Object Discoveries through Exhaustive Self-Supervised Transformers", CVPR, 2024 ( ). [ ][ ]
Paper			: "Sparse Semi-DETR: Sparse Learnable Queries for Semi-Supervised Object Detection", CVPR, 2024 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / X-Shot Object Detection:
Paper			: "Adaptive Image Transformer for One-Shot Object Detection", CVPR, 2021 ( ). [ ]
Paper			: "Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning", arXiv, 2021 ( ). [ ][ ]
Paper			: "CAT: Cross-Attention Transformer for One-Shot Object Detection", arXiv, 2021 ( ). [ ]
Paper			: "Few-Shot Object Detection with Fully Cross-Transformer", CVPR, 2022 ( ). [ ]
Paper			: "Semantic-aligned Fusion Transformer for One-shot Object Detection", CVPR, 2022 ( ). [ ]
Paper			: "Time-rEversed diffusioN tEnsor Transformer: A New TENET of Few-Shot Object Detection", ECCV, 2022 ( ). [ ][ ]
Paper			: "Meta-DETR: Image-Level Few-Shot Detection with Inter-Class Correlation Exploitation", TPAMI, 2022 ( ). [ ]
Paper			: "Incremental-DETR: Incremental Few-Shot Object Detection via Self-Supervised Learning", arXiv, 2022 ( ). [ ]
Paper			: "FS-DETR: Few-Shot DEtection TRansformer with prompting and without re-training", ICCV, 2023 ( ). [ ]
Paper			: "Meta-ZSDETR: Zero-shot DETR with Meta-learning", ICCV, 2023 ( ). [ ]
Paper			: "Revisiting Few-Shot Object Detection with Vision-Language Models", arXiv, 2023 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Open-World/Vocabulary:
Paper			: "OW-DETR: Open-world Detection Transformer", CVPR, 2022 ( ). [ ][ ]
Paper			: "Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model", CVPR, 2022 ( ). [ ][ ]
Paper			: "RegionCLIP: Region-based Language-Image Pretraining", CVPR, 2022 ( ). [ ][ ]
Paper			: "PromptDet: Towards Open-vocabulary Detection using Uncurated Images", ECCV, 2022 ( ). [ ][ ][ ]
Paper			: "Open-Vocabulary DETR with Conditional Matching", ECCV, 2022 ( ). [ ]
Paper			: "Exploiting Unlabeled Data with Vision and Language Models for Object Detection", ECCV, 2022 ( ). [ ][ ][ ]
Paper			: "DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection", NeurIPS, 2022 ( ). [ ]
Paper			: "What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs", NeurIPS, 2022 ( ). [ ][ ][ ]
Paper			: "P OVD: Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection", arXiv, 2022 ( ). [ ]
Paper			: "Open World DETR: Transformer based Open World Object Detection", arXiv, 2022 ( ). [ ]
Paper			: "Aligning Bag of Regions for Open-Vocabulary Object Detection", CVPR, 2023 ( ). [ ][ ]
Paper			: "CapDet: Unifying Dense Captioning and Open-World Detection Pretraining", CVPR, 2023 ( ). [ ]
Paper			: "CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching", CVPR, 2023 ( ). [ ][ ]
Paper			: "Detecting Everything in the Open World: Towards Universal Object Detection", CVPR, 2023 ( ). [ ][ ]
Paper			: "DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment", CVPR, 2023 ( ). [ ]
Paper			: "Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers", CVPR, 2023 ( ). [ ]
Paper			: "CAT: LoCalization and IdentificAtion Cascade Detection Transformer for Open-World Object Detection", CVPR, 2023 ( ). [ ][ ]
Paper			: "Learning to Detect and Segment for Open Vocabulary Object Detection", CVPR, 2023 ( ). [ ]
Paper			: "Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection", CVPR, 2023 ( ). [ ][ ]
Paper			: "Open-vocabulary Attribute Detection", CVPR, 2023 ( ). [ ][ ]
Paper			: "OvarNet: Towards Open-vocabulary Object Attribute Recognition", CVPR, 2023 ( ). [ ][ ][ ]
Paper			: "Annealing-Based Label-Transfer Learning for Open World Object Detection", CVPR, 2023 ( ). [ ][ ]
Paper			: "PROB: Probabilistic Objectness for Open World Object Detection", CVPR, 2023 ( ). [ ][ ][ ]
Paper			: "Random Boxes Are Open-world Object Detectors", ICCV, 2023 ( ). [ ][ ]
Paper			: "Cascade-DETR: Delving into High-Quality Universal Object Detection", ICCV, 2023 ( ). [ ][ ]
Paper			: "EdaDet: Open-Vocabulary Object Detection Using Early Dense Alignment", ICCV, 2023 ( ). [ ][ ]
Paper			: "V3Det: Vast Vocabulary Visual Detection Dataset", ICCV, 2023 ( ). [ ][ ][ ]
Paper			: "CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection", NeurIPS, 2023 ( ). [ ][ ]
Paper			: "DAMEX: Dataset-aware Mixture-of-Experts for visual understanding of mixture-of-datasets", NeurIPS, 2023 ( ). [ ][ ]
Paper			: "Scaling Open-Vocabulary Object Detection", NeurIPS, 2023 ( ). [ ]
Paper			: "Multi-modal Queried Object Detection in the Wild", NeurIPS, 2023 ( ). [ ][ ]
Paper			: "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection", arXiv, 2023 ( ). [ ]
Paper			: "GridCLIP: One-Stage Object Detection by Grid-Level CLIP Representation Learning", arXiv, 2023 ( ). [ ]
Paper			: "Three ways to improve feature alignment for open vocabulary detection", arXiv, 2023 ( ). [ ]
Paper			: "Open-Vocabulary Object Detection using Pseudo Caption Labels", arXiv, 2023 ( ). [ ]
Paper			: "Prompt-Guided Transformers for End-to-End Open-Vocabulary Object Detection", arXiv, 2023 ( ). [ ]
Paper			: "LOWA: Localize Objects in the Wild with Attributes", arXiv, 2023 ( ). [ ]
Paper			: "Open-Vocabulary Object Detection via Scene Graph Discovery", arXiv, 2023 ( ). [ ]
Paper			: "Improving Pseudo Labels for Open-Vocabulary Object Detection", arXiv, 2023 ( ). [ ]
Paper			: "Detect Every Thing with Few Examples", arXiv, 2023 ( ). [ ][ ]
Papewr			: "CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction", arXiv, 2023 ( ). [ ][ ]
Paper			: "DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection", arXiv, 2023 ( ). [ ][ ]
Paper			: "Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection", arXiv, 2023 ( ). [ ]
Paper			: "Recognize Any Regions", arXiv, 2023 ( ). [ ][ ]
Paper			: "Language-conditioned Detection Transformer", arXiv, 2023 ( ). [ ][ ]
Paper			: "Learning Pseudo-Labeler beyond Noun Concepts for Open-Vocabulary Object Detection", arXiv, 2023 ( ). [ ]
Paper			: "Open World Object Detection in the Era of Foundation Models", arXiv, 2023 ( ). [ ][ ]
Paper			: "LP-OVOD: Open-Vocabulary Object Detection by Linear Probing", WACV, 2024 ( ). [ ]
Paper			: "ProxyDet: Synthesizing Proxy Novel Classes via Classwise Mixup for Open Vocabulary Object Detection", WACV, 2024 ( ). [ ]
Paper			: "Weakly Supervised Open-Vocabulary Object Detection", AAAI, 2024 ( ). [ ][ ]
Paper			: "CLIM: Contrastive Language-Image Mosaic for Region Representation", AAAI, 2024 ( ). [ ][ ]
Paper			: "Semi-supervised Open-World Object Detection", AAAI, 2024 ( ). [ ][ ]
Paper			: "LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors", ICLR, 2024 ( ). [ ]
Paper			: "Generative Region-Language Pretraining for Open-Ended Object Detection", CVPR, 2024 ( ). [ ][ ]
Paper			: "DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection", CVPR, 2024 ( ). [ ]
Paper			: "Retrieval-Augmented Open-Vocabulary Object Detection", CVPR, 2024 ( ). [ ][ ]
Paper			: "SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection", CVPR, 2024 ( ). [ ]
Paper			: "An Open and Comprehensive Pipeline for Unified Object Grounding and Detection", arXiv, 2024 ( ). [ ][ ]
Paper			: "YOLO-World: Real-Time Open-Vocabulary Object Detection", arXiv, 2024 ( ). [ ][ ]
Paper			: "T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy", arXiv, 2024 ( ). [ ][ ][ ]
Paper			: "Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection", arXiv, 2024 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Pedestrian Detection:
Paper			: "DETR for Crowd Pedestrian Detection", arXiv, 2020 ( ). [ ][ ]
Paper			: "Effectiveness of Vision Transformer for Fast and Accurate Single-Stage Pedestrian Detection", NeurIPS, 2022 ( ). [ ]
Paper			: "Pedestrian Detection: Domain Generalization, CNNs, Transformers and Beyond", arXiv, 2022 ( ). [ ][ ]
Paper			: "VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision", CVPR, 2023 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Lane Detection:
Paper			: "End-to-end Lane Shape Prediction with Transformers", WACV, 2021 ( ). [ ][ ]
Paper			: "Line Segment Detection Using Transformers without Edges", CVPR, 2021 ( ). [ ][ ]
Paper			: "Laneformer: Object-aware Row-Column Transformers for Lane Detection", AAAI, 2022 ( ). [ ]
Paper			: "Transformer Based Line Segment Classifier With Image Context for Real-Time Vanishing Point Detection in Manhattan World", CVPR, 2022 ( ). [ ]
Paper			: "PersFormer: 3D Lane Detection via Perspective Transformer and the OpenLane Benchmark", ECCV, 2022 ( ). [ ][ ]
Paper			: "Lane Detection Transformer Based on Multi-Frame Horizontal and Vertical Attention and Visual Transformer Module", ECCV, 2022 ( ). [ ]
Paper			: "PriorLane: A Prior Knowledge Enhanced Lane Detection Approach Based on Transformer", arXiv, 2022 ( ). [ ][ ]
Paper			: "CurveFormer: 3D Lane Detection by Curve Propagation with Curve Queries and Attention", arXiv, 2022 ( ). [ ]
Paper			: "LATR: 3D Lane Detection from Monocular Images with Transformer", ICCV, 2023 ( ). [ ][ ]
Paper			: "End to End Lane detection with One-to-Several Transformer", arXiv, 2023 ( ). [ ][ ]
Paper			: "Lane2Seq: Towards Unified Lane Detection via Sequence Generation", CVPR, 2024 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Object Localization:
Paper			: "TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization", arXiv, 2021 ( ). [ ]
Paper			: "LCTR: On Awakening the Local Continuity of Transformer for Weakly Supervised Object Localization", AAAI, 2022 ( ). [ ]
Paper			: "ViTOL: Vision Transformer for Weakly Supervised Object Localization", CVPRW, 2022 ( ). [ ][ ]
Paper			: "Weakly Supervised Object Localization via Transformer with Implicit Spatial Calibration", ECCV, 2022 ( ). [ ][ ]
Paper			: "CaFT: Clustering and Filter on Tokens of Transformer for Weakly Supervised Object Localization", arXiv, 2022 ( ). [ ]
Paper			: "CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation", CVPR, 2023 ( ). [ ][ ][ ]
Paper			: "ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object Navigation", ICML, 2023 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Relation Detection:
Paper			: "Visual Relationship Detection Using Part-and-Sum Transformers with Composite Queries", ICCV, 2021 ( ). [ ]
Paper			: "Visual Composite Set Detection Using Part-and-Sum Transformers", arXiv, 2021 ( ). [ ]
Paper			: "Transformed ROIs for Capturing Visual Transformations in Videos", arXiv, 2021 ( ). [ ]
Paper			: "RelTransformer: A Transformer-Based Long-Tail Visual Relationship Recognition", CVPR, 2022 ( ). [ ][ ]
Paper			: "VReBERT: A Simple and Flexible Transformer for Visual Relationship Detection", ICPR, 2022 ( ). [ ]
Paper			: "Unified Visual Relationship Detection with Vision and Language Models", ICCV, 2023 ( ). [ ][ ]
Paper			: "Zero-shot Visual Relation Detection via Composite Visual Cues from Large Language Models", NeurIPS, 2023 ( ). [ ]
Paper			: "Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection", arXiv, 2024 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Anomaly Detection:
Paper			: "VT-ADL: A Vision Transformer Network for Image Anomaly Detection and Localization", ISIE, 2021 ( ). [ ]
Paper			: "Inpainting Transformer for Anomaly Detection", arXiv, 2021 ( ). [ ]
Paper			: "AnoViT: Unsupervised Anomaly Detection and Localization with Vision Transformer-based Encoder-Decoder", arXiv, 2022 ( ). [ ]
Paper			: "WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation", CVPR, 2023 ( ). [ ]
Paper			: "Multimodal Industrial Anomaly Detection via Hybrid Fusion", CVPR, 2023 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Cross-Domain:
Paper			: "SSTN: Self-Supervised Domain Adaptation Thermal Object Detection for Autonomous Driving", arXiv, 2021 ( ). [ ]
Paper			: "MTTrans: Cross-Domain Object Detection with Mean-Teacher Transformer", ECCV, 2022 ( ). [ ]
Paper			: "Improving Transferability for Domain Adaptive Detection Transformers", arXiv, 2022 ( ). [ ]
Paper			: "Cross-domain Detection Transformer based on Spatial-aware and Semantic-aware Token Alignment", arXiv, 2022 ( ). [ ]
Paper			: "DETR with Additional Global Aggregation for Cross-domain Weakly Supervised Object Detection", CVPR, 2023 ( ). [ ]
Paper			: "DA-DETR: Domain Adaptive Detection Transformer with Information Fusion", CVPR, 2023 ( ). [ ]
Paper			: "CLIP the Gap: A Single Domain Generalization Approach for Object Detection", CVPR, 2023 ( ). [ ][ ]
Paper			: "PM-DETR: Domain Adaptive Prompt Memory for Object Detection with Transformers", arXiv, 2023 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Co-Salient Object Detection:
Paper			: "CoSformer: Detecting Co-Salient Object with Transformers", arXiv, 2021 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Oriented Object Detection:
Paper			: "Oriented Object Detection with Transformer", arXiv, 2021 ( ). [ ]
Paper			: "AO2-DETR: Arbitrary-Oriented Object Detection Transformer", arXiv, 2022 ( ). [ ]
Paper			: "ARS-DETR: Aspect Ratio Sensitive Oriented Object Detection with Transformer", arXiv, 2023 ( ). [ ][ ]
Paper			: "RHINO: Rotated DETR with Dynamic Denoising via Hungarian Matching for Oriented Object Detection", arXiv, 2023 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Multiview Detection:
Paper			: "Multiview Detection with Shadow Transformer (and View-Coherent Data Augmentation)", ACMMM, 2021 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Polygon Detection:
Paper			: "Investigating transformers in the decomposition of polygonal shapes as point collections", ICCVW, 2021 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Drone-view:
Paper			: "TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios", ICCVW, 2021 ( ). [ ]
Paper			: "TransVisDrone: Spatio-Temporal Transformer for Vision-based Drone-to-Drone Detection in Aerial Videos", arXiv, 2022 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Infrared:
Paper			: "Infrared Small-Dim Target Detection with Transformer under Complex Backgrounds", arXiv, 2021 ( ). [ ]
Paper			: "MiPa: Mixed Patch Infrared-Visible Modality Agnostic Object Detection", arXiv, 2024 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Text Detection:
Paper			: "SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition", CVPR, 2022 ( ). [ ][ ]
Paper			: "Text Spotting Transformers", CVPR, 2022 ( ). [ ][ ]
Paper			: "Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer", CVPR, 2022 ( ). [ ]
Paper			: "Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and Spotting", ECCV, 2022 ( ). [ ]
Paper			: "End-to-End Video Text Spotting with Transformer", arXiv, 2022 ( ). [ ][ ]
Paper			: "Arbitrary Shape Text Detection using Transformers", arXiv, 2022 ( ). [ ]
Paper			: "Arbitrary Shape Text Detection via Boundary Transformer", arXiv, 2022 ( ). [ ][ ]
Paper			: "DPTNet: A Dual-Path Transformer Architecture for Scene Text Detection", arXiv, 2022 ( ). [ ]
Paper			: "Aggregated Text Transformer for Scene Text Detection", arXiv, 2022 ( ). [ ]
Paper			: "DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in Transformer", AAAI, 2023 ( ). [ ][ ]
Paper			: "Turning a CLIP Model into a Scene Text Detector", CVPR, 2023 ( ). [ ][ ]
Paper			: "DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting", CVPR, 2023 ( ). [ ][ ]
Paper			: "ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy in Transformer", ICCV, 2023 ( ). [ ][ ]
Paper			: "PBFormer: Capturing Complex Scene Text Shape with Polynomial Band Transformer", ACMMM, 2023 ( ). [ ]
Paper			: "DeepSolo++: Let Transformer Decoder with Explicit Points Solo for Text Spotting", arXiv, 2023 ( ). [ ][ ]
Paper			: "Turning a CLIP Model into a Scene Text Spotter", arXiv, 2023 ( ). [ ][ ]
Paper			: "SRFormer: Empowering Regression-Based Text Detection Transformer with Segmentation", arXiv, 2023 ( ). [ ]
Paper			: "Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis", CVPR, 2024 ( ). [ ]
Paper			: "SwinTextSpotter v2: Towards Better Synergy for Scene Text Spotting", arXiv, 2024 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Change Detection:
Paper			: "A Transformer-Based Siamese Network for Change Detection", arXiv, 2022 ( ). [ ][ ]
Paper			: "IDET: Iterative Difference-Enhanced Transformers for High-Quality Change Detection", arXiv, 2022 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Edge Detection:
Paper			: "EDTER: Edge Detection with Transformer", CVPR, 2022 ( ). [ ][ ]
Paper			: "HEAT: Holistic Edge Attention Transformer for Structured Reconstruction", CVPR, 2022 ( ). [ ][ ][ ]
Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Person Search:
Paper			: "Cascade Transformers for End-to-End Person Search", CVPR, 2022 ( ). [ ][ ]
Paper			: "PSTR: End-to-End One-Step Person Search With Transformers", CVPR, 2022 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Manipulation Detection:
Paper			: "ObjectFormer for Image Manipulation Detection and Localization", CVPR, 2022 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Mirror Detection:
Paper			: "Symmetry-Aware Transformer-based Mirror Detection", arXiv, 2022 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Shadow Detection:
Paper			: "SCOTCH and SODA: A Transformer Video Shadow Detection Framework", CVPR, 2023 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Keypoint Detection:
Paper			: "From Saliency to DINO: Saliency-guided Vision Transformer for Few-shot Keypoint Detection", arXiv, 2023 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Continual Learning:
Paper			: "Continual Detection Transformer for Incremental Object Detection", CVPR, 2023 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Visual Query Detection/Localization:
Paper			: "Where is my Wallet? Modeling Object Proposal Sets for Egocentric Visual Query Localization", CVPR, 2023 ( ). [ ][ ]
Paper			: "Single-Stage Visual Query Localization in Egocentric Videos", NeurIPS, 2023 ( ). [ ][ ][ ]
Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Task-Driven Object Detection:
Paper			: "CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection", ICCV, 2023 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Diffusion:
Paper			: "DiffusionEngine: Diffusion Model is Scalable Data Engine for Object Detection", arXiv, 2023 ( ). [ ][ ][ ]
Paper			: "Text-image Alignment for Diffusion-based Perception", arXiv, 2023 ( ). [ ][ ]
Paper			: "InstaGen: Enhancing Object Detection by Training on Synthetic Dataset", arXiv, 2024 ( ). [ ][ ][ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Semantic Segmentation
Paper			: "Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers", CVPR, 2021 ( ). [ ][ ][ ]
Paper			: "TrSeg: Transformer for semantic segmentation", PRL, 2021 ( ). [ ][ ]
Paper			: "Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer", ICCV, 2021 ( ). [ ][ ]
Paper			: "Segmenter: Transformer for Semantic Segmentation", ICCV, 2021 ( ). [ ][ ]
Paper			: "A Unified Efficient Pyramid Transformer for Semantic Segmentation", ICCVW, 2021 ( ). [ ][ ]
Paper			: "Fully Transformer Networks for Semantic Image Segmentation", arXiv, 2021 ( ). [ ]
Paper			: "SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers", NeurIPS, 2021 ( ). [ ][ ]
Paper			: "Per-Pixel Classification is Not All You Need for Semantic Segmentation", NeurIPS, 2021 ( ). [ ][ ]
Paper			: "OffRoadTranSeg: Semi-Supervised Segmentation using Transformers on OffRoad environments", arXiv, 2021 ( ). [ ]
Paper			: "Boosting Few-shot Semantic Segmentation with Transformers", arXiv, 2021 ( ). [ ]
Paper			: "Flying Guide Dog: Walkable Path Discovery for the Visually Impaired Utilizing Drones and Transformer-based Semantic Segmentation", arXiv, 2021 ( ). [ ][ ]
Paper			: "Semantic Segmentation on VSPW Dataset through Aggregation of Transformer Models", arXiv, 2021 ( ). [ ]
Paper			: "SDTP: Semantic-aware Decoupled Transformer Pyramid for Dense Image Prediction", arXiv, 2021 ( ). [ ]
Paper			: "TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation", CVPR, 2022 ( ). [ ][ ]
Paper			: "Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation", CVPR, 2022 ( ). [ ][ ]
Paper			: "Graph Reasoning Transformer for Image Parsing", ACMMM, 2022 ( ). [ ]
Paper			: "A Transformer-Based Decoder for Semantic Segmentation with Multi-level Context Mining", ECCV, 2022 ( ). [ ][ ]
Paper			: "PAUMER: Patch Pausing Transformer for Semantic Segmentation", BMVC, 2022 ( ). [ ]
Paper			: "SegViT: Semantic Segmentation with Plain Vision Transformers", NeurIPS, 2022 ( ). [ ][ ]
Paper			: "RTFormer: Efficient Design for Real-Time Semantic Segmentation with Transformer", NeurIPS, 2022 ( ). [ ][ ]
Paper			: "SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation", NeurIPS, 2022 ( ). [ ]
Paper			: "Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention", arXiv, 2022 ( ). [ ][ ]
Paper			: "Pyramid Fusion Transformer for Semantic Segmentation", arXiv, 2022 ( ). [ ]
Paper			: "Dual-Flattening Transformers through Decomposed Row and Column Queries for Semantic Segmentation", arXiv, 2022 ( ). [ ]
Paper			: "Feature Selective Transformer for Semantic Image Segmentation", arXiv, 2022 ( ). [ ]
Paper			: "StructToken: Rethinking Semantic Segmentation with Structural Prior", arXiv, 2022 ( ). [ ]
Paper			: "Improving Semantic Segmentation in Transformers using Hierarchical Inter-Level Attention", arXiv, 2022 ( ). [ ][ ][ ]
Paper			: "Visual Representation Learning with Transformer: A Sequence-to-Sequence Perspective", arXiv, 2022 ( ). [ ][ ]
Paper			: "SSformer: A Lightweight Transformer for Semantic Segmentation", arXiv, 2022 ( ). [ ][ ]
Paper			: "NamedMask: Distilling Segmenters from Complementary Foundation Models", arXiv, 2022 ( ). [ ][ ][ ]
Paper			: "IncepFormer: Efficient Inception Transformer with Pyramid Pooling for Semantic Segmentation", arXiv, 2022 ( ). [ ][ ]
Paper			: "SeaFormer: Squeeze-enhanced Axial Transformer for Mobile Semantic Segmentation", ICLR, 2023 ( ). [ ]
Paper			: "Probabilistic Prompt Learning for Dense Prediction", CVPR, 2023 ( ). [ ]
Paper			: "AutoFocusFormer: Image Segmentation off the Grid", CVPR, 2023 ( ). [ ]
Paper			: "Content-aware Token Sharing for Efficient Semantic Segmentation with Vision Transformers", CVPR, 2023 ( ). [ ][ ][ ]
Paper			: "Transformer Scale Gate for Semantic Segmentation", CVPR, 2023 ( ). [ ]
Paper			: "Dynamic Focus-aware Positional Queries for Semantic Segmentation", CVPR, 2023 ( ). [ ][ ]
Paper			: "A Good Student is Cooperative and Reliable: CNN-Transformer Collaborative Learning for Semantic Segmentation", ICCV, 2023 ( ). [ ]
Paper			: "Dynamic Token Pruning in Plain Vision Transformers for Semantic Segmentation", ICCV, 2023 ( ). [ ]
Paper			: "FreeMask: Synthetic Images with Dense Annotations Make Stronger Segmentation Models", NeurIPS, 2023 ( ). [ ][ ]
Paper			: "AiluRus: A Scalable ViT Framework for Dense Prediction", NeurIPS, 2023 ( ). [ ][ ]
Paper			: "SegViTv2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers", arXiv, 2023 ( ). [ ][ ]
Paper			: "Dynamic Token-Pass Transformers for Semantic Segmentation", arXiv, 2023 ( ). [ ]
Paper			: "Category Feature Transformer for Semantic Segmentation", arXiv, 2023 ( ). [ ]
Paper			: "ICPC: Instance-Conditioned Prompting with Contrastive Learning for Semantic Segmentation", arXiv, 2023 ( ). [ ]
Paper			: "Superpixel Transformers for Efficient Semantic Segmentation", arXiv, 2023 ( ). [ ]
Paper			: "Minimalist and High-Performance Semantic Segmentation with Plain Vision Transformers", arXiv, 2023 ( ). [ ][ ]
Paper			: "SCTNet: Single-Branch CNN with Transformer Semantic Information for Real-Time Segmentation", AAAI, 2024 ( ). [ ][ ]
Paper			: "Region-Based Representations Revisited", arXiv, 2024 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Depth Estimation
Paper			: "Vision Transformers for Dense Prediction", ICCV, 2021 ( ). [ ][ ]
Paper			: "Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction", ICCV, 2021 ( ). [ ][ ]
Paper			: "Transformer-based Monocular Depth Estimation with Attention Supervision", BMVC, 2021 ( ). [ ][ ]
Paper			: "Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics", VISAP, 2022 ( ). [ ]
Paper			: "Multi-Frame Self-Supervised Depth with Transformers", CVPR, 2022 ( ). [ ]
Paper			: "GuideFormer: Transformers for Image Guided Depth Completion", CVPR, 2022 ( ). [ ]
Paper			: "SparseFormer: Attention-based Depth Completion Network", CVPRW, 2022 ( ). [ ]
Paper			: "Depth Estimation with Simplified Transformer", CVPRW, 2022 ( ). [ ]
Paper			: "MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer", 3DV, 2022 ( ). [ ][ ]
Paper			: "Spike Transformer: Monocular Depth Estimation for Spiking Camera", ECCV, 2022 ( ). [ ][ ]
Paper			: "Hybrid Transformer Based Feature Fusion for Self-Supervised Monocular Depth Estimation", ECCVW, 2022 ( ). [ ]
Paper			: "GLPanoDepth: Global-to-Local Panoramic Depth Estimation", arXiv, 2022 ( ). [ ]
Paper			: "DepthFormer: Exploiting Long-Range Correlation and Local Information for Accurate Monocular Depth Estimation", arXiv, 2022 ( ). [ ][ ]
Paper			: "BinsFormer: Revisiting Adaptive Bins for Monocular Depth Estimation", arXiv, 2022 ( ). [ ][ ]
Paper			: "SideRT: A Real-time Pure Transformer Architecture for Single Image Depth Estimation", arXiv, 2022 ( ). [ ]
Paper			: "MonoFormer: Towards Generalization of self-supervised monocular depth estimation with Transformers", arXiv, 2022 ( ). [ ]
Paper			: "Depthformer: Multiscale Vision Transformer For Monocular Depth Estimation With Local Global Information Fusion", arXiv, 2022 ( ). [ ]
Paper			: "TODE-Trans: Transparent Object Depth Estimation with Transformer", arXiv, 2022 ( ). [ ][ ]
Paper			: "ObjCAViT: Improving Monocular Depth Estimation Using Natural Language Models And Image-Object Cross-Attention", arXiv, 2022 ( ). [ ]
Paper			: "ROIFormer: Semantic-Aware Region of Interest Transformer for Efficient Self-Supervised Monocular Depth Estimation", AAAI, 2023 ( ). [ ]
Paper			: "Lightweight Monocular Depth Estimation via Token-Sharing Transformer", ICRA, 2023 ( ). [ ]
Paper			: "CompletionFormer: Depth Completion with Convolutions and Vision Transformers", CVPR, 2023 ( ). [ ][ ][ ]
Paper			: "Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation", CVPR, 2023 ( ). [ ][ ]
Paper			: "EGformer: Equirectangular Geometry-biased Transformer for 360 Depth Estimation", ICCV, 2023 ( ). [ ]
Paper			: "Towards Zero-Shot Scale-Aware Monocular Depth Estimation", ICCV, 2023 ( ). [ ][ ][ ]
Paper			: "Win-Win: Training High-Resolution Vision Transformers from Two Windows", arXiv, 2023 ( ). [ ]
Paper			: "Learning to Adapt CLIP for Few-Shot Monocular Depth Estimation", WACV, 2024 ( ). [ ]
Paper			: "DeCoTR: Enhancing Depth Completion with 2D and 3D Attentions", CVPR, 2024 ( ). [ ]
Paper			: "Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data", arXiv, 2024 ( ). [ ][ ][ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Object Segmentation
Paper			: "SOTR: Segmenting Objects with Transformers", ICCV, 2021 ( ). [ ][ ]
Paper			: "Trans4Trans: Efficient Transformer for Transparent Object Segmentation to Help Visually Impaired People Navigate in the Real World", ICCVW, 2021 ( ). [ ][ ]
Paper			: "Segmenting Transparent Object in the Wild with Transformer", arXiv, 2021 ( ). [ ][ ]
Paper			: "SOIT: Segmenting Objects with Instance-Aware Transformers", AAAI, 2022 ( ). [ ][ ]
Paper			: "Concurrent Recognition and Segmentation with Adaptive Segment Tokens", arXiv, 2022 ( ). [ ]
Paper			: "Learning Explicit Object-Centric Representations with Vision Transformers", arXiv, 2022 ( ). [ ]
Paper			: "Mean Shift Mask Transformer for Unseen Object Instance Segmentation", arXiv, 2022 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Any-X/Every-X:
Paper			: "Segment Anything", ICCV, 2023 ( ). [ ][ ][ ]
Paper			: "Segment Everything Everywhere All at Once", NeurIPS, 2023 ( ). [ ][ ]
Paper			: "Segment Anything in High Quality", NeurIPS, 2023 ( ). [ ][ ]
Paper			: "An Empirical Study on the Robustness of the Segment Anything Model (SAM)", arXiv, 2023 ( ). [ ]
Paper			: "A Comprehensive Survey on Segment Anything Model for Vision and Beyond", arXiv, 2023 ( ). [ ]
Paper			: "SAD: Segment Any RGBD", arXiv, 2023 ( ). [ ][ ]
Paper			: "A Survey on Segment Anything Model (SAM): Vision Foundation Model Meets Prompt Engineering", arXiv, 2023 ( ). [ ]
Paper			: "Robustness of SAM: Segment Anything Under Corruptions and Beyond", arXiv, 2023 ( ). [ ]
Paper			: "Fast Segment Anything", arXiv, 2023 ( ). [ ][ ]
Paper			: "Faster Segment Anything: Towards Lightweight SAM for Mobile Applications", arXiv, 2023 ( ). [ ][ ]
Paper			: "Semantic-SAM: Segment and Recognize Anything at Any Granularity", arXiv, 2023 ( ). [ ][ ]
Paper			: "Follow Anything: Open-set detection, tracking, and following in real-time", arXiv, 2023 ( ). [ ]
Paper			: "Visual In-Context Prompting", arXiv, 2023 ( ). [ ][ ]
Paper			: "Stable Segment Anything Model", arXiv, 2023 ( ). [ ][ ]
Paper			: "EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything", arXiv, 2023 ( ). [ ]
Paper			: "EdgeSAM: Prompt-In-the-Loop Distillation for On-Device Deployment of SAM", arXiv, 2023 ( ). [ ][ ][ ]
Paper			: "RepViT-SAM: Towards Real-Time Segmenting Anything", arXiv, 2023 ( ). [ ][ ]
Paper			: "0.1% Data Makes Segment Anything Slim", arXiv, 2023 ( ). [ ][ ]
Paper			: "Interfacing Foundation Models' Embeddings", arXiv, 2023 ( ). [ ][ ][ ]
Paper			: "SqueezeSAM: User-friendly mobile interactive segmentation", arXiv, 2023 ( ). [ ]
Paper			: "Tokenize Anything via Prompting", arXiv, 2023 ( ). [ ][ ]
Paper			: "MobileSAMv2: Faster Segment Anything to Everything", arXiv, 2023 ( ). [ ][ ]
Paper			: "TinySAM: Pushing the Envelope for Efficient Segment Anything Model", arXiv, 2023 ( ). [ ][ ]
Paper			: "Convolution Meets LoRA: Parameter Efficient Finetuning for Segment Anything Model", ICLR, 2024 ( ). [ ][ ]
Paper			: "Personalize Segment Anything Model with One Shot", ICLR, 2024 ( ). [ ][ ]
Paper			: "VRP-SAM: SAM with Visual Reference Prompt", CVPR, 2024 ( ). [ ]
Paper			: "Unsegment Anything by Simulating Deformation", CVPR, 2024 ( ). [ ][ ]
Paper			: "ASAM: Boosting Segment Anything Model with Adversarial Tuning", CVPR, 2024 ( ). [ ][ ][ ]
Paper			: "PTQ4SAM: Post-Training Quantization for Segment Anything", CVPR, 2024 ( ). [ ][ ]
Paper			: "BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model", arXiv, 2024 ( ). [ ]
Paper			: "Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively", arXiv, 2024 ( ). [ ][ ][ ]
Paper			: "Learning to Prompt Segment Anything Models", arXiv, 2024 ( ). [ ]
Paper			: "RAP-SAM: Towards Real-Time All-Purpose Segment Anything", arXiv, 2024 ( ). [ ][ ][ ]
Paper			: "PA-SAM: Prompt Adapter SAM for High-Quality Image Segmentation", arXiv, 2024 ( ). [ ][ ]
Paper			: "Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks", arXiv, 2024 ( ). [ ][ ]
Paper			: "EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss", arXiv, 2024 ( ). [ ][ ]
Paper			: "DeiSAM: Segment Anything with Deictic Prompting", arXiv, 2024 ( ). [ ]
Paper			: "CAT-SAM: Conditional Tuning Network for Few-Shot Adaptation of Segmentation Anything Model", arXiv, 2024 ( ). [ ][ ][ ]
Paper			: "BLO-SAM: Bi-level Optimization Based Overfitting-Preventing Finetuning of SAM", arXiv, 2024 ( ). [ ][ ]
Paper			: "Part-aware Personalized Segment Anything Model for Patient-Specific Segmentation", arXiv, 2024 ( ). [ ]
Paper			: "Practical Region-level Attack against Segment Anything Models", arXiv, 2024 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Vision-Language:
Paper			: "Language-driven Semantic Segmentation", ICLR, 2022 ( ). [ ][ ]
Paper			: "Decoupling Zero-Shot Semantic Segmentation", CVPR, 2022 ( ). [ ][ ]
Paper			: "Image Segmentation Using Text and Image Prompts", CVPR, 2022 ( ). [ ][ ]
Paper			: "DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting", CVPR, 2022 ( ). [ ][ ][ ]
Paper			: "GroupViT: Semantic Segmentation Emerges from Text Supervision", CVPR, 2022 ( ). [ ][ ][ ]
Paper			: "Extract Free Dense Labels from CLIP", ECCV, 2022 ( ). [ ][ ][ ]
Paper			: "ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View Semantic Consistency", ICLR, 2023 ( ). [ ][ ]
Paper			: "LMSeg: Language-guided Multi-dataset Segmentation", ICLR, 2023 ( ). [ ]
Paper			: "VL-Fields: Towards Language-Grounded Neural Implicit Spatial Representations", ICRA, 2023 ( ). [ ][ ]
Paper			: "Generalized Decoding for Pixel, Image, and Language", CVPR, 2023 ( ). [ ][ ][ ]
Paper			: "IFSeg: Image-free Semantic Segmentation via Vision-Language Model", CVPR, 2023 ( ). [ ][ ]
Paper			: "Delving into Shape-aware Zero-shot Semantic Segmentation", CVPR, 2023 ( ). [ ][ ]
Paper			: "CLIP-S : Language-Guided Self-Supervised Semantic Segmentation", CVPR, 2023 ( ). [ ]
Paper			: "Semantic-Promoted Debiasing and Background Disambiguation for Zero-Shot Instance Segmentation", CVPR, 2023 ( ). [ ][ ][ ]
Paper			: "Primitive Generation and Semantic-related Alignment for Universal Zero-Shot Segmentation", CVPR, 2023 ( ). [ ][ ][ ]
Paper			: "LD-ZNet: A Latent Diffusion Approach for Text-Based Image Segmentation", ICCV, 2023 ( ). [ ][ ][ ]
Paper			: "Learning Mask-aware CLIP Representations for Zero-Shot Segmentation", NeurIPS, 2023 ( ). [ ][ ]
Paper			: "Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic Segmentation", NeurIPS, 2023 ( ). [ ][ ]
Paper			: "What a MESS: Multi-Domain Evaluation of Zero-Shot Semantic Segmentation", NeurIPS (Datasets and Benchmarks), 2023 ( ). [ ][ ][ ]
Paper			: "ZegOT: Zero-shot Segmentation Through Optimal Transport of Text Prompts", arXiv, 2023 ( ). [ ]
Paper			: "SimCon Loss with Multiple Views for Text Supervised Semantic Segmentation", arXiv, 2023 ( ). [ ]
Paper			: "DiffusionSeg: Adapting Diffusion Towards Unsupervised Object Discovery", arXiv, 2023 ( ). [ ]
Paper			: "Associating Spatially-Consistent Grouping with Text-supervised Semantic Segmentation", arXiv, 2023 ( ). [ ]
Paper			: "[CLS] Token is All You Need for Zero-Shot Semantic Segmentation", arXiv, 2023 ( ). [ ]
Paper			: "CLIP Is Also a Good Teacher: A New Learning Framework for Inductive Zero-shot Semantic Segmentation", arXiv, 2023 ( ). [ ]
Paper			: "SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding", arXiv, 2023 ( ). [ ]
Paper			: "Grounding Everything: Emerging Localization Properties in Vision-Language Transformers", arXiv, 2023 ( ). [ ][ ]
Paper			: "CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor", arXiv, 2023 ( ). [ ][ ][ ]
Paper			: "Spectral Prompt Tuning: Unveiling Unseen Classes for Zero-Shot Semantic Segmentation", AAAI, 2024 ( ). [ ][ ]
Paper			: "Annotation Free Semantic Segmentation with Vision Foundation Models", arXiv, 2024 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Open-World/Vocabulary:
Paper			: "Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding", ECCV, 2022 ( ). [ ]
Paper			: "A Simple Baseline for Open Vocabulary Semantic Segmentation with Pre-trained Vision-language Model", ECCV, 2022 ( ). [ ][ ]
Paper			: "Scaling Open-Vocabulary Image Segmentation with Image-Level Labels", ECCV, 2022 ( ). [ ]
Paper			: "Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models", BMVC, 2022 ( ). [ ][ ]
Paper			: "Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP", CVPR, 2023 ( ). [ ][ ][ ]
Paper			: "ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation", CVPR, 2023 ( ). [ ][ ]
Paper			: "Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs", CVPR, 2023 ( ). [ ][ ]
Paper			: "Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models", CVPR, 2023 ( ). [ ][ ][ ]
Paper			: "Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual Mask Annotations", CVPR, 2023 ( ). [ ][ ]
Paper			: "FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation", CVPR, 2023 ( ). [ ]
Paper			: "Side Adapter Network for Open-Vocabulary Semantic Segmentation", CVPR, 2023 ( ). [ ][ ]
Paper			: "Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision", CVPR, 2023 ( ). [ ][ ][ ]
Paper			: "Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning", CVPR, 2023 ( ). [ ]
Paper			: "Open-Vocabulary Universal Image Segmentation with MaskCLIP", ICML, 2023 ( ). [ ][ ]
Paper			: "SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation", ICML, 2023 ( ). [ ][ ]
Paper			: "Exploring Transformers for Open-world Instance Segmentation", ICCV, 2023 ( ). [ ]
Paper			: "Open-vocabulary Object Segmentation with Diffusion Models", ICCV, 2023 ( ). [ ][ ][ ]
Paper			: "SegPrompt: Boosting Open-world Segmentation via Category-level Prompt Learning", ICCV, 2023 ( ). [ ][ ]
Paper			: "Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation", ICCV, 2023 ( ). [ ][ ][ ]
Paper			: "A Simple Framework for Open-Vocabulary Segmentation and Detection", ICCV, 2023 ( ). [ ][ ]
Paper			: "Open-vocabulary Panoptic Segmentation with Embedding Modulation", ICCV, 2023 ( ). [ ]
Paper			: "Global Knowledge Calibration for Fast Open-Vocabulary Segmentation", ICCV, 2023 ( ). [ ]
Paper			: "Exploring Open-Vocabulary Semantic Segmentation from CLIP Vision Encoder Distillation Only", ICCV, 2023 ( ). [ ]
Paper			: "MasQCLIP for Open-Vocabulary Universal Image Segmentation", ICCV, 2023 ( ). [ ][ ][ ]
Paper			: "Going Denser with Open-Vocabulary Part Segmentation", ICCV, 2023 ( ). [ ][ ]
Paper			: "Open-Vocabulary Semantic Segmentation with Decoupled One-Pass Network", ICCV, 2023 ( ). [ ]][ ]
Paper			: "MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner for Open-World Semantic Segmentation", ICCV, 2023 ( ). [ ]
Paper			: "OV-PARTS: Towards Open-Vocabulary Part Segmentation", NeurIPS (Datasets and Benchmarks), 2023 ( ). [ ][ ]
Paper			: "Hierarchical Open-vocabulary Universal Image Segmentation", NeurIPS, 2023 ( ). [ ][ ][ ]
Paper			: "Open-Vocabulary Semantic Segmentation via Attribute Decomposition-Aggregation", NeurIPS, 2023 ( ). [ ]
Paper			: "Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP", NeurIPS, 2023 ( ). [ ][ ]
Paper			: "A Language-Guided Benchmark for Weakly Supervised Open Vocabulary Semantic Segmentation", arXiv, 2023 ( ). [ ]
Paper			: "CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation", arXiv, 2023 ( ). [ ][ ][ ]
Paper			: "MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic Segmentation", arXiv, 2023 ( ). [ ]
Paper			: "TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation", arXiv, 2023 ( ). [ ]
Paper			: "Diffusion Models for Zero-Shot Open-Vocabulary Segmentation", arXiv, 2023 ( ). [ ][ ]
Paper			: "Unified Open-Vocabulary Dense Visual Prediction", arXiv, 2023 ( ). [ ]
Paper			: "CLIP-DIY: CLIP Dense Inference Yields Open-Vocabulary Semantic Segmentation For-Free", arXiv, 2023 ( ). [ ]
Paper			: "Rethinking Evaluation Metrics of Open-Vocabulary Segmentaion", arXiv, 2023 ( ). [ ][ ]
Paper			: "Towards Open-Ended Visual Recognition with Large Language Model", arXiv, 2023 ( ). [ ][ ]
Paper			: "SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation", arXiv, 2023 ( ). [ ][ ]
Paper			: "Plug-and-Play, Dense-Label-Free Extraction of Open-Vocabulary Semantic Segmentation from Vision-Language Models", arXiv, 2023 ( ). [ ]
Paper			: "SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference", arXiv, 2023 ( ). [ ]
Paper			: "Towards Granularity-adjusted Pixel-level Semantic Annotation", arXiv, 2023 ( ). [ ]
Paper			: "Boosting Segment Anything Model Towards Open-Vocabulary Learning", arXiv, 2023 ( ). [ ][ ]
Paper			: "Open-Vocabulary Segmentation with Semantic-Assisted Calibration", arXiv, 2023 ( ). [ ][ ]
Paper			: "Self-Guided Open-Vocabulary Semantic Segmentation", arXiv, 2023 ( ). [ ]
Paper			: "OpenSD: Unified Open-Vocabulary Segmentation and Detection", arXiv, 2023 ( ). [ ]
Paper			: "CLIP-DINOiser: Teaching CLIP a few DINO tricks", arXiv, 2023 ( ). [ ][ ]
Paper			: "TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification", arXiv, 2023 ( ). [ ][ ][ ]
Paper			: "OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation", CVPR, 2024 ( ). [ ]
Paper			: "Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation", CVPR, 2024 ( ). [ ][ ]
Paper			: "Exploring Simple Open-Vocabulary Semantic Segmentation", arXiv, 2024 ( ). [ ][ ]
Paper			: "PosSAM: Panoptic Open-vocabulary Segment Anything", arXiv, 2024 ( ). [ ]][ ][ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / LLM-based:
Paper			: "LISA: Reasoning Segmentation via Large Language Model", arXiv, 2023 ( ). [ ][ ]
Paper			: "PixelLM: Pixel Reasoning with Large Multimodal Model", arXiv, 2023 ( ). [ ][ ][ ]
Paper			: "Pixel Aligned Language Models", arXiv, 2023 ( ). [ ][ ]
Paper			: "GSVA: Generalized Segmentation via Multimodal Large Language Models", arXiv, 2023 ( ). [ ]
Paper			: "An Improved Baseline for Reasoning Segmentation with Large Language Model", arXiv, 2023 ( ). [ ]
Paper			: "GROUNDHOG: Grounding Large Language Models to Holistic Segmentation", CVPR, 2024 ( ). [ ][ ]
Paper			: "PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model", arXiv, 2024 ( ). [ ][ ]
Paper			: "Empowering Segmentation Ability to Multi-modal Large Language Models", arXiv, 2024 ( ). [ ]
Paper			: "LaSagnA: Language-based Segmentation Assistant for Complex Queries", arXiv, 2024 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Universal Segmentation:
Paper			: "K-Net: Towards Unified Image Segmentation", NeurIPS, 2021 ( ). [ ][ ]
Paper			: "Masked-attention Mask Transformer for Universal Image Segmentation", CVPR, 2022 ( ). [ ][ ][ ]
Paper			: "MP-Former: Mask-Piloted Transformer for Image Segmentation", CVPR, 2023 ( ). [ ][ ]
Paper			: "OneFormer: One Transformer to Rule Universal Image Segmentation", CVPR, 2023 ( ). [ ][ ][ ]
Paper			: "Universal Instance Perception as Object Discovery and Retrieval", CVPR, 2023 ( ). [ ][ ]
Paper			: "CLUSTSEG: Clustering for Universal Segmentation", ICML, 2023 ( ). [ ]
Paper			: "DaTaSeg: Taming a Universal Multi-Dataset Multi-Task Segmentation Model", NeurIPS, 2023 ( ). [ ]
Paper			: "DFormer: Diffusion-guided Transformer for Universal Image Segmentation", arXiv, 2023 ( ). [ ][ ]
Paper			: "A Critical Look at the Current Usage of Foundation Model for Dense Recognition Task", arXiv, 2023 ( ). [ ]
Paper			: "Mask2Anomaly: Mask Transformer for Universal Open-set Segmentation", arXiv, 2023 ( ). [ ]
Paper			: "SegGen: Supercharging Segmentation Models with Text2Mask and Mask2Img Synthesis", arXiv, 2023 ( ). [ ][ ][ ]
Paper			: "PolyMaX: General Dense Prediction with Mask Transformer", WACV, 2024 ( ). [ ][ ]
Paper			: "PEM: Prototype-based Efficient MaskFormer for Image Segmentation", CVPR, 2024 ( ). [ ][ ]
Paper			: "OMG-Seg: Is One Model Good Enough For All Segmentation?", arXiv, 2024 ( ). [ ][ ][ ]
Paper			: "Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision", arXiv, 2024 ( ). [ ][ ]
Paper			: "Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation", arXiv, 2024 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Multi-Modal:
Paper			: "UCTNet: Uncertainty-Aware Cross-Modal Transformer Network for Indoor RGB-D Semantic Segmentation", ECCV, 2022 ( ). [ ]
Paper			: "CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers", arXiv, 2022 ( ). [ ][ ]
Paper			: "Delivering Arbitrary-Modal Semantic Segmentation", CVPR, 2023 ( ). [ ][ ][ ]
Paper			: "DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation", arXiv, 2023 ( ). [ ][ ]
Paper			: "Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation", arXiv, 2024 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Panoptic Segmentation:
Paper			: "MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers", CVPR, 2021 ( ). [ ][ ]
Paper			: "An End-to-End Trainable Video Panoptic Segmentation Method usingTransformers", arXiv, 2021 ( ). [ ]
Paper			: "Time-Space Transformers for Video Panoptic Segmentation", WACV, 2022 ( ). [ ]
Paper			: "CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation", CVPR, 2022 ( ). [ ]
Paper			: "Panoptic SegFormer", CVPR, 2022 ( ). [ ][ ]
Paper			: "k-means Mask Transformer", ECCV, 2022 ( ). [ ][ ]
Paper			: "Panoptic-PartFormer: Learning a Unified Model for Panoptic Part Segmentation", ECCV, 2022 ( ). [ ][ ]
Paper			: "CoMFormer: Continual Learning in Semantic and Panoptic Segmentation", CVPR, 2023 ( ). [ ]
Paper			: "You Only Segment Once: Towards Real-Time Panoptic Segmentation", CVPR, 2023 ( ). [ ][ ]
Paper			: "A Generalist Framework for Panoptic Segmentation of Images and Videos", ICCV, 2023 ( ). [ ][ ]
Paper			: "Towards Deeply Unified Depth-aware Panoptic Segmentation with Bi-directional Guidance Learning", ICCV, 2023 ( ). [ ][ ]
Paper			: "ReMaX: Relaxing for Better Training on Efficient Panoptic Segmentation", NeurIPS, 2023 ( ). [ ][ ]
Paper			: "PanopticPartFormer++: A Unified and Decoupled View for Panoptic Part Segmentation", arXiv, 2023 ( ). [ ][ ]
Paper			: "MaXTron: Mask Transformer with Trajectory Attention for Video Panoptic Segmentation", arXiv, 2023 ( ). [ ]
Paper			: "ECLIPSE: Efficient Continual Learning in Panoptic Segmentation with Visual Prompt Tuning", CVPR, 2024 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Instance Segmentation:
Paper			: "ISTR: End-to-End Instance Segmentation with Transformers", arXiv, 2021 ( ). [ ][ ]
Paper			: "Mask Transfiner for High-Quality Instance Segmentation", CVPR, 2022 ( ). [ ][ ][ ]
Paper			: "Instance Segmentation With Mask-Supervised Polygonal Boundary Transformers", CVPR, 2022 ( ). [ ]
Paper			: "Parallel Pre-trained Transformers (PPT) for Synthetic Data-based Instance Segmentation", CVPRW, 2022 ( ). [ ]
Paper			: "TOIST: Task Oriented Instance Segmentation Transformer with Noun-Pronoun Distillation", NeurIPS, 2022 ( ). [ ][ ]
Paper			: "Vision Transformers Are Good Mask Auto-Labelers", CVPR, 2023 ( ). [ ][ ]
Paper			: "FastInst: A Simple Query-Based Model for Real-Time Instance Segmentation", CVPR, 2023 ( ). [ ][ ]
Paper			: "Boosting Low-Data Instance Segmentation by Unsupervised Pre-training with Saliency Prompt", CVPR, 2023 ( ). [ ]
Paper			: "X-Paste: Revisiting Scalable Copy-Paste for Instance Segmentation using CLIP and StableDiffusion", ICML, 2023 ( ). [ ][ ]
Paper			: "DynaMITe: Dynamic Query Bootstrapping for Multi-object Interactive Segmentation Transformer", ICCV, 2023 ( ). [ ][ ][ ]
Paper			: "Mask Frozen-DETR: High Quality Instance Segmentation with One GPU", arXiv, 2023 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Optical Flow:
Paper			: "CRAFT: Cross-Attentional Flow Transformer for Robust Optical Flow", CVPR, 2022 ( ). [ ][ ]
Paper			: "Learning Optical Flow With Kernel Patch Attention", CVPR, 2022 ( ). [ ][ ]
Paper			: "Global Matching with Overlapping Attention for Optical Flow Estimation", CVPR, 2022 ( ). [ ][ ]
Paper			: "FlowFormer: A Transformer Architecture for Optical Flow", ECCV, 2022 ( ). [ ][ ]
Paper			: "TransFlow: Transformer as Flow Learner", CVPR, 2023 ( ). [ ]
Paper			: "FlowFormer++: Masked Cost Volume Autoencoding for Pretraining Optical Flow Estimation", CVPR, 2023 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Panoramic Semantic Segmentation:
Paper			: "Bending Reality: Distortion-aware Transformers for Adapting to Panoramic Semantic Segmentation", CVPR, 2022 ( ). [ ][ ]
Paper			: "SGAT4PASS: Spherical Geometry-Aware Transformer for PAnoramic Semantic Segmentation", IJCAI, 2023 ( ). [ ][ ]
Paper			: "FlowFormer: A Transformer Architecture and Its Masked Cost Volume Autoencoding for Optical Flow", arXiv, 2023 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / X-Shot:
Paper			: "Few-Shot Segmentation via Cycle-Consistent Transformer", NeurIPS, 2021 ( ). [ ]
Paper			: "CATrans: Context and Affinity Transformer for Few-Shot Segmentation", IJCAI, 2022 ( ). [ ]
Paper			: "Cost Aggregation with 4D Convolutional Swin Transformer for Few-Shot Segmentation", ECCV, 2022 ( ). [ ][ ][ ]
Paper			: "Dense Cross-Query-and-Support Attention Weighted Mask Aggregation for Few-Shot Segmentation", ECCV, 2022 ( ). [ ]
Paper			: "Adaptive Agent Transformer for Few-Shot Segmentation", ECCV, 2022 ( ). [ ]
Paper			: "Intermediate Prototype Mining Transformer for Few-Shot Semantic Segmentation", NeurIPS, 2022 ( ). [ ][ ]
Paper			: "Task-Adaptive Feature Transformer with Semantic Enrichment for Few-Shot Segmentation", arXiv, 2022 ( ). [ ]
Paper			: "MSANet: Multi-Similarity and Attention Guidance for Boosting Few-Shot Segmentation", arXiv, 2022 ( ). [ ][ ]
Paper			: "Suppressing the Heterogeneity: A Strong Feature Extractor for Few-shot Segmentation", ICLR, 2023 ( ). [ ]
Paper			: "Universal Few-shot Learning of Dense Prediction Tasks with Visual Token Matching", ICLR, 2023 ( ). [ ][ ]
Paper			: "SegGPT: Segmenting Everything In Context", ICCV, 2023 ( ). [ ][ ]
Paper			: "Focus on Query: Adversarial Mining Transformer for Few-Shot Segmentation", NeurIPS, 2023 ( ). [ ][ ]
Paper			: "Reference Twice: A Simple and Unified Baseline for Few-Shot Instance Segmentation", arXiv, 2023 ( ). [ ][ ]
Paper			: "Multi-Modal Prototypes for Open-Set Semantic Segmentation", arXiv, 2023 ( ). [ ]
Paper			: "Few-Shot Panoptic Segmentation With Foundation Models", arXiv, 2023 ( ). [ ][ ]
Paper			: "Visual Prompting for Generalized Few-shot Segmentation: A Multi-scale Approach", CVPR, 2024 ( ). [ ]
Paper			: "Explore In-Context Segmentation via Latent Diffusion Models", arXiv, 2024 ( ). [ ][ ][ ]
Paper			: "Chameleon: A Data-Efficient Generalist for Dense Visual Prediction in the Wild", arXiv, 2024 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / X-Supervised:
Paper			: "Multi-class Token Transformer for Weakly Supervised Semantic Segmentation", CVPR, 2022 ( ). [ ][ ]
Paper			: "Learning Affinity from Attention: End-to-End Weakly-Supervised Semantic Segmentation with Transformers", CVPR, 2022 ( ). [ ][ ]
Paper			: "Unsupervised Hierarchical Semantic Segmentation with Multiview Cosegmentation and Clustering Transformers", CVPR, 2022 ( ). [ ][ ]
Paper			: "Cross Language Image Matching for Weakly Supervised Semantic Segmentation", CVPR, 2022 ( ). [ ][ ]
Paper			: "Self-Supervised Pre-training of Vision Transformers for Dense Prediction Tasks", CVPRW, 2022 ( ). [ ]
Paper			: "Learning Co-segmentation by Segment Swapping for Retrieval and Discovery", CVPRW, 2022 ( ). [ ][ ][ ]
Paper			: "Max Pooling with Vision Transformers Reconciles Class and Shape in Weakly Supervised Semantic Segmentation", ECCV, 2022 ( ). [ ][ ]
Paper			: "TransFGU: A Top-down Approach to Fine-Grained Unsupervised Semantic Segmentation", ECCV, 2022 ( ). [ ][ ]
Paper			: "TransCAM: Transformer Attention-based CAM Refinement for Weakly Supervised Semantic Segmentation", arXiv, 2022 ( ). [ ][ ]
Paper			: "WegFormer: Transformers for Weakly Supervised Semantic Segmentation", arXiv, 2022 ( ). [ ]
Paper			: "Discovering Object Masks with Transformers for Unsupervised Semantic Segmentation", arXiv, 2022 ( ). [ ][ ]
Paper			: "eX-ViT: A Novel eXplainable Vision Transformer for Weakly Supervised Semantic Segmentation", arXiv, 2022 ( ). [ ]
Paper			: "Transformer-CNN Cohort: Semi-supervised Semantic Segmentation by the Best of Both Students", arXiv, 2022 ( ). [ ]
Paper			: "SemFormer: Semantic Guided Activation Transformer for Weakly Supervised Semantic Segmentation", arXiv, 2022 ( ). [ ][ ]
Paper			: "CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation", CVPR, 2023 ( ). [ ][ ]
Paper			: "Token Contrast for Weakly-Supervised Semantic Segmentation", CVPR, 2023 ( ). [ ][ ]
Paper			: "DPF: Learning Dense Prediction Fields with Weak Supervision", CVPR, 2023 ( ). [ ][ ]
Paper			: "SemiCVT: Semi-Supervised Convolutional Vision Transformer for Semantic Segmentation", CVPR, 2023 ( ). [ ]
Paper			: "AttentionShift: Iteratively Estimated Part-Based Attention Map for Pointly Supervised Instance Segmentation", CVPR, 2023 ( ). [ ]
Paper			: "Learning Multi-Modal Class-Specific Tokens for Weakly Supervised Dense Object Localization", CVPR, 2023 ( ). [ ]
Paper			: "A Simple Framework for Text-Supervised Semantic Segmentation", CVPR, 2023 ( ). [ ][ ]
Paper			: "SIM: Semantic-aware Instance Mask Generation for Box-Supervised Instance Segmentation", CVPR, 2023 ( ). [ ][ ]
Paper			: "AttentionShift: Iteratively Estimated Part-based Attention Map for Pointly Supervised Instance Segmentation", CVPR, 2023 ( ). [ ]
Paper			: "Point2Mask: Point-supervised Panoptic Segmentation via Optimal Transport", ICCV, 2023 ( ). [ ][ ]
Paper			: "BoxSnake: Polygonal Instance Segmentation with Box Supervision", ICCV, 2023 ( ). [ ]
Paper			: "Question-Answer Cross Language Image Matching for Weakly Supervised Semantic Segmentation", ACMMM, 2023 ( ). [ ][ ]
Paper			: "Bridging Semantic Gaps for Language-Supervised Semantic Segmentation", NeurIPS, 2023 ( ). [ ][ ]
Paper			: "Label-efficient Segmentation via Affinity Propagation", NeurIPS, 2023 ( ). [ ][ ][ ]
Paper			: "PaintSeg: Training-free Segmentation via Painting", NeurIPS, 2023 ( ). [ ]
Paper			: "SmooSeg: Smoothness Prior for Unsupervised Semantic Segmentation", NeurIPS, 2023 ( ). [ ][ ]
Paper			: "Towards Universal Vision-language Omni-supervised Segmentation", arXiv, 2023 ( ). [ ]
Paper			: "MECPformer: Multi-estimations Complementary Patch with CNN-Transformers for Weakly Supervised Semantic Segmentation", arXiv, 2023 ( ). [ ][ ]
Paper			: "WeakTr: Exploring Plain Vision Transformer for Weakly-supervised Semantic Segmentation", arXiv, 2023 ( ). [ ][ ]
Paper			: "An Alternative to WSSS? An Empirical Study of the Segment Anything Model (SAM) on Weakly-Supervised Semantic Segmentation Problems", arXiv, 2023 ( ). [ ]
Paper			: "Segment Anything is A Good Pseudo-label Generator for Weakly Supervised Semantic Segmentation", arXiv, 2023 ( ). [ ]
Paper			: "Mitigating Undisciplined Over-Smoothing in Transformer for Weakly Supervised Semantic Segmentation", arXiv, 2023 ( ). [ ]
Paper			: "Segment Anything Model (SAM) Enhanced Pseudo Labels for Weakly Supervised Semantic Segmentation", arXiv, 2023 ( ). [ ][ ]
Paper			: "MIMIC: Masked Image Modeling with Image Correspondences", arXiv, 2023 ( ). [ ][ ]
Paper			: "Prompting classes: Exploring the Power of Prompt Class Learning in Weakly Supervised Semantic Segmentation", arXiv, 2023 ( ). [ ][ ]
Paper			: "Guided Distillation for Semi-Supervised Instance Segmentation", arXiv, 2023 ( ). [ ]
Paper			: "MCTformer+: Multi-Class Token Transformer for Weakly Supervised Semantic Segmentation", arXiv, 2023 ( ). [ ][ ]
Paper			: "Masked Momentum Contrastive Learning for Zero-shot Semantic Understanding", arXiv, 2023 ( ). [ ]
Paper			: "Emergence of Segmentation with Minimalistic White-Box Transformers", arXiv, 2023 ( ). [ ][ ]
Paper			: "Weakly-Supervised Semantic Segmentation with Image-Level Labels: from Traditional Models to Foundation Models", arXiv, 2023 ( ). [ ]
Paper			: "Masked Collaborative Contrast for Weakly Supervised Semantic Segmentation", arXiv, 2023 ( ). [ ][ ]
Paper			: "White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?", arXiv, 2023 ( ). [ ][ ][ ]
Paper			: "Foundation Model Assisted Weakly Supervised Semantic Segmentation", arXiv, 2023 ( ). [ ]
Paper			: "SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance", arXiv, 2023 ( ). [ ][ ]
Paper			: "Progressive Uncertain Feature Self-reinforcement for Weakly Supervised Semantic Segmentation", AAAI, 2024 ( ). [ ][ ]
Paper			: "FeatUp: A Model-Agnostic Framework for Features at Any Resolution", ICLR, 2024 ( ). [ ]
Paper			: "The devil is in the object boundary: towards annotation-free instance segmentation using Foundation Models", ICLR, 2024 ( ). [ ][ ]
Paper			: "Separate and Conquer: Decoupling Co-occurrence via Decomposition and Representation for Weakly Supervised Semantic Segmentation", CVPR, 2024 ( ). [ ][ ]
Paper			: "AllSpark: Reborn Labeled Features from Unlabeled in Transformer for Semi-Supervised Semantic Segmentation", CVPR, 2024 ( ). [ ][ ]
Paper			: "Hunting Attributes: Context Prototype-Aware Learning for Weakly Supervised Semantic Segmentation", CVPR, 2024 ( ). [ ][ ]
Paper			: "DuPL: Dual Student with Trustworthy Progressive Learning for Robust Weakly Supervised Semantic Segmentation", CVPR, 2024 ( ). [ ][ ]
Paper			: "Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation", CVPR, 2024 ( ). [ ][ ]
Paper			: "SemPLeS: Semantic Prompt Learning for Weakly-Supervised Semantic Segmentation", arXiv, 2024 ( ). [ ]
Paper			: "WeakSAM: Segment Anything Meets Weakly-supervised Instance-level Recognition", arXiv, 2024 ( ). [ ][ ]
Paper			: "Weakly Supervised Co-training with Swapping Assignments for Semantic Segmentation", arXiv, 2024 ( ). [ ][ ]
Paper			: "CoBra: Complementary Branch Fusing Class and Semantic Knowledge for Robust Weakly Supervised Semantic Segmentation", arXiv, 2024 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Cross-Domain:
Paper			: "DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation", CVPR, 2022 ( ). [ ][ ]
Paper			: "HGFormer: Hierarchical Grouping Transformer for Domain Generalized Semantic Segmentation", CVPR, 2023 ( ). [ ][ ]
Paper			: "UniDAformer: Unified Domain Adaptive Panoptic Segmentation Transformer via Hierarchical Mask Calibration", CVPR, 2023 ( ). [ ]
Paper			: "MIC: Masked Image Consistency for Context-Enhanced Domain Adaptation", CVPR, 2023 ( ). [ ][ ]
Paper			: "CDAC: Cross-domain Attention Consistency in Transformer for Domain Adaptive Semantic Segmentation", ICCV, 2023 ( ). [ ][ ]
Paper			: "EDAPS: Enhanced Domain-Adaptive Panoptic Segmentation", ICCV, 2023 ( ). [ ][ ]
Paper			: "Prompting Diffusion Representations for Cross-Domain Semantic Segmentation", arXiv, 2023 ( ). [ ][ ]
Paper			: "Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation", arXiv, 2023 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Continual Learning:
Paper			: "Delving into Transformer for Incremental Semantic Segmentation", arXiv, 2022 ( ). [ ]
Paper			: "Incrementer: Transformer for Class-Incremental Semantic Segmentation With Knowledge Distillation Focusing on Old Class", CVPR, 2023 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Crack Detection:
Paper			: "CrackFormer: Transformer Network for Fine-Grained Crack Detection", ICCV, 2021 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Camouflaged/Concealed Object:
Paper			: "Uncertainty-Guided Transformer Reasoning for Camouflaged Object Detection", ICCV, 2021 ( ). [ ][ ]
Paper			: "Boosting Camouflaged Object Detection with Dual-Task Interactive Transformer", ICPR, 2022 ( ). [ ][ ]
Paper			: "OSFormer: One-Stage Camouflaged Instance Segmentation with Transformers", ECCV, 2022 ( ). [ ][ ]
Paper			: "Feature Shrinkage Pyramid for Camouflaged Object Detection with Transformers", CVPR, 2023 ( ). [ ][ ][ ]
Paper			: "Weakly-Supervised Concealed Object Segmentation with SAM-based Pseudo Labeling and Multi-scale Feature Grouping", NeurIPS, 2023 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Background Separation:
Paper			: "TransBlast: Self-Supervised Learning Using Augmented Subspace With Transformer for Background/Foreground Separation", ICCVW, 2021 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Scene Understanding:
Paper			: "Transformer Meets Convolution: A Bilateral Awareness Net-work for Semantic Segmentation of Very Fine Resolution Urban Scene Images", arXiv, 2021 ( ). [ ]
Paper			: "Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing", CVPR, 2022 ( ). [ ][ ]
Paper			: "IRISformer: Dense Vision Transformers for Single-Image Inverse Rendering in Indoor Scenes", CVPR, 2022 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / 3D Segmentation:
Paper			: "Stratified Transformer for 3D Point Cloud Segmentation", CVPR, 2022 ( ). [ ][ ]
Paper			: "CodedVTR: Codebook-based Sparse Voxel Transformer with Geometric Guidance", CVPR, 2022 ( ). [ ]
Paper			: "M2F3D: Mask2Former for 3D Instance Segmentation", CVPRW, 2022 ( ). [ ][ ]
Paper			: "3D Segmenter: 3D Transformer based Semantic Segmentation via 2D Panoramic Distillation", ICLR, 2023 ( ). [ ]
Paper			: "Analogical Networks for Memory-Modulated 3D Parsing", ICLR, 2023 ( ). [ ]
Paper			: "VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion", CVPR, 2023 ( ). [ ][ ]
Paper			: "GrowSP: Unsupervised Semantic Segmentation of 3D Point Clouds", CVPR, 2023 ( ). [ ][ ]
Paper			: "RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving", CVPR, 2023 ( ). [ ][ ]
Paper			: "Heat Diffusion based Multi-scale and Geometric Structure-aware Transformer for Mesh Segmentation", CVPR, 2023 ( ). [ ]
Paper			: "MSeg3D: Multi-modal 3D Semantic Segmentation for Autonomous Driving", CVPR, 2023 ( ). [ ][ ]
Paper			: "See More and Know More: Zero-shot Point Cloud Segmentation via Multi-modal Visual Data", ICCV, 2023 ( ). [ ]
Paper			: "SVQNet: Sparse Voxel-Adjacent Query Network for 4D Spatio-Temporal LiDAR Semantic Segmentation", ICCV, 2023 ( ). [ ]
Paper			: "Mask-Attention-Free Transformer for 3D Instance Segmentation", ICCV, 2023 ( ). [ ][ ]
Paper			: "UniSeg: A Unified Multi-Modal LiDAR Segmentation Network and the OpenPCSeg Codebase", ICCV, 2023 ( ). [ ][ ]
Paper			: "2D-3D Interlaced Transformer for Point Cloud Segmentation with Scene-Level Supervision", ICCV, 2023 ( ). [ ]
Paper			: "CVSformer: Cross-View Synthesis Transformer for Semantic Scene Completion", ICCV, 2023 ( ). [ ]
Paper			: "Efficient 3D Semantic Segmentation with Superpoint Transformer", ICCV, 2023 ( ). [ ][ ]
Paper			: "SATR: Zero-Shot Semantic Segmentation of 3D Shapes", ICCV, 2023 ( ). [ ][ ][ ]
Paper			: "3D Indoor Instance Segmentation in an Open-World", NeurIPS, 2023 ( ). [ ]
Paper			: "Segment Anything in 3D with NeRFs", NeurIPS, 2023 ( ). [ ][ ][ ]
Paper			: "Contrastive Lift: 3D Object Instance Segmentation by Slow-Fast Contrastive Fusion", NeurIPS, 2023 ( ). [ ][ ][ ]
Paper			: "Position-Guided Point Cloud Panoptic Segmentation Transformer", arXiv, 2023 ( ). [ ][ ]
Paper			: "UnScene3D: Unsupervised 3D Instance Segmentation for Indoor Scenes", arXiv, 2023 ( ). [ ][ ]
Paper			: "Towards Label-free Scene Understanding by Vision Foundation Models", NeurIPS, 2023 ( ). [ ][ ]
Paper			: "Dynamic Clustering Transformer Network for Point Cloud Segmentation", arXiv, 2023 ( ). [ ]
Paper			: "Symphonize 3D Semantic Scene Completion with Contextual Instance Queries", arXiv, 2023 ( ). [ ][ ]
Paper			: "Less is More: Towards Efficient Few-shot 3D Semantic Segmentation via Training-free Networks", arXiv, 2023 ( ). [ ][ ]
Paper			: "When 3D Bounding-Box Meets SAM: Point Cloud Instance Segmentation with Weak-and-Noisy Supervision", arXiv, 2023 ( ). [ ]
Paper			: "SAM-guided Unsupervised Domain Adaptation for 3D Segmentation", arXiv, 2023 ( ). [ ]
Paper			: "Leveraging Large-Scale Pretrained Vision Foundation Models for Label-Efficient 3D Point Cloud Segmentation", arXiv, 2023 ( ). [ ]
Paper			: "Understanding Self-Supervised Features for Learning Unsupervised Instance Segmentation", arXiv, 2023 ( ). [ ]
Paper			: "OneFormer3D: One Transformer for Unified Point Cloud Segmentation", arXiv, 2023 ( ). [ ]
Paper			: "Segment Any 3D Gaussians", arXiv, 2023 ( ). [ ][ ][ ]
Paper			: "SANeRF-HQ: Segment Anything for NeRF in High Quality", arXiv, 2023 ( ). [ ][ ][ ]
Paper			: "SAM-guided Graph Cut for 3D Instance Segmentation", arXiv, 2023 ( ). [ ][ ][ ]
Paper			: "SAI3D: Segment Any Instance in 3D Scenes", arXiv, 2023 ( ). [ ]
Paper			: "Rethinking Few-shot 3D Point Cloud Semantic Segmentation", CVPR, 2024 ( ). [ ][ ]
Paper			: "Building a Strong Pre-Training Baseline for Universal 3D Large-Scale Perception", CVPR, 2024 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Multi-Task:
Paper			: "Inverted Pyramid Multi-task Transformer for Dense Scene Understanding", ECCV, 2022 ( ). [ ][ ]
Paper			: "MTFormer: Multi-task Learning via Transformer and Cross-Task Reasoning", ECCV, 2022 ( ). [ ]
Paper			: "Multi-Task Learning with Multi-Query Transformer for Dense Prediction", arXiv, 2022 ( ). [ ]
Paper			: "DeMT: Deformable Mixer Transformer for Multi-Task Learning of Dense Prediction", AAAI, 2023 ( ). [ ][ ]
Paper			: "TaskPrompter: Spatial-Channel Multi-Task Prompting for Dense Scene Understanding", ICLR, 2023 ( ). [ ][ ]
Paper			: "All in Tokens: Unifying Output Space of Visual Tasks via Soft Token", ICCV, 2023 ( ). [ ][ ]
Paper			: "InvPT++: Inverted Pyramid Multi-Task Transformer for Visual Scene Understanding", arXiv, 2023 ( ). [ ]
Paper			: "Deformable Mixer Transformer with Gating for Multi-Task Learning of Dense Prediction", arXiv, 2023 ( ). [ ][ ]
Paper			: "Sub-token ViT Embedding via Stochastic Resonance Transformers", arXiv, 2023 ( ). [ ]
Paper			: "Multi-Task Dense Prediction via Mixture of Low-Rank Experts", CVPR, 2024 ( ). [ ]
Paper			: "ODIN: A Single Model for 2D and 3D Perception", arXiv, 2024 ( ). [ ][ ][ ]
Paper			: "LiFT: A Surprisingly Simple Lightweight Feature Transform for Dense ViT Descriptors", arXiv, 2024 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Forecasting:
Paper			: "Joint Forecasting of Panoptic Segmentations with Difference Attention", CVPR, 2022 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / LiDAR:
Paper			: "Online Segmentation of LiDAR Sequences: Dataset and Algorithm", CVPRW, 2022 ( ). [ ][ ][ ]
Paper			: "Gaussian Radar Transformer for Semantic Segmentation in Noisy Radar Data", RA-L, 2022 ( ). [ ]
Paper			: "Lidar Panoptic Segmentation and Tracking without Bells and Whistles", IROS, 2023 ( ). [ ][ ]
Paper			: "4D-Former: Multimodal 4D Panoptic Segmentation", CoRL, 2023 ( ). [ ][ ]
Paper			: "MASK4D: Mask Transformer for 4D Panoptic Segmentation", arXiv, 2023 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Co-Segmentation:
Paper			: "ReCo: Retrieve and Co-segment for Zero-shot Transfer", NeurIPS, 2022 ( ). [ ][ ][ ]
Paper			: "Deep ViT Features as Dense Visual Descriptors", arXiv, 2022 ( ). [ ][ ][ ]
Paper			: "LCCo: Lending CLIP to Co-Segmentation", arXiv, 2023 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Top-Down Semantic Segmentation:
Paper			: "Trans4Map: Revisiting Holistic Top-down Mapping from Egocentric Images to Allocentric Semantics with Vision Transformers", arXiv, 2022 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Surface Normal:
Paper			: "Normal Transformer: Extracting Surface Geometry from LiDAR Points Enhanced by Visual Semantics", arXiv, 2022 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Applications:
Paper			: "Transformer-based Flood Scene Segmentation for Developing Countries", NeurIPSW, 2022 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Diffusion:
Paper			: "Unleashing Text-to-Image Diffusion Models for Visual Perception", ICCV, 2023 ( ). [ ][ ][ ]
Paper			: "Dataset Diffusion: Diffusion-based Synthetic Dataset Generation for Pixel-Level Semantic Segmentation", NeurIPS, 2023 ( ). [ ][ ][ ]
Paper			: "SegRefiner: Towards Model-Agnostic Segmentation Refinement with Discrete Diffusion Process", NeurIPS, 2023 ( ). [ ][ ]
Paper			: "DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models", NeurIPS, 2023 ( ). [ ][ ][ ]
Paper			: "Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion", arXiv, 2023 ( ). [ ]
Paper			: "Diffusion Model is Secretly a Training-free Open Vocabulary Semantic Segmenter", arXiv, 2023 ( ). [ ]
Paper			: "From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models", arXiv, 2023 ( ). [ ]
Paper			: "A Simple Latent Diffusion Approach for Panoptic Segmentation and Mask Inpainting", arXiv, 2024 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Low-Level Structure Segmentation:
Paper			: "Explicit Visual Prompting for Low-Level Structure Segmentations", CVPR, 2023. ( ). [ ][ ]
Paper			: "Explicit Visual Prompting for Universal Foreground Segmentations", arXiv, 2023 ( ). [ ][ ]
Paper			: "EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models", arXiv, 2024 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Zero-Guidance Segmentation:
Paper			: "Zero-guidance Segmentation Using Zero Segment Labels", arXiv, 2023 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Part Segmentation:
Paper			: "Towards Open-World Segmentation of Parts", CVPR, 2023 ( ). [ ][ ]
Paper			: "PartDistillation: Learning Parts from Instance Segmentation", CVPR, 2023 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Entity Segmentation:
Paper			: "AIMS: All-Inclusive Multi-Level Segmentation", NeurIPS, 2023 ( ). [ ][ ]
Paper			: "SOHES: Self-supervised Open-world Hierarchical Entity Segmentation", ICLR, 2024 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Evaluation:
Paper			: "Robustness Analysis on Foundational Segmentation Models", arXiv, 2023 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Interactive Segmentation:
Paper			: "InterFormer: Real-time Interactive Image Segmentation", ICCV, 2023 ( ). [ ][ ]
Paper			: "SimpleClick: Interactive Image Segmentation with Simple Vision Transformers", ICCV, 2023 ( ). [ ][ ]
Paper			: "Interactive Image Segmentation with Cross-Modality Vision Transformers", arXiv, 2023 ( ). [ ][ ]
Paper			: "MFP: Making Full Use of Probability Maps for Interactive Image Segmentation", CVPR, 2024 ( ). [ ][ ]
Paper			: "GraCo: Granularity-Controllable Interactive Segmentation", CVPR, 2024 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Amodal Segmentation:
Paper			: "AISFormer: Amodal Instance Segmentation with Transformer", BMVC, 2022 ( ). [ ][ ]
Paper			: "Coarse-to-Fine Amodal Segmentation with Shape Prior", ICCV, 2023 ( ). [ ][ ][ ]
Paper			: "Rethinking Amodal Video Segmentation from Learning Supervised Signals with Object-centric Representation", ICCV, 2023 ( ). [ ][ ]
Paper			: "Amodal Ground Truth and Completion in the Wild", arXiv, 2023 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Amonaly Segmentation:
Paper			: "Unmasking Anomalies in Road-Scene Segmentation", ICCV, 2023 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / In-Context Segmentation:
Paper			: "SegIC: Unleashing the Emergent Correspondence for In-Context Segmentation", arXiv, 2023 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Action Recognition / RGB mainly
Paper			: "Video Action Transformer Network", CVPR, 2019 ( ). [ ][ ]
Paper			: "Towards Training Stronger Video Vision Transformers for EPIC-KITCHENS-100 Action Recognition", CVPRW, 2021 ( ). [ ]
Paper			: "Is Space-Time Attention All You Need for Video Understanding?", ICML, 2021 ( ). [ ][ ]
Paper			: "Multiscale Vision Transformers", ICCV, 2021 ( ). [ ][ ]
Paper			: "VidTr: Video Transformer Without Convolutions", ICCV, 2021 ( ). [ ][ ]
Paper			: "ViViT: A Video Vision Transformer", ICCV, 2021 ( ). [ ][ ]
Paper			: "Video Transformer Network", ICCVW, 2021 ( ). [ ][ ]
Paper			: "Token Shift Transformer for Video Classification", ACMMM, 2021 ( ). [ ][ ]
Paper			: "Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers", NeurIPS, 2021 ( ). [ ][ ][ ]
Paper			: "Space-time Mixing Attention for Video Transformer", NeurIPS, 2021 ( ). [ ][ ]
Paper			: "Shifted Chunk Transformer for Spatio-Temporal Representational Learning", NeurIPS, 2021 ( ). [ ]
Paper			: "Relational Self-Attention: What's Missing in Attention for Video Understanding", NeurIPS, 2021 ( ). [ ][ ][ ]
Paper			: "An Image is Worth 16x16 Words, What is a Video Worth?", arXiv, 2021 ( ). [ ][ ]
Paper			: "Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training", arXiv, 2021 ( ). [ ]
Paper			: "TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?", arXiv, 2021 ( ). [ ]
Paper			: "VideoLightFormer: Lightweight Action Recognition using Transformers", arXiv, 2021 ( ). [ ]
Paper			: "UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning", ICLR, 2022 ( ). [ ][ ]
Paper			: "Video Swin Transformer", CVPR, 2022 ( ). [ ][ ]
Paper			: "DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition", CVPR, 2022 ( ). [ ][ ]
Paper			: "Deformable Video Transformer", CVPR, 2022 ( ). [ ]
Paper			: "MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition", CVPR, 2022 ( ). [ ]
Paper			: "MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing", CVPR, 2022 ( ). [ ][ ]
Paper			: "Recurring the Transformer for Video Action Recognition", CVPR, 2022 ( ). [ ]
Paper			: "Stand-Alone Inter-Frame Attention in Video Models", CVPR, 2022 ( ). [ ][ ]
Paper			: "MViTv2: Improved Multiscale Vision Transformers for Classification and Detection", CVPR, 2022 ( ). [ ][ ]
Paper			: "Multiview Transformers for Video Recognition", CVPR, 2022 ( ). [ ][ ]
Paper			: "Object-Region Video Transformers", CVPR, 2022 ( ). [ ][ ]
Paper			: "Time Is MattEr: Temporal Self-supervision for Video Transformers", ICML, 2022 ( ). [ ][ ]
Paper			: "Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition", ECCV, 2022 ( ). [ ][ ]
Paper			: "DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition", ECCV, 2022 ( ). [ ][ ]
Paper			: "Efficient Video Transformers with Spatial-Temporal Token Selection", ECCV, 2022 ( ). [ ][ ]
Paper			: "Turbo Training with Token Dropout", BMVC, 2022 ( ). [ ]
Paper			: "Multi-dataset Training of Transformers for Robust Action Recognition", NeurIPS, 2022 ( ). [ ][ ]
Paper			: "Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens", NeurIPS, 2022 ( ). [ ][ ]
Paper			: "ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning", NeurIPS, 2022 ( ). [ ][ ]
Paper			: "Alignment-guided Temporal Attention for Video Action Recognition", NeurIPS, 2022 ( ). [ ]
Paper			: "Attention in Attention: Modeling Context Correlation for Efficient Video Classification", TCSVT, 2022 ( ). [ ][ ]
Paper			: "Vision Transformer with Cross-attention by Temporal Shift for Efficient Action Recognition", arXiv, 2022 ( ). [ ]
Paper			: "Efficient Attention-free Video Shift Transformers", arXiv, 2022 ( ). [ ]
Paper			: "Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling", arXiv, 2022 ( ). [ ]
Paper			: "It Takes Two: Masked Appearance-Motion Modeling for Self-supervised Video Transformer Pre-training", arXiv, 2022 ( ). [ ]
Paper			: "Linear Video Transformer with Feature Fixation", arXiv, 2022 ( ). [ ]
Paper			: "Two-Stream Transformer Architecture for Long Video Understanding", arXiv, 2022 ( ). [ ]
Paper			: "PatchBlender: A Motion Prior for Video Transformers", arXiv, 2022 ( ). [ ]
Paper			: "Dual-path Adaptation from Image to Video Transformers", CVPR, 2023 ( ). [ ][ ]
Paper			: "Streaming Video Model", CVPR, 2023 ( ). [ ][ ]
Paper			: "Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning", CVPR, 2023 ( ). [ ]
Paper			: "AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders", CVPR, 2023 ( ). [ ][ ]
Paper			: "How can objects help action recognition?", CVPR, 2023 ( ). [ ]
Paper			: "Simple MViT: A Hierarchical Vision Transformer without the Bells-and-Whistles", ICML, 2023 ( ). [ ]
Paper			: "Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles", ICML, 2023 ( ). [ ][ ]
Paper			: "Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition", ICCV, 2023 ( ). [ ][ ][ ]
Paper			: "What Can Simple Arithmetic Operations Do for Temporal Modeling?", ICCV, 2023 ( ). [ ][ ]
Paper			: "Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation", ICCV, 2023 ( ). [ ]
Paper			: "Helping Hands: An Object-Aware Ego-Centric Video Recognition Model", ICCV, 2023 ( ). [ ][ ]
Paper			: "Learning from Semantic Alignment between Unpaired Multiviews for Egocentric Video Recognition", ICCV, 2023 ( ). [ ][ ]
Paper			: "A Large-scale Study of Spatiotemporal Representation Learning with a New Benchmark on Action Recognition", ICCV, 2023 ( ). [ ][ ]
Paper			: "UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer", ICCV, 2023 ( ). [ ][ ]
Paper			: "CAST: Cross-Attention in Space and Time for Video Action Recognition", NeurIPS, 2023 ( ). [ ][ ][ ]
Paper			: "Learning Human Action Recognition Representations Without Real Humans", NeurIPS (Datasets and Benchmarks), 2023 ( ). [ ][ ]
Paper			: "SVT: Supertoken Video Transformer for Efficient Video Understanding", arXiv, 2023 ( ). [ ]
Paper			: "Prompt Learning for Action Recognition", arXiv, 2023 ( ). [ ]
Paper			: "Optimizing ViViT Training: Time and Memory Reduction for Action Recognition", arXiv, 2023 ( ). [ ]
Paper			: "Temporally-Adaptive Models for Efficient Video Understanding", arXiv, 2023 ( ). [ ][ ]
Paper			: "ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video", arXiv, 2023 ( ). [ ]
Paper			: "Multi-entity Video Transformers for Fine-Grained Video Representation Learning", arXiv, 2023 ( ). [ ][ ]
Paper			: "GeoDeformer: Geometric Deformable Transformer for Action Recognition", arXiv, 2023 ( ). [ ]
Paper			: "Early Action Recognition with Action Prototypes", arXiv, 2023 ( ). [ ]
Paper			: "Don't Judge by the Look: A Motion Coherent Augmentation for Video Recognition", ICLR, 2024 ( ). [ ][ ]
Paper			: "Learning Correlation Structures for Vision Transformers", CVPR, 2024 ( ). [ ]
Paper			: "VideoMamba: State Space Model for Efficient Video Understanding", arXiv, 2024 ( ). [ ][ ]
Paper			: "Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding", arXiv, 2024 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Action Recognition / Depth:
Paper			: "Trear: Transformer-based RGB-D Egocentric Action Recognition", IEEE Transactions on Cognitive and Developmental Systems, 2021 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Action Recognition / Pose/Skeleton:
Paper			: "Spatial Temporal Transformer Network for Skeleton-based Action Recognition", ICPRW, 2020 ( ). [ ]
Paper			: "Action Transformer: A Self-Attention Model for Short-Time Human Action Recognition", arXiv, 2021 ( ). [ ][ ]
Paper			: "STAR: Sparse Transformer-based Action Recognition", arXiv, 2021 ( ). [ ]
Paper			: "GCsT: Graph Convolutional Skeleton Transformer for Action Recognition", arXiv, 2021 ( ). [ ]
Paper			: "Global-local Motion Transformer for Unsupervised Skeleton-based Action Learning", ECCV, 2022 ( ). [ ][ ]
Paper			: "Pose Uncertainty Aware Movement Synchrony Estimation via Spatial-Temporal Graph Transformer", International Conference on Multimodal Interaction (ICMI), 2022 ( ). [ ]
Paper			: "Focal and Global Spatial-Temporal Transformer for Skeleton-based Action Recognition", ACCV, 2022 ( ). [ ]
Paper			: "Spatio-Temporal Tuples Transformer for Skeleton-Based Action Recognition", arXiv, 2022 ( ). [ ][ ]
Paper			: "ProFormer: Learning Data-efficient Representations of Body Movement with Prototype-based Feature Augmentation and Visual Transformers", arXiv, 2022 ( ). [ ][ ]
Paper			: "Spatial Transformer Network with Transfer Learning for Small-scale Fine-grained Skeleton-based Tai Chi Action Recognition", arXiv, 2022 ( ). [ ]
Paper			: "Hypergraph Transformer for Skeleton-based Action Recognition", arXiv, 2022 ( ). [ ]
Paper			: "STAR-Transformer: A Spatio-temporal Cross Attention Transformer for Human Action Recognition", WACV, 2023 ( ). [ ]
Paper			: "STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition", CVPR, 2023 ( ). [ ][ ]
Paper			: "SkeletonMAE: Graph-based Masked Autoencoder for Skeleton Sequence Pre-training", ICCV, 2023 ( ). [ ][ ]
Paper			: "Masked Motion Predictors are Strong 3D Action Representation Learners", ICCV, 2023 ( ). [ ][ ]
Paper			: "LAC - Latent Action Composition for Skeleton-based Action Segmentation", ICCV, 2023 ( ). [ ][ ]
Paper			: "SkeleTR: Towards Skeleton-based Action Recognition in the Wild", ICCV, 2023 ( ). [ ]
Paper			: "Prompted Contrast with Masked Motion Modeling: Towards Versatile 3D Action Representation Learning", ACMMM, 2023 ( ). [ ][ ]
Paper			: "Seeing the Pose in the Pixels: Learning Pose-Aware Representations in Vision Transformers", arXiv, 2023 ( ). [ ][ ]
Paper			: "On the Utility of 3D Hand Poses for Action Recognition", arXiv, 2024 ( ). [ ][ ][ ]
Paper			: "SkateFormer: Skeletal-Temporal Transformer for Human Action Recognition", arXiv, 2024 ( ). [ ][ ][ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Action Recognition / Multi-modal:
Paper			: "Attention Bottlenecks for Multimodal Fusion", NeurIPS, 2021 ( ). [ ]
Paper			: "MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition", WACV, 2022 ( ). [ ]
Paper			: "Multimodal Transformer for Nursing Activity Recognition", CVPRW, 2022 ( ). [ ][ ]
Paper			: "M&M Mix: A Multimodal Multiview Transformer Ensemble", CVPRW, 2022 ( ). [ ]
Paper			: "Combined CNN Transformer Encoder for Enhanced Fine-grained Human Action Recognition", CVPRW, 2022 ( ). [ ]
Paper			: "Hierarchically Self-Supervised Transformer for Human Skeleton Representation Learning", ECCV, 2022 ( ). [ ][ ]
Paper			: "Multi-View Fusion Transformer for Sensor-Based Human Activity Recognition", arXiv, 2022 ( ). [ ]
Paper			: "Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models", arXiv, 2022 ( ). [ ]
Paper			: "3Mformer: Multi-order Multi-mode Transformer for Skeletal Action Recognition", CVPR, 2023 ( ). [ ]
Paper			: "On Uni-Modal Feature Learning in Supervised Multi-Modal Learning", ICML, 2023 ( ). [ ]
Paper			: "Multimodal Distillation for Egocentric Action Recognition", ICCV, 2023 ( ). [ ]
Paper			: "MotionBERT: Unified Pretraining for Human Motion Analysis", ICCV, 2023 ( ). [ ][ ][ ]
Paper			: "TIM: A Time Interval Machine for Audio-Visual Action Recognition", CVPR, 2024 ( ). [ ][ ][ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Action Recognition / Group Activity:
Paper			: "GroupFormer: Group Activity Recognition with Clustered Spatial-Temporal Transformer", ICCV, 2021 ( ). [ ]
Paper			: "Hunting Group Clues with Transformers for Social Group Activity Recognition", ECCV, 2022 ( ). [ ]
Paper			: "Learning Group Activity Features Through Person Attribute Prediction", CVPR, 2024 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Action Detection/Localization
Paper			: "OadTR: Online Action Detection with Transformers", ICCV, 2021 ( ). [ ][ ]
Paper			: "Relaxed Transformer Decoders for Direct Action Proposal Generation", ICCV, 2021 ( ). [ ][ ]
Paper			: "Few-Shot Temporal Action Localization with Query Adaptive Transformer", BMVC, 2021 ( ). [ ][ ]
Paper			: "Long Short-Term Transformer for Online Action Detection", NeurIPS, 2021 ( ). [ ][ ][ ]
Paper			: "Augmented Transformer with Adaptive Graph for Temporal Action Proposal Generation", arXiv, 2021 ( ). [ ]
Paper			: "Temporal Action Proposal Generation with Transformers", arXiv, 2021 ( ). [ ]
Paper			: "End-to-end Temporal Action Detection with Transformer", arXiv, 2021 ( ). [ ][ ]
Paper			: "Feature Combination Meets Attention: Baidu Soccer Embeddings and Transformer based Temporal Detection", arXiv, 2021 ( ). [ ][ ]
Paper			: "MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection", CVPR, 2022 ( ). [ ][ ]
Paper			: "Uncertainty-Guided Probabilistic Transformer for Complex Action Recognition", CVPR, 2022 ( ). [ ]
Paper			: "TubeR: Tube-Transformer for Action Detection", CVPR, 2022 ( ). [ ]
Paper			: "Progressive Attention on Multi-Level Dense Difference Maps for Generic Event Boundary Detection", CVPR, 2022 ( ). [ ][ ]
Paper			: "Dual-Stream Transformer for Generic Event Boundary Captioning", CVPRW, 2022 ( ). [ ][ ]
Paper			: "Exploring Anchor-based Detection for Ego4D Natural Language Query", arXiv, 2022 ( ). [ ]
Paper			: "Entity-aware and Motion-aware Transformers for Language-driven Action Localization in Videos", IJCAI, 2022 ( ). [ ][ ]
Paper			: "An Efficient Spatio-Temporal Pyramid Transformer for Action Detection", ECCV, 2022 ( ). [ ]
Paper			: "Real-time Online Video Detection with Temporal Smoothing Transformers", ECCV, 2022 ( ). [ ][ ]
Paper			: "TALLFormer: Temporal Action Localization with Long-memory Transformer", ECCV, 2022 ( ). [ ][ ]
Paper			: "Uncertainty-Based Spatial-Temporal Attention for Online Action Detection", ECCV, 2022 ( ). [ ]
Paper			: "ActionFormer: Localizing Moments of Actions with Transformers", ECCV, 2022 ( ). [ ][ ]
Paper			: "Where a Strong Backbone Meets Strong Features -- ActionFormer for Ego4D Moment Queries Challenge", ECCVW, 2022 ( ). [ ][ ]
Paper			: "Continual Transformers: Redundancy-Free Attention for Online Inference", arXiv, 2022 ( ). [ ][ ]
Paper			: "Temporal Perceiver: A General Architecture for Arbitrary Boundary Detection", arXiv, 2022 ( ). [ ]
Paper			: "LocATe: End-to-end Localization of Actions in 3D with Transformers", arXiv, 2022 ( ). [ ]
Paper			: "HTNet: Anchor-free Temporal Action Localization with Hierarchical Transformers", arXiv, 2022 ( ). [ ]
Paper			: "Adaptive Perception Transformer for Temporal Action Localization", arXiv, 2022 ( ). [ ]
Paper			: "A Circular Window-based Cascade Transformer for Online Action Detection", arXiv, 2022 ( ). [ ]
Paper			: "Holistic Interaction Transformer Network for Action Detection", WACV, 2023 ( ). [ ][ ]
Paper			: "On the Benefits of 3D Pose and Tracking for Human Action Recognition", CVPR, 2023 ( ). [ ][ ]
Paper			: "Efficient Movie Scene Detection using State-Space Transformers", CVPR, 2023 ( ). [ ]
Paper			: "Token Turing Machines", CVPR, 2023 ( ). [ ][ ]
Paper			: "Decomposed Cross-modal Distillation for RGB-based Temporal Action Detection", CVPR, 2023 ( ). [ ]
Paper			: "Self-Feedback DETR for Temporal Action Detection", ICCV, 2023 ( ). [ ]
Paper			: "UnLoc: A Unified Framework for Video Localization Tasks", ICCV, 2023 ( ). [ ][ ]
Paper			: "Efficient Video Action Detection with Token Dropout and Context Refinement", ICCV, 2023 ( ). [ ][ ]
Paper			: "MS-DETR: Natural Language Video Localization with Sampling Moment-Moment Interaction", ACL, 2023 ( ). [ ][ ]
Paper			: "End-to-End Spatio-Temporal Action Localisation with Video Transformers", arXiv, 2023 ( ). [ ]
Paper			: "DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion", arXiv, 2023 ( ). [ ][ ]
Paper			: "No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary Detection", arXiv, 2023 ( ). [ ]
Paper			: "PAT: Position-Aware Transformer for Dense Multi-Label Action Detection", arXiv, 2023 ( ). [ ]
Paper			: "Adapting Short-Term Transformers for Action Detection in Untrimmed Videos", arXiv, 2023 ( ). [ ]
Paper			: "Towards More Practical Group Activity Detection: A New Benchmark and Model", arXiv, 2023 ( ). [ ][ ][ ]
Paper			: "Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization", arXiv, 2023 ( ). [ ]
Paper			: "A Semantic and Motion-Aware Spatiotemporal Transformer Network for Action Detection", TPAMI, 2024 ( ). [ ]
Paper			: "Open-Vocabulary Spatio-Temporal Action Detection", arXiv, 2024 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Action Prediction/Anticipation
Paper			: "Anticipative Video Transformer", ICCV, 2021 ( ). [ ][ ][ ]
Paper			: "TTPP: Temporal Transformer with Progressive Prediction for Efficient Action Anticipation", Neurocomputing, 2021 ( ). [ ]
Paper			: "Higher Order Recurrent Space-Time Transformer", arXiv, 2021 ( ). [ ][ ]
Paper			: "Action Forecasting with Feature-wise Self-Attention", arXiv, 2021 ( ). [ ]
Paper			: "Future Transformer for Long-term Action Anticipation", CVPR, 2022 ( ). [ ]
Paper			: "VPTR: Efficient Transformers for Video Prediction", ICPR, 2022 ( ). [ ][ ]
Paper			: "Earthformer: Exploring Space-Time Transformers for Earth System Forecasting", NeurIPS, 2022 ( ). [ ]
Paper			: "Interaction Visual Transformer for Egocentric Action Anticipation", arXiv, 2022 ( ). [ ]
Paper			: "Video Prediction by Efficient Transformers", IVC, 2022 ( ). [ ][ ]
Paper			: "Anticipative Feature Fusion Transformer for Multi-Modal Action Anticipation", WACV, 2023 ( ). [ ][ ]
Paper			: "GliTr: Glimpse Transformers with Spatiotemporal Consistency for Online Action Prediction", WACV, 2023 ( ). [ ]
Paper			: "Latency Matters: Real-Time Action Forecasting Transformer", CVPR, 2023 ( ). [ ]
Paper			: "AdamsFormer for Spatial Action Localization in the Future", CVPR, 2023 ( ). [ ]
Paper			: "The Wisdom of Crowds: Temporal Progressive Attention for Early Action Prediction", CVPR, 2023 ( ). [ ][ ][ ]
Paper			: "Memory-and-Anticipation Transformer for Online Action Understanding", ICCV, 2023 ( ). [ ][ ]
Paper			: "SwinLSTM: Improving Spatiotemporal Prediction Accuracy using Swin Transformer and LSTM", ICCV, 2023 ( ). [ ][ ]
Paper			: "Multiscale Video Pretraining for Long-Term Activity Forecasting", arXiv, 2023 ( ). [ ]
Paper			: "DiffAnt: Diffusion Models for Action Anticipation", arXiv, 2023 ( ). [ ]
Paper			: "LALM: Long-Term Action Anticipation with Language Models", arXiv, 2023 ( ). [ ]
Paper			: "Learning from One Continuous Video Stream", arXiv, 2023 ( ). [ ]
Paper			: "Object-centric Video Representation for Long-term Action Anticipation", WACV, 2024 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Video Object Segmentation
Paper			: "Fast Video Object Segmentation using the Global Context Module", ECCV, 2020 ( ). [ ]
Paper			: "SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation", CVPR, 2021 ( ). [ ][ ]
Paper			: "Joint Inductive and Transductive Learning for Video Object Segmentation", ICCV, 2021 ( ). [ ][ ]
Paper			: "Associating Objects with Transformers for Video Object Segmentation", NeurIPS, 2021 ( ). [ ][ ][ ]
Paper			: "TransVOS: Video Object Segmentation with Transformers", arXiv, 2021 ( ). [ ]
Paper			: "Siamese Network with Interactive Transformer for Video Object Segmentation", AAAI, 2022 ( ). [ ]
Paper			: "Differentiable Soft-Masked Attention", CVPRW, 2022 ( ). [ ]
Paper			: "BATMAN: Bilateral Attention Transformer in Motion-Appearance Neighboring Space for Video Object Segmentation", ECCV, 2022 ( ). [ ]
Paper			: "Decoupling Features in Hierarchical Propagation for Video Object Segmentation", NeurIPS, 2022 ( ). [ ][ ]
Paper			: "Associating Objects with Scalable Transformers for Video Object Segmentation", arXiv, 2022 ( ). [ ][ ]
Paper			: "MED-VT: Multiscale Encoder-Decoder Video Transformer with Application to Object Segmentation", CVPR, 2023 ( ). [ ][ ]
Paper			: "Boosting Video Object Segmentation via Space-time Correspondence Learning", CVPR, 2023 ( ). [ ]
Paper			: "Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation", ICCV, 2023 ( ). [ ][ ]
Paper			: "Scalable Video Object Segmentation with Simplified Framework", ICCV, 2023 ( ). [ ]
Paper			: "Integrating Boxes and Masks: A Multi-Object Framework for Unified Visual Tracking and Segmentation", ICCV, 2023 ( ). [ ][ ]
Paper			: "Multi-grained Temporal Prototype Learning for Few-shot Video Object Segmentation", ICCV, 2023 ( ). [ ][ ]
Paper			: "MOSE: A New Dataset for Video Object Segmentation in Complex Scenes", ICCV, 2023 ( ). [ ][ ][ ]
Paper			: "LVOS: A Benchmark for Long-term Video Object Segmentation", ICCV, 2023 ( ). [ ][ ][ ]
Paper			: "Joint Modeling of Feature, Correspondence, and a Compressed Memory for Video Object Segmentation", arXiv, 2023 ( ). [ ]
Paper			: "PanoVOS: Bridging Non-panoramic and Panoramic Views with Transformer for Video Segmentation", arXiv, 2023 ( ). [ ][ ][ ]
Paper			: "Putting the Object Back into Video Object Segmentation", arXiv, 2023 ( ). [ ][ ][ ]
Paper			: "M T: Multi-Scale Memory Matching for Video Object Segmentation and Tracking", arXiv, 2023 ( ). [ ]
Paper			: "Appearance-based Refinement for Object-Centric Motion Segmentation", arXiv, 2023 ( ). [ ]
Paper			: "Depth-aware Test-Time Training for Zero-shot Video Object Segmentation", CVPR, 2024 ( ). [ ][ ][ ]
Paper			: "Event-assisted Low-Light Video Object Segmentation", CVPR, 2024 ( ). [ ]
Paper			: "Point-VOS: Pointing Up Video Object Segmentation", arXiv, 2024 ( ). [ ][ ]
Paper			: "Efficient Video Object Segmentation via Modulated Cross-Attention Memory", arXiv, 2024 ( ). [ ][ ]
Paper			: "Spatial-Temporal Multi-level Association for Video Object Segmentation", arXiv, 2024 ( ). [ ]
Paper			: "Moving Object Segmentation: All You Need Is SAM (and Flow)", arXiv, 2024 ( ). [ ][ ]
Paper			: "LVOS: A Benchmark for Large-scale Long-term Video Object Segmentation", arXiv, 2024 ( ). [ ][ ][ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Video Instance Segmentation
Paper			: "End-to-End Video Instance Segmentation with Transformers", CVPR, 2021 ( ). [ ][ ]
Paper			: "Video Instance Segmentation using Inter-Frame Communication Transformers", NeurIPS, 2021 ( ). [ ][ ]
Paper			: "Deformable VisTR: Spatio temporal deformable attention for video instance segmentation", ICASSP, 2022 ( ). [ ][ ]
Paper			: "Temporally Efficient Vision Transformer for Video Instance Segmentation", CVPR, 2022 ( ). [ ][ ]
Paper			: "A Graph Matching Perspective With Transformers on Video Instance Segmentation", CVPR, 2022 ( ). [ ]
Paper			: "Video Mask Transfiner for High-Quality Video Instance Segmentation", ECCV, 2022 ( ). [ ][ ][ ]
Paper			: "SeqFormer: Sequential Transformer for Video Instance Segmentation", ECCV, 2022 ( ). [ ][ ]
Paper			: "Video Instance Segmentation via Multi-scale Spatio-temporal Split Attention Transformer", ECCV, 2022 ( ). [ ][ ]
Paper			: "MinVIS: A Minimal Video Instance Segmentation Framework without Video-based Training", NeurIPS, 2022 ( ). [ ][ ]
Paper			: "VITA: Video Instance Segmentation via Object Token Association", NeurIPS, 2022 ( ). [ ][ ]
Paper			: "Consistent Video Instance Segmentation with Inter-Frame Recurrent Attention", arXiv, 2022 ( ). [ ]
Paper			: "DeVIS: Making Deformable Transformers Work for Video Instance Segmentation", arXiv, 2022 ( ). [ ][ ]
Paper			: "InstanceFormer: An Online Video Instance Segmentation Framework", arXiv, 2022 ( ). [ ][ ]
Paper			: "Mask-Free Video Instance Segmentation", CVPR, 2023 ( ). [ ][ ]
Paper			: "MDQE: Mining Discriminative Query Embeddings to Segment Occluded Instances on Challenging Videos", CVPR, 2023 ( ). [ ][ ]
Paper			: "A Generalized Framework for Video Instance Segmentation", CVPR, 2023 ( ). [ ][ ]
Paper			: "CTVIS: Consistent Training for Online Video Instance Segmentation", ICCV, 2023 ( ). [ ][ ]
Paper			: "TCOVIS: Temporally Consistent Online Video Instance Segmentation", ICCV, 2023 ( ). [ ][ ]
Paper			: "DVIS: Decoupled Video Instance Segmentation Framework", ICCV, 2023 ( ). [ ][ ]
Paper			: "TMT-VIS: Taxonomy-aware Multi-dataset Joint Training for Video Instance Segmentation", NeurIPS, 2023 ( ). [ ][ ]
Paper			: "BoxVIS: Video Instance Segmentation with Box Annotations", arXiv, 2023 ( ). [ ][ ]
Paper			: "Video Instance Segmentation in an Open-World", arXiv, 2023 ( ). [ ][ ]
Paper			: "GRAtt-VIS: Gated Residual Attention for Auto Rectifying Video Instance Segmentation", arXiv, 2023 ( ). [ ][ ]
Paper			: "RefineVIS: Video Instance Segmentation with Temporal Attention Refinement", arXiv, 2023 ( ). [ ]
Paper			: "VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation", arXiv, 2023 ( ). [ ][ ]
Paper			: "NOVIS: A Case for End-to-End Near-Online Video Instance Segmentation", arXiv, 2023 ( ). [ ]
Paper			: "VISAGE: Video Instance Segmentation with Appearance-Guided Enhancement", arXiv, 2023 ( ). [ ][ ]
Paper			: "OW-VISCap: Open-World Video Instance Segmentation and Captioning", arXiv, 2024 ( ). [ ][ ]
Paper			: "What is Point Supervision Worth in Video Instance Segmentation?", arXiv, 2024 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Action Segmentation
Paper			: "ASFormer: Transformer for Action Segmentation", BMVC, 2021 ( ). [ ][ ]
Paper			: "Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos", CVPR, 2022 ( ). [ ][ ]
Paper			: "SC-Transformer++: Structured Context Transformer for Generic Event Boundary Detection", CVPRW, 2022 ( ). [ ][ ]
Paper			: "Unified Fully and Timestamp Supervised Temporal Action Segmentation via Sequence to Sequence Translation", ECCV, 2022 ( ). [ ][ ]
Paper			: "Transformers in Action: Weakly Supervised Action Segmentation", arXiv, 2022 ( ). [ ]
Paper			: "Cross-Enhancement Transformer for Action Segmentation", arXiv, 2022 ( ). [ ]
Paper			: "Efficient U-Transformer with Boundary-Aware Loss for Action Segmentation", arXiv, 2022 ( ). [ ]
Paper			: "Structured Context Transformer for Generic Event Boundary Detection", arXiv, 2022 ( ). [ ]
Paper			: "Enhancing Transformer Backbone for Egocentric Video Action Segmentation", CVPRW, 2023 ( ). [ ][ ]
Paper			: "How Much Temporal Long-Term Context is Needed for Action Segmentation?", ICCV, 2023 ( ). [ ][ ]
Paper			: "Diffusion Action Segmentation", ICCV, 2023 ( ). [ ][ ]
Paper			: "Temporal Segment Transformer for Action Segmentation", arXiv, 2023 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Video X Segmentation:
Paper			: "Video Semantic Segmentation via Sparse Temporal Transformer", MM, 2021 ( ). [ ]
Paper			: "Coarse-to-Fine Feature Mining for Video Semantic Segmentation", CVPR, 2022 ( ). [ ][ ]
Paper			: "TubeFormer-DeepLab: Video Mask Transformer", CVPR, 2022 ( ). [ ]
Paper			: "Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation", CVPR, 2022 ( ). [ ][ ]
Paper			: "Mining Relations among Cross-Frame Affinities for Video Semantic Segmentation", ECCV, 2022 ( ). [ ][ ]
Paper			: "PolyphonicFormer: Unified Query Learning for Depth-aware Video Panoptic Segmentation, ECCV, 2022 ( ). [ ][ ]
Paper			: "Time-Space Transformers for Video Panoptic Segmentation", arXiv, 2022 ( ). [ ]
Paper			: "Context-Aware Relative Object Queries To Unify Video Instance and Panoptic Segmentation", CVPR, 2023 ( ). [ ][ ][ ]
Paper			: "TarViS: A Unified Approach for Target-based Video Segmentation", CVPR, 2023 ( ). [ ][ ]
Paper			: "MEGA: Multimodal Alignment Aggregation and Distillation For Cinematic Video Segmentation", ICCV, 2023 ( ). [ ]
Paper			: "Tracking Anything with Decoupled Video Segmentation", ICCV, 2023 ( ). [ ][ ][ ]
Paper			: "Tube-Link: A Flexible Cross Tube Baseline for Universal Video Segmentation", ICCV, 2023 ( ). [ ][ ]
Paper			: "Temporal-aware Hierarchical Mask Classification for Video Semantic Segmentation", BMVC, 2023 ( ). [ ][ ]
Paper			: "Mask Propagation for Efficient Video Semantic Segmentation", NeurIPS, 2023 ( ). [ ][ ]
Paper			: "Video-kMaX: A Simple Unified Approach for Online and Near-Online Video Panoptic Segmentation", arXiv, 2023 ( ). [ ]
Paper			: "Segment Anything Meets Point Tracking", arXiv, 2023 ( ). [ ][ ]
Paper			: "Test-Time Training on Video Streams", arXiv, 2023 ( ). [ ][ ]
Paper			: "UniVS: Unified and Universal Video Segmentation with Prompts as Queries", CVPR, 2024 ( ). [ ][ ][ ]
Paper			: "DVIS++: Improved Decoupled Framework for Universal Video Segmentation", arXiv, 2024 ( ). [ ][ ]
Paper			: "SAM-PD: How Far Can SAM Take Us in Tracking and Segmenting Anything in Videos by Prompt Denoising", arXiv, 2024 ( ). [ ][ ]
Paper			: "OneVOS: Unifying Video Object Segmentation with All-in-One Transformer Framework", arXiv, 2024 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Video Object Detection:
Paper			: "End-to-End Video Object Detection with Spatial-Temporal Transformers", arXiv, 2021 ( ). [ ][ ]
Paper			: "MODETR: Moving Object Detection with Transformers", arXiv, 2021 ( ). [ ]
Paper			: "Spatio-Temporal Multi-Task Learning Transformer for Joint Moving Object Detection and Segmentation", arXiv, 2021 ( ). [ ]
Paper			: "ST-DETR: Spatio-Temporal Object Traces Attention Detection Transformer", arXiv, 2021 ( ). [ ]
Paper			: "PTSEFormer: Progressive Temporal-Spatial Enhanced TransFormer Towards Video Object Detection", ECCV, 2022 ( ). [ ][ ]
Paper			: "TransVOD: End-to-end Video Object Detection with Spatial-Temporal Transformers", arXiv, 2022 ( ). [ ]
Paper			: "Learning Future Object Prediction with a Spatiotemporal Detection Transformer", arXiv, 2022 ( ). [ ]
Paper			: "Identity-Consistent Aggregation for Video Object Detection", ICCV, 2023 ( ). [ ][ ]
Paper			: "Unsupervised Open-Vocabulary Object Localization in Videos", ICCV, 2023 ( ). [ ]
Paper			: "Context Enhanced Transformer for Single Image Object Detection", AAAI, 2024 ( ). [ ][ ][ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Dense Video Tasks (Detection + Segmentation):
Paper			: "TDViT: Temporal Dilated Video Transformer for Dense Video Tasks", ECCV, 2022 ( ). [ ][ ]
Paper			: "Feature Aggregated Queries for Transformer-Based Video Object Detectors", CVPR, 2023 ( ). [ ][ ]
Paper			: "Video OWL-ViT: Temporally-consistent open-world localization in video", ICCV, 2023 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Video Retrieval:
Paper			: "Self-supervised Video Retrieval Transformer Network", arXiv, 2021 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Video Hashing:
Paper			: "Self-Supervised Video Hashing via Bidirectional Transformers", CVPR, 2021 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Video-Language:
Paper			: "ActionCLIP: A New Paradigm for Video Action Recognition", arXiv, 2022 ( ). [ ][ ]
Paper			: "Prompting Visual-Language Models for Efficient Video Understanding", ECCV, 2022 ( ). [ ][ ][ ]
Paper			: "Expanding Language-Image Pretrained Models for General Video Recognition", ECCV, 2022 ( ). [ ][ ]
Paper			: "Frozen CLIP Models are Efficient Video Learners", ECCV, 2022 ( ). [ ][ ]
Paper			: "Zero-Shot Temporal Action Detection via Vision-Language Prompting", ECCV, 2022 ( ). [ ][ ]
Paper			: "Knowledge Prompting for Few-shot Action Recognition", arXiv, 2022 ( ). [ ]
Paper			: "VLG: General Video Recognition with Web Textual Knowledge", arXiv, 2022 ( ). [ ]
Paper			: "InternVideo: General Video Foundation Models via Generative and Discriminative Learning", arXiv, 2022 ( ). [ ][ ][ ]
Paper			: "PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data", arXiv, 2022 ( ). [ ]
Paper			: "Multi-Modal Few-Shot Temporal Action Detection via Vision-Language Meta-Adaptation", arXiv, 2022 ( ). [ ][ ]
Paper			: "MovieCLIP: Visual Scene Recognition in Movies", WACV, 2023 ( ). [ ][ ]
Paper			: "Semantics Guided Contrastive Learning of Transformers for Zero-Shot Temporal Activity Detection", WACV, 2023 ( ). [ ]
Paper			: "Revisiting Classifier: Transferring Vision-Language Models for Video Recognition", AAAI, 2023 ( ). [ ][ ]
Paper			: "AIM: Adapting Image Models for Efficient Video Action Recognition", ICLR, 2023 ( ). [ ][ ][ ]
Paper			: "Fine-tuned CLIP Models are Efficient Video Learners", CVPR, 2023 ( ). [ ][ ]
Paper			: "Learning Video Representations from Large Language Models", CVPR, 2023 ( ). [ ][ ][ ]
Paper			: "Text-Visual Prompting for Efficient 2D Temporal Video Grounding", CVPR, 2023 ( ). [ ]
Paper			: "Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting", CVPR, 2023 ( ). [ ][ ]
Paper			: "Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring", CVPR, 2023 ( ). [ ][ ]
Paper			: "Distilling Vision-Language Pre-training to Collaborate with Weakly-Supervised Temporal Action Localization", CVPR, 2023 ( ). [ ]
Paper			: "Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models", CVPR, 2023 ( ). [ ][ ]
Paper			: "HierVL: Learning Hierarchical Video-Language Embeddings", CVPR, 2023 ( ). [ ][ ]
Paper			: "Test of Time: Instilling Video-Language Models with a Sense of Time", CVPR, 2023 ( ). [ ][ ][ ]
Paper			: "Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization", ICML, 2023 ( ). [ ][ ]
Paper			: "Implicit Temporal Modeling with Learnable Alignment for Video Recognition", ICCV, 2023 ( ). [ ][ ]
Paper			: "Towards Open-Vocabulary Video Instance Segmentation", ICCV, 2023 ( ). [ ][ ]
Paper			: "Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning", ICCV, 2023 ( ). [ ][ ]
Paper			: "Generative Action Description Prompts for Skeleton-based Action Recognition", ICCV, 2023 ( ). [ ][ ]
Paper			: "MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge", ICCV, 2023 ( ). [ ][ ]
Paper			: "Language as the Medium: Multimodal Video Classification through text only", ICCVW, 2023 ( ). [ ]
Paper			: "Seeing in Flowing: Adapting CLIP for Action Recognition with Motion Prompts Learning", ACMMM, 2023 ( ). [ ]
Paper			: "Orthogonal Temporal Interpolation for Zero-Shot Video Recognition", ACMMM, 2023 ( ). [ ][ ]
Paper			: "Symbol-LLM: Leverage Language Models for Symbolic System in Visual Human Activity Reasoning", NeurIPS, 2023 ( ). [ ][ ][ ]
Paper			: "Opening the Vocabulary of Egocentric Actions", NeurIPS, 2023 ( ). [ ][ ][ ]
Paper			: "CLIP-guided Prototype Modulating for Few-shot Action Recognition", arXiv, 2023 ( ). [ ][ ]
Paper			: "Multi-modal Prompting for Low-Shot Temporal Action Localization", arXiv, 2023 ( ). [ ]
Paper			: "VicTR: Video-conditioned Text Representations for Activity Recognition", arXiv, 2023 ( ). [ ]
Paper			: "OpenVIS: Open-vocabulary Video Instance Segmentation", arXiv, 2023 ( ). [ ]
Paper			: "Discovering Novel Actions in an Open World with Object-Grounded Visual Commonsense Reasoning", arXiv, 2023 ( ). [ ]
Paper			: "Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features", arXiv, 2023 ( ). [ ]
Paper			: "MSQNet: Actor-agnostic Action Recognition with Multi-modal Query", arXiv, 2023 ( ). [ ][ ]
Paper			: "Training a Large Video Model on a Single Machine in a Day", arXiv, 2023 ( ). [ ][ ]
Paper			: "Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data", arXiv, 2023 ( ). [ ][ ]
Paper			: "Videoprompter: an ensemble of foundational models for zero-shot video understanding", arXiv, 2023 ( ). [ ]
Paper			: "MM-VID: Advancing Video Understanding with GPT-4V(vision)", arXiv, 2023 ( ). [ ][ ]
Paper			: "Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding", arXiv, 2023 ( ). [ ]
Paper			: "Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning", arXiv, 2023 ( ). [ ][ ]
Paper			: "Align before Adapt: Leveraging Entity-to-Region Alignments for Generalizable Video Action Recognition", arXiv, 2023 ( ). [ ]
Paper			: "MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning", arXiv, 2023 ( ). [ ][ ]
Paper			: "Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains", arXiv, 2023 ( ). [ ][ ]
Paper			: "OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition", arXiv, 2023 ( ). [ ][ ][ ]
Paper			: "Generating Action-conditioned Prompts for Open-vocabulary Video Action Recognition", arXiv, 2023 ( ). [ ]
Paper			: "EZ-CLIP: Efficient Zeroshot Video Action Recognition", arXiv, 2023 ( ). [ ][ ]
Paper			: "M -CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition", AAAI, 2024 ( ). [ ]
Paper			: "FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition", ICLR, 2024 ( ). [ ][ ][ ]
Paper			: "Language Model Guided Interpretable Video Action Reasoning", CVPR, 2024 ( ). [ ][ ]
Paper			: "Instance Brownian Bridge as Texts for Open-vocabulary Video Instance Segmentation", arXiv, 2024 ( ). [ ][ ]
Paper			: "ActionHub: A Large-scale Action Video Description Dataset for Zero-shot Action Recognition", arXiv, 2024 ( ). [ ]
Paper			: "Zero Shot Open-ended Video Inference", arXiv, 2024 ( ). [ ]
Paper			: "Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition", arXiv, 2024 ( ). [ ][ ]
Paper			: "CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation", arXiv, 2024 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / X-supervised Learning:
Paper			: "Long-Short Temporal Contrastive Learning of Video Transformers", CVPR, 2022 ( ). [ ]
Paper			: "Self-supervised Video Transformer", CVPR, 2022 ( ). [ ][ ][ ]
Paper			: "BEVT: BERT Pretraining of Video Transformers", CVPR, 2022 ( ). [ ][ ]
Paper			: "SCVRL: Shuffled Contrastive Video Representation Learning", CVPRW, 2022 ( ). [ ]
Paper			: "VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning", CVPRW, 2022 ( ). [ ][ ]
Paper			: "Static and Dynamic Concepts for Self-supervised Video Representation Learning", ECCV, 2022 ( ). [ ]
Paper			: "VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training", NeurIPS, 2022 ( ). [ ][ ]
Paper			: "Masked Autoencoders As Spatiotemporal Learners", NeurIPS, 2022 ( ). [ ][ ]
Paper			: "On the Surprising Effectiveness of Transformers in Low-Labeled Video Recognition", arXiv, 2022 ( ). [ ]
Paper			: "MaskViT: Masked Visual Pre-Training for Video Prediction", ICLR, 2023 ( ). [ ][ ][ ]
Paper			: "Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos", CVPR, 2023 ( ). [ ][ ]
Paper			: "VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking", CVPR, 2023 ( ). [ ][ ]
Paper			: "SVFormer: Semi-supervised Video Transformer for Action Recognition", CVPR, 2023 ( ). [ ][ ]
Paper			: "OmniMAE: Single Model Masked Pretraining on Images and Videos", CVPR, 2023 ( ). [ ][ ]
Paper			: "Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning", CVPR, 2023 ( ). [ ][ ]
Paper			: "Masked Motion Encoding for Self-Supervised Video Representation Learning", CVPR, 2023 ( ). [ ][ ]
Paper			: "MGMAE: Motion Guided Masking for Video Masked Autoencoding", ICCV, 2023 ( ). [ ]
Paper			: "Motion-Guided Masking for Spatiotemporal Representation Learning", ICCV, 2023 ( ). [ ]
Paper			: "Time Does Tell: Self-Supervised Time-Tuning of Dense Image Representations", ICCV, 2023 ( ). [ ][ ]
Paper			: "Language-based Action Concept Spaces Improve Video Self-Supervised Learning", NeurIPS, 2023 ( ). [ ]
Paper			: "Self-supervised video pretraining yields human-aligned visual representations", NeurIPS, 2023 ( ). [ ]
Paper			: "Siamese Masked Autoencoders", NeurIPS, 2023 ( ). [ ][ ]
Paper			: "Visual Representation Learning from Unlabeled Video using Contrastive Masked Autoencoders", arXiv, 2023 ( ). [ ]
Paper			: "Efficient Long-Short Temporal Attention Network for Unsupervised Video Object Segmentation", arXiv, 2023 ( ). [ ]
Paper			: "Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video", arXiv, 2023 ( ). [ ]
Paper			: "Asymmetric Masked Distillation for Pre-Training Small Foundation Models", arXiv, 2023 ( ). [ ]
Paper			: "Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation", arXiv, 2023 ( ). [ ]
Paper			: "No More Shortcuts: Realizing the Potential of Temporal Self-Supervision", AAAI, 2024 ( ). [ ][ ]
Paper			: "VideoMAC: Video Masked Autoencoders Meet ConvNets", CVPR, 2024 ( ). [ ]
Paper			: "Self-supervised Video Object Segmentation with Distillation Learning of Deformable Attention", arXiv, 2024 ( ). [ ]
Paper			: "MV2MAE: Multi-View Video Masked Autoencoders", arXiv, 2024 ( ). [ ][ ]
Paper			: "Revisiting Feature Prediction for Learning Visual Representations from Video", arXiv, 2024 ( ). [ ][ ][ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Transfer Learning/Adaptation:
Paper			: "Attention Prompt Tuning: Parameter-efficient Adaptation of Pre-trained Models for Spatiotemporal Modeling", FG, 2024 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / X-shot:
Paper			: "Cross-modal Representation Learning for Zero-shot Action Recognition", CVPR, 2022 ( ). [ ]
Paper			: "Zero-Shot Action Recognition with Transformer-based Video Semantic Embedding", arXiv, 2022 ( ). [ ]
Paper			: "REST: REtrieve & Self-Train for generative action recognition", arXiv, 2022 ( ). [ ]
Paper			: "MoLo: Motion-augmented Long-short Contrastive Learning for Few-shot Action Recognition", CVPR, 2023 ( ). [ ][ ]
Paper			: "Multimodal Adaptation of CLIP for Few-Shot Action Recognition", arXiv, 2023 ( ). [ ]
Paper			: "On the Importance of Spatial Relations for Few-shot Action Recognition", arXiv, 2023 ( ). [ ]
Paper			: "Few-shot Action Recognition with Captioning Foundation Models", arXiv, 2023 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Multi-Task:
Paper			: "A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives", CVPR, 2024 ( ). [ ][ ][ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Anomaly Detection:
Paper			: "Convolutional Transformer based Dual Discriminator Generative Adversarial Networks for Video Anomaly Detection", ACMMM, 2021 ( ). [ ]
Paper			: "ADTR: Anomaly Detection Transformer with Feature Reconstruction", International Conference on Neural Information Processing (ICONIP), 2022 ( ). [ ]
Paper			: "Self-Supervised Masked Convolutional Transformer Block for Anomaly Detection", arXiv, 2022 ( ). [ ][ ]
Paper			: "Multi-Contextual Predictions with Vision Transformer for Video Anomaly Detection", arXiv, 2022 ( ). [ ]
Paper			: "CLIP-TSA: CLIP-Assisted Temporal Self-Attention for Weakly-Supervised Video Anomaly Detection", ICIP, 2023 ( ). [ ]
Paper			: "Prompt-Guided Zero-Shot Anomaly Action Recognition using Pretrained Deep Skeleton Features", CVPR, 2023 ( ). [ ]
Paper			: "Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection", CVPR, 2024 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Relation Detection:
Paper			: "Video Relation Detection via Tracklet based Visual Transformer", ACMMMW, 2021 ( ). [ ][ ]
Paper			: "VRDFormer: End-to-End Video Visual Relation Detection With Transformers", CVPR, 2022 ( ). [ ][ ]
Paper			: "Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs", CVPR, 2022 ( ). [ ][ ]
Paper			: "Compositional Prompt Tuning with Motion Cues for Open-vocabulary Video Relation Detection", ICLR, 2023 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Saliency Prediction:
Paper			: "Spatio-Temporal Self-Attention Network for Video Saliency Prediction", arXiv, 2021 ( ). [ ]
Paper			: "A Unified Transformer Framework for Group-based Segmentation: Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection", arXiv, 2022 ( ). [ ][ ]
Paper			: "Discriminative Co-Saliency and Background Mining Transformer for Co-Salient Object Detection", CVPR, 2023 ( ). [ ][ ]
Paper			: "CASP-Net: Rethinking Video Saliency Prediction from an Audio-VisualConsistency Perceptual Perspective", CVPR, 2023 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Video Inpainting Detection:
Paper			: "Frequency-Aware Spatiotemporal Transformers for Video Inpainting Detection", ICCV, 2021 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Driver Activity:
Paper			: "TransDARC: Transformer-based Driver Activity Recognition with Latent Space Feature Calibration", arXiv, 2022 ( ). [ ]
Paper			: "Applying Spatiotemporal Attention to Identify Distracted and Drowsy Driving with Vision Transformers", arXiv, 2022 ( ). [ ]
Paper			: "Multi-Task Vision Transformer for Semi-Supervised Driver Distraction Detection", arXiv, 2022 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Video Alignment:
Paper			: "Dynamic Graph Warping Transformer for Video Alignment", BMVC, 2021 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Sport-related:
Paper			: "Skating-Mixer: Multimodal MLP for Scoring Figure Skating", arXiv, 2022 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Action Counting:
Paper			: "TransRAC: Encoding Multi-scale Temporal Correlation with Transformers for Repetitive Action Counting", CVPR, 2022 ( ). [ ][ ][ ]
Paper			: "PoseRAC: Pose Saliency Transformer for Repetitive Action Counting", arXiv, 2023 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Action Quality Assessment:
Paper			: "Action Quality Assessment with Temporal Parsing Transformer", ECCV, 2022 ( ). [ ]
Paper			: "Action Quality Assessment using Transformers", arXiv, 2022 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Human Interaction:
Paper			: "IGFormer: Interaction Graph Transformer for Skeleton-based Human Interaction Recognition", ECCV, 2022 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Cross-Domain:
Paper			: "Unsupervised Domain Adaptation for Video Transformers in Action Recognition", ICPR, 2022 ( ). [ ][ ]
Paper			: "AutoLabel: CLIP-based framework for Open-set Video Domain Adaptation", CVPR, 2023 ( ). [ ][ ]
Paper			: "The Unreasonable Effectiveness of Large Language-Vision Models for Source-free Video Domain Adaptation", ICCV, 2023 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Multi-Camera Editing:
Paper			: "Temporal and Contextual Transformer for Multi-Camera Editing of TV Shows", ECCVW, 2022 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Instructional/Procedural Video:
Paper			: "Learning Procedure-aware Video Representation from Instructional Videos and Their Narrations", CVPR, 2023 ( ). [ ]
Paper			: "Procedure-Aware Pretraining for Instructional Video Understanding", CVPR, 2023 ( ). [ ][ ]
Paper			: "StepFormer: Self-supervised Step Discovery and Localization in Instructional Videos", CVPR, 2023 ( ). [ ]
Paper			: "Event-Guided Procedure Planning from Instructional Videos with Text Supervision", ICCV, 2023 ( ). [ ]
Paper			: "Pretrained Language Models as Visual Planners for Human Assistance", ICCV, 2023 ( ). [ ]
Paper			: "Learning to Ground Instructional Articles in Videos through Narrations", ICCV, 2023 ( ). [ ][ ]
Paper			: "PREGO: online mistake detection in PRocedural EGOcentric videos", CVPR, 2024 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Continual Learning:
Paper			: "PIVOT: Prompting for Video Continual Learning", CVPR, 2023 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / 3D:
Paper			: "Masked Spatio-Temporal Structure Prediction for Self-supervised Learning on Point Cloud Videos", ICCV, 2023 ( ). [ ][ ]
Paper			: "EPIC Fields: Marrying 3D Geometry and Video Understanding", NeurIPS, 2023 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Audio-Video:
Paper			: "Audio-Visual Glance Network for Efficient Video Recognition", ICCV, 2023 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Event Camera:
Paper			: "EventTransAct: A video transformer-based framework for Event-camera based action recognition", IROS, 2023 ( ). [ ][ ][ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Long Video:
Paper			: "EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding", NeurIPS, 2023 ( ). [ ][ ][ ]
Paper			: "Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding", arXiv, 2023 ( ). [ ]
Paper			: "Text-Conditioned Resampler For Long Form Video Understanding", arXiv, 2023 ( ). [ ]
Paper			: "Memory Consolidation Enables Long-Context Video Understanding", arXiv, 2024 ( ). [ ]
Paper			: "VideoAgent: Long-form Video Understanding with Large Language Model as Agent", arXiv, 2024 ( ). [ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Video Story:
Paper			: "Video Timeline Modeling For News Story Understanding", NeurIPS (Datasets and Benchmarks), 2023 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Analysis:
Paper			: "Understanding Video Transformers via Universal Concept Discovery", arXiv, 2024 ( ). [ ][ ]
Ultimate-Awesome-Transformer-Attention / References / Online Resources:
Papers with Code
Transformer tutorial (Lucas Beyer)
CS25: Transformers United (Course @ Stanford)
The Annotated Transformer (Blog)
3D Vision with Transformers (GitHub)	409	over 1 year ago
Networks Beyond Attention (GitHub)	77	about 3 years ago
Practical Introduction to Transformers (GitHub)	214	over 2 years ago
Awesome Transformer Architecture Search (GitHub)	262	over 2 years ago
Transformer-in-Vision (GitHub)	1,324	over 2 years ago
Awesome Visual-Transformer (GitHub)	3,406	almost 3 years ago
Awesome Transformer for Vision Resources List (GitHub)	280	almost 5 years ago
Transformer-in-Computer-Vision (GitHub)	1,156	about 1 year ago
Transformer Tutorial in ICASSP 2022)

Awesome-Transformer-Attention

Ultimate-Awesome-Transformer-Attention / Overview

Ultimate-Awesome-Transformer-Attention / Overview / Multi-Modality

Ultimate-Awesome-Transformer-Attention / Overview

Ultimate-Awesome-Transformer-Attention / Overview / Other High-level Vision Tasks

Ultimate-Awesome-Transformer-Attention / Overview

Ultimate-Awesome-Transformer-Attention / Overview / Low-level Vision Tasks

Ultimate-Awesome-Transformer-Attention / Overview

Ultimate-Awesome-Transformer-Attention / Overview / Reinforcement Learning

Ultimate-Awesome-Transformer-Attention / Overview

Ultimate-Awesome-Transformer-Attention / Overview / Medical

Ultimate-Awesome-Transformer-Attention / Overview

Ultimate-Awesome-Transformer-Attention / Overview / Attention Mechanisms in Vision/NLP

Ultimate-Awesome-Transformer-Attention / Survey

Ultimate-Awesome-Transformer-Attention / Image Classification / Backbone / Replace Conv w/ Attention

Ultimate-Awesome-Transformer-Attention / Image Classification / Backbone / Vision Transformer

Ultimate-Awesome-Transformer-Attention / Image Classification / Backbone / Attention-Free

Ultimate-Awesome-Transformer-Attention / Image Classification / Backbone / Analysis for Transformer

Ultimate-Awesome-Transformer-Attention / Detection / Object Detection / General:

Ultimate-Awesome-Transformer-Attention / Detection / Object Detection / CNN-based backbone:

Ultimate-Awesome-Transformer-Attention / Detection / Object Detection / Transformer-based backbone:

Ultimate-Awesome-Transformer-Attention / Detection / 3D Object Detection

Ultimate-Awesome-Transformer-Attention / Detection / Multi-Modal Detection

Ultimate-Awesome-Transformer-Attention / Detection / HOI Detection

Ultimate-Awesome-Transformer-Attention / Detection / Salient Object Detection

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / X-supervised:

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / X-Shot Object Detection:

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Open-World/Vocabulary:

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Pedestrian Detection:

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Lane Detection:

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Object Localization:

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Relation Detection:

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Anomaly Detection:

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Cross-Domain:

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Co-Salient Object Detection:

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Oriented Object Detection:

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Multiview Detection:

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Polygon Detection:

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Drone-view:

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Infrared:

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Text Detection:

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Change Detection:

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Edge Detection:

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Person Search:

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Manipulation Detection:

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Mirror Detection:

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Shadow Detection:

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Keypoint Detection:

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Continual Learning:

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Visual Query Detection/Localization:

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Task-Driven Object Detection:

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Diffusion:

Ultimate-Awesome-Transformer-Attention / Segmentation / Semantic Segmentation

Ultimate-Awesome-Transformer-Attention / Segmentation / Depth Estimation

Ultimate-Awesome-Transformer-Attention / Segmentation / Object Segmentation

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Any-X/Every-X:

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Vision-Language:

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Open-World/Vocabulary:

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / LLM-based:

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Universal Segmentation:

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Multi-Modal:

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Panoptic Segmentation:

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Instance Segmentation:

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Optical Flow:

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Panoramic Semantic Segmentation:

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / X-Shot:

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / X-Supervised:

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Cross-Domain:

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Continual Learning:

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Crack Detection:

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Camouflaged/Concealed Object:

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Background Separation:

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Scene Understanding:

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / 3D Segmentation:

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Multi-Task:

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Forecasting:

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / LiDAR:

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Co-Segmentation:

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Top-Down Semantic Segmentation:

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Surface Normal: