Awesome-Transformer-Attention

Transformer papers

A comprehensive collection of papers, codes, and related resources for understanding vision transformer and attention mechanisms in computer vision and deep learning.

An ultimately comprehensive paper list of Vision Transformer/Attention, including papers, codes, and related websites

GitHub

5k stars
129 watching
489 forks
last commit: 4 months ago
Linked from 2 awesome lists

attention-mechanismattention-mechanismsawesome-listcomputer-visiondeep-learningdetrpapersself-attentiontransformertransformer-architecturetransformer-awesometransformer-cvtransformer-modelstransformer-with-cvtransformersvision-transformervisual-transformervit

Ultimate-Awesome-Transformer-Attention / Overview

Multi-Modality

Ultimate-Awesome-Transformer-Attention / Overview / Multi-Modality

Visual Captioning
Visual Question Answering
Visual Grounding
Multi-Modal Representation Learning
Multi-Modal Retrieval
Multi-Modal Generation
Prompt Learning/Tuning
Visual Document Understanding
Other Multi-Modal Tasks

Ultimate-Awesome-Transformer-Attention / Overview

Other High-level Vision Tasks

Ultimate-Awesome-Transformer-Attention / Overview / Other High-level Vision Tasks

Point Cloud / 3D
Pose Estimation
Tracking
Re-ID
Face
Scene Graph
Neural Architecture Search

Ultimate-Awesome-Transformer-Attention / Overview

Transfer / X-Supervised / X-Shot / Continual Learning
Low-level Vision Tasks

Ultimate-Awesome-Transformer-Attention / Overview / Low-level Vision Tasks

Image Restoration
Video Restoration
Inpainting / Completion / Outpainting
Image Generation
Video Generation
Transfer / Translation / Manipulation
Other Low-Level Tasks

Ultimate-Awesome-Transformer-Attention / Overview

Reinforcement Learning

Ultimate-Awesome-Transformer-Attention / Overview / Reinforcement Learning

Navigation
Other RL Tasks

Ultimate-Awesome-Transformer-Attention / Overview

Medical

Ultimate-Awesome-Transformer-Attention / Overview / Medical

Medical Segmentation
Medical Classification
Medical Detection
Medical Reconstruction
Medical Low-Level Vision
Medical Vision-Language
Medical Others

Ultimate-Awesome-Transformer-Attention / Overview

Other Tasks
Attention Mechanisms in Vision/NLP

Ultimate-Awesome-Transformer-Attention / Overview / Attention Mechanisms in Vision/NLP

Attention for Vision
NLP
Both
Others

Ultimate-Awesome-Transformer-Attention / Survey

Paper "A Survey on Multimodal Large Language Models for Autonomous Driving", WACVW, 2024 ( ). [ ][ ]
Paper "Efficient Multimodal Large Language Models: A Survey", arXiv, 2024 ( ). [ ][ ]
Paper "From Sora What We Can See: A Survey of Text-to-Video Generation", arXiv, 2024 ( ). [ ][ ]
Paper "When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models", arXiv, 2024 ( ). [ ][ ]
Paper "Foundation Models for Video Understanding: A Survey", arXiv, 2024 ( ). [ ][ ]
Paper "Vision Mamba: A Comprehensive Survey and Taxonomy", arXiv, 2024 ( ). [ ][ ]
Paper "Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond", arXiv, 2024 ( ). [ ][ ]
Paper "Video Diffusion Models: A Survey", arXiv, 2024 ( ). [ ][ ]
Paper "Unleashing the Power of Multi-Task Learning: A Comprehensive Survey Spanning Traditional, Deep, and Pretrained Foundation Model Eras", arXiv, 2024 ( ). [ ]
Paper "Hallucination of Multimodal Large Language Models: A Survey", arXiv, 2024 ( ). [ ][ ]
Paper "A Survey on Vision Mamba: Models, Applications and Challenges", arXiv, 2024 ( ). [ ][ ]
Paper "State Space Model for New-Generation Network Alternative to Transformers: A Survey", arXiv, 2024 ( ). [ ][ ]
Paper "Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions", arXiv, 2024 ( ). [ ]
Paper "From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models", arXiv, 2024 ( ). [ ][ ]
Paper "Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey", arXiv, 2024 ( ). [ ]
Paper "Sora as an AGI World Model? A Complete Survey on Text-to-Video Generation", arXiv, 2024 ( ). [ ]
Paper "Controllable Generation with Text-to-Image Diffusion Models: A Survey", arXiv, 2024 ( ). [ ][ ]
Paper "Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models", arXiv, 2024 ( ). [ ][ ]
Paper "Large Multimodal Agents: A Survey", arXiv, 2024 ( ). [ ][ ]
Paper "Parameter-Efficient Fine-Tuning for Pre-Trained Vision Models: A Survey", arXiv, 2024 ( ). [ ][ ]
Paper "Vision-Language Navigation with Embodied Intelligence: A Survey", arXiv, 2024 ( ). [ ]
Paper "The (R)Evolution of Multimodal Large Language Models: A Survey", arXiv, 2024 ( ). [ ]
Paper "Masked Modeling for Self-supervised Representation Learning on Vision and Beyond", arXiv, 2024 ( ). [ ][ ]
Paper "Transformer for Object Re-Identification: A Survey", arXiv, 2024 ( ). [ ]
Paper "Forging Vision Foundation Models for Autonomous Driving: Challenges, Methodologies, and Opportunities", arXiv, 2024 ( ). [ ][ ]
Paper "MM-LLMs: Recent Advances in MultiModal Large Language Models", arXiv, 2024 ( ). [ ]
Paper "From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities", arXiv, 2024 ( ). [ ]
Paper "A Survey on Hallucination in Large Vision-Language Models", arXiv, 2024 ( ). [ ]
Paper "A Survey for Foundation Models in Autonomous Driving", arXiv, 2024 ( ). [ ]
Paper "A Survey on Transformer Compression", arXiv, 2024 ( ). [ ]
Paper "Vision + Language Applications: A Survey", CVPRW, 2023 ( ). [ ][ ]
Paper "Multimodal Learning With Transformers: A Survey", TPAMI, 2023 ( ). [ ]
Paper "A Survey of Visual Transformers", TNNLS, 2023 ( ). [ ][ ]
Paper "Video Understanding with Large Language Models: A Survey", arXiv, 2023 ( ). [ ][ ]
Paper "Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey", arXiv, 2023 ( ). [ ]
Paper "A Survey of Reasoning with Foundation Models: Concepts, Methodologies, and Outlook", arXiv, 2023 ( ). [ ][ ]
Paper "A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise", arXiv, 2023 ( ). [ ] ]
Paper "Towards the Unification of Generative and Discriminative Visual Foundation Model: A Survey", arXiv, 2023 ( ). [ ]
Paper "Explainability of Vision Transformers: A Comprehensive Review and New Perspectives", arXiv, 2023 ( ). [ ]
Paper "Vision-Language Instruction Tuning: A Review and Analysis", arXiv, 2023 ( ). [ ][ ]
Paper "Understanding Video Transformers for Segmentation: A Survey of Application and Interpretability", arXiv, 2023 ( ). [ ]
Paper "Unsupervised Object Localization in the Era of Self-Supervised ViTs: A Survey", arXiv, 2023 ( ). [ ][ ]
Paper "A Survey on Video Diffusion Models", arXiv, 2023 ( ). [ ][ ]
Paper "The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)", arXiv, 2023 ( ). [ ]
Paper "Multimodal Foundation Models: From Specialists to General-Purpose Assistants", arXiv, 2023 ( ). [ ]
Paper "Transformers in Small Object Detection: A Benchmark and Survey of State-of-the-Art", arXiv, 2023 ( ). [ ]
Paper "RenAIssance: A Survey into AI Text-to-Image Generation in the Era of Large Model", arXiv, 2023 ( ). [ ]
Paper "A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking", arXiv, 2023 ( ). [ ]
Paper "From CNN to Transformer: A Review of Medical Image Segmentation Models", arXiv, 2023 ( ). [ ]
Paper "Foundational Models Defining a New Era in Vision: A Survey and Outlook", arXiv, 2023 ( ). [ ][ ]
Paper "A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models", arXiv, 2023 ( ). [ ]
Paper "Robust Visual Question Answering: Datasets, Methods, and Future Challenges", arXiv, 2023 ( ). [ ]
Paper "A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future", arXiv, 2023 ( ). [ ]
Paper "Transformers in Reinforcement Learning: A Survey", arXiv, 2023 ( ). [ ]
Paper "Vision Language Transformers: A Survey", arXiv, 2023 ( ). [ ]
Paper "Towards Open Vocabulary Learning: A Survey", arXiv, 2023 ( ). [ ][ ]
Paper "Large Multimodal Models: Notes on CVPR 2023 Tutorial", arXiv, 2023 ( ). [ ]
Paper "A Survey on Multimodal Large Language Models", arXiv, 2023 ( ). [ ][ ]
Paper "2D Object Detection with Transformers: A Review", arXiv, 2023 ( ). [ ]
Paper "Visual Question Answering: A Survey on Techniques and Common Trends in Recent Literature", arXiv, 2023 ( ). [ ]
Paper "Vision-Language Models in Remote Sensing: Current Progress and Future Trends", arXiv, 2023 ( ). [ ]
Paper "Visual Tuning", arXiv, 2023 ( ). [ ]
Paper "Self-supervised Learning for Pre-Training 3D Point Clouds: A Survey", arXiv, 2023 ( ). [ ]
Paper "Semantic Segmentation using Vision Transformers: A survey", arXiv, 2023 ( ). [ ]
Paper "A Review of Deep Learning for Video Captioning", arXiv, 2023 ( ). [ ]
Paper "Transformer-Based Visual Segmentation: A Survey", arXiv, 2023 ( ). [ ][ ]
Paper "Vision-Language Models for Vision Tasks: A Survey", arXiv, 2023 ( ). [ ][ ]
Paper "Text-to-image Diffusion Model in Generative AI: A Survey", arXiv, 2023 ( ). [ ]
Paper "Foundation Models for Decision Making: Problems, Methods, and Opportunities", arXiv, 2023 ( ). [ ]
Paper "Advances in Medical Image Analysis with Vision Transformers: A Comprehensive Review", arXiv, 2023 ( ). [ ][ ]
Paper "Efficiency 360: Efficient Vision Transformers", arXiv, 2023 ( ). [ ][ ]
Paper "Transformer-based Generative Adversarial Networks in Computer Vision: A Comprehensive Survey", arXiv, 2023 ( ). [ ]
Paper "Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey", arXiv, 2023 ( ). [ ][ ]
Paper "A Survey on Visual Transformer", TPAMI, 2022 ( ). [ ]
Paper "Attention mechanisms in computer vision: A survey", Computational Visual Media, 2022 ( ). [ ][ ][ ]
Paper "A Comprehensive Study of Vision Transformers on Dense Prediction Tasks", VISAP, 2022 ( ). [ ]
Paper "Vision-and-Language Pretrained Models: A Survey", IJCAI, 2022 ( ). [ ]
Paper "Vision Transformers in Medical Imaging: A Review", arXiv, 2022 ( ). [ ]
Paper "A Comprehensive Survey of Transformers for Computer Vision", arXiv, 2022 ( ). [ ]
Paper "Vision-Language Pre-training: Basics, Recent Advances, and Future Trends", arXiv, 2022 ( ). [ ]
Paper "Vision+X: A Survey on Multimodal Learning in the Light of Data", arXiv, 2022 ( ). [ ]
Paper "Vision Transformers for Action Recognition: A Survey", arXiv, 2022 ( ). [ ]
Paper "VLP: A Survey on Vision-Language Pre-training", arXiv, 2022 ( ). [ ]
Paper "Transformers in Remote Sensing: A Survey", arXiv, 2022 ( ). [ ][ ]
Paper "Medical image analysis based on transformer: A Review", arXiv, 2022 ( ). [ ]
Paper "3D Vision with Transformers: A Survey", arXiv, 2022 ( ). [ ][ ]
Paper "Vision Transformers: State of the Art and Research Challenges", arXiv, 2022 ( ). [ ]
Paper "Transformers in Medical Imaging: A Survey", arXiv, 2022 ( ). [ ][ ]
Paper "Multimodal Learning with Transformers: A Survey", arXiv, 2022 ( ). [ ]
Paper "Transforming medical imaging with Transformers? A comparative review of key properties, current progresses, and future perspectives", arXiv, 2022 ( ). [ ]
Paper "Transformers in 3D Point Clouds: A Survey", arXiv, 2022 ( ). [ ]
Paper "A survey on attention mechanisms for medical applications: are we moving towards better algorithms?", arXiv, 2022 ( ). [ ]
Paper "Efficient Transformers: A Survey", arXiv, 2022 ( ). [ ]
Paper "Are we ready for a new paradigm shift? A Survey on Visual Deep MLP", arXiv, 2022 ( ). [ ]
Paper "Vision Transformers in Medical Computer Vision - A Contemplative Retrospection", arXiv, 2022 ( ). [ ]
Paper "Video Transformers: A Survey", arXiv, 2022 ( ). [ ]
Paper "Transformers in Medical Image Analysis: A Review", arXiv, 2022 ( ). [ ]
Paper "Recent Advances in Vision Transformer: A Survey and Outlook of Recent Work", arXiv, 2022 ( ). [ ]
Paper "Transformers Meet Visual Learning Understanding: A Comprehensive Review", arXiv, 2022 ( ). [ ]
Paper "Image Captioning In the Transformer Age", arXiv, 2022 ( ). [ ][ ]
Paper "Visual Attention Methods in Deep Learning: An In-Depth Survey", arXiv, 2022 ( ). [ ]
Paper "Transformers in Vision: A Survey", ACM Computing Surveys, 2021 ( ). [ ]
Paper "Survey: Transformer based Video-Language Pre-training", arXiv, 2021 ( ). [ ]
Paper "A Survey of Transformers", arXiv, 2021 ( ). [ ]
Paper "Attention mechanisms and deep learning for machine vision: A survey of the state of the art", arXiv, 2021 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Image Classification / Backbone / Replace Conv w/ Attention

Paper : "Local Relation Networks for Image Recognition", ICCV, 2019 ( ). [ ][ ]
Paper : "Stand-Alone Self-Attention in Vision Models", NeurIPS, 2019 ( ). [ ][ ][ ]
Paper : "Axial Attention in Multidimensional Transformers", arXiv, 2019 ( ). [ ][ ]
Paper : "Exploring Self-attention for Image Recognition", CVPR, 2020 ( ). [ ][ ]
Paper : "Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation", ECCV, 2020 ( ). [ ][ ]
Paper : "Global Self-Attention Networks for Image Recognition", arXiv, 2020 ( ). [ ][ ]
Paper : "Scaling Local Self-Attention For Parameter Efficient Visual Backbones", CVPR, 2021 ( ). [ ][ ]
Paper : "Contextual Transformer Networks for Visual Recognition", CVPRW, 2021 ( ). [ ][ ]
Paper : "Vision Transformers with Hierarchical Attention", arXiv, 2022 ( ). [ ][ ]
Paper : "Attention Augmented Convolutional Networks", ICCV, 2019 ( ). [ ][ ][ ]
Paper : "Global Context Networks", ICCVW, 2019 (& TPAMI 2020) ( ). [ ][ ]
Paper : "LambdaNetworks: Modeling long-range Interactions without Attention", ICLR, 2021 ( ). [ ][ ][ ]
Paper : "Bottleneck Transformers for Visual Recognition", CVPR, 2021 ( ). [ ][ ][ ]
Paper : "Gaussian Context Transformer", CVPR, 2021 ( ). [ ]
Paper : "CoAtNet: Marrying Convolution and Attention for All Data Sizes", NeurIPS, 2021 ( ). [ ]
Paper : "On the Integration of Self-Attention and Convolution", CVPR, 2022 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Image Classification / Backbone / Vision Transformer

Paper : "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale", ICLR, 2021 ( ). [ ][ ][ ][ ]
Paper : "Perceiver: General Perception with Iterative Attention", ICML, 2021 ( ). [ ][ ]
Paper : "Rethinking Spatial Dimensions of Vision Transformers", ICCV, 2021 ( ). [ ][ ]
Paper : "Visual Transformers: Where Do Transformers Really Belong in Vision Models?", ICCV, 2021 ( ). [ ][ ]
Paper : "Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions", ICCV, 2021 ( ). [ ][ ]
Paper : "Rethinking and Improving Relative Position Encoding for Vision Transformer", ICCV, 2021 ( ). [ ][ ]
Paper : "Going deeper with Image Transformers", ICCV, 2021 ( ). [ ][ ]
Paper : "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows", ICCV, 2021 ( ). [ ][ ][ ]
Paper : "Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet", ICCV, 2021 ( ). [ ][ ]
Paper : "Leveraging Batch Normalization for Vision Transformers", ICCVW, 2021 ( ). [ ]
Paper : "DPT: Deformable Patch-based Transformer for Visual Recognition", ACMMM, 2021 ( ). [ ][ ]
Paper : "Focal Attention for Long-Range Interactions in Vision Transformers", NeurIPS, 2021 ( ). [ ][ ]
Paper : "XCiT: Cross-Covariance Image Transformers", NeurIPS, 2021 ( ). [ ]
Paper : "Twins: Revisiting Spatial Attention Design in Vision Transformers", NeurIPS, 2021 ( ). [ ][ ]
Paper : "Blending Anti-Aliasing into Vision Transformer", NeurIPS, 2021 ( ). [ ][ ]
Paper : "Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length", NeurIPS, 2021 ( ). [ ][ ]
Paper : "Augmented Shortcuts for Vision Transformers", NeurIPS, 2021 ( ). [ ]
Paper : "Transformer in Transformer", NeurIPS, 2021 ( ). [ ][ ][ ]
Paper : "ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias", NeurIPS, 2021 ( ). [ ][ ]
Paper : "DeepViT: Towards Deeper Vision Transformer", arXiv, 2021 ( ). [ ][ ]
Paper : "So-ViT: Mind Visual Tokens for Vision Transformer", arXiv, 2021 ( ). [ ][ ]
Paper : "All Tokens Matter: Token Labeling for Training Better Vision Transformers", NeurIPS, 2021 ( ). [ ][ ]
Paper : "Aggregating Nested Transformers", arXiv, 2021 ( ). [ ][ ]
Paper : "KVT: k-NN Attention for Boosting Vision Transformers", arXiv, 2021 ( ). [ ]
Paper : "Refiner: Refining Self-attention for Vision Transformers", arXiv, 2021 ( ). [ ][ ]
Paper : "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer", arXiv, 2021 ( ). [ ]
Paper : "CAT: Cross Attention in Vision Transformer", arXiv, 2021 ( ). [ ][ ]
Paper : "Scaling Vision with Sparse Mixture of Experts", arXiv, 2021 ( ). [ ]
Paper : "P2T: Pyramid Pooling Transformer for Scene Understanding", arXiv, 2021 ( ). [ ]
Paper : "PVTv2: Improved Baselines with Pyramid Vision Transformer", arXiv, 2021 ( ). [ ][ ]
Paper : "Local-to-Global Self-Attention in Vision Transformers", arXiv, 2021 ( ). [ ]
Paper : "Visual Parser: Representing Part-whole Hierarchies with Transformers", arXiv, 2021 ( ). [ ]
Paper : "Scaled ReLU Matters for Training Vision Transformers", AAAI, 2022 ( ). [ ]
Paper : "Less is More: Pay Less Attention in Vision Transformers", AAAI, 2022 ( ). [ ][ ]
Paper : "Dynamic Token Normalization Improves Vision Transformer", ICLR, 2022 ( ). [ ][ ]
Paper : "RegionViT: Regional-to-Local Attention for Vision Transformers", ICLR, 2022 ( ). [ ][ ]
Paper : "CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention", ICLR, 2022 ( ). [ ][ ]
Paper : "Scaling the Depth of Vision Transformers via the Fourier Domain Analysis", ICLR, 2022 ( ). [ ]
Paper : "Scaling Vision Transformers", CVPR, 2022 ( ). [ ]
Paper : "CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows", CVPR, 2022 ( ). [ ][ ]
Paper : "MPViT: Multi-Path Vision Transformer for Dense Prediction", CVPR, 2022 ( ). [ ][ ]
Paper : "The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy", CVPR, 2022 ( ). [ ][ ]
Paper : "Beyond Fixation: Dynamic Window Visual Transformer", CVPR, 2022 ( ). [ ][ ]
Paper : "MixFormer: Mixing Features across Windows and Dimensions", CVPR, 2022 ( ). [ ][ ]
Paper : "Vision Transformer with Deformable Attention", CVPR, 2022 ( ). [ ][ ]
Paper : "Swin Transformer V2: Scaling Up Capacity and Resolution", CVPR, 2022 ( ). [ ][ ]
Paper : "MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens", CVPR, 2022 ( ). [ ][ ]
Paper : "NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition", CVPR, 2022 ( ). [ ][ ]
Paper : "Shunted Self-Attention via Multi-Scale Token Aggregation", CVPR, 2022 ( ). [ ][ ]
Paper : "PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture", CVPRW, 2022 ( ). [ ][ ]
Paper : "X-ViT: High Performance Linear Vision Transformer without Softmax", CVPRW, 2022 ( ). [ ]
Paper : "ReMixer: Object-aware Mixing Layer for Vision Transformers", CVPRW, 2022 ( ). [ ][ ]
Paper : "Unified Normalization for Accelerating and Stabilizing Transformers", ACMMM, 2022 ( ). [ ][ ]
Paper : "Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning", ECCV, 2022 ( ). [ ][ ]
Paper : "DaViT: Dual Attention Vision Transformers", ECCV, 2022 ( ). [ ][ ]
Paper : "ScalableViT: Rethinking the Context-oriented Generalization of Vision Transformer", ECCV, 2022 ( ). [ ]
Paper : "MaxViT: Multi-Axis Vision Transformer", ECCV, 2022 ( ). [ ][ ]
Paper : "VSA: Learning Varied-Size Window Attention in Vision Transformers", ECCV, 2022 ( ). [ ][ ]
Paper : "Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning", NeurIPS, 2022 ( ). [ ]
Paper : "Orthogonal Transformer: An Efficient Vision Transformer Backbone with Token Orthogonalization", NeurIPS, 2022 ( ). [ ]
Paper : "Peripheral Vision Transformer", NeurIPS, 2022 ( ). [ ]
Paper : "Fast Vision Transformers with HiLo Attention", NeurIPS, 2022 ( ). [ ][ ]
Paper : "BViT: Broad Attention based Vision Transformer", arXiv, 2022 ( ). [ ]
Paper : "O-ViT: Orthogonal Vision Transformer", arXiv, 2022 ( ). [ ]
Paper : "Aggregating Global Features into Local Vision Transformer", arXiv, 2022 ( ). [ ][ ]
Paper : "BOAT: Bilateral Local Attention Vision Transformer", arXiv, 2022 ( ). [ ]
Paper : "ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond", arXiv, 2022 ( ). [ ]
Paper : "Hierarchical Perceiver", arXiv, 2022 ( ). [ ]
Paper : "Learning to Merge Tokens in Vision Transformers", arXiv, 2022 ( ). [ ]
Paper : "Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention", arXiv, 2022 ( ). [ ]
Paper : "Neighborhood Attention Transformer", arXiv, 2022 ( ). [ ][ ]
Paper : "Adaptive Split-Fusion Transformer", arXiv, 2022 ( ). [ ][ ]
Paper : "SP-ViT: Learning 2D Spatial Priors for Vision Transformers", arXiv, 2022 ( ). [ ]
Paper : "EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm", arXiv, 2022 ( ). [ ]
Paper : "Rethinking Query-Key Pairwise Interactions in Vision Transformers", arXiv, 2022 ( ). [ ]
Paper : "Dual Vision Transformer", arXiv, 2022 ( ). [ ][ ]
Paper : "Multi-manifold Attention for Vision Transformers", arXiv, 2022 ( ). [ ]
Paper : "MAFormer: A Transformer Network with Multi-scale Attention Fusion for Visual Recognition", arXiv, 2022 ( ). [ ]
Paper : "Axially Expanded Windows for Local-Global Interaction in Vision Transformers", arXiv, 2022 ( ). [ ]
Paper : "Grafting Vision Transformers", arXiv, 2022 ( ). [ ]
Paper : "Rethinking Hierarchicies in Pre-trained Plain Vision Transformer", arXiv, 2022 ( ). [ ]
Paper : "The Lottery Ticket Hypothesis for Vision Transformers", arXiv, 2022 ( ). [ ]
Paper : "Token Transformer: Can class token help window-based transformer build better long-range interactions?", arXiv, 2022 ( ). [ ]
Paper : "INTERN: A New Learning Paradigm Towards General Vision", arXiv, 2022 ( ). [ ][ ]
Paper : "Group Generalized Mean Pooling for Vision Transformer", arXiv, 2022 ( ). [ ]
Paper : "GPViT: A High Resolution Non-Hierarchical Vision Transformer with Group Propagation", ICLR, 2023 ( ). [ ][ ]
Paper : "Conditional Positional Encodings for Vision Transformers", ICLR, 2023 ( ). [ ][ ]
Paper : "LipsFormer: Introducing Lipschitz Continuity to Vision Transformers", ICLR, 2023 ( ). [ ][ ]
Paper : "BiFormer: Vision Transformer with Bi-Level Routing Attention", CVPR, 2023 ( ). [ ][ ]
Paper : "Top-Down Visual Attention from Analysis by Synthesis", CVPR, 2023 ( ). [ ][ ][ ]
Paper : "Visual Dependency Transformers: Dependency Tree Emerges From Reversed Attention", CVPR, 2023 ( ). [ ][ ]
Paper : "ResFormer: Scaling ViTs with Multi-Resolution Training", CVPR, 2023 ( ). [ ][ ]
Paper : "Vision Transformer with Super Token Sampling", CVPR, 2023 ( ). [ ]
Paper : "PaCa-ViT: Learning Patch-to-Cluster Attention in Vision Transformers", CVPR, 2023 ( ). [ ][ ]
Paper : "Global Context Vision Transformers", ICML, 2023 ( ). [ ][ ]
Paper : "MAGNETO: A Foundation Transformer", ICML, 2023 ( ). [ ]
Paper : "Fcaformer: Forward Cross Attention in Hybrid Vision Transformer", ICCV, 2023 ( ). [ ][ ]
Paper : "Scale-Aware Modulation Meet Transformer", ICCV, 2023 ( ). [ ][ ]
Paper : "FLatten Transformer: Vision Transformer using Focused Linear Attention", ICCV, 2023 ( ). [ ][ ]
Paper : "Revisiting Vision Transformer from the View of Path Ensemble", ICCV, 2023 ( ). [ ]
Paper : "SG-Former: Self-guided Transformer with Evolving Token Reallocation", ICCV, 2023 ( ). [ ][ ]
Paper : "Keep It SimPool: Who Said Supervised Transformers Suffer from Attention Deficit?", ICCV, 2023 ( ). [ ]
Paper : "LaPE: Layer-adaptive Position Embedding for Vision Transformers with Independent Layer Normalization", ICCV, 2023 ( ). [ ][ ]
Paper : "Scratching Visual Transformer's Back with Uniform Attention", ICCV, 2023 ( ). [ ]
Paper : "Fully Attentional Networks with Self-emerging Token Labeling", ICCV, 2023 ( ). [ ][ ]
Paper : "ClusterFormer: Clustering As A Universal Visual Learner", NeurIPS, 2023 ( ). [ ]
Paper : "Scattering Vision Transformer: Spectral Mixing Matters", NeurIPS, 2023 ( ). [ ][ ][ ]
Paper : "CrossFormer++: A Versatile Vision Transformer Hinging on Cross-scale Attention", arXiv, 2023 ( ). [ ][ ]
Paper : "Vision Transformer with Quadrangle Attention", arXiv, 2023 ( ). [ ][ ]
Paper : "ViT-Calibrator: Decision Stream Calibration for Vision Transformer", arXiv, 2023 ( ). [ ]
Paper : "SpectFormer: Frequency and Attention is what you need in a Vision Transformer", arXiv, 2023 ( ). [ ][ ][ ]
Paper : "UniNeXt: Exploring A Unified Architecture for Vision Recognition", arXiv, 2023 ( ). [ ]
Paper : "CageViT: Convolutional Activation Guided Efficient Vision Transformer", arXiv, 2023 ( ). [ ]
Paper : "Making Vision Transformers Truly Shift-Equivariant", arXiv, 2023 ( ). [ ]
Paper : "2-D SSM: A General Spatial Layer for Visual Transformers", arXiv, 2023 ( ). [ ][ ]
Paper : "Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution", NeurIPS, 2023 ( ). [ ]
Paper : "DAT++: Spatially Dynamic Vision Transformer with Deformable Attention", arXiv, 2023 ( ). [ ][ ]
Paper : "Replacing softmax with ReLU in Vision Transformers", arXiv, 2023 ( ). [ ]
Paper : "RMT: Retentive Networks Meet Vision Transformers", arXiv, 2023 ( ). [ ]
Paper : "Vision Transformers Need Registers", arXiv, 2023 ( ). [ ]
Paper : "Channel Vision Transformers: An Image Is Worth C x 16 x 16 Words", arXiv, 2023 ( ). [ ]
Paper : "EViT: An Eagle Vision Transformer with Bi-Fovea Self-Attention", arXiv, 2023 ( ). [ ]
Paper : "ViR: Vision Retention Networks", arXiv, 2023 ( ). [ ]
Paper : "Window Attention is Bugged: How not to Interpolate Position Embeddings", arXiv, 2023 ( ). [ ]
Paper : "FMViT: A multiple-frequency mixing Vision Transformer", arXiv, 2023 ( ). [ ][ ]
Paper : "Advancing Vision Transformers with Group-Mix Attention", arXiv, 2023 ( ). [ ][ ]
Paper : "Perceptual Group Tokenizer: Building Perception with Iterative Grouping", arXiv, 2023 ( ). [ ]
Paper : "SCHEME: Scalable Channer Mixer for Vision Transformers", arXiv, 2023 ( ). [ ]
Paper : "Agent Attention: On the Integration of Softmax and Linear Attention", arXiv, 2023 ( ). [ ][ ]
Paper : "ViTamin: Designing Scalable Vision Models in the Vision-Language Era", CVPR, 2024 ( ). [ ][ ]
Paper : "HIRI-ViT: Scaling Vision Transformer with High Resolution Inputs", TPAMI, 2024 ( ). [ ]
Paper : "SPFormer: Enhancing Vision Transformer with Superpixel Representation", arXiv, 2024 ( ). [ ]
Paper : "A Manifold Representation of the Key in Vision Transformers", arXiv, 2024 ( ). [ ]
Paper : "Perceiving Longer Sequences With Bi-Directional Cross-Attention Transformers", arXiv, 2024 ( ). [ ]
Paper : "VisionLLaMA: A Unified LLaMA Interface for Vision Tasks", arXiv, 2024 ( ). [ ][ ]
Paper : "xT: Nested Tokenization for Larger Context in Large Images", arXiv, 2024 ( ). [ ]
Paper : "ACC-ViT: Atrous Convolution's Comeback in Vision Transformers", arXiv, 2024 ( ). [ ]
Paper : "ViTAR: Vision Transformer with Any Resolution", arXiv, 2024 ( ). [ ]
Paper : "Adapting LLaMA Decoder to Vision Transformer", arXiv, 2024 ( ). [ ]
Paper : "Training data-efficient image transformers & distillation through attention", ICML, 2021 ( ). [ ][ ]
Paper : "ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases", ICML, 2021 ( ). [ ][ ]
Paper : "Improving the Efficiency of Transformers for Resource-Constrained Devices", DSD, 2021 ( ). [ ]
Paper : "Vision Transformer with Progressive Sampling", ICCV, 2021 ( ). [ ]
Paper : "Scalable Visual Transformers with Hierarchical Pooling", ICCV, 2021 ( ). [ ][ ]
Paper : "CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification", ICCV, 2021 ( ). [ ][ ]
Paper : "Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding", ICCV, 2021 ( ). [ ][ ]
Paper : "Visformer: The Vision-friendly Transformer", ICCV, 2021 ( ). [ ][ ]
Paper : "Multi-Exit Vision Transformer for Dynamic Inference", BMVC, 2021 ( ). [ ][ ]
Paper : "Chasing Sparsity in Vision Transformers: An End-to-End Exploration", NeurIPS, 2021 ( ). [ ][ ]
Paper : "Dynamic Grained Encoder for Vision Transformers", NeurIPS, 2021 ( ). [ ][ ]
Paper : "Glance-and-Gaze Vision Transformer", NeurIPS, 2021 ( ). [ ][ ]
Paper : "DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification", NeurIPS, 2021 ( ). [ ][ ][ ]
Paper : "ResT: An Efficient Transformer for Visual Recognition", NeurIPS, 2021 ( ). [ ][ ]
Paper : "Adder Attention for Vision Transformer", NeurIPS, 2021 ( ). [ ]
Paper : "SOFT: Softmax-free Transformer with Linear Complexity", NeurIPS, 2021 ( ). [ ][ ][ ]
Paper : "IA-RED : Interpretability-Aware Redundancy Reduction for Vision Transformers", NeurIPS, 2021 ( ). [ ][ ]
Paper : "LocalViT: Bringing Locality to Vision Transformers", arXiv, 2021 ( ). [ ][ ]
Paper : "Escaping the Big Data Paradigm with Compact Transformers", arXiv, 2021 ( ). [ ][ ]
Paper : "Vision Transformers with Patch Diversification", arXiv, 2021 ( ). [ ][ ]
Paper : "Single-Layer Vision Transformers for More Accurate Early Exits with Less Overhead", arXiv, 2021 ( ). [ ]
Paper : "Multi-Exit Vision Transformer for Dynamic Inference", arXiv, 2021 ( ). [ ]
Paper : "Vision Xformers: Efficient Attention for Image Classification", arXiv, 2021 ( ). [ ]
Paper : "Long-Short Transformer: Efficient Transformers for Language and Vision", NeurIPS, 2021 ( ). [ ][ ]
Paper : "Go Wider Instead of Deeper", arXiv, 2021 ( ). [ ]
Paper : "Armour: Generalizable Compact Self-Attention for Vision Transformers", arXiv, 2021 ( ). [ ]
Paper : "Exploring and Improving Mobile Level Vision Transformers", arXiv, 2021 ( ). [ ]
Paper : "DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and Transformers", arXiv, 2021 ( ). [ ][ ]
Paper : "UFO-ViT: High Performance Linear Vision Transformer without Softmax", arXiv, 2021 ( ). [ ]
Paper : "Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer", AAAI, 2022 ( ). [ ][ ]
Paper : "Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention", AAAI, 2022 ( ). [ ][ ]
Paper : "When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism", AAAI, 2022 ( ). [ ][ ]
Paper : "Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations", ICLR, 2022 ( ). [ ][ ]
Paper : "QuadTree Attention for Vision Transformers", ICLR, 2022 ( ). [ ][ ]
Paper : "Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice", ICLR, 2022 ( ). [ ][ ]
Paper : "Learned Queries for Efficient Local Attention", CVPR, 2022 ( ). [ ][ ]
Paper : "Lite Vision Transformer with Enhanced Self-Attention", CVPR, 2022 ( ). [ ][ ]
Paper : "A-ViT: Adaptive Tokens for Efficient Vision Transformer", CVPR, 2022 ( ). [ ][ ]
Paper : "Patch Slimming for Efficient Vision Transformers", CVPR, 2022 ( ). [ ]
Paper : "Reversible Vision Transformers", CVPR, 2022 ( ). [ ][ ][ ]
Paper : "AdaViT: Adaptive Vision Transformers for Efficient Image Recognition", CVPR, 2022 ( ). [ ]
Paper : "Dynamic Query Selection for Fast Visual Perceiver", CVPRW, 2022 ( ). [ ]
Paper : "Adaptive Token Sampling For Efficient Vision Transformers", ECCV, 2022 ( ). [ ][ ]
Paper : "EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers", ECCV, 2022 ( ). [ ][ ]
Paper : "Sliced Recursive Transformer", ECCV, 2022 ( ). [ ][ ]
Paper : "Self-slimmed Vision Transformer", ECCV, 2022 ( ). [ ][ ]
Paper : "Doubly-Fused ViT: Fuse Information from Vision Transformer Doubly with Local Representation", ECCV, 2022 ( ). [ ]
Paper : "M ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design", NeurIPS, 2022 ( ). [ ][ ]
Paper : "ResT V2: Simpler, Faster and Stronger", NeurIPS, 2022 ( ). [ ][ ]
Paper : "Learning Efficient Vision Transformers via Fine-Grained Manifold Distillation", NeurIPS, 2022 ( ). [ ]
Paper : "EfficientFormer: Vision Transformers at MobileNet Speed", NeurIPS, 2022 ( ). [ ][ ]
Paper : "GhostNetV2: Enhance Cheap Operation with Long-Range Attention", NeurIPS, 2022 ( ). [ ][ ]
Paper : "Training a Vision Transformer from scratch in less than 24 hours with 1 GPU", NeurIPSW, 2022 ( ). [ ]
Paper : "TerViT: An Efficient Ternary Vision Transformer", arXiv, 2022 ( ). [ ]
Paper : "Multi-Tailed Vision Transformer for Efficient Inference", arXiv, 2022 ( ). [ ]
Paper : "ViT-P: Rethinking Data-efficient Vision Transformers from Locality", arXiv, 2022 ( ). [ ]
Paper : "Coarse-to-Fine Vision Transformer", arXiv, 2022 ( ). [ ][ ]
Paper : "EIT: Efficiently Lead Inductive Biases to ViT", arXiv, 2022 ( ). [ ]
Paper : "SepViT: Separable Vision Transformer", arXiv, 2022 ( ). [ ]
Paper : "TRT-ViT: TensorRT-oriented Vision Transformer", arXiv, 2022 ( ). [ ]
Paper : "Super Vision Transformer", arXiv, 2022 ( ). [ ][ ]
Paper : "Tutel: Adaptive Mixture-of-Experts at Scale", arXiv, 2022 ( ). [ ][ ]
Paper : "SimA: Simple Softmax-free Attention for Vision Transformers", arXiv, 2022 ( ). [ ][ ]
Paper : "EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications", arXiv, 2022 ( ). [ ][ ]
Paper : "Vicinity Vision Transformer", arXiv, 2022 ( ). [ ][ ]
Paper : "Softmax-free Linear Transformers", arXiv, 2022 ( ). [ ][ ]
Paper : "MaiT: Leverage Attention Masks for More Efficient Image Transformers", arXiv, 2022 ( ). [ ]
Paper : "LightViT: Towards Light-Weight Convolution-Free Vision Transformers", arXiv, 2022 ( ). [ ][ ]
Paper : "Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios", arXiv, 2022 ( ). [ ]
Paper : "Lightweight Vision Transformer with Cross Feature Attention", arXiv, 2022 ( ). [ ]
Paper : "PatchDropout: Economizing Vision Transformers Using Patch Dropout", arXiv, 2022 ( ). [ ]
Paper : "ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers", arXiv, 2022 ( ). [ ]
Paper : "Dilated Neighborhood Attention Transformer", arXiv, 2022 ( ). [ ][ ]
Paper : "MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features", arXiv, 2022 ( ). [ ][ ]
Paper : "ViT-LSLA: Vision Transformer with Light Self-Limited-Attention", arXiv, 2022 ( ). [ ]
Paper : "Token Pooling in Vision Transformers for Image Classification", WACV, 2023 ( ). [ ]
Paper : "Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training", AAAI, 2023 ( ). [ ][ ]
Paper : "ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design", IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023 ( ). [ ]
Paper : "ViTALiTy: Unifying Low-rank and Sparse Approximation for Vision Transformer Acceleration with a Linear Taylor Attention", IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023 ( ). [ ]
Paper : "HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers", IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023 ( ). [ ]
Paper : "Token Merging: Your ViT But Faster", ICLR, 2023 ( ). [ ][ ]
Paper : "HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer", ICLR, 2023 ( ). [ ][ ]
Paper : "Making Vision Transformers Efficient from A Token Sparsification View", CVPR, 2023 ( ). [ ][ ]
Paper : "SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer", CVPR, 2023 ( ). [ ][ ]
Paper : "Slide-Transformer: Hierarchical Vision Transformer with Local Self-Attention", CVPR, 2023 ( ). [ ][ ]
Paper : "RIFormer: Keep Your Vision Backbone Effective While Removing Token Mixer", CVPR, 2023 ( ). [ ][ ][ ]
Paper : "EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention", CVPR, 2023 ( ). [ ][ ]
Paper : "Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention During Vision Transformer Inference", CVPR, 2023 ( ). [ ]
Paper : "RGB no more: Minimally-decoded JPEG Vision Transformers", CVPR, 2023 ( ). [ ]
Paper : "Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient Vision Transformers", CVPR, 2023 ( ). [ ]
Paper : "Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers", CVPR, 2023 ( ). [ ]
Paper : "Learned Thresholds Token Merging and Pruning for Vision Transformers", ICMLW, 2023 ( ). [ ][ ][ ]
Paper : "Make A Long Image Short: Adaptive Token Length for Vision Transformers", ECML PKDD, 2023 ( ). [ ]
Paper : "EfficientViT: Enhanced Linear Attention for High-Resolution Low-Computation Visual Recognition", ICCV, 2023 ( ). [ ][ ]
Paper : "MPCViT: Searching for Accurate and Efficient MPC-Friendly Vision Transformer with Heterogeneous Attention", ICCV, 2023 ( ). [ ][ ]
Paper : "Masked Spiking Transformer", ICCV, 2023 ( ). [ ]
Paper : "Rethinking Vision Transformers for MobileNet Size and Speed", ICCV, 2023 ( ). [ ][ ]
Paper : "DiffRate: Differentiable Compression Rate for Efficient Vision Transformers", ICCV, 2023 ( ). [ ][ ]
Paper : "ElasticViT: Conflict-aware Supernet Training for Deploying Fast Vision Transformer on Diverse Mobile Devices", ICCV, 2023 ( ). [ ]
Paper : "FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization", ICCV, 2023 ( ). [ ][ ]
Paper : "SeiT: Storage-Efficient Vision Training with Tokens Using 1% of Pixel Storage", ICCV, 2023 ( ). [ ][ ]
Paper : "Which Tokens to Use? Investigating Token Reduction in Vision Transformers", ICCVW, 2023 ( ). [ ][ ][ ]
Paper : "LGViT: Dynamic Early Exiting for Accelerating Vision Transformer", ACMMM, 2023 ( ). [ ]
Paper : "Efficient Low-rank Backpropagation for Vision Transformer Adaptation", NeurIPS, 2023 ( ). [ ]
Paper : "Lightweight Vision Transformer with Bidirectional Interaction", NeurIPS, 2023 ( ). [ ][ ]
Paper : "MCUFormer: Deploying Vision Transformers on Microcontrollers with Limited Memory", NeurIPS, 2023 ( ). [ ][ ]
Paper : "Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design", NeurIPS, 2023 ( ). [ ]
Paper : "Rethinking Local Perception in Lightweight Vision Transformer", arXiv, 2023 ( ). [ ]
Paper : "Vision Transformers with Mixed-Resolution Tokenization", arXiv, 2023 ( ). [ ][ ]
Paper : "SparseFormer: Sparse Visual Recognition via Limited Latent Tokens", arXiv, 2023 ( ). [ ][ ]
Paper : "Rethinking Mobile Block for Efficient Attention-based Models", arXiv, 2023 ( ). [ ][ ]
Paper : "Bytes Are All You Need: Transformers Operating Directly On File Bytes", arXiv, 2023 ( ). [ ]
Paper : "Muti-Scale And Token Mergence: Make Your ViT More Efficient", arXiv, 2023 ( ). [ ]
Paper : "FasterViT: Fast Vision Transformers with Hierarchical Attention", arXiv, 2023 ( ). [ ]
Paper : "Vision Transformer with Attention Map Hallucination and FFN Compaction", arXiv, 2023 ( ). [ ]
Paper : "Skip-Attention: Improving Vision Transformers by Paying Less Attention", arXiv, 2023 ( ). [ ]
Paper : "MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers", arXiv, 2023 ( ). [ ]
Paper : "DiT: Efficient Vision Transformers with Dynamic Token Routing", arXiv, 2023 ( ). [ ][ ]
Paper : "Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers", arXiv, 2023 ( ). [ ][ ]
Paper : "Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts", arXiv, 2023 ( ). [ ]
Paper : "PPT: Token Pruning and Pooling for Efficient Vision Transformers", arXiv, 2023 ( ). [ ]
Paper : "MatFormer: Nested Transformer for Elastic Inference", arXiv, 2023 ( ). [ ]
Paper : "Bootstrapping SparseFormers from Vision Foundation Models", arXiv, 2023 ( ). [ ][ ]
Paper : "GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation", WACV, 2024 ( ). [ ][ ]
Paper : "Token Fusion: Bridging the Gap between Token Pruning and Token Merging", WACV, 2024 ( ). [ ]
Paper : "Cached Transformers: Improving Transformers with Differentiable Memory Cache", AAAI, 2024 ( ). [ ]
Paper : "LF-ViT: Reducing Spatial Redundancy in Vision Transformer for Efficient Image Recognition", AAAI, 2024 ( ). [ ][ ]
Paper : "Efficient Modulation for Vision Networks", ICLR, 2024 ( ). [ ][ ]
Paper : "MLP Can Be A Good Transformer Learner", CVPR, 2024 ( ). [ ][ ]
Paper : "SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization", ICML, 2024 ( ). [ ][ ]
Paper : "When Do We Not Need Larger Vision Models?", arXiv, 2024 ( ). [ ][ ]
Paper : "LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference", ICCV, 2021 ( ). [ ][ ]
Paper : "Incorporating Convolution Designs into Visual Transformers", ICCV, 2021 ( ). [ ][ ]
Paper : "Conformer: Local Features Coupling Global Representations for Visual Recognition", ICCV, 2021 ( ). [ ][ ]
Paper : "Co-Scale Conv-Attentional Image Transformers", ICCV, 2021 ( ). [ ][ ]
Paper : "CvT: Introducing Convolutions to Vision Transformers", ICCV, 2021 ( ). [ ][ ]
Paper : "Early Convolutions Help Transformers See Better", NeurIPS, 2021 ( ). [ ]
Paper : "ConTNet: Why not use convolution and transformer at the same time?", arXiv, 2021 ( ). [ ][ ]
Paper : "A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP", arXiv, 2021 ( ). [ ]
Paper : "MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer", ICLR, 2022 ( ). [ ][ ]
Paper : "CMT: Convolutional Neural Networks Meet Vision Transformers", CVPR, 2022 ( ). [ ]
Paper : "Mobile-Former: Bridging MobileNet and Transformer", CVPR, 2022 ( ). [ ][ ]
Paper : "TinyViT: Fast Pretraining Distillation for Small Vision Transformers", ECCV, 2022 ( ). [ ][ ]
Paper : "Convolutional Embedding Makes Hierarchical Vision Transformer Stronger", ECCV, 2022 ( ). [ ]
Paper : "ParC-Net: Position Aware Circular Convolution with Merits from ConvNets and Transformer", ECCV, 2022 ( ). [ ][ ]
Paper : "How to Train Vision Transformer on Small-scale Datasets?", BMVC, 2022 ( ). [ ][ ]
Paper : "Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets", NeurIPS, 2022 ( ). [ ][ ]
Paper : "Inception Transformer", NeurIPS, 2022 ( ). [ ][ ]
Paper : "Explicitly Increasing Input Information Density for Vision Transformers on Small Datasets", NeurIPSW, 2022 ( ). [ ]
Paper : "Convolutional Xformers for Vision", arXiv, 2022 ( ). [ ][ ]
Paper : "Patches Are All You Need?", arXiv, 2022 ( ). [ ][ ]
Paper : "Separable Self-attention for Mobile Vision Transformers", arXiv, 2022 ( ). [ ][ ]
Paper : "UniFormer: Unifying Convolution and Self-attention for Visual Recognition", arXiv, 2022 ( ). [ ][ ]
Paper : "EdgeFormer: Improving Light-weight ConvNets by Learning from Vision Transformers", arXiv, 2022 ( ). [ ]
Paper : "MoCoViT: Mobile Convolutional Vision Transformer", arXiv, 2022 ( ). [ ]
Paper : "Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks", arXiv, 2022 ( ). [ ][ ]
Paper : "ConvFormer: Closing the Gap Between CNN and Vision Transformers", arXiv, 2022 ( ). [ ]
Paper : "Fast-ParC: Position Aware Global Kernel for ConvNets and ViTs", arXiv, 2022 ( ). [ ]
Paper : "MetaFormer Baselines for Vision", arXiv, 2022 ( ). [ ][ ]
Paper : "Demystify Transformers & Convolutions in Modern Image Deep Networks", arXiv, 2022 ( ). [ ][ ]
Paper : "ParCNetV2: Oversized Kernel with Enhanced Attention", arXiv, 2022 ( ). [ ]
Paper : "Visual Attention Network", arXiv, 2022 ( ). [ ][ ]
Paper : "Masked autoencoders is an effective solution to transformer data-hungry", arXiv, 2022 ( ). [ ][ ]
Paper : "Accumulated Trivial Attention Matters in Vision Transformers on Small Datasets", WACV, 2023 ( ). [ ][ ]
Paper : "Sparse and Hierarchical Masked Modeling for Convolutional Representation Learning", ICLR, 2023 ( ). [ ][ ]
Paper : "MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models", ICLR, 2023 ( ). [ ][ ]
Paper : "InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions", CVPR, 2023 ( ). [ ][ ]
Paper : "SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications", ICCV, 2023 ( ). [ ][ ]
Paper : "SCSC: Spatial Cross-scale Convolution Module to Strengthen both CNNs and Transformers", ICCVW, 2023 ( ). [ ]
Paper : "PSLT: A Light-weight Vision Transformer with Ladder Self-Attention and Progressive Shift", TPAMI, 2023 ( ). [ ][ ]
Paper : "RepViT: Revisiting Mobile CNN From ViT Perspective", arXiv, 2023 ( ). [ ][ ]
Paper : "Interpret Vision Transformers as ConvNets with Dynamic Convolutions", arXiv, 2023 ( ). [ ]
Paper : "UPDP: A Unified Progressive Depth Pruner for CNN and Vision Transformer", AAAI, 2024 ( ). [ ]
Paper : "Generative Pretraining From Pixels", ICML, 2020 ( ). [ ][ ]
Paper : "Learning Transferable Visual Models From Natural Language Supervision", ICML, 2021 ( ). [ ][ ]
Paper : "An Empirical Study of Training Self-Supervised Vision Transformers", ICCV, 2021 ( ). [ ]
Paper : "Emerging Properties in Self-Supervised Vision Transformers", ICCV, 2021 ( ). [ ][ ]
Paper : "Efficient Training of Visual Transformers with Small Datasets", NeurIPS, 2021 ( ). [ ][ ]
Paper : "Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning", NeurIPS, 2021 ( ). [ ][ ]
Paper : "MST: Masked Self-Supervised Transformer for Visual Representation", NeurIPS, 2021 ( ). [ ]
Paper : "SiT: Self-supervised Vision Transformer", arXiv, 2021 ( ). [ ][ ]
Paper : "Self-Supervised Learning with Swin Transformers", arXiv, 2021 ( ). [ ][ ]
Paper : "Investigating Transfer Learning Capabilities of Vision Transformers and CNNs by Fine-Tuning a Single Trainable Block", arXiv, 2021 ( ). [ ]
Paper : "Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations", WACV, 2022 ( ). [ ]
Paper : "BEiT: BERT Pre-Training of Image Transformers", ICLR, 2022 ( ). [ ][ ]
Paper : "Efficient Self-supervised Vision Transformers for Representation Learning", ICLR, 2022 ( ). [ ]
Paper : "Image BERT Pre-training with Online Tokenizer", ICLR, 2022 ( ). [ ][ ]
Paper : "Masked Feature Prediction for Self-Supervised Visual Pre-Training", CVPR, 2022 ( ). [ ]
Paper : "Automated Progressive Learning for Efficient Training of Vision Transformers", CVPR, 2022 ( ). [ ][ ]
Paper : "Masked Autoencoders Are Scalable Vision Learners", CVPR, 2022 ( ). [ ][ ][ ]
Paper : "SimMIM: A Simple Framework for Masked Image Modeling", CVPR, 2022 ( ). [ ][ ]
Paper : "Patch-Level Representation Learning for Self-Supervised Vision Transformers", CVPR, 2022 ( ). [ ][ ]
Paper : "Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training", CVPR, 2022 ( ). [ ][ ]
Paper : "TransMix: Attend to Mix for Vision Transformers", CVPR, 2022 ( ). [ ][ ]
Paper : "PatchRot: A Self-Supervised Technique for Training Vision Transformers", CVPRW, 2022 ( ). [ ]
Paper : "Are Large-scale Datasets Necessary for Self-Supervised Pre-training?", CVPRW, 2022 ( ). [ ]
Paper : "MC-SSL: Towards Multi-Concept Self-Supervised Learning", CVPRW, 2022 ( ). [ ]
Paper : "Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer", CVPRW, 2022 ( ). [ ]
Paper : "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language", ICML, 2022 ( ). [ ][ ]
Paper : "Self-supervised Models are Good Teaching Assistants for Vision Transformers", ICML, 2022 ( ). [ ][ ]
Paper : "Position Prediction as an Effective Pretraining Strategy", ICML, 2022 ( ). [ ][ ]
Paper : "Visual Transformer Meets CutMix for Improved Accuracy, Communication Efficiency, and Data Privacy in Split Learning", IJCAI, 2022 ( ). [ ]
Paper : "Bootstrapped Masked Autoencoders for Vision BERT Pretraining", ECCV, 2022 ( ). [ ][ ]
Paper : "TokenMix: Rethinking Image Mixing for Data Augmentation in Vision Transformers", ECCV, 2022 ( ). [ ][ ]
Paper : "Locality Guidance for Improving Vision Transformers on Tiny Datasets", ECCV, 2022 ( ). [ ][ ]
Paper : "Improving Vision Transformers by Revisiting High-frequency Components", ECCV, 2022 ( ). [ ][ ]
Paper : "Training Vision Transformers with Only 2040 Images", ECCV, 2022 ( ). [ ]
Paper : "What to Hide from Your Students: Attention-Guided Masked Image Modeling", ECCV, 2022 ( ). [ ][ ]
Paper : "SLIP: Self-supervision meets Language-Image Pre-training", ECCV, 2022 ( ). [ ][ ]
Paper : "mc-BEiT: Multi-Choice Discretization for Image BERT Pre-training", ECCV, 2022 ( ). [ ]
Paper : "Scalable Learning to Optimize: A Learned Optimizer Can Train Big Models", ECCV, 2022 ( ). [ ][ ]
Paper : "TokenMixup: Efficient Attention-guided Token-level Data Augmentation for Transformers", NeurIPS, 2022 ( ). [ ][ ]
Paper : "PatchRot: A Self-Supervised Technique for Training Vision Transformers", NeurIPSW, 2022 ( ). [ ]
Paper : "Green Hierarchical Vision Transformer for Masked Image Modeling", NeurIPS, 2022 ( ). [ ][ ]
Paper : "Differentially Private CutMix for Split Learning with Vision Transformer", NeurIPSW, 2022 ( ). [ ]
Paper : "How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers", Transactions on Machine Learning Research (TMLR), 2022 ( ). [ ][ ][ ]
Paper : "PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers", arXiv, 2022 ( ). [ ]
Paper : "RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training", arXiv, 2022 ( ). [ ]
Paper : "Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers", arXiv, 2022 ( ). [ ][ ]
Paper : "Parameter-efficient Fine-tuning for Vision Transformers", arXiv, 2022 ( ). [ ]
Paper : "DILEMMA: Self-Supervised Shape and Texture Learning with Transformers", arXiv, 2022 ( ). [ ]
Paper : "DeiT III: Revenge of the ViT", arXiv, 2022 ( ). [ ]
Paper : "Better plain ViT baselines for ImageNet-1k", arXiv, 2022 ( ). [ ][ ]
Paper : "ConvMAE: Masked Convolution Meets Masked Autoencoders", arXiv, 2022 ( ). [ ][ ]
Paper : "Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality", arXiv, 2022 ( ). [ ][ ]
Paper : "GMML is All you Need", arXiv, 2022 ( ). [ ][ ]
Paper : "Siamese Image Modeling for Self-Supervised Vision Representation Learning", arXiv, 2022 ( ). [ ]
Paper : "SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners", arXiv, 2022 ( ). [ ][ ]
Paper : "Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction", arXiv, 2022 ( ). [ ]
Paper : "Spatial Entropy Regularization for Vision Transformers", arXiv, 2022 ( ). [ ]
Paper : "Extreme Masking for Learning Instance and Distributed Visual Representations", arXiv, 2022 ( ). [ ]
Paper : "Exploring Feature Self-relation for Self-supervised Transformer", arXiv, 2022 ( ). [ ]
Paper : "Position Labels for Self-Supervised Vision Transformer", arXiv, 2022 ( ). [ ]
Paper : "Jigsaw-ViT: Learning Jigsaw Puzzles in Vision Transformer", arXiv, 2022 ( ). [ ][ ][ ]
Paper : "BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers", arXiv, 2022 ( ). [ ][ ]
Paper : "MILAN: Masked Image Pretraining on Language Assisted Representation", arXiv, 2022 ( ). [ ][ ]
Paper : "Accelerating Vision Transformer Training via a Patch Sampling Schedule", arXiv, 2022 ( ). [ ][ ]
Paper : "Exploring Target Representations for Masked Autoencoders", arXiv, 2022 ( ). [ ]
Paper : "Effective Vision Transformer Training: A Data-Centric Perspective", arXiv, 2022 ( ). [ ]
Paper : "Self-Distillation for Further Pre-training of Transformers", arXiv, 2022 ( ). [ ]
Paper : "Learning Self-Regularized Adversarial Views for Self-Supervised Vision Transformers", arXiv, 2022 ( ). [ ][ ]
Paper : "Location-Aware Self-Supervised Transformers", arXiv, 2022 ( ). [ ]
Paper : "CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet", arXiv, 2022 ( ). [ ][ ]
Paper : "MixPro: Data Augmentation with MaskMix and Progressive Attention Labeling for Vision Transformer", ICLR, 2023 ( ). [ ][ ]
Paper : "Masked Image Modeling with Denoising Contrast", ICLR, 2023 ( ). [ ][ ]
Paper : "Contextual Image Masking Modeling via Synergized Contrasting without View Augmentation for Faster and Better Visual Pretraining", ICLR, 2023 ( ). [ ]
Paper : "Corrupted Image Modeling for Self-Supervised Visual Pre-Training", ICLR, 2023 ( ). [ ]
Paper : "Masked Frequency Modeling for Self-Supervised Visual Pre-Training", ICLR, 2023 ( ). [ ][ ]
Paper : "Mask3D: Pre-training 2D Vision Transformers by Learning Masked 3D Priors", CVPR, 2023 ( ). [ ]
Paper : "Visual Atoms: Pre-training Vision Transformers with Sinusoidal Waves", CVPR, 2023 ( ). [ ][ ][ ]
Paper : "Mixed Autoencoder for Self-supervised Visual Representation Learning", CVPR, 2023 ( ). [ ]
Paper : "Token Boosting for Robust Self-Supervised Visual Transformer Pre-training", CVPR, 2023 ( ). [ ]
Paper : "Learning Visual Representations via Language-Guided Sampling", CVPR, 2023 ( ). [ ][ ]
Paper : "DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training", CVPR, 2023 ( ). [ ][ ]
Paper : "MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining", CVPR, 2023 ( ). [ ][ ]
Paper : "MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis", CVPR, 2023 ( ). [ ][ ]
Paper : "MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning", CVPR, 2023 ( ). [ ][ ]
Paper : "Integrally Pre-Trained Transformer Pyramid Networks", CVPR, 2023 ( ). [ ][ ]
Paper : "DropKey for Vision Transformer", CVPR, 2023 ( ). [ ]
Paper : "FlexiViT: One Model for All Patch Sizes", CVPR, 2023 ( ). [ ][ ]
Paper : "RA-CLIP: Retrieval Augmented Contrastive Language-Image Pre-Training", CVPR, 2023 ( ). [ ]
Paper : "CLIPPO: Image-and-Language Understanding from Pixels Only", CVPR, 2023 ( ). [ ][ ]
Paper : "Masked Autoencoders Enable Efficient Knowledge Distillers", CVPR, 2023 ( ). [ ][ ]
Paper : "Hard Patches Mining for Masked Image Modeling", CVPR, 2023 ( ). [ ][ ]
Paper : "Masked Image Modeling with Local Multi-Scale Reconstruction", CVPR, 2023 ( ). [ ]
Paper : "Stare at What You See: Masked Image Modeling without Reconstruction", CVPR, 2023 ( ). [ ][ ]
Paper : "RILS: Masked Visual Reconstruction in Language Semantic Space", CVPR, 2023 ( ). [ ][ ]
Paper : "Understanding Masked Image Modeling via Learning Occlusion Invariant Feature", CVPR, 2023 ( ). [ ]
Paper : "Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens", CVPR, 2023 ( ). [ ][ ]
Paper : "Prefix Conditioning Unifies Language and Label Supervision", CVPR, 2023 ( ). [ ]
Paper : "Reproducible scaling laws for contrastive language-image learning", CVPR, 2023 ( ). [ ][ ]
Paper : "Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training", CVPR, 2023 ( ). [ ][ ]
Paper : "Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information", CVPR, 2023 ( ). [ ][ ]
Paper : "Stitchable Neural Networks", CVPR, 2023 ( ). [ ][ ]
Paper : "A Closer Look at Self-supervised Lightweight Vision Transformers", ICML, 2023 ( ). [ ][ ]
Paper : "Scaling Vision Transformers to 22 Billion Parameters", ICML, 2023 ( ). [ ]
Paper : "Can We Scale Transformers to Predict Parameters of Diverse ImageNet Models?", ICML, 2023 ( ). [ ][ ]
Paper : "Architecture-Agnostic Masked Image Modeling - From ViT back to CNN", ICML, 2023 ( ). [ ][ ]
Paper : "Patch-level Contrastive Learning via Positional Query for Visual Pre-training", ICML, 2023 ( ). [ ][ ]
Paper : "DreamTeacher: Pretraining Image Backbones with Deep Generative Models", ICCV, 2023 ( ). [ ][ ]
Paper : "Pre-training Vision Transformers with Very Limited Synthesized Images", ICCV, 2023 ( ). [ ][ ]
Paper : "Improving Pixel-based MIM by Reducing Wasted Modeling Capability", ICCV, 2023 ( ). [ ][ ]
Paper : "Token-Label Alignment for Vision Transformers", ICCV, 2023 ( ). [ ][ ]
Paper : "SMMix: Self-Motivated Image Mixing for Vision Transformers", ICCV, 2023 ( ). [ ][ ]
Paper : "Diffusion Models as Masked Autoencoders", ICCV, 2023 ( ). [ ][ ]
Paper : "The effectiveness of MAE pre-pretraining for billion-scale pretraining", ICCV, 2023 ( ). [ ][ ]
Paper : "Teaching CLIP to Count to Ten", ICCV, 2023 ( ). [ ]
Paper : "Perceptual Grouping in Vision-Language Models", ICCV, 2023 ( ). [ ]
Paper : "CiT: Curation in Training for Effective Vision-Language Data", ICCV, 2023 ( ). [ ][ ]
Paper : "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture", ICCV, 2023 ( ). [ ]
Paper : "EfficientTrain: Exploring Generalized Curriculum Learning for Training Visual Backbones", ICCV, 2023 ( ). [ ][ ]
Paper : "StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners", NeurIPS, 2023 ( ). [ ][ ]
Paper : "Improving CLIP Training with Language Rewrites", NeurIPS, 2023 ( ). [ ][ ]
Paper : "DesCo: Learning Object Recognition with Rich Language Descriptions", NeurIPS, 2023 ( ). [ ]
Paper : "Stable and low-precision training for large-scale vision-language models", NeurIPS, 2023 ( ). [ ]
Paper : "Image Captioners Are Scalable Vision Learners Too", NeurIPS, 2023 ( ). [ ][ ]
Paper : "Does Visual Pretraining Help End-to-End Reasoning?", NeurIPS, 2023 ( ). [ ]
Paper : "An Inverse Scaling Law for CLIP Training", NeurIPS, 2023 ( ). [ ][ ]
Paper : "Towards In-context Scene Understanding", NeurIPS, 2023 ( ). [ ]
Paper : "RevColV2: Exploring Disentangled Representations in Masked Image Modeling", NeurIPS, 2023 ( ). [ ][ ]
Paper : "Diversify Your Vision Datasets with Automatic Diffusion-Based Augmentation", NeurIPS, 2023 ( ). [ ][ ]
Paper : "Improving Multimodal Datasets with Image Captioning", NeurIPS (Datasets and Benchmarks), 2023 ( ). [ ]
Paper : "Centroid-centered Modeling for Efficient Vision Transformer Pre-training", arXiv, 2023 ( ). [ ]
Paper : "SoftCLIP: Softer Cross-modal Alignment Makes CLIP Stronger", arXiv, 2023 ( ). [ ]
Paper : "RECLIP: Resource-efficient CLIP by Training with Small Images", arXiv, 2023 ( ). [ ]
Paper : "DINOv2: Learning Robust Visual Features without Supervision", arXiv, 2023 ( ). [ ]
Paper : "Objectives Matter: Understanding the Impact of Self-Supervised Objectives on Vision Transformer Representations", arXiv, 2023 ( ). [ ]
Paper : "Less is More: Removing Text-regions Improves CLIP Training Efficiency and Robustness", arXiv, 2023 ( ). [ ]
Paper : "Improved baselines for vision-language pre-training", arXiv, 2023 ( ). [ ]
Paper : "Three Towers: Flexible Contrastive Learning with Pretrained Image Models", arXiv, 2023 ( ). [ ]
Paper : "ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process", arXiv, 2023 ( ). [ ]
Paper : "MOFI: Learning Image Representations from Noisy Entity Annotated Images", arXiv, 2023 ( ). [ ]
Paper : "Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training", arXiv, 2023 ( ). [ ][ ]
Paper : "Retrieval-Enhanced Contrastive Vision-Text Models", arXiv, 2023 ( ). [ ]
Paper : "CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a $10,000 Budget; An Extra $4,000 Unlocks 81.8% Accuracy", arXiv, 2023 ( ). [ ][ ]
Paper : "Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing", arXiv, 2023 ( ). [ ][ ]
Paper : "Stitched ViTs are Flexible Vision Backbones", arXiv, 2023 ( ). [ ][ ]
Paper : "Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts", arXiv, 2023 ( ). [ ]
Paper : "Predicting masked tokens in stochastic locations improves masked image modeling", arXiv, 2023 ( ). [ ]
Paper : "From Sparse to Soft Mixtures of Experts", arXiv, 2023 ( ). [ ]
Paper : "DropPos: Pre-Training Vision Transformers by Reconstructing Dropped Positions", NeurIPS, 2023 ( ). [ ][ ]
Paper : "Masked Image Residual Learning for Scaling Deeper Vision Transformers", NeurIPS, 2023 ( ). [ ]
Paper : "Investigating the Limitation of CLIP Models: The Worst-Performing Categories", arXiv, 2023 ( ). [ ]
Paper : "Longer-range Contextualized Masked Autoencoder", arXiv, 2023 ( ). [ ]
Paper : "SILC: Improving Vision Language Pretraining with Self-Distillation", arXiv, 2023 ( ). [ ]
Paper : "CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement", arXiv, 2023 ( ). [ ]
Paper : "Object Recognition as Next Token Prediction", arXiv, 2023 ( ). [ ][ ]
Paper : "Scaling Laws of Synthetic Images for Model Training ... for Now", arXiv, 2023 ( ). [ ][ ]
Paper : "Learning Vision from Models Rivals Learning Vision from Data", arXiv, 2023 ( ). [ ][ ]
Paper : "Experts Weights Averaging: A New General Training Scheme for Vision Transformers", arXiv, 2023 ( ). [ ]
Paper : "Masked Image Modeling via Dynamic Token Morphing", arXiv, 2023 ( ). [ ]
Paper : "Limited Data, Unlimited Potential: A Study on ViTs Augmented by Masked Autoencoders", WACV, 2024 ( ). [ ][ ]
Paper : "Neural Clustering based Visual Representation Learning", CVPR, 2024 ( ). [ ]
Paper : "EfficientTrain++: Generalized Curriculum Learning for Efficient Visual Backbone Training", TPAMI, 2024 ( ). [ ][ ]
Paper : "Denoising Vision Transformers", arXiv, 2024 ( ). [ ][ ][ ]
Paper : "Scalable Pre-training of Large Autoregressive Image Models", arXiv, 2024 ( ). [ ][ ]
Paper : "Deconstructing Denoising Diffusion Models for Self-Supervised Learning", arXiv, 2024 ( ). [ ]
Paper : "Rethinking Patch Dependence for Masked Autoencoders", arXiv, 2024 ( ). [ ][ ][ ]
Paper : "Learning and Leveraging World Models in Visual Representation Learning", arXiv, 2024 ( ). [ ]
Paper : "Can Generative Models Improve Self-Supervised Representation Learning?", arXiv, 2024 ( ). [ ]
Paper : "Understanding Robustness of Transformers for Image Classification", ICCV, 2021 ( ). [ ]
Paper : "On the Robustness of Vision Transformers to Adversarial Examples", ICCV, 2021 ( ). [ ]
Paper : "Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs", BMVC, 2021 ( ). [ ][ ]
Paper : "Are Transformers More Robust Than CNNs?", NeurIPS, 2021 ( ). [ ][ ]
Paper : "Transformed CNNs: recasting pre-trained convolutional layers with self-attention", arXiv, 2021 ( ). [ ]
Paper : "On the Adversarial Robustness of Visual Transformers", arXiv, 2021 ( ). [ ]
Paper : "Reveal of Vision Transformers Robustness against Adversarial Attacks", arXiv, 2021 ( ). [ ]
Paper : "On Improving Adversarial Transferability of Vision Transformers", arXiv, 2021 ( ). [ ][ ]
Paper : "Exploring Corruption Robustness: Inductive Biases in Vision Transformers and MLP-Mixers", arXiv, 2021 ( ). [ ]
Paper : "Adversarial Token Attacks on Vision Transformers", arXiv, 2021 ( ). [ ]
Paper : "Discrete Representations Strengthen Vision Transformer Robustness", arXiv, 2021 ( ). [ ]
Paper : "Vision Transformers are Robust Learners", AAAI, 2022 ( ). [ ][ ]
Paper : "Towards Transferable Adversarial Attacks on Vision Transformers", AAAI, 2022 ( ). [ ][ ]
Paper : "MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input-Adaptation", AAAI, 2022 ( ). [ ]
Paper : "Patch-Fool: Are Vision Transformers Always Robust Against Adversarial Perturbations?", ICLR, 2022 ( ). [ ][ ]
Paper : "Delving Deep into the Generalization of Vision Transformers under Distribution Shifts", CVPR, 2022 ( ). [ ]
Paper : "Towards Practical Certifiable Patch Defense with Vision Transformer", CVPR, 2022 ( ).[ ]
Paper : "Give Me Your Attention: Dot-Product Attention Considered Harmful for Adversarial Patch Robustness", CVPR, 2022 ( ). [ ]
Paper : "Fine-tuning Image Transformers using Learnable Memory", CVPR, 2022 ( ). [ ]
Paper : "APRIL: Finding the Achilles' Heel on Privacy for Vision Transformers", CVPR, 2022 ( ). [ ]
Paper : "Certified Patch Robustness via Smoothed Vision Transformers", CVPR, 2022 ( ). [ ][ ]
Paper : "Towards Robust Vision Transformer", CVPR, 2022 ( ). [ ][ ]
Paper : "Pyramid Adversarial Training Improves ViT Performance", CVPR, 2022 ( ). [ ]
Paper : "Visual Attention Emerges from Recurrent Sparse Reconstruction", ICML, 2022 ( ). [ ][ ]
Paper : "Understanding The Robustness in Vision Transformers", ICML, 2022 ( ). [ ][ ]
Paper : "Robustifying Vision Transformer without Retraining from Scratch by Test-Time Class-Conditional Feature Alignment", IJCAI, 2022 ( ). [ ][ ]
Paper : "Understanding Adversarial Robustness of Vision Transformers via Cauchy Problem", ECML-PKDD, 2022 ( ). [ ][ ]
Paper : "An Impartial Take to the CNN vs Transformer Robustness Contest", ECCV, 2022 ( ). [ ]
Paper : "Towards Efficient Adversarial Training on Vision Transformers", ECCV, 2022 ( ). [ ]
Paper : "Are Vision Transformers Robust to Patch Perturbations?", ECCV, 2022 ( ). [ ]
Paper : "ViP: Unified Certified Detection and Recovery for Patch Attack with Vision Transformers", ECCV, 2022 ( ). [ ][ ]
Paper : "When Adversarial Training Meets Vision Transformers: Recipes from Training to Architecture", NeurIPS, 2022 ( ). [ ][ ]
Paper : "Decision-based Black-box Attack Against Vision Transformers via Patch-wise Adversarial Removal", NeurIPS, 2022 ( ). [ ]
Paper : "Optimizing Relevance Maps of Vision Transformers Improves Robustness", NeurIPS, 2022 ( ). [ ][ ]
Paper : "Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation", NeurIPS, 2022 ( ). [ ]
Paper : "Finding Differences Between Transformers and ConvNets Using Counterfactual Simulation Testing", NeurIPS, 2022 ( ). [ ]
Paper : "Are Vision Transformers Robust to Spurious Correlations?", arXiv, 2022 ( ). [ ]
Paper : "Boosting Adversarial Transferability of MLP-Mixer", arXiv, 2022 ( ). [ ]
Paper : "Deeper Insights into ViTs Robustness towards Common Corruptions", arXiv, 2022 ( ). [ ]
Paper : "Privacy-Preserving Image Classification Using Vision Transformer", arXiv, 2022 ( ). [ ]
Paper : "Federated Adversarial Training with Transformers", arXiv, 2022 ( ). [ ]
Paper : "Backdoor Attacks on Vision Transformers", arXiv, 2022 ( ). [ ][ ]
Paper : "Defending Backdoor Attacks on Vision Transformer via Patch Processing", arXiv, 2022 ( ). [ ]
Paper : "Image and Model Transformation with Secret Key for Vision Transformer", arXiv, 2022 ( ). [ ]
Paper : "Analyzing Adversarial Robustness of Vision Transformers against Spatial and Spectral Attacks", arXiv, 2022 ( ). [ ]
Paper : "CLIPping Privacy: Identity Inference Attacks on Multi-Modal Machine Learning Models", arXiv, 2022 ( ). [ ]
Paper : "A Light Recipe to Train Robust Vision Transformers", arXiv, 2022 ( ). [ ]
Paper : "Attacking Compressed Vision Transformers", arXiv, 2022 ( ). [ ]
Paper : "Visual Prompting for Adversarial Robustness", arXiv, 2022 ( ). [ ]
Paper : "Curved Representation Space of Vision Transformers", arXiv, 2022 ( ). [ ]
Paper : "Robustify Transformers with Robust Kernel Density Estimation", arXiv, 2022 ( ). [ ]
Paper : "Pretrained Transformers Do not Always Improve Robustness", arXiv, 2022 ( ). [ ]
Paper : "Revisiting adapters with adversarial training", ICLR, 2023 ( ). [ ]
Paper : "Budgeted Training for Vision Transformer", ICLR, 2023 ( ). [ ]
Paper : "Can CNNs Be More Robust Than Transformers?", ICLR, 2023 ( ). [ ][ ]
Paper : "Denoising Masked AutoEncoders are Certifiable Robust Vision Learners", ICLR, 2023 ( ). [ ][ ]
Paper : "Transferable Adversarial Attacks on Vision Transformers with Token Gradient Regularization", CVPR, 2023 ( ). [ ][ ]
Paper : "TrojViT: Trojan Insertion in Vision Transformers", CVPR, 2023 ( ). [ ]
Paper : "Improving Robustness of Vision Transformers by Reducing Sensitivity to Patch Corruptions", CVPR, 2023 ( ). [ ]
Paper : "Trade-off between Robustness and Accuracy of Vision Transformers", CVPR, 2023 ( ). [ ]
Paper : "You Are Catching My Attention: Are Vision Transformers Bad Learners Under Backdoor Attacks?", CVPR, 2023 ( ). [ ]
Paper : "Understanding and Defending Patched-based Adversarial Attacks for Vision Transformer", ICML, 2023 ( ). [ ]
Paper : "Improving Adversarial Robustness of Masked Autoencoders via Test-time Frequency-domain Prompting", ICCV, 2023 ( ). [ ][ ]
Paper : "Efficiently Robustify Pre-trained Models", ICCV, 2023 ( ). [ ]
Paper : "Transferable Adversarial Attack for Both Vision Transformers and Convolutional Networks via Momentum Integrated Gradients", ICCV, 2023 ( ). [ ]
Paper : "CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning", ICCV, 2023 ( ). [ ][ ]
Paper : "Exploring Non-additive Randomness on ViT against Query-Based Black-Box Attacks", BMVC, 2023 ( ). [ ]
Paper : "RBFormer: Improve Adversarial Robustness of Transformer by Robust Bias", BMVC, 2023 ( ). [ ]
Paper : "Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding", PR, 2023 ( ). [ ]
Paper : "CertViT: Certified Robustness of Pre-Trained Vision Transformers", arXiv, 2023 ( ). [ ][ ]
Paper : "Robust Contrastive Language-Image Pretraining against Adversarial Attacks", arXiv, 2023 ( ). [ ]
Paper : "DeepMIM: Deep Supervision for Masked Image Modeling", arXiv, 2023 ( ). [ ][ ]
Paper : "Robustifying Token Attention for Vision Transformers", ICCV, 2023 ( ). [ ][ ]
Paper : "Experts Weights Averaging: A New General Training Scheme for Vision Transformers", arXiv, 2023 ( ). [ ]
Paper : "SlowFormer: Universal Adversarial Patch for Attack on Compute and Energy Efficiency of Inference Efficient Vision Transformers", arXiv, 2023 ( ). [ ][ ]
Paper : "Masked Image Modeling via Dynamic Token Morphing", arXiv, 2023 ( ). [ ]
Paper : "Not All Prompts Are Secure: A Switchable Backdoor Attack Against Pre-trained Vision Transformers", CVPR, 2024 ( ). [ ][ ]
Paper : "Safety of Multimodal Large Language Models on Images and Text", arXiv, 2024 ( ). [ ]
Paper : "Post-Training Quantization for Vision Transformer", NeurIPS, 2021 ( ). [ ]
Paper : "Visual Transformer Pruning", arXiv, 2021 ( ). [ ]
Paper : "Multi-Dimensional Model Compression of Vision Transformer", arXiv, 2021 ( ). [ ]
Paper : "FQ-ViT: Fully Quantized Vision Transformer without Retraining", arXiv, 2021 ( ). [ ][ ]
Paper : "Unified Visual Transformer Compression", ICLR, 2022 ( ). [ ][ ]
Paper : "MiniViT: Compressing Vision Transformers with Weight Multiplexing", CVPR, 2022 ( ). [ ][ ]
Paper : "Auto-ViT-Acc: An FPGA-Aware Automatic Acceleration Framework for Vision Transformer with Mixed-Scheme Quantization", International Conference on Field Programmable Logic and Applications (FPL), 2022 ( ). [ ]
Paper : "Towards Accurate Post-Training Quantization for Vision Transformer", ACMMM, 2022 ( ). [ ]
Paper : "SPViT: Enabling Faster Vision Transformers via Soft Token Pruning", ECCV, 2022 ( ). [ ][ ]
Paper : "Patch Similarity Aware Data-Free Quantization for Vision Transformers", ECCV, 2022 ( ). [ ][ ]
Paper : "PTQ4ViT: Post-Training Quantization Framework for Vision Transformers", ECCV, 2022 ( ). [ ]
Paper : "EAPruning: Evolutionary Pruning for Vision Transformers and CNNs", BMVC, 2022 ( ). [ ]
Paper : "Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer", NeurIPS, 2022 ( ). [ ][ ]
Paper : "SAViT: Structure-Aware Vision Transformer Pruning via Collaborative Optimization", NeurIPS, 2022 ( ). [ ]
Paper : "VTC-LFC: Vision Transformer Compression with Low-Frequency Components", NeurIPS, 2022 ( ). [ ][ ]
Paper : "Q-ViT: Fully Differentiable Quantization for Vision Transformer", arXiv, 2022 ( ). [ ]
Paper : "VAQF: Fully Automatic Software-Hardware Co-Design Framework for Low-Bit Vision Transformer", arXiv, 2022 ( ). [ ]
Paper : "Vision Transformer Compression with Structured Pruning and Low Rank Approximation", arXiv, 2022 ( ). [ ]
Paper : "Searching Intrinsic Dimensions of Vision Transformers", arXiv, 2022 ( ). [ ]
Paper : "PSAQ-ViT V2: Towards Accurate and General Data-Free Quantization for Vision Transformers", arXiv, 2022 ( ). [ ][ ]
Paper : "Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention", arXiv, 2022 ( ). [ ]
Paper : "SaiT: Sparse Vision Transformers through Adaptive Token Pruning", arXiv, 2022 ( ). [ ]
Paper : "oViT: An Accurate Second-Order Pruning Framework for Vision Transformers", arXiv, 2022 ( ). [ ]
Paper : "CPT-V: A Contrastive Approach to Post-Training Quantization of Vision Transformers", arXiv, 2022 ( ). [ ]
Paper : "Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers", CVPR, 2023 ( ). [ ][ ]
Paper : "Boost Vision Transformer with GPU-Friendly Sparsity and Quantization", CVPR, 2023 ( ). [ ]
Paper : "X-Pruner: eXplainable Pruning for Vision Transformers", CVPR, 2023 ( ). [ ][ ]
Paper : "NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization for Vision Transformers", CVPR, 2023 ( ). [ ]
Paper : "Global Vision Transformer Pruning with Hessian-Aware Saliency", CVPR, 2023 ( ). [ ]
Paper : "BinaryViT: Pushing Binary Vision Transformers Towards Convolutional Models", CVPRW, 2023 ( ). [ ][ ]
Paper : "Oscillation-free Quantization for Low-bit Vision Transformers", ICML, 2023 ( ). [ ][ ]
Paper : "UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers", ICML, 2023 ( ). [ ][ ]
Paper : "COMCAT: Towards Efficient Compression and Customization of Attention-Based Vision Models", ICML, 2023 ( ). [ ][ ]
Paper : "Jumping through Local Minima: Quantization in the Loss Landscape of Vision Transformers", ICCV, 2023 ( ). [ ][ ]
Paper : "BiViT: Extremely Compressed Binary Vision Transformer", ICCV, 2023 ( ). [ ]
Paper : "I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference", ICCV, 2023 ( ). [ ][ ]
Paper : "RepQ-ViT: Scale Reparameterization for Post-Training Quantization of Vision Transformers", ICCV, 2023 ( ). [ ][ ]
Paper : "LLM-FP4: 4-Bit Floating-Point Quantized Transformers", EMNLP, 2023 ( ). [ ][ ]
Paper : "Q-HyViT: Post-Training Quantization for Hybrid Vision Transformer with Bridge Block Reconstruction", arXiv, 2023 ( ). [ ]
Paper : "Bi-ViT: Pushing the Limit of Vision Transformer Quantization", arXiv, 2023 ( ). [ ]
Paper : "BinaryViT: Towards Efficient and Accurate Binary Vision Transformers", arXiv, 2023 ( ). [ ]
Paper : "Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers", arXiv, 2023 ( ). [ ]
Paper : "Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing", arXiv, 2023 ( ). [ ]
Paper : "Variation-aware Vision Transformer Quantization", arXiv, 2023 ( ). [ ][ ]
Paper : "Data-independent Module-aware Pruning for Hierarchical Vision Transformers", ICLR, 2024 ( ). [ ][ ]
Paper : "MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer", CVPR, 2024 ( ). [ ][ ]
Paper : "Dense Vision Transformer Compression with Few Samples", CVPR, 2024 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Image Classification / Backbone / Attention-Free

Paper : "RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition", arXiv, 2021 ( ). [ ][ ]
Paper : "Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks", arXiv, 2021 ( ). [ ]
Paper : "Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet", arXiv, 2021 ( ). [ ][ ]
Paper : "ResMLP: Feedforward networks for image classification with data-efficient training", arXiv, 2021 ( ). [ ]
Paper : "Can Attention Enable MLPs To Catch Up With CNNs?", arXiv, 2021 ( ). [ ]
Paper : "Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition", arXiv, 2021 ( ). [ ][ ]
Paper : "Rethinking Token-Mixing MLP for MLP-based Vision Backbone", arXiv, 2021 ( ). [ ]
Paper : "S -MLPv2: Improved Spatial-Shift MLP Architecture for Vision", arXiv, 2021 ( ). [ ]
Paper : "RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision?", arXiv, 2021 ( ). [ ][ ]
Paper : "Hire-MLP: Vision MLP via Hierarchical Rearrangement", arXiv, 2021 ( ). [ ]
Paper : "Sparse-MLP: A Fully-MLP Architecture with Conditional Computation", arXiv, 2021 ( ). [ ]
Paper : "ConvMLP: Hierarchical Convolutional MLPs for Vision", arXiv, 2021 ( ). [ ][ ]
Paper : "Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?", arXiv, 2021 ( ). [ ]
Paper : "MLP-Mixer: An all-MLP Architecture for Vision", NeurIPS, 2021 ( ). [ ][ ][ ][ ]
Paper : "Pay Attention to MLPs", NeurIPS, 2021 ( ). [ ][ ]
Paper : "S -MLP: Spatial-Shift MLP Architecture for Vision", WACV, 2022 ( ). [ ]
Paper : "CycleMLP: A MLP-like Architecture for Dense Prediction", ICLR, 2022 ( ). [ ][ ]
Paper : "AS-MLP: An Axial Shifted MLP Architecture for Vision", ICLR, 2022 ( ). [ ][ ]
Paper : "An Image Patch is a Wave: Quantum Inspired Vision MLP", CVPR, 2022 ( ). [ ][ ]
Paper : "DynaMixer: A Vision MLP Architecture with Dynamic Mixing", ICML, 2022 ( ). [ ][ ]
Paper : "Spatial-Channel Token Distillation for Vision MLPs", ICML, 2022 ( ). [ ]
Paper : " AMixer: Adaptive Weight Mixing for Self-Attention Free Vision Transformers", ECCV, 2022 ( ). [ ]
Paper : "Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs", arXiv, 2022 ( ). [ ]
Paper : "ActiveMLP: An MLP-like Architecture with Active Token Mixer", arXiv, 2022 ( ). [ ]
Paper : "MDMLP: Image Classification from Scratch on Small Datasets with MLP", arXiv, 2022 ( ). [ ][ ]
Paper : "Parameterization of Cross-Token Relations with Relative Positional Encoding for Vision MLP", arXiv, 2022 ( ). [ ][ ]
Paper : "SplitMixer: Fat Trimmed From MLP-like Models", arXiv, 2022 ( ). [ ][ ]
Paper : "gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window", arXiv, 2022 ( ). [ ]
Paper : "Analysis of Quantization on MLP-based Vision Models", arXiv, 2022 ( ). [ ]
Paper : "Adaptive Frequency Filters As Efficient Global Token Mixers", ICCV, 2023 ( ). [ ]
Paper : "Strip-MLP: Efficient Token Interaction for Vision MLP", ICCV, 2023 ( ). [ ][ ]
Paper : "On the Connection between Local Attention and Dynamic Depth-wise Convolution", ICLR, 2022 ( ). [ ][ ]
Paper : "MetaFormer is Actually What You Need for Vision", CVPR, 2022 ( ). [ ][ ]
Paper : "A ConvNet for the 2020s", CVPR, 2022 ( ). [ ][ ]
Paper : "Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs", CVPR, 2022 ( ). [ ][ ][ ]
Paper : "Focal Modulation Networks", NeurIPS, 2022 ( ). [ ][ ]
Paper : "HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions", NeurIPS, 2022 ( ). [ ][ ][ ]
Paper : "S4ND: Modeling Images and Videos as Multidimensional Signals Using State Spaces", NeurIPS, 2022 ( ). [ ]
Paper : "Sequencer: Deep LSTM for Image Classification", arXiv, 2022 ( ). [ ]
Paper : "Efficient Multi-order Gated Aggregation Network", arXiv, 2022 ( ). [ ]
Paper : "Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition", arXiv, 2022 ( ). [ ]
Paper : "Image as Set of Points", ICLR, 2023 ( ). [ ][ ]
Paper : "More ConvNets in the 2020s: Scaling up Kernels Beyond 51x51 using Sparsity", ICLR, 2023 ( ). [ ][ ]
Paper : "ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders", CVPR, 2023 ( ). [ ][ ]
Paper : "SPANet: Frequency-balancing Token Mixer using Spectral Pooling Aggregation Modulation", ICCV, 2023 ( ). [ ][ ][ ]
Paper : "FFT-based Dynamic Token Mixer for Vision", arXiv, 2023 ( ). [ ][ ]
Paper : "ConvNets Match Vision Transformers at Scale", arXiv, 2023 ( ). [ ]
Paper : "VMamba: Visual State Space Model", arXiv, 2024 ( ). [ ][ ]
Paper : "Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model", arXiv, 2024 ( ). [ ][[PyTorch](
Paper : "Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures", arXiv, 2024 ( ). [ ][ ]
Paper : "LocalMamba: Visual State Space Model with Windowed Selective Scan", arXiv, 2024 ( ). [ ][ ]
Paper : "SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series", arXiv, 2024 ( ). [ ][ ]
Paper : "PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition", arXiv, 2024 ( ). [ ][ ]
Paper : "EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba", arXiv, 2024 ( ). [ ][ ]
Paper : "DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs", arXiv, 2024 ( ). [ ]
Paper : "MambaOut: Do We Really Need Mamba for Vision?", arXiv, 2024 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Image Classification / Backbone / Analysis for Transformer

Paper : "On the Relationship between Self-Attention and Convolutional Layers", ICLR, 2020 ( ). [ ][ ][ ]
Paper : "Transformer Interpretability Beyond Attention Visualization", CVPR, 2021 ( ). [ ][ ]
Paper : "Are Convolutional Neural Networks or Transformers more like human vision?", CogSci, 2021 ( ). [ ]
Paper : "ConvNets vs. Transformers: Whose Visual Representations are More Transferable?", ICCVW, 2021 ( ). [ ]
Paper : "Do Vision Transformers See Like Convolutional Neural Networks?", NeurIPS, 2021 ( ). [ ]
Paper : "Intriguing Properties of Vision Transformers", NeurIPS, 2021 ( ). [ ][ ]
Paper : "FoveaTer: Foveated Transformer for Image Classification", arXiv, 2021 ( ). [ ]
Paper : "Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight", arXiv, 2021 ( ). [ ]
Paper : "Revisiting the Calibration of Modern Neural Networks", arXiv, 2021 ( ). [ ]
Paper : "What Makes for Hierarchical Vision Transformer?", arXiv, 2021 ( ). [ ]
Paper : "Visualizing Paired Image Similarity in Transformer Networks", WACV, 2022 ( ). [ ][ ]
Paper : "Can Vision Transformers Learn without Natural Images?", AAAI, 2022 ( ). [ ][ ][ ]
Paper : "How Do Vision Transformers Work?", ICLR, 2022 ( ). [ ][ ]
Paper : "When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations", ICLR, 2022 ( ). [ ][ ]
Paper : "Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision Transformers", ICML, 2022 ( ). [ ]
Paper : "Three things everyone should know about Vision Transformers", ECCV, 2022 ( ). [ ]
Paper : "Vision Transformers provably learn spatial structure", NeurIPS, 2022 ( ). [ ]
Paper : "Visualizing and Understanding Patch Interactions in Vision Transformer", arXiv, 2022 ( ). [ ]
Paper : "CNNs and Transformers Perceive Hybrid Images Similar to Humans", arXiv, 2022 ( ). [ ][ ]
Paper : "Masked Jigsaw Puzzle: A Versatile Position Embedding for Vision Transformers", CVPR, 2023 ( ). [ ][ ]
Paper : "A Unified and Biologically-Plausible Relational Graph Representation of Vision Transformers", arXiv, 2022 ( ). [ ]
Paper : "How Well Do Vision Transformers (VTs) Transfer To The Non-Natural Image Domain? An Empirical Study Involving Art Classification", arXiv, 2022 ( ). [ ]
Paper : "Transformer Vs. MLP-Mixer Exponential Expressive Gap For NLP Problems", arXiv, 2022 ( ). [ ]
Paper : "ProtoPFormer: Concentrating on Prototypical Parts in Vision Transformers for Interpretable Image Recognition", arXiv, 2022 ( ). [ ][ ]
Paper : "Exploring Visual Interpretability for Contrastive Language-Image Pre-training", arXiv, 2022 ( ). [ ][ ]
Paper : "Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers", arXiv, 2022 ( ). [ ]
Paper : "Vision Transformer Visualization: What Neurons Tell and How Neurons Behave?", arXiv, 2022 ( ). [ ][ ]
Paper : "ViT-CX: Causal Explanation of Vision Transformers", arXiv, 2022 ( ). [ ]
Paper : "Demystify Self-Attention in Vision Transformers from a Semantic Perspective: Analysis and Application", arXiv, 2022 ( ). [ ]
Paper : "Explanation on Pretraining Bias of Finetuned Vision Transformer", arXiv, 2022 ( ). [ ]
Paper : "Learning to Estimate Shapley Values with Vision Transformers", ICLR, 2023 ( ). [ ][ ]
Paper : "ImageNet-X: Understanding Model Mistakes with Factor of Variation Annotations", ICLR, 2023 ( ). [ ]
Paper : "A Theoretical Understanding of Vision Transformers: Learning, Generalization, and Sample Complexity", ICLR, 2023 ( ). [ ]
Paper : "What Do Self-Supervised Vision Transformers Learn?", ICLR, 2023 ( ). [ ][ ]
Paper : "When and why Vision-Language Models behave like Bags-of-Words, and what to do about it?", ICLR, 2023 ( ). [ ]
Paper : "CLIP-Dissect: Automatic Description of Neuron Representations in Deep Vision Networks", ICLR, 2023 ( ). [ ]
Paper : "Understanding Masked Autoencoders via Hierarchical Latent Variable Models", CVPR, 2023 ( ). [ ]
Paper : "Teaching Matters: Investigating the Role of Supervision in Vision Transformers", CVPR, 2023 ( ). [ ][ ][ ]
Paper : "Masked Autoencoding Does Not Help Natural Language Supervision at Scale", CVPR, 2023 ( ). [ ]
Paper : "On Data Scaling in Masked Image Modeling", CVPR, 2023 ( ). [ ][ ]
Paper : "Revealing the Dark Secrets of Masked Image Modeling", CVPR, 2023 ( ). [ ]
Paper : "VISION DIFFMASK: Faithful Interpretation of Vision Transformers with Differentiable Patch Masking", CVPRW, 2023 ( ). [ ][ ]
Paper : "A Multidimensional Analysis of Social Biases in Vision Transformers", ICCV, 2023 ( ). [ ][ ]
Paper : "Analyzing Vision Transformers for Image Classification in Class Embedding Space", NeurIPS, 2023 ( ). [ ]
Paper : "Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks", NeurIPS, 2023 ( ). [ ][ ]
Paper : "Are Vision Transformers More Data Hungry Than Newborn Visual Systems?", NeurIPS, 2023 ( ). [ ]
Paper : "AtMan: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation", NeurIPS, 2023 ( ). [ ][ ]
Paper : "AttentionViz: A Global View of Transformer Attention", arXiv, 2023 ( ). [ ][ ]
Paper : "Understanding Gaussian Attention Bias of Vision Transformers Using Effective Receptive Fields", arXiv, 2023 ( ). [ ]
Paper : "Reviving Shift Equivariance in Vision Transformers", arXiv, 2023 ( ). [ ]
Paper : "ViT-ReciproCAM: Gradient and Attention-Free Visual Explanations for Vision Transformer", arXiv, 2023 ( ). [ ]
Paper : "Eureka-Moments in Transformers: Multi-Step Tasks Reveal Softmax Induced Optimization Problems", arXiv, 2023 ( ). [ ]
Paper : "A Simple Interpretable Transformer for Fine-Grained Image Classification and Analysis", arXiv, 2023 ( ). [ ][ ]
Paper : "Attention Guided CAM: Visual Explanations of Vision Transformer Guided by Self-Attention", AAAI, 2024 ( ). [ ][ ]
Paper : "Can Transformers Capture Spatial Relations between Objects?", ICLR, 2024 ( ). [ ][ ][ ]
Paper : "Token Transformation Matters: Towards Faithful Post-hoc Explanation for Vision Transformer", CVPR, 2024 ( ). [ ]
Paper : "On the Faithfulness of Vision Transformer Explanations", CVPR, 2024 ( ). [ ]
Paper : "A Decade's Battle on Dataset Bias: Are We There Yet?", arXiv, 2024 ( ). [ ][ ]
Paper : "LeGrad: An Explainability Method for Vision Transformers via Feature Formation Sensitivity", arXiv, 2024 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Detection / Object Detection / General:

Paper : "detrex: Benchmarking Detection Transformers", arXiv, 2023 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Detection / Object Detection / CNN-based backbone:

Paper : "End-to-End Object Detection with Transformers", ECCV, 2020 ( ). [ ][ ]
Paper : "Deformable DETR: Deformable Transformers for End-to-End Object Detection", ICLR, 2021 ( ). [ ][ ]
Paper : "UP-DETR: Unsupervised Pre-training for Object Detection with Transformers", CVPR, 2021 ( ). [ ][ ]
Paper : "Fast Convergence of DETR with Spatially Modulated Co-Attention", ICCV, 2021 ( ). [ ][ ]
Paper : "Conditional DETR for Fast Training Convergence", ICCV, 2021 ( ). [ ]
Paper : "PnP-DETR: Towards Efficient Visual Analysis with Transformers", ICCV, 2021 ( ). [ ][ ]
Paper : "Rethinking Transformer-based Set Prediction for Object Detection", ICCV, 2021 ( ). [ ]
Paper : "Dynamic DETR: End-to-End Object Detection With Dynamic Attention", ICCV, 2021 ( ). [ ]
Paper : "ViT-YOLO:Transformer-Based YOLO for Object Detection", ICCVW, 2021 ( ). [ ]
Paper : "End-to-End Object Detection with Adaptive Clustering Transformer", BMVC, 2021 ( ). [ ][ ]
Paper : "Paying Attention to Varying Receptive Fields: Object Detection with Atrous Filters and Vision Transformers", BMVC, 2021 ( ). [ ]
Paper : "Efficient DETR: Improving End-to-End Object Detector with Dense Prior", arXiv, 2021 ( ). [ ]
Paper : "Content-Augmented Feature Pyramid Network with Light Linear Transformers", arXiv, 2021 ( ). [ ]
Paper : "DETReg: Unsupervised Pretraining with Region Priors for Object Detection", arXiv, 2021 ( ). [ ][ ]
Paper : "Guiding Query Position and Performing Similar Attention for Transformer-Based Detection Heads", arXiv, 2021 ( ). [ ]
Paper : "Anchor DETR: Query Design for Transformer-Based Detector", AAAI, 2022 ( ). [ ][ ]
Paper : "Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity", ICLR, 2022 ( ). [ ][ ]
Paper : "DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR", ICLR, 2022 ( ). [ ][ ]
Paper : "DN-DETR: Accelerate DETR Training by Introducing Query DeNoising", CVPR, 2022 ( ). [ ][ ]
Paper : "Accelerating DETR Convergence via Semantic-Aligned Matching", CVPR, 2022 ( ). [ ][ ]
Paper : "AdaMixer: A Fast-Converging Query-Based Object Detector", CVPR, 2022 ( ). [ ][ ]
Paper : "DESTR: Object Detection With Split Transformer", CVPR, 2022 ( ). [ ]
Paper : "Recurrent Glimpse-based Decoder for Detection with Transformer", CVPR, 2022 ( ). [ ][ ]
Paper : "Training Object Detectors From Scratch: An Empirical Study in the Era of Vision Transformer", CVPR, 2022 ( ). [ ]
Paper : "Towards Data-Efficient Detection Transformers", ECCV, 2022 ( ). [ ][ ]
Paper : "Efficient Decoder-free Object Detection with Transformers", ECCV, 2022 ( ). [ ]
Paper : "Cornerformer: Purifying Instances for Corner-Based Detectors", ECCV, 2022 ( ). [ ]
Paper : "A Simple Approach and Benchmark for 21,000-Category Object Detection", ECCV, 2022 ( ). [ ][ ]
Paper : "Obj2Seq: Formatting Objects as Sequences with Class Prompt for Visual Tasks", NeurIPS, 2022 ( ). [ ][ ]
Paper : "Knowledge Amalgamation for Object Detection with Transformers", arXiv, 2022 ( ). [ ]
Paper : "Transformer-based Context Condensation for Boosting Feature Pyramids in Object Detection", arXiv, 2022 ( ). [ ]
Paper : "Conditional DETR V2: Efficient Detection Transformer with Box Queries", arXiv, 2022 ( ). [ ]
Paper : "Semantic-Aligned Matching for Enhanced DETR Convergence and Multi-Scale Feature Fusion", arXiv, 2022 ( ). [ ][ ]
Paper : "ComplETR: Reducing the cost of annotations for object detection in dense scenes with vision transformers", arXiv, 2022 ( ). [ ]
Paper : "Pair DETR: Contrastive Learning Speeds Up DETR Training", arXiv, 2022 ( ). [ ]
Paper : "Group DETR v2: Strong Object Detector with Encoder-Decoder Pretraining", arXiv, 2022 ( ). [ ]
Paper : "Knowledge Distillation for Detection Transformer with Consistent Distillation Points Sampling", arXiv, 2022 ( ). [ ]
Paper : "D ETR: Decoder Distillation for Detection Transformer", arXiv, 2022 ( ). [ ]
Paper : "Teach-DETR: Better Training DETR with Teachers", arXiv, 2022 ( ). [ ][ ]
Paper : "NMS Strikes Back", arXiv, 2022 ( ). [ ][ ]
Paper : "ViT-Adapter: Exploring Plain Vision Transformer for Accurate Dense Predictions", ICLR, 2023 ( ). [ ][ ]
Paper : "Lite DETR: An Interleaved Multi-Scale Encoder for Efficient DETR", CVPR, 2023 ( ). [ ][ ]
Paper : "Dense Distinct Query for End-to-End Object Detection", CVPR, 2023 ( ). [ ][ ]
Paper : "Siamese DETR", CVPR, 2023 ( ). [ ][ ]
Paper : "SAP-DETR: Bridging the Gap Between Salient Points and Queries-Based Transformer Detector for Fast Model Convergency", CVPR, 2023 ( ). [ ]
Paper : "Q-DETR: An Efficient Low-Bit Quantized Detection Transformer", CVPR, 2023 ( ). [ ][ ]
Paper : "Lite DETR: An Interleaved Multi-Scale Encoder for Efficient DETR", CVPR, 2023 ( ). [ ][ ]
Paper : "DETRs with Hybrid Matching", CVPR, 2023 ( ). [ ][ ]
Paper : "Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation", CVPR, 2023 ( ). [ ][ ]
Paper : "Towards Efficient Use of Multi-Scale Features in Transformer-Based Object Detectors", CVPR, 2023 ( ). [ ][ ]
Paper : "Enhanced Training of Query-Based Object Detection via Selective Query Recollection", CVPR, 2023 ( ). [ ][ ]
Paper : "Learning Dynamic Query Combinations for Transformer-based Object Detection and Segmentation", ICML, 2023 ( ). [ ]
Paper : "SpeedDETR: Speed-aware Transformers for End-to-end Object Detection", ICML, 2023 ( ). [ ]
Paper : "AlignDet: Aligning Pre-training and Fine-tuning in Object Detection", ICCV, 2023 ( ). [ ][ ][ ]
Paper : "Less is More: Focus Attention for Efficient DETR", ICCV, 2023 ( ). [ ][ ][ ]
Paper : "DETR Doesn't Need Multi-Scale or Locality Design", ICCV, 2023 ( ). [ ][ ]
Paper : "ASAG: Building Strong One-Decoder-Layer Sparse Detectors via Adaptive Sparse Anchor Generation", ICCV, 2023 ( ). [ ][ ]
Paper : "Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection", ICCV, 2023 ( ). [ ][ ]
Paper : "Detection Transformer with Stable Matching", ICCV, 2023 ( ). [ ][ ]
Paper : "Integral Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection", ICCV, 2023 ( ). [ ][ ]
Paper : "Group DETR: Fast Training Convergence with Decoupled One-to-Many Label Assignment", ICCV, 2023 ( ). [ ][ ]
Paper : "DETRs with Collaborative Hybrid Assignments Training", ICCV, 2023 ( ). [ ][ ]
Paper : "DETRDistill: A Universal Knowledge Distillation Framework for DETR-families", ICCV, 2023 ( ). [ ]
Paper : "Decoupled DETR: Spatially Disentangling Localization and Classification for Improved End-to-End Object Detection", ICCV, 2023 ( ). [ ]
Paper : "StageInteractor: Query-based Object Detector with Cross-stage Interaction", ICCV, 2023 ( ). [ ]
Paper : "Rank-DETR for High Quality Object Detection", NeurIPS, 2023 ( ). [ ][ ]
Paper : "Cal-DETR: Calibrated Detection Transformer", NeurIPS, 2023 ( ). [ ][ ]
Paper : "KS-DETR: Knowledge Sharing in Attention Learning for Detection Transformer", arXiv, 2023 ( ). [ ][ ]
Paper : "FeatAug-DETR: Enriching One-to-Many Matching for DETRs with Feature Augmentation", arXiv, 2023 ( ). [ ][ ]
Paper : "DETRs Beat YOLOs on Real-time Object Detection", arXiv, 2023 ( ). [ ]
Paper : "Align-DETR: Improving DETR with Simple IoU-aware BCE loss", arXiv, 2023 ( ). [ ][ ]
Paper : "Box-DETR: Understanding and Boxing Conditional Spatial Queries", arXiv, 2023 ( ). [ ][ ]
Paper : "Enhancing Your Trained DETRs with Box Refinement", arXiv, 2023 ( ). [ ][ ]
Paper : "Revisiting DETR Pre-training for Object Detection", arXiv, 2023 ( ). [ ]
Paper : "Gen2Det: Generate to Detect", arXiv, 2023 ( ). [ ]
Paper : "ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions", CVPR, 2024 ( ). [ ][ ]
Paper : "Salience-DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement", CVPR, 2024 ( ). [ ][ ]
Paper : "MS-DETR: Efficient DETR Training with Mixed Supervision", arXiv, 2024 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Detection / Object Detection / Transformer-based backbone:

Paper : "Toward Transformer-Based Object Detection", arXiv, 2020 ( ). [ ]
Paper : "WB-DETR: Transformer-Based Detector Without Backbone", ICCV, 2021 ( ). [ ]
Paper : "You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection", NeurIPS, 2021 ( ). [ ][ ]
Paper : "Benchmarking Detection Transfer Learning with Vision Transformers", arXiv, 2021 ( ). [ ]
Paper : "ViDT: An Efficient and Effective Fully Transformer-based Object Detector", ICLR, 2022 ( ). [ ][ ]
Paper : "FP-DETR: Detection Transformer Advanced by Fully Pre-training", ICLR, 2022 ( ). [ ]
Paper : "DETR++: Taming Your Multi-Scale Detection Transformer", CVPRW, 2022 ( ). [ ]
Paper : "Exploring Plain Vision Transformer Backbones for Object Detection", ECCV, 2022 ( ). [ ]
Paper : "A Simple Single-Scale Vision Transformer for Object Detection and Instance Segmentation", ECCV, 2022 ( ). [ ]
Paper : "A Transformer-Based Object Detector with Coarse-Fine Crossing Representations", NeurIPS, 2022 ( ). [ ]
Paper : "D ETR: Decoder-Only DETR with Computationally Efficient Cross-Scale Attention", arXiv, 2022 ( ). [ ]
Paper : "DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection", ICLR, 2023 ( ). [ ][ ]
Paper : "SimPLR: A Simple and Plain Transformer for Object Detection and Segmentation", arXiv, 2023 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Detection / 3D Object Detection

Paper : "LiDAR-based Online 3D Video Object Detection with Graph-based Message Passing and Spatiotemporal Transformer Attention", CVPR, 2020 ( ). [ ][ ]
Paper : "3D Object Detection with Pointformer", arXiv, 2020 ( ). [ ]
Paper : "Improving 3D Object Detection with Channel-wise Transformer", ICCV, 2021 ( ). [ ][ ]
Paper : "Group-Free 3D Object Detection via Transformers", ICCV, 2021 ( ). [ ][ ]
Paper : "Voxel Transformer for 3D Object Detection", ICCV, 2021 ( ). [ ]
Paper : "An End-to-End Transformer Model for 3D Object Detection", ICCV, 2021 ( ). [ ][ ][ ]
Paper : "DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries", CoRL, 2021 ( ). [ ]
Paper : "M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers", WACV, 2022 ( ). [ ][ ]
Paper : "MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer", CVPR, 2022 ( ). [ ][ ]
Paper : "Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds", CVPR, 2022 ( ). [ ][ ]
Paper : "TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers", CVPR, 2022 ( ). [ ][ ]
Paper : "CAT-Det: Contrastively Augmented Transformer for Multi-modal 3D Object Detection", CVPR, 2022 ( ). [ ]
Paper : "Multimodal Token Fusion for Vision Transformers", CVPR, 2022 ( ). [ ]
Paper : "Embracing Single Stride 3D Object Detector with Sparse Transformer", CVPR, 2022 ( ). [ ][ ]
Paper : "LIFT: Learning 4D LiDAR Image Fusion Transformer for 3D Object Detection", CVPR, 2022 ( ). [ ]
Paper : "BoxeR: Box-Attention for 2D and 3D Transformers", CVPR, 2022 ( ). [ ][ ]
Paper : "Bridged Transformer for Vision and Point Cloud 3D Object Detection", CVPR, 2022 ( ). [ ]
Paper : "VISTA: Boosting 3D Object Detection via Dual Cross-VIew SpaTial Attention", CVPR, 2022 ( ). [ ][ ]
Paper : "Towards Self-Supervised Pre-Training of 3DETR for Label-Efficient 3D Object Detection", CVPRW, 2022 ( ). [ ]
Paper : "Multimodal Transformer for Automatic 3D Annotation and Object Detection", ECCV, 2022 ( ). [ ][ ]
Paper : "CenterFormer: Center-based Transformer for 3D Object Detection", ECCV, 2022 ( ). [ ][ ]
Paper : "Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds", ECCV, 2022 ( ). [ ][ ][ ]
Paper : "SpatialDETR: Robust Scalable Transformer-Based 3D Object Detection from Multi-View Camera Images with Global Cross-Sensor Attention", ECCV, 2022 ( ). [ ][ ]
Paper : "CramNet: Camera-Radar Fusion with Ray-Constrained Cross-Attention for Robust 3D Object Detection", ECCV, 2022 ( ). [ ]
Paper : "SWFormer: Sparse Window Transformer for 3D Object Detection in Point Clouds", ECCV, 2022 ( ). [ ]
Paper : "Enhancing Multi-modal Features Using Local Self-Attention for 3D Object Detection", ECCV, 2022 ( ). [ ]
Paper : "Unifying Voxel-based Representation with Transformer for 3D Object Detection", NeurIPS, 2022 ( ). [ ][ ]
Paper : "MsSVT: Mixed-scale Sparse Voxel Transformer for 3D Object Detection on Point Clouds", NeurIPS, 2022 ( ). [ ][ ]
Paper : "DeepInteraction: 3D Object Detection via Modality Interaction", NeurIPS, 2022 ( ). [ ][ ]
Paper : "PETR: Position Embedding Transformation for Multi-View 3D Object Detection", arXiv, 2022 ( ). [ ]
Paper : "Graph-DETR3D: Rethinking Overlapping Regions for Multi-View 3D Object Detection", arXiv, 2022 ( ). [ ]
Paper : "PolarFormer: Multi-camera 3D Object Detection with Polar Transformer", arXiv, 2022 ( ). [ ][ ]
Paper : "Graph Neural Network and Spatiotemporal Transformer Attention for 3D Video Object Detection from Point Clouds", arXiv, 2022 ( ). [ ]
Paper : "SEFormer: Structure Embedding Transformer for 3D Object Detection", arXiv, 2022 ( ). [ ]
Paper : "CRAFT: Camera-Radar 3D Object Detection with Spatio-Contextual Fusion Transformer", arXiv, 2022 ( ). [ ]
Paper : "CrossDTR: Cross-view and Depth-guided Transformers for 3D Object Detection", arXiv, 2022 ( ). [ ][ ]
Paper : "Focal-PETR: Embracing Foreground for Efficient Multi-Camera 3D Object Detection", arXiv, 2022 ( ). [ ]
Paper : "Li3DeTr: A LiDAR based 3D Detection Transformer", WACV, 2023 ( ). [ ]
Paper : "PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection", CVPR, 2023 ( ). [ ][ ]
Paper : "OcTr: Octree-based Transformer for 3D Object Detection", CVPR, 2023 ( ). [ ]
Paper : "MonoATT: Online Monocular 3D Object Detection with Adaptive Token Transformer", CVPR, 2023 ( ). [ ]
Paper : "PVT-SSD: Single-Stage 3D Object Detector with Point-Voxel Transformer", CVPR, 2023 ( ). [ ][ ]
Paper : "ConQueR: Query Contrast Voxel-DETR for 3D Object Detection", CVPR, 2023 ( ). [ ][ ][ ]
Paper : "FrustumFormer: Adaptive Instance-aware Resampling for Multi-view 3D Detection", CVPR, 2023 ( ). [ ][ ]
Paper : "DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets", CVPR, 2023 ( ). [ ][ ]
Paper : "AShapeFormer: Semantics-Guided Object-Level Active Shape Encoding for 3D Object Detection via Transformers", CVPR, 2023 ( ). [ ][ ]
Paper : "MV-JAR: Masked Voxel Jigsaw and Reconstruction for LiDAR-Based Self-Supervised Pre-Training", CVPR, 2023 ( ). [ ][ ]
Paper : "FocalFormer3D: Focusing on Hard Instance for 3D Object Detection", ICCV, 2023 ( ). [ ][ ]
Paper : "3D Point Positional Encoding for Multi-Camera 3D Object Detection Transformers", ICCV, 2023 ( ). [ ][ ]
Paper : "Pixel-Aligned Recurrent Queries for Multi-View 3D Object Detection", ICCV, 2023 ( ). [ ][ ][ ]
Paper : "Cross Modal Transformer: Towards Fast and Robust 3D Object Detection", ICCV, 2023 ( ). [ ][ ]
Paper : "MonoDETR: Depth-aware Transformer for Monocular 3D Object Detection", ICCV, 2023 ( ). [ ][ ]
Paper : "Efficient Transformer-based 3D Object Detection with Dynamic Token Halting", ICCV, 2023 ( ). [ ]
Paper : "PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images", ICCV, 2023 ( ). [ ][ ]
Paper : "Object as Query: Lifting any 2D Object Detector to 3D Detection", ICCV, 2023 ( ). [ ]
Paper : "An Empirical Analysis of Range for 3D Object Detection", ICCVW, 2023 ( ). [ ]
Paper : "Uni3DETR: Unified 3D Detection Transformer", NeurIPS, 2023 ( ). [ ][ ]
Paper : "Diffusion-SS3D: Diffusion Model for Semi-supervised 3D Object Detection", NeurIPS, 2023 ( ). [ ][ ]
Paper : "Spatial-Temporal Enhanced Transformer Towards Multi-Frame 3D Object Detection", arXiv, 2023 ( ). [ ][[Code (in construction)(
Paper : "V-DETR: DETR with Vertex Relative Position Encoding for 3D Object Detection", arXiv, 2023 ( ). [ ][ ]
Paper : "3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features", arXiv, 2023 ( ). [ ][ ][ ]
Paper : "PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection", arXiv, 2023 ( ). [ ][ ]
Paper : "Point-DETR3D: Leveraging Imagery Data with Spatial Point Prior for Weakly Semi-supervised 3D Object Detection", AAAI, 2024 ( ). [ ]
Paper : "MixSup: Mixed-grained Supervision for Label-efficient LiDAR-based 3D Object Detection", ICLR, 2024 ( ). [ ][ ]
Paper : "Enhancing 3D Object Detection with 2D Detection-Guided Query Anchors", CVPR, 2024 ( ). [ ]
Paper : "ScatterFormer: Efficient Voxel Transformer with Scattered Linear Attention", arXiv, 2024 ( ). [ ][ ]
Paper : "MsSVT++: Mixed-scale Sparse Voxel Transformer with Center Voting for 3D Object Detection", arXiv, 2024 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Detection / Multi-Modal Detection

Paper : "Open-Vocabulary Object Detection Using Captions", CVPR, 2021 ( ). [ ][ ]
Paper : "MDETR - Modulated Detection for End-to-End Multi-Modal Understanding", ICCV, 2021 ( ). [ ][ ][ ]
Paper : "FETNet: Feature Exchange Transformer Network for RGB-D Object Detection", BMVC, 2021 ( ). [ ]
Paper : "Exploiting Scene Depth for Object Detection with Multimodal Transformers", BMVC, 2021 ( ). [ ][ ]
Paper : "StrucTexT: Structured Text Understanding with Multi-Modal Transformers", arXiv, 2021 ( ). [ ]
Paper : "Class-agnostic Object Detection with Multi-modal Transformer", ECCV, 2022 ( ). [ ][ ]
Paper : "Simple Open-Vocabulary Object Detection with Vision Transformers", ECCV, 2022 ( ). [ ][ ][ ]
Paper : "X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks", ECCV, 2022 ( ). [ ]
Paper : "simCrossTrans: A Simple Cross-Modality Transfer Learning for Object Detection with ConvNets or Vision Transformers", arXiv, 2022 ( ). [ ][ ]
Paper : "DALL-E for Detection: Language-driven Context Image Synthesis for Object Detection", arXiv, 2022 ( ). [ ]
Paper : "You Only Need One Detector: Unified Object Detector for Different Modalities based on Vision Transformers", arXiv, 2022 ( ). [ ][ ]
Paper : "OmDet: Language-Aware Object Detection with Large-scale Vision-Language Multi-dataset Pre-training", arXiv, 2022 ( ). [ ]
Paper : "Video Referring Expression Comprehension via Transformer with Content-aware Query", arXiv, 2022 ( ). [ ]
Paper : "DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding", AAAI, 2023 ( ). [ ][ ]
Paper : "F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models", ICLR, 2023 ( ). [ ][ ]
Paper : "Open-Vocabulary Point-Cloud Object Detection without 3D Annotation", CVPR, 2023 ( ). [ ][ ]
Paper : "Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding", CVPR, 2023 ( ). [ ]
Paper : "OmniLabel: A Challenging Benchmark for Language-Based Object Detection", ICCV, 2023 ( ). [ ][ ][ ]
Paper : "Multi-Modal Classifiers for Open-Vocabulary Object Detection", ICML, 2023 ( ). [ ][ ][ ]
Paper : "CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection", NeurIPS, 2023 ( ). [ ][ ][ ]
Paper : "Contextual Object Detection with Multimodal Large Language Models", arXiv, 2023 ( ). [ ][ ][ ]
Paper : "Object2Scene: Putting Objects in Context for Open-Vocabulary 3D Detection", arXiv, 2023 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Detection / HOI Detection

Paper : "End-to-End Human Object Interaction Detection with HOI Transformer", CVPR, 2021 ( ). [ ][ ]
Paper : "HOTR: End-to-End Human-Object Interaction Detection with Transformers", CVPR, 2021 ( ). [ ][ ]
Paper : "MSTR: Multi-Scale Transformer for End-to-End Human-Object Interaction Detection", CVPR, 2022 ( ). [ ]
Paper : "What to look at and where: Semantic and Spatial Refined Transformer for detecting human-object interactions", CVPR, 2022 ( ). [ ]
Paper : "Consistency Learning via Decoding Path Augmentation for Transformers in Human Object Interaction Detection", CVPR, 2022 ( ). [ ][ ]
Paper : "Human-Object Interaction Detection via Disentangled Transformer", CVPR, 2022 ( ). [ ]
Paper : "Exploring Structure-Aware Transformer Over Interaction Proposals for Human-Object Interaction Detection", CVPR, 2022 ( ). [ ][ ]
Paper : "Distillation Using Oracle Queries for Transformer-Based Human-Object Interaction Detection", CVPR, 2022 ( ). [ ]
Paper : "Efficient Two-Stage Detection of Human-Object Interactions with a Novel Unary-Pairwise Transformer", CVPR, 2022 ( ). [ ][ ][ ]
Paper : "Category-Aware Transformer Network for Better Human-Object Interaction Detection", CVPR, 2022 ( ). [ ]
Paper : "GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection", CVPR, 2022 ( ). [ ][ ]
Paper : "Towards Hard-Positive Query Mining for DETR-based Human-Object Interaction Detection", ECCV, 2022 ( ). [ ][ ]
Paper : "Iwin: Human-Object Interaction Detection via Transformer with Irregular Windows", ECCV, 2022 ( ). [ ]
Paper : "RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection", NeurIPS, 2022 ( ). [ ][ ]
Paper : "Video-based Human-Object Interaction Detection from Tubelet Tokens", NeurIPS, 2022 ( ). [ ]
Paper : "Understanding Embodied Reference with Touch-Line Transformer", arXiv, 2022 ( ). [ ][ ]
Paper : "Weakly-supervised HOI Detection via Prior-guided Bi-level Representation Learning", ICLR, 2023 ( ). [ ]
Paper : "HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models", CVPR, 2023 ( ). [ ][ ]
Paper : "ViPLO: Vision Transformer based Pose-Conditioned Self-Loop Graph for Human-Object Interaction Detection", CVPR, 2023 ( ). [ ][ ]
Paper : "Open-Category Human-Object Interaction Pre-Training via Language Modeling Framework", CVPR, 2023 ( ). [ ]
Paper : "Category Query Learning for Human-Object Interaction Classification", CVPR, 2023 ( ). [ ][ ]
Paper : "Re-mine, Learn and Reason: Exploring the Cross-modal Semantic Correlations for Language-guided HOI detection", ICCV, 2023 ( ). [ ]
Paper : "Exploring Predicate Visual Context in Detecting of Human-Object Interactions", ICCV, 2023 ( ). [ ][ ]
Paper : "Agglomerative Transformer for Human-Object Interaction Detection", ICCV, 2023 ( ). [ ][ ]
Paper : "RLIPv2: Fast Scaling of Relational Language-Image Pre-training", ICCV, 2023 ( ). [ ][ ]
Paper : "EgoPCA: A New Framework for Egocentric Hand-Object Interaction Understanding", ICCV, 2023 ( ). [ ][ ]
Paper : "Detecting Any Human-Object Interaction Relationship: Universal HOI Detector with Spatial Prompt Learning on Foundation Models", NeurIPS, 2023 ( ). [ ][ ]
Paper : "Neural-Logic Human-Object Interaction Detection", NeurIPS, 2023 ( ). [ ][ ]
Paper : "Exploiting CLIP for Zero-shot HOI Detection Requires Knowledge Distillation at Multiple Levels", arXiv, 2023 ( ). [ ]
Paper : "Disentangled Pre-training for Human-Object Interaction Detection", CVPR, 2024 ( ). [ ][ ]
Paper : "HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision", arXiv, 2024 ( ). [ ][ ][ ]

Ultimate-Awesome-Transformer-Attention / Detection / Salient Object Detection

Paper : "Visual Saliency Transformer", ICCV, 2021 ( ). [ ]
Paper : "Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction", NeurIPS, 2021 ( ). [ ]
Paper : "SwinNet: Swin Transformer drives edge-aware RGB-D and RGB-T salient object detection", TCSVT, 2021 ( ). [ ][ ]
Paper : "Transformer Transforms Salient Object Detection and Camouflaged Object Detection", arXiv, 2021 ( ). [ ]
Paper : "Unifying Global-Local Representations in Salient Object Detection with Transformer", arXiv, 2021 ( ). [ ]
Paper : "TriTransNet: RGB-D Salient Object Detection with a Triplet Transformer Embedding Network", arXiv, 2021 ( ). [ ]
Paper : "Boosting Salient Object Detection with Transformer-based Asymmetric Bilateral U-Net", arXiv, 2021 ( ). [ ]
Paper : "TranSalNet: Visual saliency prediction using transformers", arXiv, 2021 ( ). [ ]
Paper : "DFTR: Depth-supervised Hierarchical Feature Fusion Transformer for Salient Object Detection", arXiv, 2022 ( ). [ ]
Paper : "GroupTransNet: Group Transformer Network for RGB-D Salient Object Detection", arXiv, 2022 ( ). [ ]
Paper : "SelfReformer: Self-Refined Network with Transformer for Salient Object Detection", arXiv, 2022 ( ). [ ]
Paper : "Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient Object Detection", arXiv, 2022 ( ). [ ]
Paper : "Mirror Complementary Transformer Network for RGB-thermal Salient Object Detection", arXiv, 2022 ( ). [ ][ ]
Paper : "SiaTrans: Siamese Transformer Network for RGB-D Salient Object Detection with Depth Image Classification", arXiv, 2022 ( ). [ ]
Paper : "PSFormer: Point Transformer for 3D Salient Object Detection", arXiv, 2022 ( ). [ ]
Paper : "Recurrent Multi-scale Transformer for High-Resolution Salient Object Detection", ACMMM, 2023 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / X-supervised:

Paper : "Localizing Objects with Self-Supervised Transformers and no Labels", BMVC, 2021 ( ). [ ][ ]
Paper : "Omni-DETR: Omni-Supervised Object Detection with Transformers", CVPR, 2022 ( ). [ ][ ]
Paper : "Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut", CVPR, 2022 ( ). [ ][ ][ ]
Paper : "Scaling Novel Object Detection with Weakly Supervised Detection Transformers", CVPRW, 2022 ( ). [ ]
Paper : "Re-Attention Transformer for Weakly Supervised Object Localization", arXiv, 2022 ( ). [ ][ ]
Paper : "TokenCut: Segmenting Objects in Images and Videos with Self-supervised Transformer and Normalized Cut", arXiv, 2022 ( ). [ ][ ][ ]
Paper : "Semi-DETR: Semi-Supervised Object Detection With Detection Transformers", CVPR, 2023 ( ). [ ][ ][ ]
Paper : "Object Discovery from Motion-Guided Tokens", CVPR, 2023 ( ). [ ][ ][ ]
Paper : "Cut and Learn for Unsupervised Object Detection and Instance Segmentation", CVPR, 2023 ( ). [ ][ ][ ]
Paper : "Invariant Slot Attention: Object Discovery with Slot-Centric Reference Frames", ICML, 2023 ( ). [ ]
Paper : "MOST: Multiple Object localization with Self-supervised Transformers for object discovery", ICCV, 2023 ( ). [ ][ ][ ]
Paper : "Generative Prompt Model for Weakly Supervised Object Localization", ICCV, 2023 ( ). [ ][ ]
Paper : "Spatial-Aware Token for Weakly Supervised Object Localization", ICCV, 2023 ( ). [ ][ ]
Paper : "ALWOD: Active Learning for Weakly-Supervised Object Detection", ICCV, 2023 ( ). [ ][ ]
Paper 51 10 months ago : "HASSOD: Hierarchical Adaptive Self-Supervised Object Detection", NeurIPS, 2023 ( ). [ ][ ][ ]
Paper : "SeqCo-DETR: Sequence Consistency Training for Self-Supervised Object Detection with Transformers", arXiv, 2023 ( ). [ ]
Paper : "R-MAE: Regions Meet Masked Autoencoders", arXiv, 2023 ( ). [ ]
Paper : "SimDETR: Simplifying self-supervised pretraining for DETR", arXiv, 2023 ( ). [ ]
Paper : "Unsupervised Universal Image Segmentation", arXiv, 2023 ( ). [ ][ ]
Paper : "CuVLER: Enhanced Unsupervised Object Discoveries through Exhaustive Self-Supervised Transformers", CVPR, 2024 ( ). [ ][ ]
Paper : "Sparse Semi-DETR: Sparse Learnable Queries for Semi-Supervised Object Detection", CVPR, 2024 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / X-Shot Object Detection:

Paper : "Adaptive Image Transformer for One-Shot Object Detection", CVPR, 2021 ( ). [ ]
Paper : "Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning", arXiv, 2021 ( ). [ ][ ]
Paper : "CAT: Cross-Attention Transformer for One-Shot Object Detection", arXiv, 2021 ( ). [ ]
Paper : "Few-Shot Object Detection with Fully Cross-Transformer", CVPR, 2022 ( ). [ ]
Paper : "Semantic-aligned Fusion Transformer for One-shot Object Detection", CVPR, 2022 ( ). [ ]
Paper : "Time-rEversed diffusioN tEnsor Transformer: A New TENET of Few-Shot Object Detection", ECCV, 2022 ( ). [ ][ ]
Paper : "Meta-DETR: Image-Level Few-Shot Detection with Inter-Class Correlation Exploitation", TPAMI, 2022 ( ). [ ]
Paper : "Incremental-DETR: Incremental Few-Shot Object Detection via Self-Supervised Learning", arXiv, 2022 ( ). [ ]
Paper : "FS-DETR: Few-Shot DEtection TRansformer with prompting and without re-training", ICCV, 2023 ( ). [ ]
Paper : "Meta-ZSDETR: Zero-shot DETR with Meta-learning", ICCV, 2023 ( ). [ ]
Paper : "Revisiting Few-Shot Object Detection with Vision-Language Models", arXiv, 2023 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Open-World/Vocabulary:

Paper : "OW-DETR: Open-world Detection Transformer", CVPR, 2022 ( ). [ ][ ]
Paper : "Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model", CVPR, 2022 ( ). [ ][ ]
Paper : "RegionCLIP: Region-based Language-Image Pretraining", CVPR, 2022 ( ). [ ][ ]
Paper : "PromptDet: Towards Open-vocabulary Detection using Uncurated Images", ECCV, 2022 ( ). [ ][ ][ ]
Paper : "Open-Vocabulary DETR with Conditional Matching", ECCV, 2022 ( ). [ ]
Paper : "Exploiting Unlabeled Data with Vision and Language Models for Object Detection", ECCV, 2022 ( ). [ ][ ][ ]
Paper : "DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection", NeurIPS, 2022 ( ). [ ]
Paper : "What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs", NeurIPS, 2022 ( ). [ ][ ][ ]
Paper : "P OVD: Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection", arXiv, 2022 ( ). [ ]
Paper : "Open World DETR: Transformer based Open World Object Detection", arXiv, 2022 ( ). [ ]
Paper : "Aligning Bag of Regions for Open-Vocabulary Object Detection", CVPR, 2023 ( ). [ ][ ]
Paper : "CapDet: Unifying Dense Captioning and Open-World Detection Pretraining", CVPR, 2023 ( ). [ ]
Paper : "CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching", CVPR, 2023 ( ). [ ][ ]
Paper : "Detecting Everything in the Open World: Towards Universal Object Detection", CVPR, 2023 ( ). [ ][ ]
Paper : "DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment", CVPR, 2023 ( ). [ ]
Paper : "Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers", CVPR, 2023 ( ). [ ]
Paper : "CAT: LoCalization and IdentificAtion Cascade Detection Transformer for Open-World Object Detection", CVPR, 2023 ( ). [ ][ ]
Paper : "Learning to Detect and Segment for Open Vocabulary Object Detection", CVPR, 2023 ( ). [ ]
Paper : "Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection", CVPR, 2023 ( ). [ ][ ]
Paper : "Open-vocabulary Attribute Detection", CVPR, 2023 ( ). [ ][ ]
Paper : "OvarNet: Towards Open-vocabulary Object Attribute Recognition", CVPR, 2023 ( ). [ ][ ][ ]
Paper : "Annealing-Based Label-Transfer Learning for Open World Object Detection", CVPR, 2023 ( ). [ ][ ]
Paper : "PROB: Probabilistic Objectness for Open World Object Detection", CVPR, 2023 ( ). [ ][ ][ ]
Paper : "Random Boxes Are Open-world Object Detectors", ICCV, 2023 ( ). [ ][ ]
Paper : "Cascade-DETR: Delving into High-Quality Universal Object Detection", ICCV, 2023 ( ). [ ][ ]
Paper : "EdaDet: Open-Vocabulary Object Detection Using Early Dense Alignment", ICCV, 2023 ( ). [ ][ ]
Paper : "V3Det: Vast Vocabulary Visual Detection Dataset", ICCV, 2023 ( ). [ ][ ][ ]
Paper : "CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection", NeurIPS, 2023 ( ). [ ][ ]
Paper : "DAMEX: Dataset-aware Mixture-of-Experts for visual understanding of mixture-of-datasets", NeurIPS, 2023 ( ). [ ][ ]
Paper : "Scaling Open-Vocabulary Object Detection", NeurIPS, 2023 ( ). [ ]
Paper : "Multi-modal Queried Object Detection in the Wild", NeurIPS, 2023 ( ). [ ][ ]
Paper : "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection", arXiv, 2023 ( ). [ ]
Paper : "GridCLIP: One-Stage Object Detection by Grid-Level CLIP Representation Learning", arXiv, 2023 ( ). [ ]
Paper : "Three ways to improve feature alignment for open vocabulary detection", arXiv, 2023 ( ). [ ]
Paper : "Open-Vocabulary Object Detection using Pseudo Caption Labels", arXiv, 2023 ( ). [ ]
Paper : "Prompt-Guided Transformers for End-to-End Open-Vocabulary Object Detection", arXiv, 2023 ( ). [ ]
Paper : "LOWA: Localize Objects in the Wild with Attributes", arXiv, 2023 ( ). [ ]
Paper : "Open-Vocabulary Object Detection via Scene Graph Discovery", arXiv, 2023 ( ). [ ]
Paper : "Improving Pseudo Labels for Open-Vocabulary Object Detection", arXiv, 2023 ( ). [ ]
Paper : "Detect Every Thing with Few Examples", arXiv, 2023 ( ). [ ][ ]
Papewr : "CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction", arXiv, 2023 ( ). [ ][ ]
Paper : "DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection", arXiv, 2023 ( ). [ ][ ]
Paper : "Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection", arXiv, 2023 ( ). [ ]
Paper : "Recognize Any Regions", arXiv, 2023 ( ). [ ][ ]
Paper : "Language-conditioned Detection Transformer", arXiv, 2023 ( ). [ ][ ]
Paper : "Learning Pseudo-Labeler beyond Noun Concepts for Open-Vocabulary Object Detection", arXiv, 2023 ( ). [ ]
Paper : "Open World Object Detection in the Era of Foundation Models", arXiv, 2023 ( ). [ ][ ]
Paper : "LP-OVOD: Open-Vocabulary Object Detection by Linear Probing", WACV, 2024 ( ). [ ]
Paper : "ProxyDet: Synthesizing Proxy Novel Classes via Classwise Mixup for Open Vocabulary Object Detection", WACV, 2024 ( ). [ ]
Paper : "Weakly Supervised Open-Vocabulary Object Detection", AAAI, 2024 ( ). [ ][ ]
Paper : "CLIM: Contrastive Language-Image Mosaic for Region Representation", AAAI, 2024 ( ). [ ][ ]
Paper : "Semi-supervised Open-World Object Detection", AAAI, 2024 ( ). [ ][ ]
Paper : "LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors", ICLR, 2024 ( ). [ ]
Paper : "Generative Region-Language Pretraining for Open-Ended Object Detection", CVPR, 2024 ( ). [ ][ ]
Paper : "DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection", CVPR, 2024 ( ). [ ]
Paper : "Retrieval-Augmented Open-Vocabulary Object Detection", CVPR, 2024 ( ). [ ][ ]
Paper : "SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection", CVPR, 2024 ( ). [ ]
Paper : "An Open and Comprehensive Pipeline for Unified Object Grounding and Detection", arXiv, 2024 ( ). [ ][ ]
Paper : "YOLO-World: Real-Time Open-Vocabulary Object Detection", arXiv, 2024 ( ). [ ][ ]
Paper : "T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy", arXiv, 2024 ( ). [ ][ ][ ]
Paper : "Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection", arXiv, 2024 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Pedestrian Detection:

Paper : "DETR for Crowd Pedestrian Detection", arXiv, 2020 ( ). [ ][ ]
Paper : "Effectiveness of Vision Transformer for Fast and Accurate Single-Stage Pedestrian Detection", NeurIPS, 2022 ( ). [ ]
Paper : "Pedestrian Detection: Domain Generalization, CNNs, Transformers and Beyond", arXiv, 2022 ( ). [ ][ ]
Paper : "VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision", CVPR, 2023 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Lane Detection:

Paper : "End-to-end Lane Shape Prediction with Transformers", WACV, 2021 ( ). [ ][ ]
Paper : "Line Segment Detection Using Transformers without Edges", CVPR, 2021 ( ). [ ][ ]
Paper : "Laneformer: Object-aware Row-Column Transformers for Lane Detection", AAAI, 2022 ( ). [ ]
Paper : "Transformer Based Line Segment Classifier With Image Context for Real-Time Vanishing Point Detection in Manhattan World", CVPR, 2022 ( ). [ ]
Paper : "PersFormer: 3D Lane Detection via Perspective Transformer and the OpenLane Benchmark", ECCV, 2022 ( ). [ ][ ]
Paper : "Lane Detection Transformer Based on Multi-Frame Horizontal and Vertical Attention and Visual Transformer Module", ECCV, 2022 ( ). [ ]
Paper : "PriorLane: A Prior Knowledge Enhanced Lane Detection Approach Based on Transformer", arXiv, 2022 ( ). [ ][ ]
Paper : "CurveFormer: 3D Lane Detection by Curve Propagation with Curve Queries and Attention", arXiv, 2022 ( ). [ ]
Paper : "LATR: 3D Lane Detection from Monocular Images with Transformer", ICCV, 2023 ( ). [ ][ ]
Paper : "End to End Lane detection with One-to-Several Transformer", arXiv, 2023 ( ). [ ][ ]
Paper : "Lane2Seq: Towards Unified Lane Detection via Sequence Generation", CVPR, 2024 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Object Localization:

Paper : "TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization", arXiv, 2021 ( ). [ ]
Paper : "LCTR: On Awakening the Local Continuity of Transformer for Weakly Supervised Object Localization", AAAI, 2022 ( ). [ ]
Paper : "ViTOL: Vision Transformer for Weakly Supervised Object Localization", CVPRW, 2022 ( ). [ ][ ]
Paper : "Weakly Supervised Object Localization via Transformer with Implicit Spatial Calibration", ECCV, 2022 ( ). [ ][ ]
Paper : "CaFT: Clustering and Filter on Tokens of Transformer for Weakly Supervised Object Localization", arXiv, 2022 ( ). [ ]
Paper : "CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation", CVPR, 2023 ( ). [ ][ ][ ]
Paper : "ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object Navigation", ICML, 2023 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Relation Detection:

Paper : "Visual Relationship Detection Using Part-and-Sum Transformers with Composite Queries", ICCV, 2021 ( ). [ ]
Paper : "Visual Composite Set Detection Using Part-and-Sum Transformers", arXiv, 2021 ( ). [ ]
Paper : "Transformed ROIs for Capturing Visual Transformations in Videos", arXiv, 2021 ( ). [ ]
Paper : "RelTransformer: A Transformer-Based Long-Tail Visual Relationship Recognition", CVPR, 2022 ( ). [ ][ ]
Paper : "VReBERT: A Simple and Flexible Transformer for Visual Relationship Detection", ICPR, 2022 ( ). [ ]
Paper : "Unified Visual Relationship Detection with Vision and Language Models", ICCV, 2023 ( ). [ ][ ]
Paper : "Zero-shot Visual Relation Detection via Composite Visual Cues from Large Language Models", NeurIPS, 2023 ( ). [ ]
Paper : "Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection", arXiv, 2024 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Anomaly Detection:

Paper : "VT-ADL: A Vision Transformer Network for Image Anomaly Detection and Localization", ISIE, 2021 ( ). [ ]
Paper : "Inpainting Transformer for Anomaly Detection", arXiv, 2021 ( ). [ ]
Paper : "AnoViT: Unsupervised Anomaly Detection and Localization with Vision Transformer-based Encoder-Decoder", arXiv, 2022 ( ). [ ]
Paper : "WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation", CVPR, 2023 ( ). [ ]
Paper : "Multimodal Industrial Anomaly Detection via Hybrid Fusion", CVPR, 2023 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Cross-Domain:

Paper : "SSTN: Self-Supervised Domain Adaptation Thermal Object Detection for Autonomous Driving", arXiv, 2021 ( ). [ ]
Paper : "MTTrans: Cross-Domain Object Detection with Mean-Teacher Transformer", ECCV, 2022 ( ). [ ]
Paper : "Improving Transferability for Domain Adaptive Detection Transformers", arXiv, 2022 ( ). [ ]
Paper : "Cross-domain Detection Transformer based on Spatial-aware and Semantic-aware Token Alignment", arXiv, 2022 ( ). [ ]
Paper : "DETR with Additional Global Aggregation for Cross-domain Weakly Supervised Object Detection", CVPR, 2023 ( ). [ ]
Paper : "DA-DETR: Domain Adaptive Detection Transformer with Information Fusion", CVPR, 2023 ( ). [ ]
Paper : "CLIP the Gap: A Single Domain Generalization Approach for Object Detection", CVPR, 2023 ( ). [ ][ ]
Paper : "PM-DETR: Domain Adaptive Prompt Memory for Object Detection with Transformers", arXiv, 2023 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Co-Salient Object Detection:

Paper : "CoSformer: Detecting Co-Salient Object with Transformers", arXiv, 2021 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Oriented Object Detection:

Paper : "Oriented Object Detection with Transformer", arXiv, 2021 ( ). [ ]
Paper : "AO2-DETR: Arbitrary-Oriented Object Detection Transformer", arXiv, 2022 ( ). [ ]
Paper : "ARS-DETR: Aspect Ratio Sensitive Oriented Object Detection with Transformer", arXiv, 2023 ( ). [ ][ ]
Paper : "RHINO: Rotated DETR with Dynamic Denoising via Hungarian Matching for Oriented Object Detection", arXiv, 2023 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Multiview Detection:

Paper : "Multiview Detection with Shadow Transformer (and View-Coherent Data Augmentation)", ACMMM, 2021 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Polygon Detection:

Paper : "Investigating transformers in the decomposition of polygonal shapes as point collections", ICCVW, 2021 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Drone-view:

Paper : "TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios", ICCVW, 2021 ( ). [ ]
Paper : "TransVisDrone: Spatio-Temporal Transformer for Vision-based Drone-to-Drone Detection in Aerial Videos", arXiv, 2022 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Infrared:

Paper : "Infrared Small-Dim Target Detection with Transformer under Complex Backgrounds", arXiv, 2021 ( ). [ ]
Paper : "MiPa: Mixed Patch Infrared-Visible Modality Agnostic Object Detection", arXiv, 2024 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Text Detection:

Paper : "SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition", CVPR, 2022 ( ). [ ][ ]
Paper : "Text Spotting Transformers", CVPR, 2022 ( ). [ ][ ]
Paper : "Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer", CVPR, 2022 ( ). [ ]
Paper : "Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and Spotting", ECCV, 2022 ( ). [ ]
Paper : "End-to-End Video Text Spotting with Transformer", arXiv, 2022 ( ). [ ][ ]
Paper : "Arbitrary Shape Text Detection using Transformers", arXiv, 2022 ( ). [ ]
Paper : "Arbitrary Shape Text Detection via Boundary Transformer", arXiv, 2022 ( ). [ ][ ]
Paper : "DPTNet: A Dual-Path Transformer Architecture for Scene Text Detection", arXiv, 2022 ( ). [ ]
Paper : "Aggregated Text Transformer for Scene Text Detection", arXiv, 2022 ( ). [ ]
Paper : "DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in Transformer", AAAI, 2023 ( ). [ ][ ]
Paper : "Turning a CLIP Model into a Scene Text Detector", CVPR, 2023 ( ). [ ][ ]
Paper : "DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting", CVPR, 2023 ( ). [ ][ ]
Paper : "ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy in Transformer", ICCV, 2023 ( ). [ ][ ]
Paper : "PBFormer: Capturing Complex Scene Text Shape with Polynomial Band Transformer", ACMMM, 2023 ( ). [ ]
Paper : "DeepSolo++: Let Transformer Decoder with Explicit Points Solo for Text Spotting", arXiv, 2023 ( ). [ ][ ]
Paper : "Turning a CLIP Model into a Scene Text Spotter", arXiv, 2023 ( ). [ ][ ]
Paper : "SRFormer: Empowering Regression-Based Text Detection Transformer with Segmentation", arXiv, 2023 ( ). [ ]
Paper : "Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis", CVPR, 2024 ( ). [ ]
Paper : "SwinTextSpotter v2: Towards Better Synergy for Scene Text Spotting", arXiv, 2024 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Change Detection:

Paper : "A Transformer-Based Siamese Network for Change Detection", arXiv, 2022 ( ). [ ][ ]
Paper : "IDET: Iterative Difference-Enhanced Transformers for High-Quality Change Detection", arXiv, 2022 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Edge Detection:

Paper : "EDTER: Edge Detection with Transformer", CVPR, 2022 ( ). [ ][ ]
Paper : "HEAT: Holistic Edge Attention Transformer for Structured Reconstruction", CVPR, 2022 ( ). [ ][ ][ ]
Paper : "Cascade Transformers for End-to-End Person Search", CVPR, 2022 ( ). [ ][ ]
Paper : "PSTR: End-to-End One-Step Person Search With Transformers", CVPR, 2022 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Manipulation Detection:

Paper : "ObjectFormer for Image Manipulation Detection and Localization", CVPR, 2022 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Mirror Detection:

Paper : "Symmetry-Aware Transformer-based Mirror Detection", arXiv, 2022 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Shadow Detection:

Paper : "SCOTCH and SODA: A Transformer Video Shadow Detection Framework", CVPR, 2023 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Keypoint Detection:

Paper : "From Saliency to DINO: Saliency-guided Vision Transformer for Few-shot Keypoint Detection", arXiv, 2023 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Continual Learning:

Paper : "Continual Detection Transformer for Incremental Object Detection", CVPR, 2023 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Visual Query Detection/Localization:

Paper : "Where is my Wallet? Modeling Object Proposal Sets for Egocentric Visual Query Localization", CVPR, 2023 ( ). [ ][ ]
Paper : "Single-Stage Visual Query Localization in Egocentric Videos", NeurIPS, 2023 ( ). [ ][ ][ ]

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Task-Driven Object Detection:

Paper : "CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection", ICCV, 2023 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Detection / Other Detection Tasks / Diffusion:

Paper : "DiffusionEngine: Diffusion Model is Scalable Data Engine for Object Detection", arXiv, 2023 ( ). [ ][ ][ ]
Paper : "Text-image Alignment for Diffusion-based Perception", arXiv, 2023 ( ). [ ][ ]
Paper : "InstaGen: Enhancing Object Detection by Training on Synthetic Dataset", arXiv, 2024 ( ). [ ][ ][ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Semantic Segmentation

Paper : "Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers", CVPR, 2021 ( ). [ ][ ][ ]
Paper : "TrSeg: Transformer for semantic segmentation", PRL, 2021 ( ). [ ][ ]
Paper : "Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer", ICCV, 2021 ( ). [ ][ ]
Paper : "Segmenter: Transformer for Semantic Segmentation", ICCV, 2021 ( ). [ ][ ]
Paper : "A Unified Efficient Pyramid Transformer for Semantic Segmentation", ICCVW, 2021 ( ). [ ][ ]
Paper : "Fully Transformer Networks for Semantic Image Segmentation", arXiv, 2021 ( ). [ ]
Paper : "SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers", NeurIPS, 2021 ( ). [ ][ ]
Paper : "Per-Pixel Classification is Not All You Need for Semantic Segmentation", NeurIPS, 2021 ( ). [ ][ ]
Paper : "OffRoadTranSeg: Semi-Supervised Segmentation using Transformers on OffRoad environments", arXiv, 2021 ( ). [ ]
Paper : "Boosting Few-shot Semantic Segmentation with Transformers", arXiv, 2021 ( ). [ ]
Paper : "Flying Guide Dog: Walkable Path Discovery for the Visually Impaired Utilizing Drones and Transformer-based Semantic Segmentation", arXiv, 2021 ( ). [ ][ ]
Paper : "Semantic Segmentation on VSPW Dataset through Aggregation of Transformer Models", arXiv, 2021 ( ). [ ]
Paper : "SDTP: Semantic-aware Decoupled Transformer Pyramid for Dense Image Prediction", arXiv, 2021 ( ). [ ]
Paper : "TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation", CVPR, 2022 ( ). [ ][ ]
Paper : "Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation", CVPR, 2022 ( ). [ ][ ]
Paper : "Graph Reasoning Transformer for Image Parsing", ACMMM, 2022 ( ). [ ]
Paper : "A Transformer-Based Decoder for Semantic Segmentation with Multi-level Context Mining", ECCV, 2022 ( ). [ ][ ]
Paper : "PAUMER: Patch Pausing Transformer for Semantic Segmentation", BMVC, 2022 ( ). [ ]
Paper : "SegViT: Semantic Segmentation with Plain Vision Transformers", NeurIPS, 2022 ( ). [ ][ ]
Paper : "RTFormer: Efficient Design for Real-Time Semantic Segmentation with Transformer", NeurIPS, 2022 ( ). [ ][ ]
Paper : "SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation", NeurIPS, 2022 ( ). [ ]
Paper : "Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention", arXiv, 2022 ( ). [ ][ ]
Paper : "Pyramid Fusion Transformer for Semantic Segmentation", arXiv, 2022 ( ). [ ]
Paper : "Dual-Flattening Transformers through Decomposed Row and Column Queries for Semantic Segmentation", arXiv, 2022 ( ). [ ]
Paper : "Feature Selective Transformer for Semantic Image Segmentation", arXiv, 2022 ( ). [ ]
Paper : "StructToken: Rethinking Semantic Segmentation with Structural Prior", arXiv, 2022 ( ). [ ]
Paper : "Improving Semantic Segmentation in Transformers using Hierarchical Inter-Level Attention", arXiv, 2022 ( ). [ ][ ][ ]
Paper : "Visual Representation Learning with Transformer: A Sequence-to-Sequence Perspective", arXiv, 2022 ( ). [ ][ ]
Paper : "SSformer: A Lightweight Transformer for Semantic Segmentation", arXiv, 2022 ( ). [ ][ ]
Paper : "NamedMask: Distilling Segmenters from Complementary Foundation Models", arXiv, 2022 ( ). [ ][ ][ ]
Paper : "IncepFormer: Efficient Inception Transformer with Pyramid Pooling for Semantic Segmentation", arXiv, 2022 ( ). [ ][ ]
Paper : "SeaFormer: Squeeze-enhanced Axial Transformer for Mobile Semantic Segmentation", ICLR, 2023 ( ). [ ]
Paper : "Probabilistic Prompt Learning for Dense Prediction", CVPR, 2023 ( ). [ ]
Paper : "AutoFocusFormer: Image Segmentation off the Grid", CVPR, 2023 ( ). [ ]
Paper : "Content-aware Token Sharing for Efficient Semantic Segmentation with Vision Transformers", CVPR, 2023 ( ). [ ][ ][ ]
Paper : "Transformer Scale Gate for Semantic Segmentation", CVPR, 2023 ( ). [ ]
Paper : "Dynamic Focus-aware Positional Queries for Semantic Segmentation", CVPR, 2023 ( ). [ ][ ]
Paper : "A Good Student is Cooperative and Reliable: CNN-Transformer Collaborative Learning for Semantic Segmentation", ICCV, 2023 ( ). [ ]
Paper : "Dynamic Token Pruning in Plain Vision Transformers for Semantic Segmentation", ICCV, 2023 ( ). [ ]
Paper : "FreeMask: Synthetic Images with Dense Annotations Make Stronger Segmentation Models", NeurIPS, 2023 ( ). [ ][ ]
Paper : "AiluRus: A Scalable ViT Framework for Dense Prediction", NeurIPS, 2023 ( ). [ ][ ]
Paper : "SegViTv2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers", arXiv, 2023 ( ). [ ][ ]
Paper : "Dynamic Token-Pass Transformers for Semantic Segmentation", arXiv, 2023 ( ). [ ]
Paper : "Category Feature Transformer for Semantic Segmentation", arXiv, 2023 ( ). [ ]
Paper : "ICPC: Instance-Conditioned Prompting with Contrastive Learning for Semantic Segmentation", arXiv, 2023 ( ). [ ]
Paper : "Superpixel Transformers for Efficient Semantic Segmentation", arXiv, 2023 ( ). [ ]
Paper : "Minimalist and High-Performance Semantic Segmentation with Plain Vision Transformers", arXiv, 2023 ( ). [ ][ ]
Paper : "SCTNet: Single-Branch CNN with Transformer Semantic Information for Real-Time Segmentation", AAAI, 2024 ( ). [ ][ ]
Paper : "Region-Based Representations Revisited", arXiv, 2024 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Depth Estimation

Paper : "Vision Transformers for Dense Prediction", ICCV, 2021 ( ). [ ][ ]
Paper : "Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction", ICCV, 2021 ( ). [ ][ ]
Paper : "Transformer-based Monocular Depth Estimation with Attention Supervision", BMVC, 2021 ( ). [ ][ ]
Paper : "Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics", VISAP, 2022 ( ). [ ]
Paper : "Multi-Frame Self-Supervised Depth with Transformers", CVPR, 2022 ( ). [ ]
Paper : "GuideFormer: Transformers for Image Guided Depth Completion", CVPR, 2022 ( ). [ ]
Paper : "SparseFormer: Attention-based Depth Completion Network", CVPRW, 2022 ( ). [ ]
Paper : "Depth Estimation with Simplified Transformer", CVPRW, 2022 ( ). [ ]
Paper : "MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer", 3DV, 2022 ( ). [ ][ ]
Paper : "Spike Transformer: Monocular Depth Estimation for Spiking Camera", ECCV, 2022 ( ). [ ][ ]
Paper : "Hybrid Transformer Based Feature Fusion for Self-Supervised Monocular Depth Estimation", ECCVW, 2022 ( ). [ ]
Paper : "GLPanoDepth: Global-to-Local Panoramic Depth Estimation", arXiv, 2022 ( ). [ ]
Paper : "DepthFormer: Exploiting Long-Range Correlation and Local Information for Accurate Monocular Depth Estimation", arXiv, 2022 ( ). [ ][ ]
Paper : "BinsFormer: Revisiting Adaptive Bins for Monocular Depth Estimation", arXiv, 2022 ( ). [ ][ ]
Paper : "SideRT: A Real-time Pure Transformer Architecture for Single Image Depth Estimation", arXiv, 2022 ( ). [ ]
Paper : "MonoFormer: Towards Generalization of self-supervised monocular depth estimation with Transformers", arXiv, 2022 ( ). [ ]
Paper : "Depthformer: Multiscale Vision Transformer For Monocular Depth Estimation With Local Global Information Fusion", arXiv, 2022 ( ). [ ]
Paper : "TODE-Trans: Transparent Object Depth Estimation with Transformer", arXiv, 2022 ( ). [ ][ ]
Paper : "ObjCAViT: Improving Monocular Depth Estimation Using Natural Language Models And Image-Object Cross-Attention", arXiv, 2022 ( ). [ ]
Paper : "ROIFormer: Semantic-Aware Region of Interest Transformer for Efficient Self-Supervised Monocular Depth Estimation", AAAI, 2023 ( ). [ ]
Paper : "Lightweight Monocular Depth Estimation via Token-Sharing Transformer", ICRA, 2023 ( ). [ ]
Paper : "CompletionFormer: Depth Completion with Convolutions and Vision Transformers", CVPR, 2023 ( ). [ ][ ][ ]
Paper : "Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation", CVPR, 2023 ( ). [ ][ ]
Paper : "EGformer: Equirectangular Geometry-biased Transformer for 360 Depth Estimation", ICCV, 2023 ( ). [ ]
Paper : "Towards Zero-Shot Scale-Aware Monocular Depth Estimation", ICCV, 2023 ( ). [ ][ ][ ]
Paper : "Win-Win: Training High-Resolution Vision Transformers from Two Windows", arXiv, 2023 ( ). [ ]
Paper : "Learning to Adapt CLIP for Few-Shot Monocular Depth Estimation", WACV, 2024 ( ). [ ]
Paper : "DeCoTR: Enhancing Depth Completion with 2D and 3D Attentions", CVPR, 2024 ( ). [ ]
Paper : "Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data", arXiv, 2024 ( ). [ ][ ][ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Object Segmentation

Paper : "SOTR: Segmenting Objects with Transformers", ICCV, 2021 ( ). [ ][ ]
Paper : "Trans4Trans: Efficient Transformer for Transparent Object Segmentation to Help Visually Impaired People Navigate in the Real World", ICCVW, 2021 ( ). [ ][ ]
Paper : "Segmenting Transparent Object in the Wild with Transformer", arXiv, 2021 ( ). [ ][ ]
Paper : "SOIT: Segmenting Objects with Instance-Aware Transformers", AAAI, 2022 ( ). [ ][ ]
Paper : "Concurrent Recognition and Segmentation with Adaptive Segment Tokens", arXiv, 2022 ( ). [ ]
Paper : "Learning Explicit Object-Centric Representations with Vision Transformers", arXiv, 2022 ( ). [ ]
Paper : "Mean Shift Mask Transformer for Unseen Object Instance Segmentation", arXiv, 2022 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Any-X/Every-X:

Paper : "Segment Anything", ICCV, 2023 ( ). [ ][ ][ ]
Paper : "Segment Everything Everywhere All at Once", NeurIPS, 2023 ( ). [ ][ ]
Paper : "Segment Anything in High Quality", NeurIPS, 2023 ( ). [ ][ ]
Paper : "An Empirical Study on the Robustness of the Segment Anything Model (SAM)", arXiv, 2023 ( ). [ ]
Paper : "A Comprehensive Survey on Segment Anything Model for Vision and Beyond", arXiv, 2023 ( ). [ ]
Paper : "SAD: Segment Any RGBD", arXiv, 2023 ( ). [ ][ ]
Paper : "A Survey on Segment Anything Model (SAM): Vision Foundation Model Meets Prompt Engineering", arXiv, 2023 ( ). [ ]
Paper : "Robustness of SAM: Segment Anything Under Corruptions and Beyond", arXiv, 2023 ( ). [ ]
Paper : "Fast Segment Anything", arXiv, 2023 ( ). [ ][ ]
Paper : "Faster Segment Anything: Towards Lightweight SAM for Mobile Applications", arXiv, 2023 ( ). [ ][ ]
Paper : "Semantic-SAM: Segment and Recognize Anything at Any Granularity", arXiv, 2023 ( ). [ ][ ]
Paper : "Follow Anything: Open-set detection, tracking, and following in real-time", arXiv, 2023 ( ). [ ]
Paper : "Visual In-Context Prompting", arXiv, 2023 ( ). [ ][ ]
Paper : "Stable Segment Anything Model", arXiv, 2023 ( ). [ ][ ]
Paper : "EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything", arXiv, 2023 ( ). [ ]
Paper : "EdgeSAM: Prompt-In-the-Loop Distillation for On-Device Deployment of SAM", arXiv, 2023 ( ). [ ][ ][ ]
Paper : "RepViT-SAM: Towards Real-Time Segmenting Anything", arXiv, 2023 ( ). [ ][ ]
Paper : "0.1% Data Makes Segment Anything Slim", arXiv, 2023 ( ). [ ][ ]
Paper : "Interfacing Foundation Models' Embeddings", arXiv, 2023 ( ). [ ][ ][ ]
Paper : "SqueezeSAM: User-friendly mobile interactive segmentation", arXiv, 2023 ( ). [ ]
Paper : "Tokenize Anything via Prompting", arXiv, 2023 ( ). [ ][ ]
Paper : "MobileSAMv2: Faster Segment Anything to Everything", arXiv, 2023 ( ). [ ][ ]
Paper : "TinySAM: Pushing the Envelope for Efficient Segment Anything Model", arXiv, 2023 ( ). [ ][ ]
Paper : "Convolution Meets LoRA: Parameter Efficient Finetuning for Segment Anything Model", ICLR, 2024 ( ). [ ][ ]
Paper : "Personalize Segment Anything Model with One Shot", ICLR, 2024 ( ). [ ][ ]
Paper : "VRP-SAM: SAM with Visual Reference Prompt", CVPR, 2024 ( ). [ ]
Paper : "Unsegment Anything by Simulating Deformation", CVPR, 2024 ( ). [ ][ ]
Paper : "ASAM: Boosting Segment Anything Model with Adversarial Tuning", CVPR, 2024 ( ). [ ][ ][ ]
Paper : "PTQ4SAM: Post-Training Quantization for Segment Anything", CVPR, 2024 ( ). [ ][ ]
Paper : "BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model", arXiv, 2024 ( ). [ ]
Paper : "Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively", arXiv, 2024 ( ). [ ][ ][ ]
Paper : "Learning to Prompt Segment Anything Models", arXiv, 2024 ( ). [ ]
Paper : "RAP-SAM: Towards Real-Time All-Purpose Segment Anything", arXiv, 2024 ( ). [ ][ ][ ]
Paper : "PA-SAM: Prompt Adapter SAM for High-Quality Image Segmentation", arXiv, 2024 ( ). [ ][ ]
Paper : "Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks", arXiv, 2024 ( ). [ ][ ]
Paper : "EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss", arXiv, 2024 ( ). [ ][ ]
Paper : "DeiSAM: Segment Anything with Deictic Prompting", arXiv, 2024 ( ). [ ]
Paper : "CAT-SAM: Conditional Tuning Network for Few-Shot Adaptation of Segmentation Anything Model", arXiv, 2024 ( ). [ ][ ][ ]
Paper : "BLO-SAM: Bi-level Optimization Based Overfitting-Preventing Finetuning of SAM", arXiv, 2024 ( ). [ ][ ]
Paper : "Part-aware Personalized Segment Anything Model for Patient-Specific Segmentation", arXiv, 2024 ( ). [ ]
Paper : "Practical Region-level Attack against Segment Anything Models", arXiv, 2024 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Vision-Language:

Paper : "Language-driven Semantic Segmentation", ICLR, 2022 ( ). [ ][ ]
Paper : "Decoupling Zero-Shot Semantic Segmentation", CVPR, 2022 ( ). [ ][ ]
Paper : "Image Segmentation Using Text and Image Prompts", CVPR, 2022 ( ). [ ][ ]
Paper : "DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting", CVPR, 2022 ( ). [ ][ ][ ]
Paper : "GroupViT: Semantic Segmentation Emerges from Text Supervision", CVPR, 2022 ( ). [ ][ ][ ]
Paper : "Extract Free Dense Labels from CLIP", ECCV, 2022 ( ). [ ][ ][ ]
Paper : "ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View Semantic Consistency", ICLR, 2023 ( ). [ ][ ]
Paper : "LMSeg: Language-guided Multi-dataset Segmentation", ICLR, 2023 ( ). [ ]
Paper : "VL-Fields: Towards Language-Grounded Neural Implicit Spatial Representations", ICRA, 2023 ( ). [ ][ ]
Paper : "Generalized Decoding for Pixel, Image, and Language", CVPR, 2023 ( ). [ ][ ][ ]
Paper : "IFSeg: Image-free Semantic Segmentation via Vision-Language Model", CVPR, 2023 ( ). [ ][ ]
Paper : "Delving into Shape-aware Zero-shot Semantic Segmentation", CVPR, 2023 ( ). [ ][ ]
Paper : "CLIP-S : Language-Guided Self-Supervised Semantic Segmentation", CVPR, 2023 ( ). [ ]
Paper : "Semantic-Promoted Debiasing and Background Disambiguation for Zero-Shot Instance Segmentation", CVPR, 2023 ( ). [ ][ ][ ]
Paper : "Primitive Generation and Semantic-related Alignment for Universal Zero-Shot Segmentation", CVPR, 2023 ( ). [ ][ ][ ]
Paper : "LD-ZNet: A Latent Diffusion Approach for Text-Based Image Segmentation", ICCV, 2023 ( ). [ ][ ][ ]
Paper : "Learning Mask-aware CLIP Representations for Zero-Shot Segmentation", NeurIPS, 2023 ( ). [ ][ ]
Paper : "Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic Segmentation", NeurIPS, 2023 ( ). [ ][ ]
Paper : "What a MESS: Multi-Domain Evaluation of Zero-Shot Semantic Segmentation", NeurIPS (Datasets and Benchmarks), 2023 ( ). [ ][ ][ ]
Paper : "ZegOT: Zero-shot Segmentation Through Optimal Transport of Text Prompts", arXiv, 2023 ( ). [ ]
Paper : "SimCon Loss with Multiple Views for Text Supervised Semantic Segmentation", arXiv, 2023 ( ). [ ]
Paper : "DiffusionSeg: Adapting Diffusion Towards Unsupervised Object Discovery", arXiv, 2023 ( ). [ ]
Paper : "Associating Spatially-Consistent Grouping with Text-supervised Semantic Segmentation", arXiv, 2023 ( ). [ ]
Paper : "[CLS] Token is All You Need for Zero-Shot Semantic Segmentation", arXiv, 2023 ( ). [ ]
Paper : "CLIP Is Also a Good Teacher: A New Learning Framework for Inductive Zero-shot Semantic Segmentation", arXiv, 2023 ( ). [ ]
Paper : "SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding", arXiv, 2023 ( ). [ ]
Paper : "Grounding Everything: Emerging Localization Properties in Vision-Language Transformers", arXiv, 2023 ( ). [ ][ ]
Paper : "CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor", arXiv, 2023 ( ). [ ][ ][ ]
Paper : "Spectral Prompt Tuning: Unveiling Unseen Classes for Zero-Shot Semantic Segmentation", AAAI, 2024 ( ). [ ][ ]
Paper : "Annotation Free Semantic Segmentation with Vision Foundation Models", arXiv, 2024 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Open-World/Vocabulary:

Paper : "Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding", ECCV, 2022 ( ). [ ]
Paper : "A Simple Baseline for Open Vocabulary Semantic Segmentation with Pre-trained Vision-language Model", ECCV, 2022 ( ). [ ][ ]
Paper : "Scaling Open-Vocabulary Image Segmentation with Image-Level Labels", ECCV, 2022 ( ). [ ]
Paper : "Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models", BMVC, 2022 ( ). [ ][ ]
Paper : "Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP", CVPR, 2023 ( ). [ ][ ][ ]
Paper : "ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation", CVPR, 2023 ( ). [ ][ ]
Paper : "Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs", CVPR, 2023 ( ). [ ][ ]
Paper : "Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models", CVPR, 2023 ( ). [ ][ ][ ]
Paper : "Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual Mask Annotations", CVPR, 2023 ( ). [ ][ ]
Paper : "FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation", CVPR, 2023 ( ). [ ]
Paper : "Side Adapter Network for Open-Vocabulary Semantic Segmentation", CVPR, 2023 ( ). [ ][ ]
Paper : "Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision", CVPR, 2023 ( ). [ ][ ][ ]
Paper : "Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning", CVPR, 2023 ( ). [ ]
Paper : "Open-Vocabulary Universal Image Segmentation with MaskCLIP", ICML, 2023 ( ). [ ][ ]
Paper : "SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation", ICML, 2023 ( ). [ ][ ]
Paper : "Exploring Transformers for Open-world Instance Segmentation", ICCV, 2023 ( ). [ ]
Paper : "Open-vocabulary Object Segmentation with Diffusion Models", ICCV, 2023 ( ). [ ][ ][ ]
Paper : "SegPrompt: Boosting Open-world Segmentation via Category-level Prompt Learning", ICCV, 2023 ( ). [ ][ ]
Paper : "Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation", ICCV, 2023 ( ). [ ][ ][ ]
Paper : "A Simple Framework for Open-Vocabulary Segmentation and Detection", ICCV, 2023 ( ). [ ][ ]
Paper : "Open-vocabulary Panoptic Segmentation with Embedding Modulation", ICCV, 2023 ( ). [ ]
Paper : "Global Knowledge Calibration for Fast Open-Vocabulary Segmentation", ICCV, 2023 ( ). [ ]
Paper : "Exploring Open-Vocabulary Semantic Segmentation from CLIP Vision Encoder Distillation Only", ICCV, 2023 ( ). [ ]
Paper : "MasQCLIP for Open-Vocabulary Universal Image Segmentation", ICCV, 2023 ( ). [ ][ ][ ]
Paper : "Going Denser with Open-Vocabulary Part Segmentation", ICCV, 2023 ( ). [ ][ ]
Paper : "Open-Vocabulary Semantic Segmentation with Decoupled One-Pass Network", ICCV, 2023 ( ). [ ]][ ]
Paper : "MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner for Open-World Semantic Segmentation", ICCV, 2023 ( ). [ ]
Paper : "OV-PARTS: Towards Open-Vocabulary Part Segmentation", NeurIPS (Datasets and Benchmarks), 2023 ( ). [ ][ ]
Paper : "Hierarchical Open-vocabulary Universal Image Segmentation", NeurIPS, 2023 ( ). [ ][ ][ ]
Paper : "Open-Vocabulary Semantic Segmentation via Attribute Decomposition-Aggregation", NeurIPS, 2023 ( ). [ ]
Paper : "Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP", NeurIPS, 2023 ( ). [ ][ ]
Paper : "A Language-Guided Benchmark for Weakly Supervised Open Vocabulary Semantic Segmentation", arXiv, 2023 ( ). [ ]
Paper : "CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation", arXiv, 2023 ( ). [ ][ ][ ]
Paper : "MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic Segmentation", arXiv, 2023 ( ). [ ]
Paper : "TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation", arXiv, 2023 ( ). [ ]
Paper : "Diffusion Models for Zero-Shot Open-Vocabulary Segmentation", arXiv, 2023 ( ). [ ][ ]
Paper : "Unified Open-Vocabulary Dense Visual Prediction", arXiv, 2023 ( ). [ ]
Paper : "CLIP-DIY: CLIP Dense Inference Yields Open-Vocabulary Semantic Segmentation For-Free", arXiv, 2023 ( ). [ ]
Paper : "Rethinking Evaluation Metrics of Open-Vocabulary Segmentaion", arXiv, 2023 ( ). [ ][ ]
Paper : "Towards Open-Ended Visual Recognition with Large Language Model", arXiv, 2023 ( ). [ ][ ]
Paper : "SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation", arXiv, 2023 ( ). [ ][ ]
Paper : "Plug-and-Play, Dense-Label-Free Extraction of Open-Vocabulary Semantic Segmentation from Vision-Language Models", arXiv, 2023 ( ). [ ]
Paper : "SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference", arXiv, 2023 ( ). [ ]
Paper : "Towards Granularity-adjusted Pixel-level Semantic Annotation", arXiv, 2023 ( ). [ ]
Paper : "Boosting Segment Anything Model Towards Open-Vocabulary Learning", arXiv, 2023 ( ). [ ][ ]
Paper : "Open-Vocabulary Segmentation with Semantic-Assisted Calibration", arXiv, 2023 ( ). [ ][ ]
Paper : "Self-Guided Open-Vocabulary Semantic Segmentation", arXiv, 2023 ( ). [ ]
Paper : "OpenSD: Unified Open-Vocabulary Segmentation and Detection", arXiv, 2023 ( ). [ ]
Paper : "CLIP-DINOiser: Teaching CLIP a few DINO tricks", arXiv, 2023 ( ). [ ][ ]
Paper : "TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification", arXiv, 2023 ( ). [ ][ ][ ]
Paper : "OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation", CVPR, 2024 ( ). [ ]
Paper : "Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation", CVPR, 2024 ( ). [ ][ ]
Paper : "Exploring Simple Open-Vocabulary Semantic Segmentation", arXiv, 2024 ( ). [ ][ ]
Paper : "PosSAM: Panoptic Open-vocabulary Segment Anything", arXiv, 2024 ( ). [ ]][ ][ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / LLM-based:

Paper : "LISA: Reasoning Segmentation via Large Language Model", arXiv, 2023 ( ). [ ][ ]
Paper : "PixelLM: Pixel Reasoning with Large Multimodal Model", arXiv, 2023 ( ). [ ][ ][ ]
Paper : "Pixel Aligned Language Models", arXiv, 2023 ( ). [ ][ ]
Paper : "GSVA: Generalized Segmentation via Multimodal Large Language Models", arXiv, 2023 ( ). [ ]
Paper : "An Improved Baseline for Reasoning Segmentation with Large Language Model", arXiv, 2023 ( ). [ ]
Paper : "GROUNDHOG: Grounding Large Language Models to Holistic Segmentation", CVPR, 2024 ( ). [ ][ ]
Paper : "PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model", arXiv, 2024 ( ). [ ][ ]
Paper : "Empowering Segmentation Ability to Multi-modal Large Language Models", arXiv, 2024 ( ). [ ]
Paper : "LaSagnA: Language-based Segmentation Assistant for Complex Queries", arXiv, 2024 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Universal Segmentation:

Paper : "K-Net: Towards Unified Image Segmentation", NeurIPS, 2021 ( ). [ ][ ]
Paper : "Masked-attention Mask Transformer for Universal Image Segmentation", CVPR, 2022 ( ). [ ][ ][ ]
Paper : "MP-Former: Mask-Piloted Transformer for Image Segmentation", CVPR, 2023 ( ). [ ][ ]
Paper : "OneFormer: One Transformer to Rule Universal Image Segmentation", CVPR, 2023 ( ). [ ][ ][ ]
Paper : "Universal Instance Perception as Object Discovery and Retrieval", CVPR, 2023 ( ). [ ][ ]
Paper : "CLUSTSEG: Clustering for Universal Segmentation", ICML, 2023 ( ). [ ]
Paper : "DaTaSeg: Taming a Universal Multi-Dataset Multi-Task Segmentation Model", NeurIPS, 2023 ( ). [ ]
Paper : "DFormer: Diffusion-guided Transformer for Universal Image Segmentation", arXiv, 2023 ( ). [ ][ ]
Paper : "A Critical Look at the Current Usage of Foundation Model for Dense Recognition Task", arXiv, 2023 ( ). [ ]
Paper : "Mask2Anomaly: Mask Transformer for Universal Open-set Segmentation", arXiv, 2023 ( ). [ ]
Paper : "SegGen: Supercharging Segmentation Models with Text2Mask and Mask2Img Synthesis", arXiv, 2023 ( ). [ ][ ][ ]
Paper : "PolyMaX: General Dense Prediction with Mask Transformer", WACV, 2024 ( ). [ ][ ]
Paper : "PEM: Prototype-based Efficient MaskFormer for Image Segmentation", CVPR, 2024 ( ). [ ][ ]
Paper : "OMG-Seg: Is One Model Good Enough For All Segmentation?", arXiv, 2024 ( ). [ ][ ][ ]
Paper : "Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision", arXiv, 2024 ( ). [ ][ ]
Paper : "Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation", arXiv, 2024 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Multi-Modal:

Paper : "UCTNet: Uncertainty-Aware Cross-Modal Transformer Network for Indoor RGB-D Semantic Segmentation", ECCV, 2022 ( ). [ ]
Paper : "CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers", arXiv, 2022 ( ). [ ][ ]
Paper : "Delivering Arbitrary-Modal Semantic Segmentation", CVPR, 2023 ( ). [ ][ ][ ]
Paper : "DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation", arXiv, 2023 ( ). [ ][ ]
Paper : "Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation", arXiv, 2024 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Panoptic Segmentation:

Paper : "MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers", CVPR, 2021 ( ). [ ][ ]
Paper : "An End-to-End Trainable Video Panoptic Segmentation Method usingTransformers", arXiv, 2021 ( ). [ ]
Paper : "Time-Space Transformers for Video Panoptic Segmentation", WACV, 2022 ( ). [ ]
Paper : "CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation", CVPR, 2022 ( ). [ ]
Paper : "Panoptic SegFormer", CVPR, 2022 ( ). [ ][ ]
Paper : "k-means Mask Transformer", ECCV, 2022 ( ). [ ][ ]
Paper : "Panoptic-PartFormer: Learning a Unified Model for Panoptic Part Segmentation", ECCV, 2022 ( ). [ ][ ]
Paper : "CoMFormer: Continual Learning in Semantic and Panoptic Segmentation", CVPR, 2023 ( ). [ ]
Paper : "You Only Segment Once: Towards Real-Time Panoptic Segmentation", CVPR, 2023 ( ). [ ][ ]
Paper : "A Generalist Framework for Panoptic Segmentation of Images and Videos", ICCV, 2023 ( ). [ ][ ]
Paper : "Towards Deeply Unified Depth-aware Panoptic Segmentation with Bi-directional Guidance Learning", ICCV, 2023 ( ). [ ][ ]
Paper : "ReMaX: Relaxing for Better Training on Efficient Panoptic Segmentation", NeurIPS, 2023 ( ). [ ][ ]
Paper : "PanopticPartFormer++: A Unified and Decoupled View for Panoptic Part Segmentation", arXiv, 2023 ( ). [ ][ ]
Paper : "MaXTron: Mask Transformer with Trajectory Attention for Video Panoptic Segmentation", arXiv, 2023 ( ). [ ]
Paper : "ECLIPSE: Efficient Continual Learning in Panoptic Segmentation with Visual Prompt Tuning", CVPR, 2024 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Instance Segmentation:

Paper : "ISTR: End-to-End Instance Segmentation with Transformers", arXiv, 2021 ( ). [ ][ ]
Paper : "Mask Transfiner for High-Quality Instance Segmentation", CVPR, 2022 ( ). [ ][ ][ ]
Paper : "Instance Segmentation With Mask-Supervised Polygonal Boundary Transformers", CVPR, 2022 ( ). [ ]
Paper : "Parallel Pre-trained Transformers (PPT) for Synthetic Data-based Instance Segmentation", CVPRW, 2022 ( ). [ ]
Paper : "TOIST: Task Oriented Instance Segmentation Transformer with Noun-Pronoun Distillation", NeurIPS, 2022 ( ). [ ][ ]
Paper : "Vision Transformers Are Good Mask Auto-Labelers", CVPR, 2023 ( ). [ ][ ]
Paper : "FastInst: A Simple Query-Based Model for Real-Time Instance Segmentation", CVPR, 2023 ( ). [ ][ ]
Paper : "Boosting Low-Data Instance Segmentation by Unsupervised Pre-training with Saliency Prompt", CVPR, 2023 ( ). [ ]
Paper : "X-Paste: Revisiting Scalable Copy-Paste for Instance Segmentation using CLIP and StableDiffusion", ICML, 2023 ( ). [ ][ ]
Paper : "DynaMITe: Dynamic Query Bootstrapping for Multi-object Interactive Segmentation Transformer", ICCV, 2023 ( ). [ ][ ][ ]
Paper : "Mask Frozen-DETR: High Quality Instance Segmentation with One GPU", arXiv, 2023 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Optical Flow:

Paper : "CRAFT: Cross-Attentional Flow Transformer for Robust Optical Flow", CVPR, 2022 ( ). [ ][ ]
Paper : "Learning Optical Flow With Kernel Patch Attention", CVPR, 2022 ( ). [ ][ ]
Paper : "Global Matching with Overlapping Attention for Optical Flow Estimation", CVPR, 2022 ( ). [ ][ ]
Paper : "FlowFormer: A Transformer Architecture for Optical Flow", ECCV, 2022 ( ). [ ][ ]
Paper : "TransFlow: Transformer as Flow Learner", CVPR, 2023 ( ). [ ]
Paper : "FlowFormer++: Masked Cost Volume Autoencoding for Pretraining Optical Flow Estimation", CVPR, 2023 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Panoramic Semantic Segmentation:

Paper : "Bending Reality: Distortion-aware Transformers for Adapting to Panoramic Semantic Segmentation", CVPR, 2022 ( ). [ ][ ]
Paper : "SGAT4PASS: Spherical Geometry-Aware Transformer for PAnoramic Semantic Segmentation", IJCAI, 2023 ( ). [ ][ ]
Paper : "FlowFormer: A Transformer Architecture and Its Masked Cost Volume Autoencoding for Optical Flow", arXiv, 2023 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / X-Shot:

Paper : "Few-Shot Segmentation via Cycle-Consistent Transformer", NeurIPS, 2021 ( ). [ ]
Paper : "CATrans: Context and Affinity Transformer for Few-Shot Segmentation", IJCAI, 2022 ( ). [ ]
Paper : "Cost Aggregation with 4D Convolutional Swin Transformer for Few-Shot Segmentation", ECCV, 2022 ( ). [ ][ ][ ]
Paper : "Dense Cross-Query-and-Support Attention Weighted Mask Aggregation for Few-Shot Segmentation", ECCV, 2022 ( ). [ ]
Paper : "Adaptive Agent Transformer for Few-Shot Segmentation", ECCV, 2022 ( ). [ ]
Paper : "Intermediate Prototype Mining Transformer for Few-Shot Semantic Segmentation", NeurIPS, 2022 ( ). [ ][ ]
Paper : "Task-Adaptive Feature Transformer with Semantic Enrichment for Few-Shot Segmentation", arXiv, 2022 ( ). [ ]
Paper : "MSANet: Multi-Similarity and Attention Guidance for Boosting Few-Shot Segmentation", arXiv, 2022 ( ). [ ][ ]
Paper : "Suppressing the Heterogeneity: A Strong Feature Extractor for Few-shot Segmentation", ICLR, 2023 ( ). [ ]
Paper : "Universal Few-shot Learning of Dense Prediction Tasks with Visual Token Matching", ICLR, 2023 ( ). [ ][ ]
Paper : "SegGPT: Segmenting Everything In Context", ICCV, 2023 ( ). [ ][ ]
Paper : "Focus on Query: Adversarial Mining Transformer for Few-Shot Segmentation", NeurIPS, 2023 ( ). [ ][ ]
Paper : "Reference Twice: A Simple and Unified Baseline for Few-Shot Instance Segmentation", arXiv, 2023 ( ). [ ][ ]
Paper : "Multi-Modal Prototypes for Open-Set Semantic Segmentation", arXiv, 2023 ( ). [ ]
Paper : "Few-Shot Panoptic Segmentation With Foundation Models", arXiv, 2023 ( ). [ ][ ]
Paper : "Visual Prompting for Generalized Few-shot Segmentation: A Multi-scale Approach", CVPR, 2024 ( ). [ ]
Paper : "Explore In-Context Segmentation via Latent Diffusion Models", arXiv, 2024 ( ). [ ][ ][ ]
Paper : "Chameleon: A Data-Efficient Generalist for Dense Visual Prediction in the Wild", arXiv, 2024 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / X-Supervised:

Paper : "Multi-class Token Transformer for Weakly Supervised Semantic Segmentation", CVPR, 2022 ( ). [ ][ ]
Paper : "Learning Affinity from Attention: End-to-End Weakly-Supervised Semantic Segmentation with Transformers", CVPR, 2022 ( ). [ ][ ]
Paper : "Unsupervised Hierarchical Semantic Segmentation with Multiview Cosegmentation and Clustering Transformers", CVPR, 2022 ( ). [ ][ ]
Paper : "Cross Language Image Matching for Weakly Supervised Semantic Segmentation", CVPR, 2022 ( ). [ ][ ]
Paper : "Self-Supervised Pre-training of Vision Transformers for Dense Prediction Tasks", CVPRW, 2022 ( ). [ ]
Paper : "Learning Co-segmentation by Segment Swapping for Retrieval and Discovery", CVPRW, 2022 ( ). [ ][ ][ ]
Paper : "Max Pooling with Vision Transformers Reconciles Class and Shape in Weakly Supervised Semantic Segmentation", ECCV, 2022 ( ). [ ][ ]
Paper : "TransFGU: A Top-down Approach to Fine-Grained Unsupervised Semantic Segmentation", ECCV, 2022 ( ). [ ][ ]
Paper : "TransCAM: Transformer Attention-based CAM Refinement for Weakly Supervised Semantic Segmentation", arXiv, 2022 ( ). [ ][ ]
Paper : "WegFormer: Transformers for Weakly Supervised Semantic Segmentation", arXiv, 2022 ( ). [ ]
Paper : "Discovering Object Masks with Transformers for Unsupervised Semantic Segmentation", arXiv, 2022 ( ). [ ][ ]
Paper : "eX-ViT: A Novel eXplainable Vision Transformer for Weakly Supervised Semantic Segmentation", arXiv, 2022 ( ). [ ]
Paper : "Transformer-CNN Cohort: Semi-supervised Semantic Segmentation by the Best of Both Students", arXiv, 2022 ( ). [ ]
Paper : "SemFormer: Semantic Guided Activation Transformer for Weakly Supervised Semantic Segmentation", arXiv, 2022 ( ). [ ][ ]
Paper : "CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation", CVPR, 2023 ( ). [ ][ ]
Paper : "Token Contrast for Weakly-Supervised Semantic Segmentation", CVPR, 2023 ( ). [ ][ ]
Paper : "DPF: Learning Dense Prediction Fields with Weak Supervision", CVPR, 2023 ( ). [ ][ ]
Paper : "SemiCVT: Semi-Supervised Convolutional Vision Transformer for Semantic Segmentation", CVPR, 2023 ( ). [ ]
Paper : "AttentionShift: Iteratively Estimated Part-Based Attention Map for Pointly Supervised Instance Segmentation", CVPR, 2023 ( ). [ ]
Paper : "Learning Multi-Modal Class-Specific Tokens for Weakly Supervised Dense Object Localization", CVPR, 2023 ( ). [ ]
Paper : "A Simple Framework for Text-Supervised Semantic Segmentation", CVPR, 2023 ( ). [ ][ ]
Paper : "SIM: Semantic-aware Instance Mask Generation for Box-Supervised Instance Segmentation", CVPR, 2023 ( ). [ ][ ]
Paper : "AttentionShift: Iteratively Estimated Part-based Attention Map for Pointly Supervised Instance Segmentation", CVPR, 2023 ( ). [ ]
Paper : "Point2Mask: Point-supervised Panoptic Segmentation via Optimal Transport", ICCV, 2023 ( ). [ ][ ]
Paper : "BoxSnake: Polygonal Instance Segmentation with Box Supervision", ICCV, 2023 ( ). [ ]
Paper : "Question-Answer Cross Language Image Matching for Weakly Supervised Semantic Segmentation", ACMMM, 2023 ( ). [ ][ ]
Paper : "Bridging Semantic Gaps for Language-Supervised Semantic Segmentation", NeurIPS, 2023 ( ). [ ][ ]
Paper : "Label-efficient Segmentation via Affinity Propagation", NeurIPS, 2023 ( ). [ ][ ][ ]
Paper : "PaintSeg: Training-free Segmentation via Painting", NeurIPS, 2023 ( ). [ ]
Paper : "SmooSeg: Smoothness Prior for Unsupervised Semantic Segmentation", NeurIPS, 2023 ( ). [ ][ ]
Paper : "Towards Universal Vision-language Omni-supervised Segmentation", arXiv, 2023 ( ). [ ]
Paper : "MECPformer: Multi-estimations Complementary Patch with CNN-Transformers for Weakly Supervised Semantic Segmentation", arXiv, 2023 ( ). [ ][ ]
Paper : "WeakTr: Exploring Plain Vision Transformer for Weakly-supervised Semantic Segmentation", arXiv, 2023 ( ). [ ][ ]
Paper : "An Alternative to WSSS? An Empirical Study of the Segment Anything Model (SAM) on Weakly-Supervised Semantic Segmentation Problems", arXiv, 2023 ( ). [ ]
Paper : "Segment Anything is A Good Pseudo-label Generator for Weakly Supervised Semantic Segmentation", arXiv, 2023 ( ). [ ]
Paper : "Mitigating Undisciplined Over-Smoothing in Transformer for Weakly Supervised Semantic Segmentation", arXiv, 2023 ( ). [ ]
Paper : "Segment Anything Model (SAM) Enhanced Pseudo Labels for Weakly Supervised Semantic Segmentation", arXiv, 2023 ( ). [ ][ ]
Paper : "MIMIC: Masked Image Modeling with Image Correspondences", arXiv, 2023 ( ). [ ][ ]
Paper : "Prompting classes: Exploring the Power of Prompt Class Learning in Weakly Supervised Semantic Segmentation", arXiv, 2023 ( ). [ ][ ]
Paper : "Guided Distillation for Semi-Supervised Instance Segmentation", arXiv, 2023 ( ). [ ]
Paper : "MCTformer+: Multi-Class Token Transformer for Weakly Supervised Semantic Segmentation", arXiv, 2023 ( ). [ ][ ]
Paper : "Masked Momentum Contrastive Learning for Zero-shot Semantic Understanding", arXiv, 2023 ( ). [ ]
Paper : "Emergence of Segmentation with Minimalistic White-Box Transformers", arXiv, 2023 ( ). [ ][ ]
Paper : "Weakly-Supervised Semantic Segmentation with Image-Level Labels: from Traditional Models to Foundation Models", arXiv, 2023 ( ). [ ]
Paper : "Masked Collaborative Contrast for Weakly Supervised Semantic Segmentation", arXiv, 2023 ( ). [ ][ ]
Paper : "White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?", arXiv, 2023 ( ). [ ][ ][ ]
Paper : "Foundation Model Assisted Weakly Supervised Semantic Segmentation", arXiv, 2023 ( ). [ ]
Paper : "SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance", arXiv, 2023 ( ). [ ][ ]
Paper : "Progressive Uncertain Feature Self-reinforcement for Weakly Supervised Semantic Segmentation", AAAI, 2024 ( ). [ ][ ]
Paper : "FeatUp: A Model-Agnostic Framework for Features at Any Resolution", ICLR, 2024 ( ). [ ]
Paper : "The devil is in the object boundary: towards annotation-free instance segmentation using Foundation Models", ICLR, 2024 ( ). [ ][ ]
Paper : "Separate and Conquer: Decoupling Co-occurrence via Decomposition and Representation for Weakly Supervised Semantic Segmentation", CVPR, 2024 ( ). [ ][ ]
Paper : "AllSpark: Reborn Labeled Features from Unlabeled in Transformer for Semi-Supervised Semantic Segmentation", CVPR, 2024 ( ). [ ][ ]
Paper : "Hunting Attributes: Context Prototype-Aware Learning for Weakly Supervised Semantic Segmentation", CVPR, 2024 ( ). [ ][ ]
Paper : "DuPL: Dual Student with Trustworthy Progressive Learning for Robust Weakly Supervised Semantic Segmentation", CVPR, 2024 ( ). [ ][ ]
Paper : "Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation", CVPR, 2024 ( ). [ ][ ]
Paper : "SemPLeS: Semantic Prompt Learning for Weakly-Supervised Semantic Segmentation", arXiv, 2024 ( ). [ ]
Paper : "WeakSAM: Segment Anything Meets Weakly-supervised Instance-level Recognition", arXiv, 2024 ( ). [ ][ ]
Paper : "Weakly Supervised Co-training with Swapping Assignments for Semantic Segmentation", arXiv, 2024 ( ). [ ][ ]
Paper : "CoBra: Complementary Branch Fusing Class and Semantic Knowledge for Robust Weakly Supervised Semantic Segmentation", arXiv, 2024 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Cross-Domain:

Paper : "DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation", CVPR, 2022 ( ). [ ][ ]
Paper : "HGFormer: Hierarchical Grouping Transformer for Domain Generalized Semantic Segmentation", CVPR, 2023 ( ). [ ][ ]
Paper : "UniDAformer: Unified Domain Adaptive Panoptic Segmentation Transformer via Hierarchical Mask Calibration", CVPR, 2023 ( ). [ ]
Paper : "MIC: Masked Image Consistency for Context-Enhanced Domain Adaptation", CVPR, 2023 ( ). [ ][ ]
Paper : "CDAC: Cross-domain Attention Consistency in Transformer for Domain Adaptive Semantic Segmentation", ICCV, 2023 ( ). [ ][ ]
Paper : "EDAPS: Enhanced Domain-Adaptive Panoptic Segmentation", ICCV, 2023 ( ). [ ][ ]
Paper : "Prompting Diffusion Representations for Cross-Domain Semantic Segmentation", arXiv, 2023 ( ). [ ][ ]
Paper : "Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation", arXiv, 2023 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Continual Learning:

Paper : "Delving into Transformer for Incremental Semantic Segmentation", arXiv, 2022 ( ). [ ]
Paper : "Incrementer: Transformer for Class-Incremental Semantic Segmentation With Knowledge Distillation Focusing on Old Class", CVPR, 2023 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Crack Detection:

Paper : "CrackFormer: Transformer Network for Fine-Grained Crack Detection", ICCV, 2021 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Camouflaged/Concealed Object:

Paper : "Uncertainty-Guided Transformer Reasoning for Camouflaged Object Detection", ICCV, 2021 ( ). [ ][ ]
Paper : "Boosting Camouflaged Object Detection with Dual-Task Interactive Transformer", ICPR, 2022 ( ). [ ][ ]
Paper : "OSFormer: One-Stage Camouflaged Instance Segmentation with Transformers", ECCV, 2022 ( ). [ ][ ]
Paper : "Feature Shrinkage Pyramid for Camouflaged Object Detection with Transformers", CVPR, 2023 ( ). [ ][ ][ ]
Paper : "Weakly-Supervised Concealed Object Segmentation with SAM-based Pseudo Labeling and Multi-scale Feature Grouping", NeurIPS, 2023 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Background Separation:

Paper : "TransBlast: Self-Supervised Learning Using Augmented Subspace With Transformer for Background/Foreground Separation", ICCVW, 2021 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Scene Understanding:

Paper : "Transformer Meets Convolution: A Bilateral Awareness Net-work for Semantic Segmentation of Very Fine Resolution Urban Scene Images", arXiv, 2021 ( ). [ ]
Paper : "Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing", CVPR, 2022 ( ). [ ][ ]
Paper : "IRISformer: Dense Vision Transformers for Single-Image Inverse Rendering in Indoor Scenes", CVPR, 2022 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / 3D Segmentation:

Paper : "Stratified Transformer for 3D Point Cloud Segmentation", CVPR, 2022 ( ). [ ][ ]
Paper : "CodedVTR: Codebook-based Sparse Voxel Transformer with Geometric Guidance", CVPR, 2022 ( ). [ ]
Paper : "M2F3D: Mask2Former for 3D Instance Segmentation", CVPRW, 2022 ( ). [ ][ ]
Paper : "3D Segmenter: 3D Transformer based Semantic Segmentation via 2D Panoramic Distillation", ICLR, 2023 ( ). [ ]
Paper : "Analogical Networks for Memory-Modulated 3D Parsing", ICLR, 2023 ( ). [ ]
Paper : "VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion", CVPR, 2023 ( ). [ ][ ]
Paper : "GrowSP: Unsupervised Semantic Segmentation of 3D Point Clouds", CVPR, 2023 ( ). [ ][ ]
Paper : "RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving", CVPR, 2023 ( ). [ ][ ]
Paper : "Heat Diffusion based Multi-scale and Geometric Structure-aware Transformer for Mesh Segmentation", CVPR, 2023 ( ). [ ]
Paper : "MSeg3D: Multi-modal 3D Semantic Segmentation for Autonomous Driving", CVPR, 2023 ( ). [ ][ ]
Paper : "See More and Know More: Zero-shot Point Cloud Segmentation via Multi-modal Visual Data", ICCV, 2023 ( ). [ ]
Paper : "SVQNet: Sparse Voxel-Adjacent Query Network for 4D Spatio-Temporal LiDAR Semantic Segmentation", ICCV, 2023 ( ). [ ]
Paper : "Mask-Attention-Free Transformer for 3D Instance Segmentation", ICCV, 2023 ( ). [ ][ ]
Paper : "UniSeg: A Unified Multi-Modal LiDAR Segmentation Network and the OpenPCSeg Codebase", ICCV, 2023 ( ). [ ][ ]
Paper : "2D-3D Interlaced Transformer for Point Cloud Segmentation with Scene-Level Supervision", ICCV, 2023 ( ). [ ]
Paper : "CVSformer: Cross-View Synthesis Transformer for Semantic Scene Completion", ICCV, 2023 ( ). [ ]
Paper : "Efficient 3D Semantic Segmentation with Superpoint Transformer", ICCV, 2023 ( ). [ ][ ]
Paper : "SATR: Zero-Shot Semantic Segmentation of 3D Shapes", ICCV, 2023 ( ). [ ][ ][ ]
Paper : "3D Indoor Instance Segmentation in an Open-World", NeurIPS, 2023 ( ). [ ]
Paper : "Segment Anything in 3D with NeRFs", NeurIPS, 2023 ( ). [ ][ ][ ]
Paper : "Contrastive Lift: 3D Object Instance Segmentation by Slow-Fast Contrastive Fusion", NeurIPS, 2023 ( ). [ ][ ][ ]
Paper : "Position-Guided Point Cloud Panoptic Segmentation Transformer", arXiv, 2023 ( ). [ ][ ]
Paper : "UnScene3D: Unsupervised 3D Instance Segmentation for Indoor Scenes", arXiv, 2023 ( ). [ ][ ]
Paper : "Towards Label-free Scene Understanding by Vision Foundation Models", NeurIPS, 2023 ( ). [ ][ ]
Paper : "Dynamic Clustering Transformer Network for Point Cloud Segmentation", arXiv, 2023 ( ). [ ]
Paper : "Symphonize 3D Semantic Scene Completion with Contextual Instance Queries", arXiv, 2023 ( ). [ ][ ]
Paper : "Less is More: Towards Efficient Few-shot 3D Semantic Segmentation via Training-free Networks", arXiv, 2023 ( ). [ ][ ]
Paper : "When 3D Bounding-Box Meets SAM: Point Cloud Instance Segmentation with Weak-and-Noisy Supervision", arXiv, 2023 ( ). [ ]
Paper : "SAM-guided Unsupervised Domain Adaptation for 3D Segmentation", arXiv, 2023 ( ). [ ]
Paper : "Leveraging Large-Scale Pretrained Vision Foundation Models for Label-Efficient 3D Point Cloud Segmentation", arXiv, 2023 ( ). [ ]
Paper : "Understanding Self-Supervised Features for Learning Unsupervised Instance Segmentation", arXiv, 2023 ( ). [ ]
Paper : "OneFormer3D: One Transformer for Unified Point Cloud Segmentation", arXiv, 2023 ( ). [ ]
Paper : "Segment Any 3D Gaussians", arXiv, 2023 ( ). [ ][ ][ ]
Paper : "SANeRF-HQ: Segment Anything for NeRF in High Quality", arXiv, 2023 ( ). [ ][ ][ ]
Paper : "SAM-guided Graph Cut for 3D Instance Segmentation", arXiv, 2023 ( ). [ ][ ][ ]
Paper : "SAI3D: Segment Any Instance in 3D Scenes", arXiv, 2023 ( ). [ ]
Paper : "Rethinking Few-shot 3D Point Cloud Semantic Segmentation", CVPR, 2024 ( ). [ ][ ]
Paper : "Building a Strong Pre-Training Baseline for Universal 3D Large-Scale Perception", CVPR, 2024 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Multi-Task:

Paper : "Inverted Pyramid Multi-task Transformer for Dense Scene Understanding", ECCV, 2022 ( ). [ ][ ]
Paper : "MTFormer: Multi-task Learning via Transformer and Cross-Task Reasoning", ECCV, 2022 ( ). [ ]
Paper : "Multi-Task Learning with Multi-Query Transformer for Dense Prediction", arXiv, 2022 ( ). [ ]
Paper : "DeMT: Deformable Mixer Transformer for Multi-Task Learning of Dense Prediction", AAAI, 2023 ( ). [ ][ ]
Paper : "TaskPrompter: Spatial-Channel Multi-Task Prompting for Dense Scene Understanding", ICLR, 2023 ( ). [ ][ ]
Paper : "All in Tokens: Unifying Output Space of Visual Tasks via Soft Token", ICCV, 2023 ( ). [ ][ ]
Paper : "InvPT++: Inverted Pyramid Multi-Task Transformer for Visual Scene Understanding", arXiv, 2023 ( ). [ ]
Paper : "Deformable Mixer Transformer with Gating for Multi-Task Learning of Dense Prediction", arXiv, 2023 ( ). [ ][ ]
Paper : "Sub-token ViT Embedding via Stochastic Resonance Transformers", arXiv, 2023 ( ). [ ]
Paper : "Multi-Task Dense Prediction via Mixture of Low-Rank Experts", CVPR, 2024 ( ). [ ]
Paper : "ODIN: A Single Model for 2D and 3D Perception", arXiv, 2024 ( ). [ ][ ][ ]
Paper : "LiFT: A Surprisingly Simple Lightweight Feature Transform for Dense ViT Descriptors", arXiv, 2024 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Forecasting:

Paper : "Joint Forecasting of Panoptic Segmentations with Difference Attention", CVPR, 2022 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / LiDAR:

Paper : "Online Segmentation of LiDAR Sequences: Dataset and Algorithm", CVPRW, 2022 ( ). [ ][ ][ ]
Paper : "Gaussian Radar Transformer for Semantic Segmentation in Noisy Radar Data", RA-L, 2022 ( ). [ ]
Paper : "Lidar Panoptic Segmentation and Tracking without Bells and Whistles", IROS, 2023 ( ). [ ][ ]
Paper : "4D-Former: Multimodal 4D Panoptic Segmentation", CoRL, 2023 ( ). [ ][ ]
Paper : "MASK4D: Mask Transformer for 4D Panoptic Segmentation", arXiv, 2023 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Co-Segmentation:

Paper : "ReCo: Retrieve and Co-segment for Zero-shot Transfer", NeurIPS, 2022 ( ). [ ][ ][ ]
Paper : "Deep ViT Features as Dense Visual Descriptors", arXiv, 2022 ( ). [ ][ ][ ]
Paper : "LCCo: Lending CLIP to Co-Segmentation", arXiv, 2023 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Top-Down Semantic Segmentation:

Paper : "Trans4Map: Revisiting Holistic Top-down Mapping from Egocentric Images to Allocentric Semantics with Vision Transformers", arXiv, 2022 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Surface Normal:

Paper : "Normal Transformer: Extracting Surface Geometry from LiDAR Points Enhanced by Visual Semantics", arXiv, 2022 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Applications:

Paper : "Transformer-based Flood Scene Segmentation for Developing Countries", NeurIPSW, 2022 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Diffusion:

Paper : "Unleashing Text-to-Image Diffusion Models for Visual Perception", ICCV, 2023 ( ). [ ][ ][ ]
Paper : "Dataset Diffusion: Diffusion-based Synthetic Dataset Generation for Pixel-Level Semantic Segmentation", NeurIPS, 2023 ( ). [ ][ ][ ]
Paper : "SegRefiner: Towards Model-Agnostic Segmentation Refinement with Discrete Diffusion Process", NeurIPS, 2023 ( ). [ ][ ]
Paper : "DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models", NeurIPS, 2023 ( ). [ ][ ][ ]
Paper : "Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion", arXiv, 2023 ( ). [ ]
Paper : "Diffusion Model is Secretly a Training-free Open Vocabulary Semantic Segmenter", arXiv, 2023 ( ). [ ]
Paper : "From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models", arXiv, 2023 ( ). [ ]
Paper : "A Simple Latent Diffusion Approach for Panoptic Segmentation and Mask Inpainting", arXiv, 2024 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Low-Level Structure Segmentation:

Paper : "Explicit Visual Prompting for Low-Level Structure Segmentations", CVPR, 2023. ( ). [ ][ ]
Paper : "Explicit Visual Prompting for Universal Foreground Segmentations", arXiv, 2023 ( ). [ ][ ]
Paper : "EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models", arXiv, 2024 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Zero-Guidance Segmentation:

Paper : "Zero-guidance Segmentation Using Zero Segment Labels", arXiv, 2023 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Part Segmentation:

Paper : "Towards Open-World Segmentation of Parts", CVPR, 2023 ( ). [ ][ ]
Paper : "PartDistillation: Learning Parts from Instance Segmentation", CVPR, 2023 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Entity Segmentation:

Paper : "AIMS: All-Inclusive Multi-Level Segmentation", NeurIPS, 2023 ( ). [ ][ ]
Paper : "SOHES: Self-supervised Open-world Hierarchical Entity Segmentation", ICLR, 2024 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Evaluation:

Paper : "Robustness Analysis on Foundational Segmentation Models", arXiv, 2023 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Interactive Segmentation:

Paper : "InterFormer: Real-time Interactive Image Segmentation", ICCV, 2023 ( ). [ ][ ]
Paper : "SimpleClick: Interactive Image Segmentation with Simple Vision Transformers", ICCV, 2023 ( ). [ ][ ]
Paper : "Interactive Image Segmentation with Cross-Modality Vision Transformers", arXiv, 2023 ( ). [ ][ ]
Paper : "MFP: Making Full Use of Probability Maps for Interactive Image Segmentation", CVPR, 2024 ( ). [ ][ ]
Paper : "GraCo: Granularity-Controllable Interactive Segmentation", CVPR, 2024 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Amodal Segmentation:

Paper : "AISFormer: Amodal Instance Segmentation with Transformer", BMVC, 2022 ( ). [ ][ ]
Paper : "Coarse-to-Fine Amodal Segmentation with Shape Prior", ICCV, 2023 ( ). [ ][ ][ ]
Paper : "Rethinking Amodal Video Segmentation from Learning Supervised Signals with Object-centric Representation", ICCV, 2023 ( ). [ ][ ]
Paper : "Amodal Ground Truth and Completion in the Wild", arXiv, 2023 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / Amonaly Segmentation:

Paper : "Unmasking Anomalies in Road-Scene Segmentation", ICCV, 2023 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Segmentation / Other Segmentation Tasks / In-Context Segmentation:

Paper : "SegIC: Unleashing the Emergent Correspondence for In-Context Segmentation", arXiv, 2023 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Action Recognition / RGB mainly

Paper : "Video Action Transformer Network", CVPR, 2019 ( ). [ ][ ]
Paper : "Towards Training Stronger Video Vision Transformers for EPIC-KITCHENS-100 Action Recognition", CVPRW, 2021 ( ). [ ]
Paper : "Is Space-Time Attention All You Need for Video Understanding?", ICML, 2021 ( ). [ ][ ]
Paper : "Multiscale Vision Transformers", ICCV, 2021 ( ). [ ][ ]
Paper : "VidTr: Video Transformer Without Convolutions", ICCV, 2021 ( ). [ ][ ]
Paper : "ViViT: A Video Vision Transformer", ICCV, 2021 ( ). [ ][ ]
Paper : "Video Transformer Network", ICCVW, 2021 ( ). [ ][ ]
Paper : "Token Shift Transformer for Video Classification", ACMMM, 2021 ( ). [ ][ ]
Paper : "Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers", NeurIPS, 2021 ( ). [ ][ ][ ]
Paper : "Space-time Mixing Attention for Video Transformer", NeurIPS, 2021 ( ). [ ][ ]
Paper : "Shifted Chunk Transformer for Spatio-Temporal Representational Learning", NeurIPS, 2021 ( ). [ ]
Paper : "Relational Self-Attention: What's Missing in Attention for Video Understanding", NeurIPS, 2021 ( ). [ ][ ][ ]
Paper : "An Image is Worth 16x16 Words, What is a Video Worth?", arXiv, 2021 ( ). [ ][ ]
Paper : "Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training", arXiv, 2021 ( ). [ ]
Paper : "TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?", arXiv, 2021 ( ). [ ]
Paper : "VideoLightFormer: Lightweight Action Recognition using Transformers", arXiv, 2021 ( ). [ ]
Paper : "UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning", ICLR, 2022 ( ). [ ][ ]
Paper : "Video Swin Transformer", CVPR, 2022 ( ). [ ][ ]
Paper : "DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition", CVPR, 2022 ( ). [ ][ ]
Paper : "Deformable Video Transformer", CVPR, 2022 ( ). [ ]
Paper : "MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition", CVPR, 2022 ( ). [ ]
Paper : "MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing", CVPR, 2022 ( ). [ ][ ]
Paper : "Recurring the Transformer for Video Action Recognition", CVPR, 2022 ( ). [ ]
Paper : "Stand-Alone Inter-Frame Attention in Video Models", CVPR, 2022 ( ). [ ][ ]
Paper : "MViTv2: Improved Multiscale Vision Transformers for Classification and Detection", CVPR, 2022 ( ). [ ][ ]
Paper : "Multiview Transformers for Video Recognition", CVPR, 2022 ( ). [ ][ ]
Paper : "Object-Region Video Transformers", CVPR, 2022 ( ). [ ][ ]
Paper : "Time Is MattEr: Temporal Self-supervision for Video Transformers", ICML, 2022 ( ). [ ][ ]
Paper : "Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition", ECCV, 2022 ( ). [ ][ ]
Paper : "DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition", ECCV, 2022 ( ). [ ][ ]
Paper : "Efficient Video Transformers with Spatial-Temporal Token Selection", ECCV, 2022 ( ). [ ][ ]
Paper : "Turbo Training with Token Dropout", BMVC, 2022 ( ). [ ]
Paper : "Multi-dataset Training of Transformers for Robust Action Recognition", NeurIPS, 2022 ( ). [ ][ ]
Paper : "Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens", NeurIPS, 2022 ( ). [ ][ ]
Paper : "ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning", NeurIPS, 2022 ( ). [ ][ ]
Paper : "Alignment-guided Temporal Attention for Video Action Recognition", NeurIPS, 2022 ( ). [ ]
Paper : "Attention in Attention: Modeling Context Correlation for Efficient Video Classification", TCSVT, 2022 ( ). [ ][ ]
Paper : "Vision Transformer with Cross-attention by Temporal Shift for Efficient Action Recognition", arXiv, 2022 ( ). [ ]
Paper : "Efficient Attention-free Video Shift Transformers", arXiv, 2022 ( ). [ ]
Paper : "Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling", arXiv, 2022 ( ). [ ]
Paper : "It Takes Two: Masked Appearance-Motion Modeling for Self-supervised Video Transformer Pre-training", arXiv, 2022 ( ). [ ]
Paper : "Linear Video Transformer with Feature Fixation", arXiv, 2022 ( ). [ ]
Paper : "Two-Stream Transformer Architecture for Long Video Understanding", arXiv, 2022 ( ). [ ]
Paper : "PatchBlender: A Motion Prior for Video Transformers", arXiv, 2022 ( ). [ ]
Paper : "Dual-path Adaptation from Image to Video Transformers", CVPR, 2023 ( ). [ ][ ]
Paper : "Streaming Video Model", CVPR, 2023 ( ). [ ][ ]
Paper : "Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning", CVPR, 2023 ( ). [ ]
Paper : "AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders", CVPR, 2023 ( ). [ ][ ]
Paper : "How can objects help action recognition?", CVPR, 2023 ( ). [ ]
Paper : "Simple MViT: A Hierarchical Vision Transformer without the Bells-and-Whistles", ICML, 2023 ( ). [ ]
Paper : "Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles", ICML, 2023 ( ). [ ][ ]
Paper : "Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition", ICCV, 2023 ( ). [ ][ ][ ]
Paper : "What Can Simple Arithmetic Operations Do for Temporal Modeling?", ICCV, 2023 ( ). [ ][ ]
Paper : "Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation", ICCV, 2023 ( ). [ ]
Paper : "Helping Hands: An Object-Aware Ego-Centric Video Recognition Model", ICCV, 2023 ( ). [ ][ ]
Paper : "Learning from Semantic Alignment between Unpaired Multiviews for Egocentric Video Recognition", ICCV, 2023 ( ). [ ][ ]
Paper : "A Large-scale Study of Spatiotemporal Representation Learning with a New Benchmark on Action Recognition", ICCV, 2023 ( ). [ ][ ]
Paper : "UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer", ICCV, 2023 ( ). [ ][ ]
Paper : "CAST: Cross-Attention in Space and Time for Video Action Recognition", NeurIPS, 2023 ( ). [ ][ ][ ]
Paper : "Learning Human Action Recognition Representations Without Real Humans", NeurIPS (Datasets and Benchmarks), 2023 ( ). [ ][ ]
Paper : "SVT: Supertoken Video Transformer for Efficient Video Understanding", arXiv, 2023 ( ). [ ]
Paper : "Prompt Learning for Action Recognition", arXiv, 2023 ( ). [ ]
Paper : "Optimizing ViViT Training: Time and Memory Reduction for Action Recognition", arXiv, 2023 ( ). [ ]
Paper : "Temporally-Adaptive Models for Efficient Video Understanding", arXiv, 2023 ( ). [ ][ ]
Paper : "ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video", arXiv, 2023 ( ). [ ]
Paper : "Multi-entity Video Transformers for Fine-Grained Video Representation Learning", arXiv, 2023 ( ). [ ][ ]
Paper : "GeoDeformer: Geometric Deformable Transformer for Action Recognition", arXiv, 2023 ( ). [ ]
Paper : "Early Action Recognition with Action Prototypes", arXiv, 2023 ( ). [ ]
Paper : "Don't Judge by the Look: A Motion Coherent Augmentation for Video Recognition", ICLR, 2024 ( ). [ ][ ]
Paper : "Learning Correlation Structures for Vision Transformers", CVPR, 2024 ( ). [ ]
Paper : "VideoMamba: State Space Model for Efficient Video Understanding", arXiv, 2024 ( ). [ ][ ]
Paper : "Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding", arXiv, 2024 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Action Recognition / Depth:

Paper : "Trear: Transformer-based RGB-D Egocentric Action Recognition", IEEE Transactions on Cognitive and Developmental Systems, 2021 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Action Recognition / Pose/Skeleton:

Paper : "Spatial Temporal Transformer Network for Skeleton-based Action Recognition", ICPRW, 2020 ( ). [ ]
Paper : "Action Transformer: A Self-Attention Model for Short-Time Human Action Recognition", arXiv, 2021 ( ). [ ][ ]
Paper : "STAR: Sparse Transformer-based Action Recognition", arXiv, 2021 ( ). [ ]
Paper : "GCsT: Graph Convolutional Skeleton Transformer for Action Recognition", arXiv, 2021 ( ). [ ]
Paper : "Global-local Motion Transformer for Unsupervised Skeleton-based Action Learning", ECCV, 2022 ( ). [ ][ ]
Paper : "Pose Uncertainty Aware Movement Synchrony Estimation via Spatial-Temporal Graph Transformer", International Conference on Multimodal Interaction (ICMI), 2022 ( ). [ ]
Paper : "Focal and Global Spatial-Temporal Transformer for Skeleton-based Action Recognition", ACCV, 2022 ( ). [ ]
Paper : "Spatio-Temporal Tuples Transformer for Skeleton-Based Action Recognition", arXiv, 2022 ( ). [ ][ ]
Paper : "ProFormer: Learning Data-efficient Representations of Body Movement with Prototype-based Feature Augmentation and Visual Transformers", arXiv, 2022 ( ). [ ][ ]
Paper : "Spatial Transformer Network with Transfer Learning for Small-scale Fine-grained Skeleton-based Tai Chi Action Recognition", arXiv, 2022 ( ). [ ]
Paper : "Hypergraph Transformer for Skeleton-based Action Recognition", arXiv, 2022 ( ). [ ]
Paper : "STAR-Transformer: A Spatio-temporal Cross Attention Transformer for Human Action Recognition", WACV, 2023 ( ). [ ]
Paper : "STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition", CVPR, 2023 ( ). [ ][ ]
Paper : "SkeletonMAE: Graph-based Masked Autoencoder for Skeleton Sequence Pre-training", ICCV, 2023 ( ). [ ][ ]
Paper : "Masked Motion Predictors are Strong 3D Action Representation Learners", ICCV, 2023 ( ). [ ][ ]
Paper : "LAC - Latent Action Composition for Skeleton-based Action Segmentation", ICCV, 2023 ( ). [ ][ ]
Paper : "SkeleTR: Towards Skeleton-based Action Recognition in the Wild", ICCV, 2023 ( ). [ ]
Paper : "Prompted Contrast with Masked Motion Modeling: Towards Versatile 3D Action Representation Learning", ACMMM, 2023 ( ). [ ][ ]
Paper : "Seeing the Pose in the Pixels: Learning Pose-Aware Representations in Vision Transformers", arXiv, 2023 ( ). [ ][ ]
Paper : "On the Utility of 3D Hand Poses for Action Recognition", arXiv, 2024 ( ). [ ][ ][ ]
Paper : "SkateFormer: Skeletal-Temporal Transformer for Human Action Recognition", arXiv, 2024 ( ). [ ][ ][ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Action Recognition / Multi-modal:

Paper : "Attention Bottlenecks for Multimodal Fusion", NeurIPS, 2021 ( ). [ ]
Paper : "MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition", WACV, 2022 ( ). [ ]
Paper : "Multimodal Transformer for Nursing Activity Recognition", CVPRW, 2022 ( ). [ ][ ]
Paper : "M&M Mix: A Multimodal Multiview Transformer Ensemble", CVPRW, 2022 ( ). [ ]
Paper : "Combined CNN Transformer Encoder for Enhanced Fine-grained Human Action Recognition", CVPRW, 2022 ( ). [ ]
Paper : "Hierarchically Self-Supervised Transformer for Human Skeleton Representation Learning", ECCV, 2022 ( ). [ ][ ]
Paper : "Multi-View Fusion Transformer for Sensor-Based Human Activity Recognition", arXiv, 2022 ( ). [ ]
Paper : "Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models", arXiv, 2022 ( ). [ ]
Paper : "3Mformer: Multi-order Multi-mode Transformer for Skeletal Action Recognition", CVPR, 2023 ( ). [ ]
Paper : "On Uni-Modal Feature Learning in Supervised Multi-Modal Learning", ICML, 2023 ( ). [ ]
Paper : "Multimodal Distillation for Egocentric Action Recognition", ICCV, 2023 ( ). [ ]
Paper : "MotionBERT: Unified Pretraining for Human Motion Analysis", ICCV, 2023 ( ). [ ][ ][ ]
Paper : "TIM: A Time Interval Machine for Audio-Visual Action Recognition", CVPR, 2024 ( ). [ ][ ][ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Action Recognition / Group Activity:

Paper : "GroupFormer: Group Activity Recognition with Clustered Spatial-Temporal Transformer", ICCV, 2021 ( ). [ ]
Paper : "Hunting Group Clues with Transformers for Social Group Activity Recognition", ECCV, 2022 ( ). [ ]
Paper : "Learning Group Activity Features Through Person Attribute Prediction", CVPR, 2024 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Action Detection/Localization

Paper : "OadTR: Online Action Detection with Transformers", ICCV, 2021 ( ). [ ][ ]
Paper : "Relaxed Transformer Decoders for Direct Action Proposal Generation", ICCV, 2021 ( ). [ ][ ]
Paper : "Few-Shot Temporal Action Localization with Query Adaptive Transformer", BMVC, 2021 ( ). [ ][ ]
Paper : "Long Short-Term Transformer for Online Action Detection", NeurIPS, 2021 ( ). [ ][ ][ ]
Paper : "Augmented Transformer with Adaptive Graph for Temporal Action Proposal Generation", arXiv, 2021 ( ). [ ]
Paper : "Temporal Action Proposal Generation with Transformers", arXiv, 2021 ( ). [ ]
Paper : "End-to-end Temporal Action Detection with Transformer", arXiv, 2021 ( ). [ ][ ]
Paper : "Feature Combination Meets Attention: Baidu Soccer Embeddings and Transformer based Temporal Detection", arXiv, 2021 ( ). [ ][ ]
Paper : "MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection", CVPR, 2022 ( ). [ ][ ]
Paper : "Uncertainty-Guided Probabilistic Transformer for Complex Action Recognition", CVPR, 2022 ( ). [ ]
Paper : "TubeR: Tube-Transformer for Action Detection", CVPR, 2022 ( ). [ ]
Paper : "Progressive Attention on Multi-Level Dense Difference Maps for Generic Event Boundary Detection", CVPR, 2022 ( ). [ ][ ]
Paper : "Dual-Stream Transformer for Generic Event Boundary Captioning", CVPRW, 2022 ( ). [ ][ ]
Paper : "Exploring Anchor-based Detection for Ego4D Natural Language Query", arXiv, 2022 ( ). [ ]
Paper : "Entity-aware and Motion-aware Transformers for Language-driven Action Localization in Videos", IJCAI, 2022 ( ). [ ][ ]
Paper : "An Efficient Spatio-Temporal Pyramid Transformer for Action Detection", ECCV, 2022 ( ). [ ]
Paper : "Real-time Online Video Detection with Temporal Smoothing Transformers", ECCV, 2022 ( ). [ ][ ]
Paper : "TALLFormer: Temporal Action Localization with Long-memory Transformer", ECCV, 2022 ( ). [ ][ ]
Paper : "Uncertainty-Based Spatial-Temporal Attention for Online Action Detection", ECCV, 2022 ( ). [ ]
Paper : "ActionFormer: Localizing Moments of Actions with Transformers", ECCV, 2022 ( ). [ ][ ]
Paper : "Where a Strong Backbone Meets Strong Features -- ActionFormer for Ego4D Moment Queries Challenge", ECCVW, 2022 ( ). [ ][ ]
Paper : "Continual Transformers: Redundancy-Free Attention for Online Inference", arXiv, 2022 ( ). [ ][ ]
Paper : "Temporal Perceiver: A General Architecture for Arbitrary Boundary Detection", arXiv, 2022 ( ). [ ]
Paper : "LocATe: End-to-end Localization of Actions in 3D with Transformers", arXiv, 2022 ( ). [ ]
Paper : "HTNet: Anchor-free Temporal Action Localization with Hierarchical Transformers", arXiv, 2022 ( ). [ ]
Paper : "Adaptive Perception Transformer for Temporal Action Localization", arXiv, 2022 ( ). [ ]
Paper : "A Circular Window-based Cascade Transformer for Online Action Detection", arXiv, 2022 ( ). [ ]
Paper : "Holistic Interaction Transformer Network for Action Detection", WACV, 2023 ( ). [ ][ ]
Paper : "On the Benefits of 3D Pose and Tracking for Human Action Recognition", CVPR, 2023 ( ). [ ][ ]
Paper : "Efficient Movie Scene Detection using State-Space Transformers", CVPR, 2023 ( ). [ ]
Paper : "Token Turing Machines", CVPR, 2023 ( ). [ ][ ]
Paper : "Decomposed Cross-modal Distillation for RGB-based Temporal Action Detection", CVPR, 2023 ( ). [ ]
Paper : "Self-Feedback DETR for Temporal Action Detection", ICCV, 2023 ( ). [ ]
Paper : "UnLoc: A Unified Framework for Video Localization Tasks", ICCV, 2023 ( ). [ ][ ]
Paper : "Efficient Video Action Detection with Token Dropout and Context Refinement", ICCV, 2023 ( ). [ ][ ]
Paper : "MS-DETR: Natural Language Video Localization with Sampling Moment-Moment Interaction", ACL, 2023 ( ). [ ][ ]
Paper : "End-to-End Spatio-Temporal Action Localisation with Video Transformers", arXiv, 2023 ( ). [ ]
Paper : "DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion", arXiv, 2023 ( ). [ ][ ]
Paper : "No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary Detection", arXiv, 2023 ( ). [ ]
Paper : "PAT: Position-Aware Transformer for Dense Multi-Label Action Detection", arXiv, 2023 ( ). [ ]
Paper : "Adapting Short-Term Transformers for Action Detection in Untrimmed Videos", arXiv, 2023 ( ). [ ]
Paper : "Towards More Practical Group Activity Detection: A New Benchmark and Model", arXiv, 2023 ( ). [ ][ ][ ]
Paper : "Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization", arXiv, 2023 ( ). [ ]
Paper : "A Semantic and Motion-Aware Spatiotemporal Transformer Network for Action Detection", TPAMI, 2024 ( ). [ ]
Paper : "Open-Vocabulary Spatio-Temporal Action Detection", arXiv, 2024 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Action Prediction/Anticipation

Paper : "Anticipative Video Transformer", ICCV, 2021 ( ). [ ][ ][ ]
Paper : "TTPP: Temporal Transformer with Progressive Prediction for Efficient Action Anticipation", Neurocomputing, 2021 ( ). [ ]
Paper : "Higher Order Recurrent Space-Time Transformer", arXiv, 2021 ( ). [ ][ ]
Paper : "Action Forecasting with Feature-wise Self-Attention", arXiv, 2021 ( ). [ ]
Paper : "Future Transformer for Long-term Action Anticipation", CVPR, 2022 ( ). [ ]
Paper : "VPTR: Efficient Transformers for Video Prediction", ICPR, 2022 ( ). [ ][ ]
Paper : "Earthformer: Exploring Space-Time Transformers for Earth System Forecasting", NeurIPS, 2022 ( ). [ ]
Paper : "Interaction Visual Transformer for Egocentric Action Anticipation", arXiv, 2022 ( ). [ ]
Paper : "Video Prediction by Efficient Transformers", IVC, 2022 ( ). [ ][ ]
Paper : "Anticipative Feature Fusion Transformer for Multi-Modal Action Anticipation", WACV, 2023 ( ). [ ][ ]
Paper : "GliTr: Glimpse Transformers with Spatiotemporal Consistency for Online Action Prediction", WACV, 2023 ( ). [ ]
Paper : "Latency Matters: Real-Time Action Forecasting Transformer", CVPR, 2023 ( ). [ ]
Paper : "AdamsFormer for Spatial Action Localization in the Future", CVPR, 2023 ( ). [ ]
Paper : "The Wisdom of Crowds: Temporal Progressive Attention for Early Action Prediction", CVPR, 2023 ( ). [ ][ ][ ]
Paper : "Memory-and-Anticipation Transformer for Online Action Understanding", ICCV, 2023 ( ). [ ][ ]
Paper : "SwinLSTM: Improving Spatiotemporal Prediction Accuracy using Swin Transformer and LSTM", ICCV, 2023 ( ). [ ][ ]
Paper : "Multiscale Video Pretraining for Long-Term Activity Forecasting", arXiv, 2023 ( ). [ ]
Paper : "DiffAnt: Diffusion Models for Action Anticipation", arXiv, 2023 ( ). [ ]
Paper : "LALM: Long-Term Action Anticipation with Language Models", arXiv, 2023 ( ). [ ]
Paper : "Learning from One Continuous Video Stream", arXiv, 2023 ( ). [ ]
Paper : "Object-centric Video Representation for Long-term Action Anticipation", WACV, 2024 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Video Object Segmentation

Paper : "Fast Video Object Segmentation using the Global Context Module", ECCV, 2020 ( ). [ ]
Paper : "SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation", CVPR, 2021 ( ). [ ][ ]
Paper : "Joint Inductive and Transductive Learning for Video Object Segmentation", ICCV, 2021 ( ). [ ][ ]
Paper : "Associating Objects with Transformers for Video Object Segmentation", NeurIPS, 2021 ( ). [ ][ ][ ]
Paper : "TransVOS: Video Object Segmentation with Transformers", arXiv, 2021 ( ). [ ]
Paper : "Siamese Network with Interactive Transformer for Video Object Segmentation", AAAI, 2022 ( ). [ ]
Paper : "Differentiable Soft-Masked Attention", CVPRW, 2022 ( ). [ ]
Paper : "BATMAN: Bilateral Attention Transformer in Motion-Appearance Neighboring Space for Video Object Segmentation", ECCV, 2022 ( ). [ ]
Paper : "Decoupling Features in Hierarchical Propagation for Video Object Segmentation", NeurIPS, 2022 ( ). [ ][ ]
Paper : "Associating Objects with Scalable Transformers for Video Object Segmentation", arXiv, 2022 ( ). [ ][ ]
Paper : "MED-VT: Multiscale Encoder-Decoder Video Transformer with Application to Object Segmentation", CVPR, 2023 ( ). [ ][ ]
Paper : "Boosting Video Object Segmentation via Space-time Correspondence Learning", CVPR, 2023 ( ). [ ]
Paper : "Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation", ICCV, 2023 ( ). [ ][ ]
Paper : "Scalable Video Object Segmentation with Simplified Framework", ICCV, 2023 ( ). [ ]
Paper : "Integrating Boxes and Masks: A Multi-Object Framework for Unified Visual Tracking and Segmentation", ICCV, 2023 ( ). [ ][ ]
Paper : "Multi-grained Temporal Prototype Learning for Few-shot Video Object Segmentation", ICCV, 2023 ( ). [ ][ ]
Paper : "MOSE: A New Dataset for Video Object Segmentation in Complex Scenes", ICCV, 2023 ( ). [ ][ ][ ]
Paper : "LVOS: A Benchmark for Long-term Video Object Segmentation", ICCV, 2023 ( ). [ ][ ][ ]
Paper : "Joint Modeling of Feature, Correspondence, and a Compressed Memory for Video Object Segmentation", arXiv, 2023 ( ). [ ]
Paper : "PanoVOS: Bridging Non-panoramic and Panoramic Views with Transformer for Video Segmentation", arXiv, 2023 ( ). [ ][ ][ ]
Paper : "Putting the Object Back into Video Object Segmentation", arXiv, 2023 ( ). [ ][ ][ ]
Paper : "M T: Multi-Scale Memory Matching for Video Object Segmentation and Tracking", arXiv, 2023 ( ). [ ]
Paper : "Appearance-based Refinement for Object-Centric Motion Segmentation", arXiv, 2023 ( ). [ ]
Paper : "Depth-aware Test-Time Training for Zero-shot Video Object Segmentation", CVPR, 2024 ( ). [ ][ ][ ]
Paper : "Event-assisted Low-Light Video Object Segmentation", CVPR, 2024 ( ). [ ]
Paper : "Point-VOS: Pointing Up Video Object Segmentation", arXiv, 2024 ( ). [ ][ ]
Paper : "Efficient Video Object Segmentation via Modulated Cross-Attention Memory", arXiv, 2024 ( ). [ ][ ]
Paper : "Spatial-Temporal Multi-level Association for Video Object Segmentation", arXiv, 2024 ( ). [ ]
Paper : "Moving Object Segmentation: All You Need Is SAM (and Flow)", arXiv, 2024 ( ). [ ][ ]
Paper : "LVOS: A Benchmark for Large-scale Long-term Video Object Segmentation", arXiv, 2024 ( ). [ ][ ][ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Video Instance Segmentation

Paper : "End-to-End Video Instance Segmentation with Transformers", CVPR, 2021 ( ). [ ][ ]
Paper : "Video Instance Segmentation using Inter-Frame Communication Transformers", NeurIPS, 2021 ( ). [ ][ ]
Paper : "Deformable VisTR: Spatio temporal deformable attention for video instance segmentation", ICASSP, 2022 ( ). [ ][ ]
Paper : "Temporally Efficient Vision Transformer for Video Instance Segmentation", CVPR, 2022 ( ). [ ][ ]
Paper : "A Graph Matching Perspective With Transformers on Video Instance Segmentation", CVPR, 2022 ( ). [ ]
Paper : "Video Mask Transfiner for High-Quality Video Instance Segmentation", ECCV, 2022 ( ). [ ][ ][ ]
Paper : "SeqFormer: Sequential Transformer for Video Instance Segmentation", ECCV, 2022 ( ). [ ][ ]
Paper : "Video Instance Segmentation via Multi-scale Spatio-temporal Split Attention Transformer", ECCV, 2022 ( ). [ ][ ]
Paper : "MinVIS: A Minimal Video Instance Segmentation Framework without Video-based Training", NeurIPS, 2022 ( ). [ ][ ]
Paper : "VITA: Video Instance Segmentation via Object Token Association", NeurIPS, 2022 ( ). [ ][ ]
Paper : "Consistent Video Instance Segmentation with Inter-Frame Recurrent Attention", arXiv, 2022 ( ). [ ]
Paper : "DeVIS: Making Deformable Transformers Work for Video Instance Segmentation", arXiv, 2022 ( ). [ ][ ]
Paper : "InstanceFormer: An Online Video Instance Segmentation Framework", arXiv, 2022 ( ). [ ][ ]
Paper : "Mask-Free Video Instance Segmentation", CVPR, 2023 ( ). [ ][ ]
Paper : "MDQE: Mining Discriminative Query Embeddings to Segment Occluded Instances on Challenging Videos", CVPR, 2023 ( ). [ ][ ]
Paper : "A Generalized Framework for Video Instance Segmentation", CVPR, 2023 ( ). [ ][ ]
Paper : "CTVIS: Consistent Training for Online Video Instance Segmentation", ICCV, 2023 ( ). [ ][ ]
Paper : "TCOVIS: Temporally Consistent Online Video Instance Segmentation", ICCV, 2023 ( ). [ ][ ]
Paper : "DVIS: Decoupled Video Instance Segmentation Framework", ICCV, 2023 ( ). [ ][ ]
Paper : "TMT-VIS: Taxonomy-aware Multi-dataset Joint Training for Video Instance Segmentation", NeurIPS, 2023 ( ). [ ][ ]
Paper : "BoxVIS: Video Instance Segmentation with Box Annotations", arXiv, 2023 ( ). [ ][ ]
Paper : "Video Instance Segmentation in an Open-World", arXiv, 2023 ( ). [ ][ ]
Paper : "GRAtt-VIS: Gated Residual Attention for Auto Rectifying Video Instance Segmentation", arXiv, 2023 ( ). [ ][ ]
Paper : "RefineVIS: Video Instance Segmentation with Temporal Attention Refinement", arXiv, 2023 ( ). [ ]
Paper : "VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation", arXiv, 2023 ( ). [ ][ ]
Paper : "NOVIS: A Case for End-to-End Near-Online Video Instance Segmentation", arXiv, 2023 ( ). [ ]
Paper : "VISAGE: Video Instance Segmentation with Appearance-Guided Enhancement", arXiv, 2023 ( ). [ ][ ]
Paper : "OW-VISCap: Open-World Video Instance Segmentation and Captioning", arXiv, 2024 ( ). [ ][ ]
Paper : "What is Point Supervision Worth in Video Instance Segmentation?", arXiv, 2024 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Action Segmentation

Paper : "ASFormer: Transformer for Action Segmentation", BMVC, 2021 ( ). [ ][ ]
Paper : "Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos", CVPR, 2022 ( ). [ ][ ]
Paper : "SC-Transformer++: Structured Context Transformer for Generic Event Boundary Detection", CVPRW, 2022 ( ). [ ][ ]
Paper : "Unified Fully and Timestamp Supervised Temporal Action Segmentation via Sequence to Sequence Translation", ECCV, 2022 ( ). [ ][ ]
Paper : "Transformers in Action: Weakly Supervised Action Segmentation", arXiv, 2022 ( ). [ ]
Paper : "Cross-Enhancement Transformer for Action Segmentation", arXiv, 2022 ( ). [ ]
Paper : "Efficient U-Transformer with Boundary-Aware Loss for Action Segmentation", arXiv, 2022 ( ). [ ]
Paper : "Structured Context Transformer for Generic Event Boundary Detection", arXiv, 2022 ( ). [ ]
Paper : "Enhancing Transformer Backbone for Egocentric Video Action Segmentation", CVPRW, 2023 ( ). [ ][ ]
Paper : "How Much Temporal Long-Term Context is Needed for Action Segmentation?", ICCV, 2023 ( ). [ ][ ]
Paper : "Diffusion Action Segmentation", ICCV, 2023 ( ). [ ][ ]
Paper : "Temporal Segment Transformer for Action Segmentation", arXiv, 2023 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Video X Segmentation:

Paper : "Video Semantic Segmentation via Sparse Temporal Transformer", MM, 2021 ( ). [ ]
Paper : "Coarse-to-Fine Feature Mining for Video Semantic Segmentation", CVPR, 2022 ( ). [ ][ ]
Paper : "TubeFormer-DeepLab: Video Mask Transformer", CVPR, 2022 ( ). [ ]
Paper : "Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation", CVPR, 2022 ( ). [ ][ ]
Paper : "Mining Relations among Cross-Frame Affinities for Video Semantic Segmentation", ECCV, 2022 ( ). [ ][ ]
Paper : "PolyphonicFormer: Unified Query Learning for Depth-aware Video Panoptic Segmentation, ECCV, 2022 ( ). [ ][ ]
Paper : "Time-Space Transformers for Video Panoptic Segmentation", arXiv, 2022 ( ). [ ]
Paper : "Context-Aware Relative Object Queries To Unify Video Instance and Panoptic Segmentation", CVPR, 2023 ( ). [ ][ ][ ]
Paper : "TarViS: A Unified Approach for Target-based Video Segmentation", CVPR, 2023 ( ). [ ][ ]
Paper : "MEGA: Multimodal Alignment Aggregation and Distillation For Cinematic Video Segmentation", ICCV, 2023 ( ). [ ]
Paper : "Tracking Anything with Decoupled Video Segmentation", ICCV, 2023 ( ). [ ][ ][ ]
Paper : "Tube-Link: A Flexible Cross Tube Baseline for Universal Video Segmentation", ICCV, 2023 ( ). [ ][ ]
Paper : "Temporal-aware Hierarchical Mask Classification for Video Semantic Segmentation", BMVC, 2023 ( ). [ ][ ]
Paper : "Mask Propagation for Efficient Video Semantic Segmentation", NeurIPS, 2023 ( ). [ ][ ]
Paper : "Video-kMaX: A Simple Unified Approach for Online and Near-Online Video Panoptic Segmentation", arXiv, 2023 ( ). [ ]
Paper : "Segment Anything Meets Point Tracking", arXiv, 2023 ( ). [ ][ ]
Paper : "Test-Time Training on Video Streams", arXiv, 2023 ( ). [ ][ ]
Paper : "UniVS: Unified and Universal Video Segmentation with Prompts as Queries", CVPR, 2024 ( ). [ ][ ][ ]
Paper : "DVIS++: Improved Decoupled Framework for Universal Video Segmentation", arXiv, 2024 ( ). [ ][ ]
Paper : "SAM-PD: How Far Can SAM Take Us in Tracking and Segmenting Anything in Videos by Prompt Denoising", arXiv, 2024 ( ). [ ][ ]
Paper : "OneVOS: Unifying Video Object Segmentation with All-in-One Transformer Framework", arXiv, 2024 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Video Object Detection:

Paper : "End-to-End Video Object Detection with Spatial-Temporal Transformers", arXiv, 2021 ( ). [ ][ ]
Paper : "MODETR: Moving Object Detection with Transformers", arXiv, 2021 ( ). [ ]
Paper : "Spatio-Temporal Multi-Task Learning Transformer for Joint Moving Object Detection and Segmentation", arXiv, 2021 ( ). [ ]
Paper : "ST-DETR: Spatio-Temporal Object Traces Attention Detection Transformer", arXiv, 2021 ( ). [ ]
Paper : "PTSEFormer: Progressive Temporal-Spatial Enhanced TransFormer Towards Video Object Detection", ECCV, 2022 ( ). [ ][ ]
Paper : "TransVOD: End-to-end Video Object Detection with Spatial-Temporal Transformers", arXiv, 2022 ( ). [ ]
Paper : "Learning Future Object Prediction with a Spatiotemporal Detection Transformer", arXiv, 2022 ( ). [ ]
Paper : "Identity-Consistent Aggregation for Video Object Detection", ICCV, 2023 ( ). [ ][ ]
Paper : "Unsupervised Open-Vocabulary Object Localization in Videos", ICCV, 2023 ( ). [ ]
Paper : "Context Enhanced Transformer for Single Image Object Detection", AAAI, 2024 ( ). [ ][ ][ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Dense Video Tasks (Detection + Segmentation):

Paper : "TDViT: Temporal Dilated Video Transformer for Dense Video Tasks", ECCV, 2022 ( ). [ ][ ]
Paper : "Feature Aggregated Queries for Transformer-Based Video Object Detectors", CVPR, 2023 ( ). [ ][ ]
Paper : "Video OWL-ViT: Temporally-consistent open-world localization in video", ICCV, 2023 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Video Retrieval:

Paper : "Self-supervised Video Retrieval Transformer Network", arXiv, 2021 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Video Hashing:

Paper : "Self-Supervised Video Hashing via Bidirectional Transformers", CVPR, 2021 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Video-Language:

Paper : "ActionCLIP: A New Paradigm for Video Action Recognition", arXiv, 2022 ( ). [ ][ ]
Paper : "Prompting Visual-Language Models for Efficient Video Understanding", ECCV, 2022 ( ). [ ][ ][ ]
Paper : "Expanding Language-Image Pretrained Models for General Video Recognition", ECCV, 2022 ( ). [ ][ ]
Paper : "Frozen CLIP Models are Efficient Video Learners", ECCV, 2022 ( ). [ ][ ]
Paper : "Zero-Shot Temporal Action Detection via Vision-Language Prompting", ECCV, 2022 ( ). [ ][ ]
Paper : "Knowledge Prompting for Few-shot Action Recognition", arXiv, 2022 ( ). [ ]
Paper : "VLG: General Video Recognition with Web Textual Knowledge", arXiv, 2022 ( ). [ ]
Paper : "InternVideo: General Video Foundation Models via Generative and Discriminative Learning", arXiv, 2022 ( ). [ ][ ][ ]
Paper : "PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data", arXiv, 2022 ( ). [ ]
Paper : "Multi-Modal Few-Shot Temporal Action Detection via Vision-Language Meta-Adaptation", arXiv, 2022 ( ). [ ][ ]
Paper : "MovieCLIP: Visual Scene Recognition in Movies", WACV, 2023 ( ). [ ][ ]
Paper : "Semantics Guided Contrastive Learning of Transformers for Zero-Shot Temporal Activity Detection", WACV, 2023 ( ). [ ]
Paper : "Revisiting Classifier: Transferring Vision-Language Models for Video Recognition", AAAI, 2023 ( ). [ ][ ]
Paper : "AIM: Adapting Image Models for Efficient Video Action Recognition", ICLR, 2023 ( ). [ ][ ][ ]
Paper : "Fine-tuned CLIP Models are Efficient Video Learners", CVPR, 2023 ( ). [ ][ ]
Paper : "Learning Video Representations from Large Language Models", CVPR, 2023 ( ). [ ][ ][ ]
Paper : "Text-Visual Prompting for Efficient 2D Temporal Video Grounding", CVPR, 2023 ( ). [ ]
Paper : "Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting", CVPR, 2023 ( ). [ ][ ]
Paper : "Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring", CVPR, 2023 ( ). [ ][ ]
Paper : "Distilling Vision-Language Pre-training to Collaborate with Weakly-Supervised Temporal Action Localization", CVPR, 2023 ( ). [ ]
Paper : "Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models", CVPR, 2023 ( ). [ ][ ]
Paper : "HierVL: Learning Hierarchical Video-Language Embeddings", CVPR, 2023 ( ). [ ][ ]
Paper : "Test of Time: Instilling Video-Language Models with a Sense of Time", CVPR, 2023 ( ). [ ][ ][ ]
Paper : "Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization", ICML, 2023 ( ). [ ][ ]
Paper : "Implicit Temporal Modeling with Learnable Alignment for Video Recognition", ICCV, 2023 ( ). [ ][ ]
Paper : "Towards Open-Vocabulary Video Instance Segmentation", ICCV, 2023 ( ). [ ][ ]
Paper : "Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning", ICCV, 2023 ( ). [ ][ ]
Paper : "Generative Action Description Prompts for Skeleton-based Action Recognition", ICCV, 2023 ( ). [ ][ ]
Paper : "MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge", ICCV, 2023 ( ). [ ][ ]
Paper : "Language as the Medium: Multimodal Video Classification through text only", ICCVW, 2023 ( ). [ ]
Paper : "Seeing in Flowing: Adapting CLIP for Action Recognition with Motion Prompts Learning", ACMMM, 2023 ( ). [ ]
Paper : "Orthogonal Temporal Interpolation for Zero-Shot Video Recognition", ACMMM, 2023 ( ). [ ][ ]
Paper : "Symbol-LLM: Leverage Language Models for Symbolic System in Visual Human Activity Reasoning", NeurIPS, 2023 ( ). [ ][ ][ ]
Paper : "Opening the Vocabulary of Egocentric Actions", NeurIPS, 2023 ( ). [ ][ ][ ]
Paper : "CLIP-guided Prototype Modulating for Few-shot Action Recognition", arXiv, 2023 ( ). [ ][ ]
Paper : "Multi-modal Prompting for Low-Shot Temporal Action Localization", arXiv, 2023 ( ). [ ]
Paper : "VicTR: Video-conditioned Text Representations for Activity Recognition", arXiv, 2023 ( ). [ ]
Paper : "OpenVIS: Open-vocabulary Video Instance Segmentation", arXiv, 2023 ( ). [ ]
Paper : "Discovering Novel Actions in an Open World with Object-Grounded Visual Commonsense Reasoning", arXiv, 2023 ( ). [ ]
Paper : "Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features", arXiv, 2023 ( ). [ ]
Paper : "MSQNet: Actor-agnostic Action Recognition with Multi-modal Query", arXiv, 2023 ( ). [ ][ ]
Paper : "Training a Large Video Model on a Single Machine in a Day", arXiv, 2023 ( ). [ ][ ]
Paper : "Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data", arXiv, 2023 ( ). [ ][ ]
Paper : "Videoprompter: an ensemble of foundational models for zero-shot video understanding", arXiv, 2023 ( ). [ ]
Paper : "MM-VID: Advancing Video Understanding with GPT-4V(vision)", arXiv, 2023 ( ). [ ][ ]
Paper : "Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding", arXiv, 2023 ( ). [ ]
Paper : "Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning", arXiv, 2023 ( ). [ ][ ]
Paper : "Align before Adapt: Leveraging Entity-to-Region Alignments for Generalizable Video Action Recognition", arXiv, 2023 ( ). [ ]
Paper : "MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning", arXiv, 2023 ( ). [ ][ ]
Paper : "Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains", arXiv, 2023 ( ). [ ][ ]
Paper : "OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition", arXiv, 2023 ( ). [ ][ ][ ]
Paper : "Generating Action-conditioned Prompts for Open-vocabulary Video Action Recognition", arXiv, 2023 ( ). [ ]
Paper : "EZ-CLIP: Efficient Zeroshot Video Action Recognition", arXiv, 2023 ( ). [ ][ ]
Paper : "M -CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition", AAAI, 2024 ( ). [ ]
Paper : "FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition", ICLR, 2024 ( ). [ ][ ][ ]
Paper : "Language Model Guided Interpretable Video Action Reasoning", CVPR, 2024 ( ). [ ][ ]
Paper : "Instance Brownian Bridge as Texts for Open-vocabulary Video Instance Segmentation", arXiv, 2024 ( ). [ ][ ]
Paper : "ActionHub: A Large-scale Action Video Description Dataset for Zero-shot Action Recognition", arXiv, 2024 ( ). [ ]
Paper : "Zero Shot Open-ended Video Inference", arXiv, 2024 ( ). [ ]
Paper : "Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition", arXiv, 2024 ( ). [ ][ ]
Paper : "CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation", arXiv, 2024 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / X-supervised Learning:

Paper : "Long-Short Temporal Contrastive Learning of Video Transformers", CVPR, 2022 ( ). [ ]
Paper : "Self-supervised Video Transformer", CVPR, 2022 ( ). [ ][ ][ ]
Paper : "BEVT: BERT Pretraining of Video Transformers", CVPR, 2022 ( ). [ ][ ]
Paper : "SCVRL: Shuffled Contrastive Video Representation Learning", CVPRW, 2022 ( ). [ ]
Paper : "VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning", CVPRW, 2022 ( ). [ ][ ]
Paper : "Static and Dynamic Concepts for Self-supervised Video Representation Learning", ECCV, 2022 ( ). [ ]
Paper : "VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training", NeurIPS, 2022 ( ). [ ][ ]
Paper : "Masked Autoencoders As Spatiotemporal Learners", NeurIPS, 2022 ( ). [ ][ ]
Paper : "On the Surprising Effectiveness of Transformers in Low-Labeled Video Recognition", arXiv, 2022 ( ). [ ]
Paper : "MaskViT: Masked Visual Pre-Training for Video Prediction", ICLR, 2023 ( ). [ ][ ][ ]
Paper : "Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos", CVPR, 2023 ( ). [ ][ ]
Paper : "VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking", CVPR, 2023 ( ). [ ][ ]
Paper : "SVFormer: Semi-supervised Video Transformer for Action Recognition", CVPR, 2023 ( ). [ ][ ]
Paper : "OmniMAE: Single Model Masked Pretraining on Images and Videos", CVPR, 2023 ( ). [ ][ ]
Paper : "Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning", CVPR, 2023 ( ). [ ][ ]
Paper : "Masked Motion Encoding for Self-Supervised Video Representation Learning", CVPR, 2023 ( ). [ ][ ]
Paper : "MGMAE: Motion Guided Masking for Video Masked Autoencoding", ICCV, 2023 ( ). [ ]
Paper : "Motion-Guided Masking for Spatiotemporal Representation Learning", ICCV, 2023 ( ). [ ]
Paper : "Time Does Tell: Self-Supervised Time-Tuning of Dense Image Representations", ICCV, 2023 ( ). [ ][ ]
Paper : "Language-based Action Concept Spaces Improve Video Self-Supervised Learning", NeurIPS, 2023 ( ). [ ]
Paper : "Self-supervised video pretraining yields human-aligned visual representations", NeurIPS, 2023 ( ). [ ]
Paper : "Siamese Masked Autoencoders", NeurIPS, 2023 ( ). [ ][ ]
Paper : "Visual Representation Learning from Unlabeled Video using Contrastive Masked Autoencoders", arXiv, 2023 ( ). [ ]
Paper : "Efficient Long-Short Temporal Attention Network for Unsupervised Video Object Segmentation", arXiv, 2023 ( ). [ ]
Paper : "Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video", arXiv, 2023 ( ). [ ]
Paper : "Asymmetric Masked Distillation for Pre-Training Small Foundation Models", arXiv, 2023 ( ). [ ]
Paper : "Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation", arXiv, 2023 ( ). [ ]
Paper : "No More Shortcuts: Realizing the Potential of Temporal Self-Supervision", AAAI, 2024 ( ). [ ][ ]
Paper : "VideoMAC: Video Masked Autoencoders Meet ConvNets", CVPR, 2024 ( ). [ ]
Paper : "Self-supervised Video Object Segmentation with Distillation Learning of Deformable Attention", arXiv, 2024 ( ). [ ]
Paper : "MV2MAE: Multi-View Video Masked Autoencoders", arXiv, 2024 ( ). [ ][ ]
Paper : "Revisiting Feature Prediction for Learning Visual Representations from Video", arXiv, 2024 ( ). [ ][ ][ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Transfer Learning/Adaptation:

Paper : "Attention Prompt Tuning: Parameter-efficient Adaptation of Pre-trained Models for Spatiotemporal Modeling", FG, 2024 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / X-shot:

Paper : "Cross-modal Representation Learning for Zero-shot Action Recognition", CVPR, 2022 ( ). [ ]
Paper : "Zero-Shot Action Recognition with Transformer-based Video Semantic Embedding", arXiv, 2022 ( ). [ ]
Paper : "REST: REtrieve & Self-Train for generative action recognition", arXiv, 2022 ( ). [ ]
Paper : "MoLo: Motion-augmented Long-short Contrastive Learning for Few-shot Action Recognition", CVPR, 2023 ( ). [ ][ ]
Paper : "Multimodal Adaptation of CLIP for Few-Shot Action Recognition", arXiv, 2023 ( ). [ ]
Paper : "On the Importance of Spatial Relations for Few-shot Action Recognition", arXiv, 2023 ( ). [ ]
Paper : "Few-shot Action Recognition with Captioning Foundation Models", arXiv, 2023 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Multi-Task:

Paper : "A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives", CVPR, 2024 ( ). [ ][ ][ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Anomaly Detection:

Paper : "Convolutional Transformer based Dual Discriminator Generative Adversarial Networks for Video Anomaly Detection", ACMMM, 2021 ( ). [ ]
Paper : "ADTR: Anomaly Detection Transformer with Feature Reconstruction", International Conference on Neural Information Processing (ICONIP), 2022 ( ). [ ]
Paper : "Self-Supervised Masked Convolutional Transformer Block for Anomaly Detection", arXiv, 2022 ( ). [ ][ ]
Paper : "Multi-Contextual Predictions with Vision Transformer for Video Anomaly Detection", arXiv, 2022 ( ). [ ]
Paper : "CLIP-TSA: CLIP-Assisted Temporal Self-Attention for Weakly-Supervised Video Anomaly Detection", ICIP, 2023 ( ). [ ]
Paper : "Prompt-Guided Zero-Shot Anomaly Action Recognition using Pretrained Deep Skeleton Features", CVPR, 2023 ( ). [ ]
Paper : "Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection", CVPR, 2024 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Relation Detection:

Paper : "Video Relation Detection via Tracklet based Visual Transformer", ACMMMW, 2021 ( ). [ ][ ]
Paper : "VRDFormer: End-to-End Video Visual Relation Detection With Transformers", CVPR, 2022 ( ). [ ][ ]
Paper : "Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs", CVPR, 2022 ( ). [ ][ ]
Paper : "Compositional Prompt Tuning with Motion Cues for Open-vocabulary Video Relation Detection", ICLR, 2023 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Saliency Prediction:

Paper : "Spatio-Temporal Self-Attention Network for Video Saliency Prediction", arXiv, 2021 ( ). [ ]
Paper : "A Unified Transformer Framework for Group-based Segmentation: Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection", arXiv, 2022 ( ). [ ][ ]
Paper : "Discriminative Co-Saliency and Background Mining Transformer for Co-Salient Object Detection", CVPR, 2023 ( ). [ ][ ]
Paper : "CASP-Net: Rethinking Video Saliency Prediction from an Audio-VisualConsistency Perceptual Perspective", CVPR, 2023 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Video Inpainting Detection:

Paper : "Frequency-Aware Spatiotemporal Transformers for Video Inpainting Detection", ICCV, 2021 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Driver Activity:

Paper : "TransDARC: Transformer-based Driver Activity Recognition with Latent Space Feature Calibration", arXiv, 2022 ( ). [ ]
Paper : "Applying Spatiotemporal Attention to Identify Distracted and Drowsy Driving with Vision Transformers", arXiv, 2022 ( ). [ ]
Paper : "Multi-Task Vision Transformer for Semi-Supervised Driver Distraction Detection", arXiv, 2022 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Video Alignment:

Paper : "Dynamic Graph Warping Transformer for Video Alignment", BMVC, 2021 ( ). [ ]
Paper : "Skating-Mixer: Multimodal MLP for Scoring Figure Skating", arXiv, 2022 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Action Counting:

Paper : "TransRAC: Encoding Multi-scale Temporal Correlation with Transformers for Repetitive Action Counting", CVPR, 2022 ( ). [ ][ ][ ]
Paper : "PoseRAC: Pose Saliency Transformer for Repetitive Action Counting", arXiv, 2023 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Action Quality Assessment:

Paper : "Action Quality Assessment with Temporal Parsing Transformer", ECCV, 2022 ( ). [ ]
Paper : "Action Quality Assessment using Transformers", arXiv, 2022 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Human Interaction:

Paper : "IGFormer: Interaction Graph Transformer for Skeleton-based Human Interaction Recognition", ECCV, 2022 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Cross-Domain:

Paper : "Unsupervised Domain Adaptation for Video Transformers in Action Recognition", ICPR, 2022 ( ). [ ][ ]
Paper : "AutoLabel: CLIP-based framework for Open-set Video Domain Adaptation", CVPR, 2023 ( ). [ ][ ]
Paper : "The Unreasonable Effectiveness of Large Language-Vision Models for Source-free Video Domain Adaptation", ICCV, 2023 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Multi-Camera Editing:

Paper : "Temporal and Contextual Transformer for Multi-Camera Editing of TV Shows", ECCVW, 2022 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Instructional/Procedural Video:

Paper : "Learning Procedure-aware Video Representation from Instructional Videos and Their Narrations", CVPR, 2023 ( ). [ ]
Paper : "Procedure-Aware Pretraining for Instructional Video Understanding", CVPR, 2023 ( ). [ ][ ]
Paper : "StepFormer: Self-supervised Step Discovery and Localization in Instructional Videos", CVPR, 2023 ( ). [ ]
Paper : "Event-Guided Procedure Planning from Instructional Videos with Text Supervision", ICCV, 2023 ( ). [ ]
Paper : "Pretrained Language Models as Visual Planners for Human Assistance", ICCV, 2023 ( ). [ ]
Paper : "Learning to Ground Instructional Articles in Videos through Narrations", ICCV, 2023 ( ). [ ][ ]
Paper : "PREGO: online mistake detection in PRocedural EGOcentric videos", CVPR, 2024 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Continual Learning:

Paper : "PIVOT: Prompting for Video Continual Learning", CVPR, 2023 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / 3D:

Paper : "Masked Spatio-Temporal Structure Prediction for Self-supervised Learning on Point Cloud Videos", ICCV, 2023 ( ). [ ][ ]
Paper : "EPIC Fields: Marrying 3D Geometry and Video Understanding", NeurIPS, 2023 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Audio-Video:

Paper : "Audio-Visual Glance Network for Efficient Video Recognition", ICCV, 2023 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Event Camera:

Paper : "EventTransAct: A video transformer-based framework for Event-camera based action recognition", IROS, 2023 ( ). [ ][ ][ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Long Video:

Paper : "EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding", NeurIPS, 2023 ( ). [ ][ ][ ]
Paper : "Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding", arXiv, 2023 ( ). [ ]
Paper : "Text-Conditioned Resampler For Long Form Video Understanding", arXiv, 2023 ( ). [ ]
Paper : "Memory Consolidation Enables Long-Context Video Understanding", arXiv, 2024 ( ). [ ]
Paper : "VideoAgent: Long-form Video Understanding with Large Language Model as Agent", arXiv, 2024 ( ). [ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Video Story:

Paper : "Video Timeline Modeling For News Story Understanding", NeurIPS (Datasets and Benchmarks), 2023 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / Video (High-level) / Other Video Tasks / Analysis:

Paper : "Understanding Video Transformers via Universal Concept Discovery", arXiv, 2024 ( ). [ ][ ]

Ultimate-Awesome-Transformer-Attention / References / Online Resources:

Papers with Code
Transformer tutorial (Lucas Beyer)
CS25: Transformers United (Course @ Stanford)
The Annotated Transformer (Blog)
3D Vision with Transformers (GitHub) 406 5 months ago
Networks Beyond Attention (GitHub) 77 almost 2 years ago
Practical Introduction to Transformers (GitHub) 212 over 1 year ago
Awesome Transformer Architecture Search (GitHub) 260 over 1 year ago
Transformer-in-Vision (GitHub) 1,319 about 1 year ago
Awesome Visual-Transformer (GitHub) 3,387 over 1 year ago
Awesome Transformer for Vision Resources List (GitHub) 280 over 3 years ago
Transformer-in-Computer-Vision (GitHub) 1,130 2 days ago
Transformer Tutorial in ICASSP 2022)

Backlinks from these awesome lists:

More related projects: