Awesome-Visual-Transformer

CV transformer papers

Collects and curates papers on transformer-based computer vision research

Collect some papers about transformer with vision. Awesome Transformer with Computer Vision (CV)

GitHub

3k stars
102 watching
398 forks
last commit: over 1 year ago
Linked from 3 awesome lists

detrtransformertransformer-awesometransformer-cvtransformer-with-cvvisual-transformer

Awesome Visual-Transformer / Papers / Transformer original paper

Attention is All You Need (NIPS 2017)

Awesome Visual-Transformer / Papers / Technical blog

Link [English Blog] Transformers in Vision [ ]
Link [Chinese Blog] 3W字长文带你轻松入门视觉transformer [ ]
Link [Chinese Blog] Vision Transformer 超详细解读 (原理分析+代码解读) [ ]

Awesome Visual-Transformer / Papers / Survey

paper Multimodal learning with transformers: A survey (IEEE TPAMI) [ ] - 2023.05.11
paper A Survey of Visual Transformers [ ] - 2021.11.30
paper Transformers in Vision: A Survey [ ] - 2021.02.22
paper A Survey on Visual Transformer [ ] - 2021.1.30
paper A Survey of Transformers [ ] - 2020.6.09

Awesome Visual-Transformer / Papers / arXiv papers

paper Understanding Gaussian Attention Bias of Vision Transformers Using Effective Receptive [ ]
paper Focused Decoding Enables 3D Anatomical Detection by Transformers [ ] [ ]
paper TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation [ ] [ ]
paper Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery with Transformers [ ] [ ]
paper BatchFormer: Learning to Explore Sample Relationships for Robust Representation Learning [ ] [ ]
[paper] RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning
paper Improved Multiscale Vision Transformers for Classification and Detection [ ] [ ]
paper DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection [ ] [ ]
paper Three things everyone should know about Vision Transformers [ ]
paper DeiT III: Revenge of the ViT [ ]
paper DaViT: Dual Attention Vision Transformers [ ] [ ]
paper Collaborative Transformers for Grounded Situation Recognition [ ] [ ]
paper Grounded Situation Recognition with Transformers [ ] [ ]
[paper] MaxViT: Multi-Axis Vision Transformer
[paper] V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer
paper Unsupervised Anomaly Detection in Medical Images with a Memory-augmented Multi-level Cross-attentional Masked Autoencoder [ ] [ ]
paper Contrastive Transformer-based Multiple Instance Learning for Weakly Supervised Polyp Frame Detection [ ] [ ]
paper VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training [ ] [ ]
paper PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers [ ]
paper ResViT: Residual vision transformers for multi-modal medical image synthesis [ ]
paper Combining EfficientNet and Vision Transformers for Video Deepfake Detection [ ] [ ]
paper Discrete Representations Strengthen Vision Transformer Robustness [ ]
paper StyleSwin: Transformer-based GAN for High-resolution Image Generation [ ] [ ]
paper Sliced Recursive Transformer [ ] [ ]
paper Dynamic Token Normalization Improves Vision Transformer [ ]
paper TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? [ ] [ ]
paper Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding [ ]
paper Object-Region Video Transformers [ ] [ ]
paper Adaptively Multi-view and Temporal Fusing Transformer for 3D Human Pose Estimation [ ] [ ]
paper NViT: Vision Transformer Compression and Parameter Redistribution [ ]
paper 6D-ViT: Category-Level 6D Object Pose Estimation via Transformer-based Instance Representation Learning [ ]
paper Adversarial Token Attacks on Vision Transformers [ ]
paper Contextual Transformer Networks for Visual Recognition [ ] [ ]
paper TranSalNet: Visual saliency prediction using transformers [ ]
paper MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer [ ]
paper A free lunch from ViT: Adaptive Attention Multi-scale Fusion Transformer for Fine-grained Visual Recognition [ ]
paper 3D-Transformer: Molecular Representation with Transformer in 3D Space [ ]
paper CCTrans: Simplifying and Improving Crowd Counting with Transformer [ ]
paper UFO-ViT: High Performance Linear Vision Transformer without Softmax [ ]
paper Sparse Spatial Transformers for Few-Shot Learning [ ]
paper Vision Transformer Hashing for Image Retrieval [ ]
paper OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification [ ]
paper Pix2seq: A Language Modeling Framework for Object Detection [ ]
paper CoAtNet: Marrying Convolution and Attention for All Data Sizes [ ]
paper LOTR: Face Landmark Localization Using Localization Transformer [ ]
paper Transformer-Unet: Raw Image Processing with Unet [ ]
paper GraFormer: Graph Convolution Transformer for 3D Pose Estimation [ ]
paper CDTrans: Cross-domain Transformer for Unsupervised Domain Adaptation [ ]
paper PQ-Transformer: Jointly Parsing 3D Objects and Layouts from Point Clouds [ ] [ ]
paper Anchor DETR: Query Design for Transformer-Based Detector [ ] [ ]
paper DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR [ ] [ ]
paper Efficient Transformer for Single Image Super-Resolution [ ]
paper MaskFormer: Per-Pixel Classification is Not All You Need for Semantic Segmentation [ ] [ ]
paper SwinIR: Image Restoration Using Swin Transformer [ ] [ ]
paper Trans4Trans: Efficient Transformer for Transparent Object and Semantic Scene Segmentation in Real-World Navigation Assistance [ ]
paper Do Vision Transformers See Like Convolutional Neural Networks? [ ]
paper Boosting Salient Object Detection with Transformer-based Asymmetric Bilateral U-Net [ ]
paper Light Field Image Super-Resolution with Transformers [ ] [ ]
paper Focal Self-attention for Local-Global Interactions in Vision Transformers [ ] [ ]
paper Polyp-PVT: Polyp Segmentation with Pyramid Vision Transformers [ ] [ ]
paper Mobile-Former: Bridging MobileNet and Transformer [ ]
paper TriTransNet: RGB-D Salient Object Detection with a Triplet Transformer Embedding Network [ ]
paper PSViT: Better Vision Transformer via Token Pooling and Attention Sharing [ ]
paper Boosting Few-shot Semantic Segmentation with Transformers [ ] [ ]
paper Congested Crowd Instance Localization with Dilated Convolutional Swin Transformer [ ]
paper Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer [ ]
paper Styleformer: Transformer based Generative Adversarial Networks with Style Vector [ ] [ ]
paper CMT: Convolutional Neural Networks Meet Vision Transformers [ ]
paper TransAttUnet: Multi-level Attention-guided U-Net with Transformer for Medical Image Segmentation [ ]
paper TransClaw U-Net: Claw U-Net with Transformers for Medical Image Segmentation [ ]
paper ViTGAN: Training GANs with Vision Transformers [ ]
paper What Makes for Hierarchical Vision Transformer? [ ]
paper Trans4Trans: Efficient Transformer for Transparent Object Segmentation to Help Visually Impaired People Navigate in the Real World [ ]
paper Feature Fusion Vision Transformer for Fine-Grained Visual Categorization [ ]
paper TransformerFusion: Monocular RGB Scene Reconstruction using Transformers [ ]
paper Escaping the Big Data Paradigm with Compact Transformers [ ]
paper How to train your ViT? Data, Augmentation,and Regularization in Vision Transformers [ ]
paper Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks [ ]
paper XCiT: Cross-Covariance Image Transformers [ ] [ ]
paper Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer [ ] [ ]
paper Video Swin Transformer [ ] [ ]
paper VOLO: Vision Outlooker for Visual Recognition [ ] [ ]
paper Transformer Meets Convolution: A Bilateral Awareness Net-work for Semantic Segmentation of Very Fine Resolution Ur-ban Scene Images [ ]
paper End-to-end Temporal Action Detection with Transformer [ ] [ ]
paper How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers [ ]
paper Efficient Self-supervised Vision Transformers for Representation Learning [ ]
paper Space-time Mixing Attention for Video Transformer [ ]
paper Transformed CNNs: recasting pre-trained convolutional layers with self-attention [ ]
paper CAT: Cross Attention in Vision Transformer [ ]
paper Scaling Vision Transformers [ ]
paper DETReg: Unsupervised Pretraining with Region Priors for Object Detection [ ] [ ]
paper Chasing Sparsity in Vision Transformers:An End-to-End Exploration [ ]
paper MViT: Mask Vision Transformer for Facial Expression Recognition in the wild [ ]
paper Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight [ ]
paper On Improving Adversarial Transferability of Vision Transformers [ ]
paper Fully Transformer Networks for Semantic ImageSegmentation [ ]
paper Visual Transformer for Task-aware Active Learning [ ] [ ]
paper Efficient Training of Visual Transformers with Small-Size Datasets [ ]
paper Reveal of Vision Transformers Robustness against Adversarial Attacks [ ]
paper Person Re-Identification with a Locally Aware Transformer [ ]
paper Refiner: Refining Self-attention for Vision Transformers [ ]
paper ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias [ ]
paper Video Instance Segmentation using Inter-Frame Communication Transformers [ ]
paper Transformer in Convolutional Neural Networks [ ] [ ]
paper Uformer: A General U-Shaped Transformer for Image Restoration [ ] [ ]
paper Patch Slimming for Efficient Vision Transformers [ ]
paper RegionViT: Regional-to-Local Attention for Vision Transformers [ ]
paper Associating Objects with Transformers for Video Object Segmentation [ ] [ ]
paper Few-Shot Segmentation via Cycle-Consistent Transformer [ ]
paper Glance-and-Gaze Vision Transformer [ ] [ ]
paper Unsupervised MRI Reconstruction via Zero-Shot Learned Adversarial Transformers [ ]
paper DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification [ ] [ ]
paper When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations [ ] [ ]
paper Unsupervised Out-of-Domain Detection via Pre-trained Transformers [ ]
paper TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classication [ ]
paper TransVOS: Video Object Segmentation with Transformers [ ]
paper KVT: k-NN Attention for Boosting Vision Transformers [ ]
paper MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens [ ] [ ]
paper SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers [ ] [ ]
paper SDNet: mutil-branch for single image deraining using swin [ ] [ ]
paper Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length [ ]
paper Gaze Estimation using Transformer [ ] [ ]
paper Transformer-Based Deep Image Matching for Generalizable Person Re-identification [ ]
paper Less is More: Pay Less Attention in Vision Transformers [ ]
paper FoveaTer: Foveated Transformer for Image Classification [ ]
paper Transformer-Based Source-Free Domain Adaptation [ ] [ ]
paper An Attention Free Transformer [ ]
paper PTNet: A High-Resolution Infant MRI Synthesizer Based on Transformer [ ]
paper ResT: An Efficient Transformer for Visual Recognition [ ] [ ]
paper CogView: Mastering Text-to-Image Generation via Transformers [ ]
paper Aggregating Nested Transformers [ ]
paper Temporal Action Proposal Generation with Transformers [ ]
paper Boosting Crowd Counting with Transformers [ ]
paper COTR: Convolution in Transformer Network for End to End Polyp Detection [ ]
paper End-to-End Video Object Detection with Spatial-Temporal Transformers [ ] [ ]
paper Intriguing Properties of Vision Transformers [ ] [ ]
paper Combining Transformer Generators with Convolutional Discriminators [ ]
paper Rethinking the Design Principles of Robust Vision Transformer [ ]
paper Vision Transformers are Robust Learners [ ] [ ]
paper Manipulation Detection in Satellite Images Using Vision Transformer [ ]
paper Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation [ ] [ ]
paper Self-Supervised Learning with Swin Transformers [ ] [ ]
paper SCTN: Sparse Convolution-Transformer Network for Scene Flow Estimation [ ]
paper RelationTrack: Relation-aware Multiple Object Tracking with Decoupled Representation [ ]
paper Visual Grounding with Transformers [ ]
paper Visual Composite Set Detection Using Part-and-Sum Transformers [ ]
paper TrTr: Visual Tracking with Transformer [ ] [ ]
paper MOTR: End-to-End Multiple-Object Tracking with TRansformer [ ] [ ]
paper Attention for Image Registration (AiR): an unsupervised Transformer approach [ ]
paper TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval [ ]
paper ISTR: End-to-End Instance Segmentation with Transformers [ ] [ ]
paper CAT: Cross-Attention Transformer for One-Shot Object Detection [ ]
paper CoSformer: Detecting Co-Salient Object with Transformers [ ]
paper End-to-End Attention-based Image Captioning [ ]
paper Pyramid Medical Transformer for Medical Image Segmentation [ ]
paper HandsFormer: Keypoint Transformer for Monocular 3D Pose Estimation ofHands and Object in Interaction [ ]
paper GasHis-Transformer: A Multi-scale Visual Transformer Approach for Gastric Histopathology Image Classification [ ]
paper Emerging Properties in Self-Supervised Vision Transformers [ ]
paper Inpainting Transformer for Anomaly Detection [ ]
paper Twins: Revisiting Spatial Attention Design in Vision Transformers [ ] [ ]
paper Point Cloud Learning with Transformer [ ]
paper Medical Transformer: Universal Brain Encoder for 3D MRI Analysis [ ]
paper ConTNet: Why not use convolution and transformer at the same time? [ ] [ ]
paper Dual Transformer for Point Cloud Analysis [ ]
paper Improve Vision Transformers Training by Suppressing Over-smoothing [ ] [ ]
paper Transformer Meets DCFAM: A Novel Semantic Segmentation Scheme for Fine-Resolution Remote Sensing Images [ ]
paper M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers [ ] [ ]
paper Skeletor: Skeletal Transformers for Robust Body-Pose Estimation [ ]
paper Learning to Cluster Faces via Transformer [ ]
paper Multiscale Vision Transformers [ ] [ ]
paper VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text [ ]
paper So-ViT: Mind Visual Tokens for Vision Transformer [ ] [ ]
paper Token Labeling: Training a 85.5% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet [ ] [ ]
paper TransRPPG: Remote Photoplethysmography Transformer for 3D Mask Face Presentation Attack Detection [ ]
paper VideoGPT: Video Generation using VQ-VAE and Transformers [ ]
paper M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection [ ]
paper Transformer Transforms Salient Object Detection and Camouflaged Object Detection [ ]
paper TransCrowd: Weakly-Supervised Crowd Counting with Transformer [ ] [ ]
paper Visual Transformer Pruning [ ]
paper Self-supervised Video Retrieval Transformer Network [ ]
paper Vision Transformer using Low-level Chest X-ray Feature Corpus for COVID-19 Diagnosis and Severity Quantification [ ]
paper TransGAN: Two Transformers Can Make One Strong GAN [ ] [ ]
paper Geometry-Free View Synthesis: Transformers and no 3D Priors [ ] [ ]
paper Co-Scale Conv-Attentional Image Transformers [ ] [ ]
paper LocalViT: Bringing Locality to Vision Transformers [ ] [ ]
paper Cloth Interactive Transformer for Virtual Try-On [ ] [ ]
paper Handwriting Transformers [ ]
paper SiT: Self-supervised vIsion Transformer [ ] [ ]
paper On the Robustness of Vision Transformers to Adversarial Examples [ ]
paper An Empirical Study of Training Self-Supervised Visual Transformers [ ]
paper A Video Is Worth Three Views: Trigeminal Transformers for Video-based Person Re-identification [ ]
paper Aggregated Contextual Transformations for High-Resolution Image Inpainting [ ] [ ]
paper Deepfake Detection Scheme Based on Vision Transformer and Distillation [ ]
paper Augmented Transformer with Adaptive Graph for Temporal Action Proposal Generation [ ]
paper TubeR: Tube-Transformer for Action Detection [ ]
paper AAformer: Auto-Aligned Transformer for Person Re-Identification [ ]
paper TFill: Image Completion via a Transformer-Based Architecture [ ]
paper Group-Free 3D Object Detection via Transformers [ ] [ ]
paper Spatial-Temporal Graph Transformer for Multiple Object Tracking [ ]
paper Going deeper with Image Transformers[ ]
paper Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning [ [ ]
paper DA-DETR: Domain Adaptive Detection Transformer by Hybrid Attention [ ]
paper Robust Facial Expression Recognition with Convolutional Visual Transformers [ ]
paper Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers [ ]
paper Spatiotemporal Transformer for Video-based Person Re-identification[ ]
paper TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation [ ] [ ]
paper CvT: Introducing Convolutions to Vision Transformers [ ] [ ]
paper TFPose: Direct Human Pose Estimation with Transformers [ ]
paper TransCenter: Transformers with Dense Queries for Multiple-Object Tracking [ ]
paper Face Transformer for Recognition [ ]
paper On the Adversarial Robustness of Visual Transformers [ ]
paper Understanding Robustness of Transformers for Image Classification [ ]
paper Lifting Transformer for 3D Human Pose Estimation in Video [ ]
paper Global Self-Attention Networks for Image Recognition[ ]
paper High-Fidelity Pluralistic Image Completion with Transformers [ ] [ ]
paper Vision Transformers for Dense Prediction [ ] [ ]
paper TransFG: A Transformer Architecture for Fine-grained Recognition? [ ]
paper Is Space-Time Attention All You Need for Video Understanding? [ ]
paper Multi-view 3D Reconstruction with Transformer [ ]
paper Can Vision Transformers Learn without Natural Images? [ ] [ ]
paper End-to-End Trainable Multi-Instance Pose Estimation with Transformers [ ]
paper Instance-level Image Retrieval using Reranking Transformers [ ] [ ]
paper BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search [ ] [ ]
paper Incorporating Convolution Designs into Visual Transformers [ ]
paper DeepViT: Towards Deeper Vision Transformer [ ]
paper Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training [ ]
paper 3D Human Pose Estimation with Spatial and Temporal Transformers [ ] [ ]
paper SUNETR: Transformers for 3D Medical Image Segmentation [ ]
paper Scalable Visual Transformers with Hierarchical Pooling [ ]
paper ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases [ ]
paper TransMed: Transformers Advance Multi-modal Medical Image Classification [ ]
paper U-Net Transformer: Self and Cross Attention for Medical Image Segmentation [ ]
paper SpecTr: Spectral Transformer for Hyperspectral Pathology Image Segmentation [ ] [ ]
paper TransBTS: Multimodal Brain Tumor Segmentation Using Transformer [ ] [ ]
paper SSTN: Self-Supervised Domain Adaptation Thermal Object Detection for Autonomous Driving [ ]
paper Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer [ ] [ ]
paper Do We Really Need Explicit Position Encodings for Vision Transformers? [ ] [ ]
paper Deepfake Video Detection Using Convolutional Vision Transformer[ ]
paper Training Vision Transformers for Image Retrieval[ ]
paper Video Transformer Network[ ]
paper Bottleneck Transformers for Visual Recognition [ ]
paper CPTR: Full Transformer Network for Image Captioning [ ]
paper Learn to Dance with AIST++: Music Conditioned 3D Dance Generation [ ] [ ]
paper Segmenting Transparent Object in the Wild with Transformer [ ] [ ]
paper Investigating the Vision Transformer Model for Image Retrieval Tasks [ ]
paper Trear: Transformer-based RGB-D Egocentric Action Recognition [ ]
paper VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search [ ]
paper TrackFormer: Multi-Object Tracking with Transformers [ ]
paper Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry [ ]
paper Transformer for Image Quality Assessment [ ] [ ]
paper TransTrack: Multiple-Object Tracking with Transformer [ ] [ ]
paper Training data-efficient image transformers & distillation through attention [ ] [ ]
paper 3D Object Detection with Pointformer [ ]
paper Toward Transformer-Based Object Detection [ ]
paper Taming Transformers for High-Resolution Image Synthesis [ ] [ ]
paper SceneFormer: Indoor Scene Generation with Transformers [ ]
paper PCT: Point Cloud Transformer [ ]
paper DETR for Pedestrian Detection[ ]
paper Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry[ ]
paper General Multi-label Image Classification with Transformers [ ]

Awesome Visual-Transformer / Papers / 2022

paper P2T: Pyramid Pooling Transformer for Scene Understanding [ ]
paper Expanding Language-Image Pretrained Models for General Video Recognition [ ] [ ]
paper TinyViT: Fast Pretraining Distillation for Small Vision Transformers [ ] [ ]
paper Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery with Transformers [ ] [ ]
paper AiATrack: Attention in Attention for Transformer Visual Tracking [ ] [ ]
paper Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework [ ] [ ]
paper Towards Grand Unification of Object Tracking [ ] [ ]
paper Tracking Objects as Pixel-wise Distributions [ ] [ ]
paper Masked Autoencoders Are Scalable Vision Learners [ ]
paper CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows [ ] [ ]
paper Fast Point Transformer [ ]
paper EDTER: Edge Detection With Transformer [ ] [ ]
paper Bridged Transformer for Vision and Point Cloud 3D Object Detection [ ]
paper MNSRNet: Multimodal Transformer Network for 3D Surface Super-Resolution [ ]
paper HyperTransformer: A Textural and Spectral Feature Fusion Transformer for Pansharpening [ ] [ ]
paper Keypoint Transformer: Solving Joint Identification in Challenging Hands and Object Interactions for Accurate 3D Pose Estimation [ ]
paper MPViT: Multi-Path Vision Transformer for Dense Prediction [ ]
paper A-ViT: Adaptive Tokens for Efficient Vision Transformer [ ]
paper TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation [ ] [ ]
paper Continual Learning With Lifelong Vision Transformer [ ]
paper Swin Transformer V2: Scaling Up Capacity and Resolution [ ]
paper Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection From Point Clouds [ ] [ ]
paper Multi-Class Token Transformer for Weakly Supervised Semantic Segmentation [ ]
paper Human-Object Interaction Detection via Disentangled Transformer [ ]
paper LGT-Net: Indoor Panoramic Room Layout Estimation With Geometry-Aware Transformer Network [ ]
paper Sparse Local Patch Transformer for Robust Face Alignment and Landmarks Inherent Relation Learning [ ]
paper Vision Transformer With Deformable Attention [ ]
paper DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers [ ]
paper Restormer: Efficient Transformer for High-Resolution Image Restoration [ ] [ ]
paper Accelerating DETR Convergence via Semantic-Aligned Matching [ ] [ ]
paper BEVT: BERT Pretraining of Video Transformers [ ] [ ]
paper Mobile-Former: Bridging MobileNet and Transformer [ ]
paper Spatio-temporal Relation Modeling for Few-shot Action Recognition [ ] [ ]
paper MiniViT: Compressing Vision Transformers with Weight Multiplexing [ ] [ ]
paper Collaborative Transformers for Grounded Situation Recognition [ ] [ ]
paper Beyond Fixation: Dynamic Window Visual Transformer [ ] [ ]
paper Multimodal Token Fusion for Vision Transformers [ ]
paper Convolutional Neural Networks Meet Vision Transformers [ ]
paper Fine-tuning Image Transformers using Learnable Memory [ ]
paper Attend to Mix for Vision Transformers [ ] [ ]
paper Nominate Synergistic Context in Vision Transformer for Visual Recognition [ ] [ ]
paper Shunted Self-Attention via Multi-Scale Token Aggregation [ ] [ ]
paper Towards Robust Vision Transformer [ [ ]
paper Lite Vision Transformer with Enhanced Self-Attention [ [ ]
paper StyTr2: Image Style Transfer with Transformers [ ] [ ]
paper Image-Adaptive Hint Generation via Vision Transformer for Outpainting [ ] [ ]

Awesome Visual-Transformer / Papers / 2021

paper ProTo: Program-Guided Transformer for Program-Guided Tasks [ ] [ ]
paper Augmented Shortcuts for Vision Transformers [ ] [ ]
paper You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection [ ] [ ]
paper Semantic Correspondence with Transformers [ ] [ ]
paper QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries [ ] [ ]
paper Dual-stream Network for Visual Recognition [ ] [ ]
paper Container: Context Aggregation Network [ ] [ ]
paper Transformer in Transformer [ ] [ ]
paper T6D-Direct: Transformers for Multi-Object 6D Pose Direct Regression [ ]
paper Long Short-Term Transformer for Online Action Detection [ ]
paper TransformerFusion: Monocular RGB Scene Reconstruction using Transformers [ ]
paper TransMatcher: Deep Image Matching Through Transformers for Generalizable Person Re-identification [ ]
paper TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classification [ ]
paper Associating Objects with Transformers for Video Object Segmentation [ ]
paper Test-Time Personalization with a Transformer for Human Pose Estimation [ ]
paper Revitalizing CNN Attention via Transformers in Self-Supervised Visual Representation Learning [ ]
paper Dynamic Grained Encoder for Vision Transformers [ ]
paper HRFormer: High-Resolution Vision Transformer for Dense Predict [ ]
paper Searching the Search Space of Vision Transformer [ ]
paper Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition [ ]
paper SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers [ ]
paper Do Vision Transformers See Like Convolutional Neural Networks? [ ]
paper Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers [ ]
paper Glance-and-Gaze Vision Transformer [ ]
paper MST: Masked Self-Supervised Transformer for Visual Representation [ ]
paper DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification [ ]
paper TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up [ ]
paper Augmented Shortcuts for Vision Transformers [ ]
paper Improved Transformer for High-Resolution GANs [ ]
paper All Tokens Matter: Token Labeling for Training Better Vision Transformers [ ]
paper XCiT: Cross-Covariance Image Transformers [ ]
paper Efficient Training of Visual Transformers with Small Datasets [ ]
paper Swin Transformer: Hierarchical Vision Transformer using Shifted Windows ( ) [ ] [ ]
paper High-Fidelity Pluralistic Image Completion with Transformers [ ] [ ]
paper PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers ( ) [ ] [ ]
paper Revisiting Stereo Depth Estimation From a Sequence-to-Sequence Perspective with Transformers [ ] [ ]
paper Rethinking Transformer-based Set Prediction for Object Detection [ ]
paper Paint Transformer: Feed Forward Neural Painting with Stroke Prediction ( ) ) [ [ ]
paper 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds [ ]
paper Training Vision Transformers from Scratch on ImageNet [ ] [ ]
paper THUNDR: Transformer-Based 3D Human Reconstruction With Markers [ ]
paper Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding [ ]
paper Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions [ ] [ ]
paper Spatial-Temporal Transformer for Dynamic Scene Graph Generation [ ]
paper GLiT: Neural Architecture Search for Global and Local Image Transformer [ ]
paper TRAR: Routing the Attention Spans in Transformer for Visual Question Answering [ ]
paper UniT: Multimodal Multitask Learning With a Unified Transformer [ ] [ ]
paper Stochastic Transformer Networks With Linear Competing Units: Application To End-to-End SL Translation [ ]
paper Transformer-Based Dual Relation Graph for Multi-Label Image Recognition [ ]
paper LocalTrans: A Multiscale Local Transformer Network for Cross-Resolution Homography Estimation [ ]
paper Improving 3D Object Detection With Channel-Wise Transformer [ ]
paper A Latent Transformer for Disentangled Face Editing in Images and Videos [ ] [ ]
paper GroupFormer: Group Activity Recognition With Clustered Spatial-Temporal Transformer [ ]
paper Unified Questioner Transformer for Descriptive Question Generation in Goal-Oriented Visual Dialogue [ ]
paper WB-DETR: Transformer-Based Detector Without Backbone [ ]
paper The Animation Transformer: Visual Correspondence via Segment Matching [ ]
paper The Animation Transformer: Visual Correspondence via Segment Matching [ ]
paper Relaxed Transformer Decoders for Direct Action Proposal Generation [ ]
paper Pyramid Point Cloud Transformer for Large-Scale Place Recognition [ ] [ ]
paper Multimodal Co-Attention Transformer for Survival Prediction in Gigapixel Whole Slide Images [ ]
paper Uncertainty-Guided Transformer Reasoning for Camouflaged Object Detection [ ]
paper Image Harmonization With Transformer [ ] [ ]
paper COTR: Correspondence Transformer for Matching Across Images [ ]
paper MUSIQ: Multi-Scale Image Quality Transformer [ ]
paper Episodic Transformer for Vision-and-Language Navigation [ ]
paper Action-Conditioned 3D Human Motion Synthesis With Transformer VAE [ ]
paper CrackFormer: Transformer Network for Fine-Grained Crack Detection [ ]
paper HiT: Hierarchical Transformer With Momentum Contrast for Video-Text Retrieval [ ]
paper Event-Based Video Reconstruction Using Transformer [ ]
paper STVGBert: A Visual-Linguistic Transformer Based Framework for Spatio-Temporal Video Grounding [ ]
paper HiFT: Hierarchical Feature Transformer for Aerial Tracking [ ] [ ]
paper DocFormer: End-to-End Transformer for Document Understanding [ ]
paper LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference [ ] [ ]
paper SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign Language Recognition[ ]
paper VidTr: Video Transformer Without Convolutions [ ]
paper Action-Conditioned 3D Human Motion Synthesis with Transformer VAE [ ]
paper Segmenter: Transformer for Semantic Segmentation [ ] [ ]
paper Visformer: The Vision-friendly Transformer [ ] [ ]
paper PnP-DETR: Towards Efficient Visual Analysis with Transformers ( ) [ ] [ ]
paper [ ] Voxel Transformer for 3D Object Detection [ ]
paper TransVG: End-to-End Visual Grounding with Transformers [ ]
paper An End-to-End Transformer Model for 3D Object Detection [ ] [ ]
paper Eformer: Edge Enhancement based Transformer for Medical Image Denoising [ ]
paper TransFER: Learning Relation-aware Facial Expression Representations with Transformers [ ]
paper Oriented Object Detection with Transformer [ ]
paper ViViT: A Video Vision Transformer [ ]
paper Learning Spatio-Temporal Transformer for Visual Tracking [ ] [ ]
paper Improving 3D Object Detection with Channel-wise Transformer [ ]
paper Visual Saliency Transformer [ ]
paper Rethinking Spatial Dimensions of Vision Transformers [ ] [ ]
paper CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [ ] [ ]
paper Point Transformer [ ]
paper TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization [ ] [ ]
paper Visual Transformers: Token-based Image Representation and Processing for Computer Vision [ ]
paper Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction [ ] [ ]
paper Conditional DETR for Fast Training Convergence [ ] [ ]
paper PIT: Position-Invariant Transform for Cross-FoV Domain Adaptation [ ] [ ]
paper SOTR: Segmenting Objects with Transformers [ ] [ ]
paper SnowflakeNet: Point Cloud Completion by Snowflake Point Deconvolution with Skip-Transformer [ ] [ ]
paper TransPose: Keypoint Localization via Transformer [ ] [ ]
paper TransReID: Transformer-based Object Re-Identification [ ] [ ]
paper Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer [ ] [ ]
paper Anticipative Video Transformer [ ] [ ]
paper Rethinking and Improving Relative Position Encoding for Vision Transformer [ ] [ ]
paper Vision Transformer with Progressive Sampling [ ] [ ]
paper Fast Convergence of DETR with Spatially Modulated Co-Attention [ ] [ ]
paper AutoFormer: Searching Transformers for Visual Recognition [ ] [ ]
paper Diverse Part Discovery: Occluded Person Re-identification with Part-Aware Transformer [ ]
paper HOTR: End-to-End Human-Object Interaction Detection with Transformers ( ) [ ]
paper End-to-End Human Pose and Mesh Reconstruction with Transformers [ ]
paper Line Segment Detection Using Transformers without Edges [ ]
paper Multi-Modal Fusion Transformer for End-to-End Autonomous Driving [ ] [ ]
paper Pose Recognition with Cascade Transformers [ ]
paper Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [ ]
paper LoFTR: Detector-Free Local Feature Matching with Transformers [ ] [ ]
paper Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers [ ]
paper Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers [ ] [ ]
paper Transformer Tracking [ ] [ ]
paper Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking (** oral**) [ ]
paper End-to-End Video Instance Segmentation with Transformers [ ]
paper Transformer Interpretability Beyond Attention Visualization [ ] [ ]
paper Pre-Trained Image Processing Transformer [ ]
paper UP-DETR: Unsupervised Pre-training for Object Detection with Transformers [ ]
paper Perceptual Image Quality Assessment with Transformers ( ) [ ]
paper High-Resolution Complex Scene Synthesis with Transformers ( ) [ ]
paper Collaborative Transformers for Grounded Situation Recognition [ ] [ ]
paper Generative Video Transformer: Can Objects be the Words? [ ]
paper Generative Adversarial Transformers [ ] [ ]
paper NDT-Transformer: Large-Scale 3D Point Cloud Localisation using the Normal Distribution Transform Representation [ ]
paper VTNet: Visual Transformer Network for Object Goal Navigation [ ]
paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [ ] [ ]
paper Deformable DETR: Deformable Transformers for End-to-End Object Detection [ ] [ ]
paper MODELING LONG-RANGE INTERACTIONS WITHOUT ATTENTION [ ] [ ]
paper Video Transformer for Deepfake Detection with Incremental Learning[ ]
paper HAT: Hierarchical Aggregation Transformers for Person Re-identification [ ]
paper Token Shift Transformer for Video Classification [ ] [ ]
paper DPT: Deformable Patch-based Transformer for Visual Recognition [ ] [ ]
paper UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation [ ] [ ]
paper Medical Transformer: Gated Axial-Attention for Medical Image Segmentation [ ] [ ]
paper Multi-Compound Transformer for Accurate Biomedical Image Segmentation [ ] [ ]
paper Progressively Normalized Self-Attention Network for Video Polyp Segmentation [ ] [ ]
paper A Multi-Branch Hybrid Transformer Networkfor Corneal Endothelial Cell Segmentation [ ]
paper End-to-End Object Detection with Adaptive Clustering Transformer [ ]
paper Grounded Situation Recognition with Transformers [ ] [ ]
paper TransFusion: Cross-view Fusion with Transformer for 3D Human Pose Estimation [ ] [ ]
paper VT-ADL: A Vision Transformer Network for Image Anomaly Detection and Localization ( ) [ ]
paper DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries [ ]
paper Medical Image Segmentation using Squeeze-and-Expansion Transformers [ ]
paper You Only Group Once: Efficient Point-Cloud Processing with Token Representation and Relation Inference Module ( ) [ ] [ ]
paper PTT: Point-Track-Transformer Module for 3D Single Object Tracking in Point Clouds [ ] [ ]
paper End-to-end Lane Shape Prediction with Transformers [ ] [ ]
paper Vision Transformer for Fast and Efficient Scene Text Recognition [ ]

Awesome Visual-Transformer / Papers / 2020

paper End-to-End Object Detection with Transformers ( ) [ ] [ ]
paper [ ] Feature Pyramid Transformer ( ) [ ] [ ]

Awesome Visual-Transformer / Papers / Other resource

Awesome-Transformer-Attention 4,651 4 months ago [ ]

Backlinks from these awesome lists:

More related projects: