DaViT.pdf

•

0 recomendaciones•5 vistas

This paper proposes DaViT, a vision transformer architecture that uses both spatial and channel attention to efficiently capture global context. Spatial attention performs local interactions across spatial locations while channel attention captures global representations by attending to all spatial positions across channels. Together, they complement each other to achieve state-of-the-art performance on image classification, semantic segmentation, and object detection tasks, with linear computational complexity scaling to high-resolution inputs.

Ingeniería

Paper review:
2023.01.19
Uploaded on ArXiv: April 2022

Background
• For Transformer models global context modelling capabilities, the computational complexity
grows quadratically.
• It limits their ability to scale up to high-resolution scenarios.
• Local attention on spatially local windows benefit for linear complexity, but with a loss of global
contextual information.
• It is important to design an architecture that can capture global contexts while maintaining
efficiency.

Introduction
• Effective vision transformer architecture that can capture global context while maintaining
computational efficiency.
• Exploits self-attention mechanisms with both “spatial tokens” and “channel tokens”.
• With spatial tokens, the spatial dimension defines the token scope, and the channel dimension defines the token feature
dimension.
• With channel tokens, it is inversed: the channel dimension defines the token scope, and the spatial dimension defines the
token feature dimension.
• Tokens along the sequence direction are further grouped for both spatial and channel tokens to maintain the linear
complexity of the entire model.
• These two self-attentions complement each other.
• Since each channel token contains an abstract representation of the entire image -> the channel attention naturally captures
global interactions and representations by taking all spatial positions into account when computing attention scores between
channels.
• The spatial attention refines the local representations by performing fine-grained interactions across spatial locations, which
in turn helps the global information modeling in channel attention.
• DaViT achieved state-of-the-art performance on four different tasks with efficient computations.

Attention
• Standard global self-attention
• Complexity of O(2P2C + 4PC2)
• Spatial window-based self-attention
• Complexity of O(2PPwC+4PC2)
• Linear complexity with spatial size P
• Channel Group Attention
• Complexity of O(6PC2)
• Linear complexity with spatial size P
Nw: Number of windows, Ng: Number of channel group, Cg: Channels per group, Ch: Channels per head

Comparisons of efficiency vs. performance

Results – Image Classification
and Semantic Segmentation

Más contenido relacionado

Similar a DaViT.pdf

Transformer Mods for Document Length InputsSujit Pal

240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...thanhdowork

Survey of Attention mechanism & Use in Computer VisionSwatiNarkhede1

Deep learning for 3-D Scene Reconstruction and Modeling Yu Huang

Convolutional Neural Networks : Popular Architecturesananth

ConvNeXt: A ConvNet for the 2020s explainedSushant Gautam

“How Transformers are Changing the Direction of Deep Learning Architectures,”...Edge AI and Vision Alliance

Presentation vision transformersppt.pptxhtn540

PR-108: MobileNetV2: Inverted Residuals and Linear BottlenecksJinwon Lee

final_project_1_2k21cse07.pptxshwetabhagat25

PR243: Designing Network Design SpacesJinwon Lee

Moldable pipelines for CNNs on heterogeneous edge devicesLEGATO project

Faster R-CNN - PR012Jinwon Lee

Week5-Faster R-CNN.pptxfahmi324663

Andrea Sini ThesisAndrea Sini, MBA

Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Sergey Karayev

Fisheye Omnidirectional View in Autonomous DrivingYu Huang

Global Map Matching using BLE Beacons for Indoor Route and Stay EstimationDaisuke Yamamoto

NS-CUK Seminar: J.H.Lee, Review on "Learnable Structural Semantic Readout for...ssuser4b1f48

Similar a DaViT.pdf (20)

Transformer Mods for Document Length Inputs

240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...

Survey of Attention mechanism & Use in Computer Vision

Deep learning for 3-D Scene Reconstruction and Modeling

Convolutional Neural Networks : Popular Architectures

ConvNeXt: A ConvNet for the 2020s explained

“How Transformers are Changing the Direction of Deep Learning Architectures,”...

Presentation vision transformersppt.pptx

PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks

final_project_1_2k21cse07.pptx

PR243: Designing Network Design Spaces

Moldable pipelines for CNNs on heterogeneous edge devices

Faster R-CNN - PR012

Week5-Faster R-CNN.pptx

Andrea Sini Thesis

Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...

Fisheye Omnidirectional View in Autonomous Driving

Global Map Matching using BLE Beacons for Indoor Route and Stay Estimation

NS-CUK Seminar: J.H.Lee, Review on "Learnable Structural Semantic Readout for...

Último

Architect Hassan Khalil Portfolio for 2024hassan khalil

Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000

Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar

CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani

An introduction to Semiconductor and its types.pptxPurva Nikam

9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Low Rate Call Girls In Saket, Delhi NCR

An experimental study in using natural admixture as an alternative for chemic...Chandu841456

Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran

Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774

IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst

Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis

Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ

Past, Present and Future of Generative AIabhishek36461

TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1

Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721

Correctly Loading Incremental Data at ScaleAlluxio, Inc.

Introduction-To-Agricultural-Surveillance-Rover.pptxk795866

young call girls in Green Park🔝 9953056974 🔝 escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

POWER SYSTEMS-1 Complete notes examplesDr. Gudipudi Nageswara Rao

Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066

DaViT.pdf

1. Paper review: 2023.01.19 Uploaded on ArXiv: April 2022

2. Background • For Transformer models global context modelling capabilities, the computational complexity grows quadratically. • It limits their ability to scale up to high-resolution scenarios. • Local attention on spatially local windows benefit for linear complexity, but with a loss of global contextual information. • It is important to design an architecture that can capture global contexts while maintaining efficiency.

3. Introduction • Effective vision transformer architecture that can capture global context while maintaining computational efficiency. • Exploits self-attention mechanisms with both “spatial tokens” and “channel tokens”. • With spatial tokens, the spatial dimension defines the token scope, and the channel dimension defines the token feature dimension. • With channel tokens, it is inversed: the channel dimension defines the token scope, and the spatial dimension defines the token feature dimension. • Tokens along the sequence direction are further grouped for both spatial and channel tokens to maintain the linear complexity of the entire model. • These two self-attentions complement each other. • Since each channel token contains an abstract representation of the entire image -> the channel attention naturally captures global interactions and representations by taking all spatial positions into account when computing attention scores between channels. • The spatial attention refines the local representations by performing fine-grained interactions across spatial locations, which in turn helps the global information modeling in channel attention. • DaViT achieved state-of-the-art performance on four different tasks with efficient computations.

4. Spatial and Channel Dual Attention

5. Attention • Standard global self-attention • Complexity of O(2P2C + 4PC2) • Spatial window-based self-attention • Complexity of O(2PPwC+4PC2) • Linear complexity with spatial size P • Channel Group Attention • Complexity of O(6PC2) • Linear complexity with spatial size P Nw: Number of windows, Ng: Number of channel group, Cg: Channels per group, Ch: Channels per head

6. Dual Attention Block Architecture

7. Comparisons of efficiency vs. performance

8. Results – Image Classification and Semantic Segmentation

9. Results – Object Detection

DaViT.pdf

Recomendados

Recomendados

Más contenido relacionado

Similar a DaViT.pdf

Similar a DaViT.pdf (20)

Último

Último (20)

DaViT.pdf