This paper proposes DaViT, a vision transformer architecture that uses both spatial and channel attention to efficiently capture global context. Spatial attention performs local interactions across spatial locations while channel attention captures global representations by attending to all spatial positions across channels. Together, they complement each other to achieve state-of-the-art performance on image classification, semantic segmentation, and object detection tasks, with linear computational complexity scaling to high-resolution inputs.