15. 15
• Attention は”Query”, “Key”, “Value”の3要素を用いて計算される。
• Query を 𝑄 ∊ ℝ𝑛×𝑑𝑞、Key を K ∊ ℝ𝑛𝑣×𝑑𝑞、Value を V ∊ ℝ𝑛𝑣×𝑑𝑣 として
𝐴𝑡𝑡 𝑄, 𝐾, 𝑉 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
𝑄𝐾𝑇
𝑑𝑞
𝑉
と表すことができる (Dot-Product Attention)。
• この式の意味は「query ベクトルと key ベクトルの類似度を求め、 そ
の正規化した重みを value ベクトルに適⽤して値を取り出す」と解釈
でき、出力は 𝑉 の行ごとの重み付き和である。
Attention
[9] Vaswani, Ashish, et al. “Attention is
all you need.” Advances in neural
information processing systems. 2017. よ
り図を引用, 一部改変
16. 16
• CNN の特徴量が複数のチャンネルを持つように、Attention も複数のチャンネル (ヘッ
ドと呼ばれる) を並列させて表現力を高めることができる。
Multi-head Attention
[9] Vaswani, Ashish, et al.
"Attention is all you need."
Advances in neural information
processing systems. 2017.
より図を引用
37. 結論
• 本研究では、シンプルな ViT を用いた HPE 手法を提案し、今後の研究のベースライン手法
となり得ると主張。
• 複雑なモジュール等を用いていないにも関わらず、いくつかの工夫を組み合わせることで
MS COCO データセットに対して最高精度を実現。
• Decoder のアーキテクチャ等に関しては改善の余地があると主張。
• Human Pose Estimation のみならず、Animal Pose Estimation や Face Keypoint Detection 等
への応用も期待できる。
37
38. 引用
1. Xu, Yufei, et al. "ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation." arXiv preprint arXiv:2204.12484 (2022).
2. osovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An
image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations,
2020.
3. K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16000–16009, 2022.
4. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in
context. In Proceedings of the European Conference on Computer Vision (ECCV), 2014.
5. Cao, Zhe, et al. “OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields.” IEEE transactions on pattern analysis
and machine intelligence 43.1 (2019).
6. Dang, Qi, et al. "Deep learning based 2d human pose estimation: A survey“ Tsinghua Science and Technology 24.6 (2019): 663-676.
7. Senior, Andrew W., et al. "Improved protein structure prediction using potentials from deep learning." Nature 577.7792 (2020): 706-710.
8. .https://www.slideshare.net/DeepLearningJP2016/dltransformer-vit-perceiver-frozen-pretrained-transformer-etc
9. Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems. 2017.
10. Girshick, Ross. "Fast r-cnn." Proceedings of the IEEE international conference on computer vision. 2015.
38
39. 引用
11. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009.
12. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
13. J. Wu, H. Zheng, B. Zhao, Y. Li, B. Yan, R. Liang, W. Wang, S. Zhou, G. Lin, Y. Fu, et al. Ai challenger: A large-scale dataset for going
deeper in image understanding. arXiv preprint arXiv:1711.06475, 2017.
14. Y. Xu, Q. Zhang, J. Zhang, and D. Tao. Vitae: Vision transformer advanced by exploring intrinsic inductive bias. Advances in Neural
Information Processing Systems, 34, 2021.
15. L. Cai, Z. Zhang, Y. Zhu, L. Zhang, M. Li, and X. Xue. Bigdetection: A large-scale benchmark for improved object detector pre-training.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4777–4787, 2022.
39