SlideShare una empresa de Scribd logo
1 de 13
Descargar para leer sin conexión
!"#$%&'()*+&!,-%.*",/$,&
*",&0)1$"&23-..)*)4-+)"%
!"#$%&"'()$*"'+,' !"#)$-&#'(./"&$0(#)$1-2223435
橋口凌大(名工大玉木研)
英語論文紹介 3435653657
概要
n8#9:'$;&,<=$>#?@A:の提案
• シフトによる時空間特徴の統合
nシフト操作はゼロパラメータ,ゼロBCDE;
• 3FモデルをGFモデルに拡張
n既存のGF手法と比較を行い認識率を調査
• H,':=,I;J44$K-"LL:,L"M%,;;:L>"')$-NEO345PQ
• R-B545$KS##>L#T)$3453Q
• UV8U1.V"W:T$KC,T)$U--N345XQ
関連研究
n3FモデルをGFモデルに拡張
• YGF$KB:,I&=:'&#<:L)$-NEO3434Q
n3Fモデルで動作認識
• 8S2$KC,'T)$Z--N345[Q
nN,;,#'$8L"';<#L>:L$N,8]$KF#;#^,=;9,_T)$
Z-CO3435Qも8S2のような手法で拡張を行う
X3D: Expanding Architectures for Efficient Video Recognition
Christoph Feichtenhofer
Facebook AI Research (FAIR)
Abstract
This paper presents X3D, a family of efficient video net-
works that progressively expand a tiny 2D image classifi-
cation architecture along multiple network axes, in space,
time, width and depth. Inspired by feature selection methods
in machine learning, a simple stepwise network expansion
approach is employed that expands a single axis in each step,
such that good accuracy to complexity trade-off is achieved.
To expand X3D to a specific target complexity, we perform
progressive forward expansion followed by backward con-
traction. X3D achieves state-of-the-art performance while
requiring 4.8⇥ and 5.5⇥ fewer multiply-adds and parame-
ters for similar accuracy as previous work. Our most surpris-
ing finding is that networks with high spatiotemporal resolu-
tion can perform well, while being extremely light in terms of
network width and parameters. We report competitive accu-
racy at unprecedented efficiency on video classification and
detection benchmarks. Code will be available at: https:
//github.com/facebookresearch/SlowFast.
1. Introduction
Neural networks for video recognition have been largely
driven by expanding 2D image architectures [29,47,64,71]
into spacetime. Naturally, these expansions often happen
along the temporal axis, involving extending the network
inputs, features, and/or filter kernels into spacetime (e.g.
[7, 13, 17, 42, 56, 75]); other design decisions—including
depth (number of layers), width (number of channels), and
T
C
H,W
prediction
Ít
ÍÜ
Ís
Íd
Íw
Íb
Ís
Input frames
res2
res3
res4
Figure 1. X3D networks progressively expand a 2D network across
the following axes: Temporal duration t, frame rate ⌧ , spatial
resolution s, width w, bottleneck width b, and depth d.
This paper focuses on the low-computation regime in
terms of computation/accuracy trade-off for video recogni-
tion. We base our design upon the “mobile-regime" mod-
els [31,32,61] developed for image recognition. Our core
idea is that while expanding a small model along the tem-
poral axis can increase accuracy, the computation/accuracy
trade-off may not always be best compared with expanding
other axes, especially in the low-computation regime where
accuracy can increase quickly along different axes.
In this paper, we progressively “expand" a tiny base 2D
image architecture into a spatiotemporal one by expanding
multiple possible axes shown in Fig. 1. The candidate axes
are temporal duration t, frame rate ⌧ , spatial resolution
s, network width w, bottleneck width b, and depth d.
The resulting architecture is referred as X3D (Expand 3D)
for expanding from the 2D space into 3D spacetime domain.
The 2D base architecture is driven by the MobileNet [31,
32,61] core concept of channel-wise1
separable convolutions,
arXiv:2004.04730v1
[cs.CV]
9
Apr
2020
[Feichtenhofer, CVPR2020]
[Lin+, ICCV2019]
TSM: Temporal Shift Module for Efficient Video Understanding
Ji Lin
MIT
jilin@mit.edu
Chuang Gan
MIT-IBM Watson AI Lab
ganchuang@csail.mit.edu
Song Han
MIT
songhan@mit.edu
Abstract
The explosive growth in video streaming gives rise to
challenges on performing video understanding at high accu-
racy and low computation cost. Conventional 2D CNNs
are computationally cheap but cannot capture temporal
relationships; 3D CNN based methods can achieve good
performance but are computationally intensive, making it
expensive to deploy. In this paper, we propose a generic
and effective Temporal Shift Module (TSM) that enjoys both
high efficiency and high performance. Specifically, it can
achieve the performance of 3D CNN but maintain 2D CNN’s
complexity. TSM shifts part of the channels along the tempo-
ral dimension; thus facilitate information exchanged among
neighboring frames. It can be inserted into 2D CNNs to
achieve temporal modeling at zero computation and zero
Channel C
Temporal
T
(a) The original ten-
sor without shift.
pad zero
temporal
shift
truncate
T
C
H,W
(b) Offline temporal
shift (bi-direction).
t=0
t=3
…
t=1
t=2
Channel C
(c) Online temporal
shift (uni-direction).
Figure 1. Temporal Shift Module (TSM) performs efficient tem-
poral modeling by moving the feature map along the temporal
dimension. It is computationally free on top of a 2D convolution,
but achieves strong temporal modeling ability. TSM efficiently
supports both offline and online video recognition. Bi-directional
TSM mingles both past and future frames with the current frame,
which is suitable for high-throughput offline video recognition.
Uni-directional TSM mingles only the past frame with the current
frame, which is suitable for low-latency online video recognition.
!"#"$%&'()%#*$(+,(の拡張
nN,8を動画認識モデルに拡張したい
n1==:'=,#'を時間方向に拡張
• 計算量
• フレーム数増につれ計算量爆発
n8L"';<#L>:LをGFに拡張するのは現実的でない
• 3F 8L"';<#L>:L$N,8]の計算量が多い
nN,8ベースで動作認識
(a) Full space-time atten-
tion: O(T2
S2
)
(b) Spatial-only attention:
O(TS2
)
(c) TimeSform
ViViT (Mode
O(T2
S + TS
Figure 1: Different approaches to space-time self-attention for vi
locations that the query vector, located at the center of the grid
Unlike prior work, our key vector is constructed by mixing info
same spatial location within a local temporal window. Our meth
these tokens. Note that our mechanism allows for an efficient
attention at no extra cost.
A baseline solution to this problem is to consider spatial-on
!"#$%&'()*+#,-./01023
<latexit sha1_base64="No9GqnmNeE52qj8tWzuF/PHJ220=">AAACbnichVHLSsNAFD2N7/qqCiKIWCyVuinTIiquim7c2XeF2pYkjhpMk5CkhVr8AffiQlAURMTPcOMPuOgniBuhghsX3qQBUVFvmMyZM/fcOXNHMlTFshlr+YSu7p7evv4B/+DQ8MhoYGw8b+k1U+Y5WVd1c0sSLa4qGs/Ziq3yLcPkYlVSeUE6WHf2C3VuWoquZe2GwUtVcU9TdhVZtIkqbkay5XgwU44v+CuBEIsyN4I/QcwDIXiR1AM32MYOdMiooQoODTZhFSIs+oqIgcEgroQmcSYhxd3nOIKftDXK4pQhEntA/z1aFT1Wo7VT03LVMp2i0jBJGUSYPbJb1mYP7I49sfdfazXdGo6XBs1SR8uNyujxVObtX1WVZhv7n6o/PdvYxYrrVSHvhss4t5A7+vrhaTuzmg4359kVeyb/l6zF7ukGWv1Vvk7x9BmcB4h9b/dPkI9HY0vRxdRiKLHmPUU/pjGHCPV7GQlsIImc27ETnOPC9yJMCjPCbCdV8HmaCXwJIfIBsB2Mfg==</latexit>
O(T2
S2
)
手法
n;&,<=$>#?@A:を入れ込むことにより時空間情報を得る
n残差の分岐の前におく
n-A";;$=#9:'のみを時間方向にずらす
• シフト操作のみで計算量,パラメータはN,8
(a) ViT (video)
(a) ViT (video) (b) TokShift (c) TokShift-A
mer for Video Classification
anbin Hao∗
sity of Science and
nology of China
Hefei, China
nbin@hotmail.com
Chong-Wah Ngo
Singapore Management University
Singapore
cwngo@smu.edu.sg
and
nd-
, it
wer
in-
en-
vy
men
, a
po-
(a) Video Tensor (b) Temporal Shift
University of Science and
Technology of China
Hefei, China
haoyanbin@hotmail.com
Singapore Management University
Singapore
cwngo@smu.edu.sg
ding 1 and
nderstand-
etworks, it
ive power
length in-
onal inten-
ing heavy
x 3-dimen
okShift), a
ng tempo-
fically, the
n features
nsely plug
ansformer
icing that
deo trans-
nderstand-
obustness,
t clips of
precision:
Gaze+, and
(a) Video Tensor (b) Temporal Shift
(c) Patch Shift (d) Token Shift
実験設定
n学習
• 短辺をK33J)$GG4Qにリサイズ
• O"'?#>$+L,(&=':;;
• S"=@L"=,#'
• V">>"
• !@:
• !#L,W#'="A$<A,`
• O">?#>IL#` 33J$a$33J$または3b7)$
GXJ]
• 5クリップX657フレーム
• フレーム間G3
n推論
• 2A@=,^,:c =:;=
• 54個のクリップを一様にサンプリング
• 短辺を33J$3b7)GXJ]にリサイズ
• 33J$a$33J 3b7)$GXJ]にクロップ
• -:'=:LIL#`
• O,(&=IL#`
• A:<=IL#`
nデータセット
• H,':=,;J44
• R-B545
• UV8U1$V"W:T
実験結果:-"%,."/#011
nS&,<=による効果
• サンプリングステップを変えて比較
• ステップ小
• 動画として情報が少ない
• ステップ大
• フレーム間が離れすぎていて動画
としての情報が欠落
ark
ow-
od-
the
s.
the
the
lly,
nce
res.
elf-
ock,
m”,
ify
ngs
ere
rks
ua-
to
330] and maintain aspect ratio), random brightness, saturation, gamma,
hue, and horizontal-flip. We train the model for 18 epochs with
batch-size 21 per GPU. Base lr is set to 0.1, is decayed by 0.1 at
epoch [10, 15] during training. The Base-16 ViT [9] (contains 12
encoders) is adopted as the backbone and the shifted proportion is
set to 𝑎
𝐷 , 𝑐
𝐷 = 1
4 . All experiments are run on 2/8× V100 GPUs.
Inference adopts the same testing strategies as [12]. Specifi-
cally, we uniformly sample 10 clips from a testing video. Each clip
are resized to short-side=224 (256/384), then cropped into three
224×224 sub-clips (from “left”, “center”, “right” parts). A video pre-
diction is obtained by averaging scores of 30 sub-clips.
4.3 Ablation Study
Our ablation studies the impacts of the shift-types, integrations,
hyperparameters of video transformers on the Kinetics-400 dataset.
Non Shift vs TokShift. We compare the Non Shift and TokShift
under various sampling steps. Specifically, we fix clip size (𝑇) to 8
frames, but change sampling steps (𝑆) to cover variable temporal
duration.
Xfmr
Acc1 𝑇 × 𝑆
8 × 8 8 × 16 8 × 32 8 × 64
ViT (video)[9] 76.17 76.32 76.02 75.73
TokShift 76.90 (0.73↑) 76.81 (0.49↑) 77.28 (1.26↑) 76.60 (0.87↑)
Table 1: NonShift vs TokShift under diffrent sampling strate-
gies (“T/S” refers to frames/sampling-step; Accuracy-1).
nS&,<=$>#?@A:の埋め込み位置
(a) ViT (video) (b) TokShift (c) TokShift-A (d) TokShift-B (e) TokShift-C
Figure 3: Integrations of TokShift module into ViT Encoder: the TokShift can be plugged in multiple positions of a vision
encoder: “prior residual” (TokShift), “prior layer-norm” (TokShift-A), “prior MSA/FFN” (TokShift-B), and “post MSA/FFN
(TokShift-C)”. Notably, a fronter position will affect the subsequent modules inside residual blocks more.
MSA
!
𝑧
𝑧
𝑧
(𝑡)
𝑙−1
"
= Concat
#
head
(𝑡)
1 , head
(𝑡)
2 , ..., head
(𝑡)
𝑀
$
(5)
head
(𝑡)
𝑖 = Softmax
%
𝑄
𝑄
𝑄(𝑡)𝐾
𝐾
𝐾 (𝑡)
√
𝐷
&
𝑉
𝑉
𝑉 (𝑡)
(6)
𝑄
𝑄
𝑄(𝑡),𝐾
𝐾
𝐾(𝑡),𝑉
𝑉
𝑉 (𝑡) = 𝑧
𝑧
𝑧
(𝑡)
𝑙−1
×𝑊
𝑊
𝑊𝑞,𝑊
𝑊
𝑊 𝑘,𝑊
𝑊
𝑊 𝑣 (7)
𝑊
𝑊
𝑊𝑞,𝑊
𝑊
𝑊 𝑘,𝑊
𝑊
𝑊 𝑣 ∈ R𝐷×𝐷
Given a token tensor 𝑐
𝑐
𝑐𝑙 ∈ R𝑇×𝐷 representing the global se-
quence of a clip, the TokShift works as follows: 𝑐
𝑐
𝑐𝑙 is firstly split
into 3 groups along channel dimension (Equation (12)).
𝑐
𝑐
𝑐𝑙 = [𝑠
𝑠
𝑠𝑎,𝑠
𝑠
𝑠𝑏,𝑠
𝑠
𝑠𝑐] (12)
𝑠
𝑠
𝑠𝑎,𝑠
𝑠
𝑠𝑏,𝑠
𝑠
𝑠𝑐 ∈ R𝑇×𝑎
, R𝑇×𝑏
, R𝑇×𝑐
𝑎 + 𝑏 + 𝑐 = 𝐷
“prior MSA/FFN” and “post MSA/FFN”. We experimentally verify
that placing TokShift at a front position(i.e., “prior residual”) brings
the best effect. This finding is different from TSM on CNN, where
the optimal position is “post residual".
4 EXPERIMENTS
We conduct extensive experiments on three standard benchmarks
for video classification and adopt Top-1/5 accuracy (%) as evalua-
tion metrics. We also report the model parameters and GFLOPs to
quantify computations.
4.1 Datasets
We adopt one large-scale (Kinetics-400) and two small-scale (EGTEA-
Gaze+ & UCF-101) datasets as evaluation benchmarks. Below presents
their brief descriptions.
Kinetics-400 [4] serves as a standard large-scale benchmark for
video classification. Its clips are truncated into 10 seconds duration
with humans’ annotations. The training/validation set separately
contains ∼ 246k/20k clips covering 400 video categories.
EGTEA Gaze+ [19] is a First-Person-View video dataset cov-
ering 106 fine-grained daily action categories. We select split-1
to evaluate our model, and this split contains 8,299/2,022 train-
ing/validating clips, with an average clip length of 3.1 seconds.
UCF-101 [26] contains 101 action categories, such as “human-
human/object” interactions, sports and etc. We also select split-1
for performance reporting. This split contains 9,537/3,783 train-
ing/validating clips with average duration of 5.8 seconds.
4.2 Implementations
We implement a paradigm of shift-transformers, including the ViT
(video), TemporalShift/PatchShift/TokShift-xfmr, on top of 2D ViT
Non Shift vs TokShift. We compare the Non Shift and TokShift
under various sampling steps. Specifically, we fix clip size (𝑇) to 8
frames, but change sampling steps (𝑆) to cover variable temporal
duration.
Xfmr
Acc1 𝑇 × 𝑆
8 × 8 8 × 16 8 × 32 8 × 64
ViT (video)[9] 76.17 76.32 76.02 75.73
TokShift 76.90 (0.73↑) 76.81 (0.49↑) 77.28 (1.26↑) 76.60 (0.87↑)
Table 1: NonShift vs TokShift under diffrent sampling strate-
gies (“T/S” refers to frames/sampling-step; Accuracy-1).
As in Tab. 1, the TokShift-xfmr outperforms ViT (video) under
all settings. We observe that the performance of TokShift varies
with different temporal steps. A smaller step makes TokShift not
effective in learning temporal evolution. On the other hand, a large
step of 64 frames results in a highly discontinuous frame sequence,
affecting learning effectiveness. A temporal step of 32 frames ap-
pears to be a suitable choice.
Integration. We study the impacts of implanting the TokShift
at different positions of an encoder. As mentioned in section 3.2, a
fronter position will affect the later modules inside residual blocks
more.
Model Words Shiftted 𝐺𝐹𝐿𝑂𝑃𝑠 × 𝑉𝑖𝑒𝑤𝑠 Acc1 Acc5
(%) (%)
ViT (video) [9] None 134.7 × 30 76.02 92.52
TokShift-𝐴 Token 134.7 × 30 76.85 93.10
TokShift-𝐵 Token 134.7 × 30 77.21 92.81
TokShift-𝐶 Token 134.7 × 30 77.00 92.92
TokShift Token 134.7 × 30 77.28 92.91
Table 2: Comparisons of the TokShift residing in different
positions (GFLOPs, Accuracy-1/5).
実験結果:-"%,."/#011
nシフトバリエーションによる性能比較
Shift Type Words Shiftted 𝐺𝐹𝐿𝑂𝑃𝑠 × 𝑉𝑖𝑒𝑤𝑠 Acc1 Acc5
(%) (%)
ViT (video) [9] None 134.7 × 30 76.02 92.52
TemporalShift Token + Patches 134.7 × 30 72.88 91.24
PachShift Patches 134.7 × 30 73.08 91.17
TokShift Token 134.7 × 30 77.28 92.91
Table 3: Comparisons of shifting different visual “Words”
(GFLOPs, Accuracy-1/5).
Model Backbone Res # Words Acc1
(𝐻 ×𝑊 ) 𝑇 · (𝑁 + 1) (%)
TokShift (HR) Base-16 384 × 384 8 · (242 + 1) 78.14
TokShift-Large (HR) Large-16 384 × 384 8 · (242 + 1) 79.83
TokShift-Large (HR) Large-16 384 × 384 12 · (242 + 1) 80.40
TokShift (MR) Base-16 256 × 256 8 · (162 + 1) 77.68
TokShift-Hybrid (MR) R50+Base-16 256 × 256 8 · (162 + 1) 77.55
TokShift-Hybrid (MR) R50+Base-16 256 × 256 16 · (162 + 1) 78.34
eo Classification
Chong-Wah Ngo
Singapore Management University
Singapore
cwngo@smu.edu.sg
a) Video Tensor (b) Temporal Shift
(c) Patch Shift (d) Token Shift
pes of Shift for video transformer. A video em-
ains two types of “words”: 𝑁 [Patch] + 1 [Class]
e, the Shift can be applied on: (a) Neither, (b)
h, and (d) [Class] Token along the temporal axis.
n Shift Transformer for Video Classification
ng
Hong Kong
R, China
mail.com
Yanbin Hao∗
University of Science and
Technology of China
Hefei, China
haoyanbin@hotmail.com
Chong-Wah Ngo
Singapore Management University
Singapore
cwngo@smu.edu.sg
kable successes in understanding 1 and
NLP and Image Content Understand-
ve to convolutional neural networks, it
pretability, high discriminative power
ibility in processing varying length in-
naturally contain computational inten-
-wise self-attention, incurring heavy
being applied on the complex 3-dimen
ken Shift Module (i.e., TokShift), a
FLOPs operator, for modeling tempo-
ransformer encoder. Specifically, the
shifts partial [Class] token features
cent frames. Then, we densely plug
der of a plain 2D vision transformer
esentation. It is worth noticing that
a pure convolutional-free video trans-
onal efficiency for video understand-
rd benchmarks verify its robustness,
y. Particularly, with input clips of
ransformer achieves SOTA precision:
cs-400, 66.56% on EGTEA-Gaze+, and
, comparable or better than existing
rparts. Our code is open-sourced in:
works/TokShift-Transformer.
ies → Activity recognition and un-
rmer; Shift; Self-attention
Chong-Wah Ngo. 2021. Token Shift Trans-
n Proceedings of the 29th ACM International
thor.
(a) Video Tensor (b) Temporal Shift
(c) Patch Shift (d) Token Shift
Figure 1: Types of Shift for video transformer. A video em-
bedding contains two types of “words”: 𝑁 [Patch] + 1 [Class]
Token. Hence, the Shift can be applied on: (a) Neither, (b)
Both, (c) Patch, and (d) [Class] Token along the temporal axis.
“!” indicates padding zeros.
Conference on Multimedia (MM ’21), October 20–24, 2021, Virtual Event, China.
ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3474085.3475272
1 INTRODUCTION
“If a picture is worth a thousand words, is a video worth a million?”
[40]. This quote basically predicts that image and video can be po-
tentially interpreted as linguistical sentences, except that videos
contain richer information than images. The recent progress of ex-
tending linguistical-style, convolutional-free transformers [9] on
visual content understanding successfully verifies the quote’s prior
oken Shift Transformer for Video Classification
ao Zhang
rsity of Hong Kong
ong SAR, China
oinf@gmail.com
Yanbin Hao∗
University of Science and
Technology of China
Hefei, China
haoyanbin@hotmail.com
Chong-Wah Ngo
Singapore Management University
Singapore
cwngo@smu.edu.sg
es remarkable successes in understanding 1 and
als (e.g., NLP and Image Content Understand-
lternative to convolutional neural networks, it
ong interpretability, high discriminative power
and flexibility in processing varying length in-
ncoders naturally contain computational inten-
h as pair-wise self-attention, incurring heavy
en when being applied on the complex 3-dimen
ents Token Shift Module (i.e., TokShift), a
er, zero-FLOPs operator, for modeling tempo-
n each transformer encoder. Specifically, the
mporally shifts partial [Class] token features
oss adjacent frames. Then, we densely plug
ch encoder of a plain 2D vision transformer
deo representation. It is worth noticing that
ormer is a pure convolutional-free video trans-
omputational efficiency for video understand-
n standard benchmarks verify its robustness,
efficiency. Particularly, with input clips of
kShift transformer achieves SOTA precision:
e Kinetics-400, 66.56% on EGTEA-Gaze+, and
datasets, comparable or better than existing
l counterparts. Our code is open-sourced in:
ideoNetworks/TokShift-Transformer.
TS
hodologies → Activity recognition and un-
Transformer; Shift; Self-attention
mat:
ao, and Chong-Wah Ngo. 2021. Token Shift Trans-
ification. In Proceedings of the 29th ACM International
ponding author.
(a) Video Tensor (b) Temporal Shift
(c) Patch Shift (d) Token Shift
Figure 1: Types of Shift for video transformer. A video em-
bedding contains two types of “words”: 𝑁 [Patch] + 1 [Class]
Token. Hence, the Shift can be applied on: (a) Neither, (b)
Both, (c) Patch, and (d) [Class] Token along the temporal axis.
“!” indicates padding zeros.
Conference on Multimedia (MM ’21), October 20–24, 2021, Virtual Event, China.
ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3474085.3475272
1 INTRODUCTION
“If a picture is worth a thousand words, is a video worth a million?”
[40]. This quote basically predicts that image and video can be po-
tentially interpreted as linguistical sentences, except that videos
contain richer information than images. The recent progress of ex-
tending linguistical-style, convolutional-free transformers [9] on
visual content understanding successfully verifies the quote’s prior
nシフトの割合による性能比較
• dまではシフト量を増やすほど性能向
上に寄与した
same spatial grid across times incurs visual misalignment whe
rigidly dividing a moving object. We verify this hypothesis by me
suring the mean cosine similarity between the patch and its temp
ral neighbor features on kinetics-400 val set with Temporal/Patc
Shift transformer and observe distances scores of 0.577/0.570. A
for [Class] token part, since it reflects global frame-level conten
without alignment concerns, the distance between neighboring te
poral tokens is 0.95. Consequently, for the TokShift or CNN-Shi
the drawback is bypassed by global alignment or sliding window
Proportion of shifted channels. This hyperparameter co
trols the ratio of dynamic/static information in a video feature. W
evaluate 𝑎
𝐷 , 𝑐
𝐷 in range [1/4, 1/8, 1/16] and present results in Tab
4. We experimentally find that 1/4 is an optimal value, indicatin
that half channels are shifted (1/4 back + 1/4 forth) while the re
half remain unchanged.
Model Channels Shifted Acc1 Acc5
( 𝑎
𝐷 + 𝑐
𝐷 ) (%) (%)
TokShift 1/4 + 1/4 77.28 92.91
TokShift 1/8 + 1/8 77.18 92.95
TokShift 1/16 + 1/16 76.93 92.82
Table 4: Comparisons of the TokShift regarding to propo
実験結果:-"%,."/#011
nフレーム数と解像度が与える影響
• フレーム数が多いと性能向上
• 解像度が増えると性能向上
• どちらも情報量が増加
that half channels are shifted (1/4 back + 1/4 forth) while the rest
half remain unchanged.
Model Channels Shifted Acc1 Acc5
( 𝑎
𝐷 + 𝑐
𝐷 ) (%) (%)
TokShift 1/4 + 1/4 77.28 92.91
TokShift 1/8 + 1/8 77.18 92.95
TokShift 1/16 + 1/16 76.93 92.82
Table 4: Comparisons of the TokShift regarding to propor-
tions of shifted channels (Accuracy-1/5).
Word count is an essential factor in transformer learning since
more words indicate more details. As for videos, the number of
visual words is positively correlated with two factors: spatial reso-
lutions and temporal frames. We study the impacts of word count
on the TokShift-xfmr. Table 5 lists the detailed performances un-
der different word counts. Here, MR/HR represents middle/high
resolutions.
Model Res Words/Frame #Frames # Words Acc1
𝐻 ×𝑊 𝑁 + 1 𝑇 𝑇 · (𝑁 + 1) (%)
TokShift 224 × 224 142 + 1 6 1,182 76.72
TokShift 224 × 224 142 + 1 8 1,576 77.28
TokShift 224 × 224 142 + 1 10 19,70 77.56
TokShift 224 × 224 142 + 1 16 3,152 78.18
TokShift (MR) 256 × 256 162 + 1 8 2,056 77.68
TokShift (HR) 384 × 384 242 + 1 8 4,616 78.14
Table 5: Comparisons of the TokShift according to visual
word counts (Accuracy-1).
4.4 Comparison with the State-of-the-Art
We compare our TokShift-xfmrs with current SOTAs on Kinetics-
400 datasets in Table 7. Since there is no prior work that purely
utilizes transformer, we mainly select 3D-CNNs, such as I3D, Non-
Local, TSM, SlowFast, X3D as SOTAs.
Compared to 3D-CNNs, transformers exhibit prominent perfor-
mances. Particularly, the spatial-only transformer, i.e., ViT (video),
achieves comparable or better performance (76.02) than strong 3D-
CNNs, such as I3D (71.1), S3D-G (74.7) and TSM (76.3). Moreover,
a comparison of our most slimed TokShift-xfmr (77.28) with TSM
(76.3) indicates that shift gains extra benefits on transformers than
CNNs. Thirdly, the TokShift-xfmr (MR) achieves comparable per-
formance (77.68) with Non-Local R101 networks (77.7) under the
same spatial resolutions (255) but using fewer frames (8 vs 128).
This is reasonable since both contain the attentions. Finally, we
compare TokShift-xfmr with the current best 3D-CNNs: SlowFast
and X3D. The two 3D-CNNs are particularly optimized by a dual-
paths or efficient network parameters searching mechanism. Since
both SlowFast & X3D introduce large structural changes over cor-
responding 2D nets, their authors prefer to train on videos with
sufficient epochs (256), rather than initialize from ImageNet pre-
trained weights. We show that our TokShift-Large-xfmr (80.40) per-
forms better than SlowFast (79.80) and is comparable with X3D-
XXL (80.4), where models are under their best settings. Notably,
we only use 12 frames, whereas SlowFast/X3D-XXL uses 32/16+
frames.
In all, a pure transformer can perform comparable or better than
Shift Type Words Shiftted 𝐺𝐹𝐿𝑂𝑃𝑠 × 𝑉𝑖𝑒𝑤𝑠 Acc1 Acc5
(%) (%)
ViT (video) [9] None 134.7 × 30 76.02 92.52
TemporalShift Token + Patches 134.7 × 30 72.88 91.24
PachShift Patches 134.7 × 30 73.08 91.17
TokShift Token 134.7 × 30 77.28 92.91
Table 3: Comparisons of shifting different visual “Words”
(GFLOPs, Accuracy-1/5).
In Table 3, the TokShift exhibits the best top-1 accuracy. Addi-
Model Backbone Res # Words Acc1
(𝐻 ×𝑊 ) 𝑇 · (𝑁 + 1) (%)
TokShift (HR) Base-16 384 × 384 8 · (242 + 1) 78.14
TokShift-Large (HR) Large-16 384 × 384 8 · (242 + 1) 79.83
TokShift-Large (HR) Large-16 384 × 384 12 · (242 + 1) 80.40
TokShift (MR) Base-16 256 × 256 8 · (162 + 1) 77.68
TokShift-Hybrid (MR) R50+Base-16 256 × 256 8 · (162 + 1) 77.55
TokShift-Hybrid (MR) R50+Base-16 256 × 256 16 · (162 + 1) 78.34
Table 6: Comparisons of the TokShift on different trans-
former backbones (Accuracy-1).
実験結果:-"%,."/#011
n従来手法との比較
• N,8のみでも3FTシフトと同等の性能
• 8#9:';&,<=は既存のGF手法を上回る性能
Model Backbone Pretrain Inference Res # Frames/Clip GFLOPs×Views Params Accuracy-1 Accuracy-5
(𝐻 ×𝑊 ) 𝑇 (%) (%)
I3D [4] from [11] InceptionV1 ImageNet 224 × 224 250 108× NA 12M 71.1 90.3
Two-Stream I3D [4] from [11] InceptionV1 ImageNet 224 × 224 500 216× NA 25M 75.7 92.0
S3D-G [39] InceptionV1 ImageNet 224 × 224 250 71.3×NA 11.5M 74.7 93.4
Two-Stream S3D-G [39] InceptionV1 ImageNet 224 × 224 500 142.6×NA 11.5M 77.2 93.0
Non-Local R50 [33] from [11] ResNet50 ImageNet 256 × 256 128 282 × 30 35.3M 76.5 92.6
Non-Local R101[33] from [11] ResNet101 ImageNet 256 × 256 128 359 × 30 54.3M 77.7 93.3
TSM [20] ResNet50 ImageNet 256 × 256 8 33 × 10 24.3M 74.1 91.2
TSM [20] ResNet50 ImageNet 256 × 256 16 65 × 10 24.3M 74.7 -
TSM [20] ResNext101 ImageNet 256 × 256 8 NA×10 - 76.3 -
SlowFast 4 × 16 [12] ResNet50 None 256 × 256 32 36.1 × 30 34.4M 75.6 92.1
SlowFast 8 × 8 [12] ResNet50 None 256 × 256 32 65.7 × 30 - 77.0 92.6
SlowFast 8 × 8 [12] ResNet101 None 256 × 256 32 106 × 30 53.7M 77.9 93.2
SlowFast 8 × 8 [12] ResNet101+NL None 256 × 256 32 116 × 30 59.9M 78.7 93.5
SlowFast 16 × 8 [12] ResNet101+NL None 256 × 256 32 234 × 30 59.9M 79.8 93.9
X3D-M [11] X2D [11] None 256 × 256 16 6.2 × 30 3.8M 76.0 92.3
X3D-L [11] X2D [11] None 356 × 356 16 24.8 × 30 6.1M 77.5 92.9
X3D-XL[11] X2D [11] None 356 × 356 16 48.4 × 30 11M 79.1 93.9
X3D-XXL [11] X2D [11] None NA NA 194.1 × 30 20.3M 80.4 94.6
ViT (Video)[9] Base-16 ImageNet-21k 224 × 224 8 134.7 × 30 85.9M 76.02 92.52
TokShift Base-16 ImageNet-21k 224 × 224 8 134.7 × 30 85.9M 77.28 92.91
TokShift (MR) Base-16 ImageNet-21k 256 × 256 8 175.8 × 30 85.9M 77.68 93.55
TokShift (HR) Base-16 ImageNet-21k 384 × 384 8 394.7 × 30 85.9M 78.14 93.91
TokShift Base-16 ImageNet-21k 224 × 224 16 269.5 × 30 85.9M 78.18 93.78
TokShift-Large (HR) Large-16 ImageNet-21k 384 × 384 8 1397.6 × 30 303.4M 79.83 94.39
TokShift-Large (HR) Large-16 ImageNet-21k 384 × 384 12 2096.4 × 30 303.4M 80.40 94.45
実験結果:-"%,."/#011
n従来手法との比較
• N,8のみでも3FTシフトと同等の性能
• 8#9:';&,<=は既存のGF手法を上回る性能
Model Backbone Pretrain Inference Res # Frames/Clip GFLOPs×Views Params Accuracy-1 Accuracy-5
(𝐻 ×𝑊 ) 𝑇 (%) (%)
I3D [4] from [11] InceptionV1 ImageNet 224 × 224 250 108× NA 12M 71.1 90.3
Two-Stream I3D [4] from [11] InceptionV1 ImageNet 224 × 224 500 216× NA 25M 75.7 92.0
S3D-G [39] InceptionV1 ImageNet 224 × 224 250 71.3×NA 11.5M 74.7 93.4
Two-Stream S3D-G [39] InceptionV1 ImageNet 224 × 224 500 142.6×NA 11.5M 77.2 93.0
Non-Local R50 [33] from [11] ResNet50 ImageNet 256 × 256 128 282 × 30 35.3M 76.5 92.6
Non-Local R101[33] from [11] ResNet101 ImageNet 256 × 256 128 359 × 30 54.3M 77.7 93.3
TSM [20] ResNet50 ImageNet 256 × 256 8 33 × 10 24.3M 74.1 91.2
TSM [20] ResNet50 ImageNet 256 × 256 16 65 × 10 24.3M 74.7 -
TSM [20] ResNext101 ImageNet 256 × 256 8 NA×10 - 76.3 -
SlowFast 4 × 16 [12] ResNet50 None 256 × 256 32 36.1 × 30 34.4M 75.6 92.1
SlowFast 8 × 8 [12] ResNet50 None 256 × 256 32 65.7 × 30 - 77.0 92.6
SlowFast 8 × 8 [12] ResNet101 None 256 × 256 32 106 × 30 53.7M 77.9 93.2
SlowFast 8 × 8 [12] ResNet101+NL None 256 × 256 32 116 × 30 59.9M 78.7 93.5
SlowFast 16 × 8 [12] ResNet101+NL None 256 × 256 32 234 × 30 59.9M 79.8 93.9
X3D-M [11] X2D [11] None 256 × 256 16 6.2 × 30 3.8M 76.0 92.3
X3D-L [11] X2D [11] None 356 × 356 16 24.8 × 30 6.1M 77.5 92.9
X3D-XL[11] X2D [11] None 356 × 356 16 48.4 × 30 11M 79.1 93.9
X3D-XXL [11] X2D [11] None NA NA 194.1 × 30 20.3M 80.4 94.6
ViT (Video)[9] Base-16 ImageNet-21k 224 × 224 8 134.7 × 30 85.9M 76.02 92.52
TokShift Base-16 ImageNet-21k 224 × 224 8 134.7 × 30 85.9M 77.28 92.91
TokShift (MR) Base-16 ImageNet-21k 256 × 256 8 175.8 × 30 85.9M 77.68 93.55
TokShift (HR) Base-16 ImageNet-21k 384 × 384 8 394.7 × 30 85.9M 78.14 93.91
TokShift Base-16 ImageNet-21k 224 × 224 16 269.5 × 30 85.9M 78.18 93.78
TokShift-Large (HR) Large-16 ImageNet-21k 384 × 384 8 1397.6 × 30 303.4M 79.83 94.39
TokShift-Large (HR) Large-16 ImageNet-21k 384 × 384 12 2096.4 × 30 303.4M 80.40 94.45
実験結果2小規模データセット
nC"_:L$0#L>"A,W"=,#'の影響
• モダリティ ^,?:#$#L$,>"(:]によって変化
• どちらのデータセットにおいても既存手法を上回る
4 × 224 500 216× NA 25M 75.7 92.0
4 × 224 250 71.3×NA 11.5M 74.7 93.4
4 × 224 500 142.6×NA 11.5M 77.2 93.0
6 × 256 128 282 × 30 35.3M 76.5 92.6
6 × 256 128 359 × 30 54.3M 77.7 93.3
6 × 256 8 33 × 10 24.3M 74.1 91.2
6 × 256 16 65 × 10 24.3M 74.7 -
6 × 256 8 NA×10 - 76.3 -
6 × 256 32 36.1 × 30 34.4M 75.6 92.1
6 × 256 32 65.7 × 30 - 77.0 92.6
6 × 256 32 106 × 30 53.7M 77.9 93.2
6 × 256 32 116 × 30 59.9M 78.7 93.5
6 × 256 32 234 × 30 59.9M 79.8 93.9
6 × 256 16 6.2 × 30 3.8M 76.0 92.3
6 × 356 16 24.8 × 30 6.1M 77.5 92.9
6 × 356 16 48.4 × 30 11M 79.1 93.9
NA NA 194.1 × 30 20.3M 80.4 94.6
4 × 224 8 134.7 × 30 85.9M 76.02 92.52
4 × 224 8 134.7 × 30 85.9M 77.28 92.91
6 × 256 8 175.8 × 30 85.9M 77.68 93.55
4 × 384 8 394.7 × 30 85.9M 78.14 93.91
4 × 224 16 269.5 × 30 85.9M 78.18 93.78
4 × 384 8 1397.6 × 30 303.4M 79.83 94.39
4 × 384 12 2096.4 × 30 303.4M 80.40 94.45
ate-of-the-arts on Kinetics-400 Val.
e-
a-
al-
Ns
e-
ry
m-
re
fi-
at-
Model Modaility Pretrain Res # Frames EGAZ+
(𝐻 ×𝑊 ) 𝑇 Acc1 (%)
TSM [20] RGB Kinetics-400 224 × 224 8 63.45
SAP [34] RGB Kinetics-400 256 × 256 64 64.10
ViT (Video) [9] RGB ImageNet-21k 224 × 224 8 62.59
TokShift RGB None 224 × 224 8 28.90
TokShift RGB ImageNet-21k 224 × 224 8 62.85
TokShift* RGB ImageNet-21k 224 × 224 8 59.24
TokShift RGB Kinetics-400 224 × 224 8 63.69
TokShift* RGB Kinetics-400 224 × 224 8 64.82
TokShift OptFlow Kinetics-400 224 × 224 8 48.81
TokShift*-En OptFlow+RGB Kinetics-400 224 × 224 8 65.08
TokShift* (HR) RGB Kinetics-400 384 × 384 8 65.77
TokShift-Large* (HR) RGB Kinetics-400 384 × 384 8 66.56
Table 8: Impacts of various pretrained weights on EGTEA-
GAZE++ Split-1 dataset (“*” means freeze layer-norm).
(e) “Playing ice hockey”
Figure 4: Visualization of attention on sample clips from the Ki
the even presents corresponding attention maps.
Model Pretrain Res # Frames UCF101
(𝐻 ×𝑊 ) 𝑇 Acc1 (%)
I3D [4] Kinetics-400 224 × 224 250 84.50
P3D [21] Kinetics-400 224 × 224 16 84.20
Two-Stream I3D [4] Kinetics-400 224 × 224 500 93.40
TSM [20] Kinetics-400 256 × 256 8 95.90
ViT (Video) [9] ImageNet-21k 256 × 256 8 91.46
TokShift None 256 × 256 8 91.60
TokShift ImageNet-21k 256 × 256 8 91.65
TokShift* Kinetics-400 256 × 256 8 95.35
TokShift* (HR) Kinetics-400 384 × 384 8 96.14
TokShift-Large* (HR) Kinetics-400 384 × 384 8 96.80
Table 9: Impacts of various pretrained weights on UCF-101
Split dataset (“*” means freeze layer-norm).
EGTEA Gaze+ UCF101
まとめ
n8L"';<#L>:Lの中間特徴量をシフトすることで動作認識を行う
• 8#9:';&,<= >#?@A:の提案
• S&,<=操作のみ
• ゼロパラメータ
• ゼロ計算量
• -C1SS$=#9:'のみをシフト
• グローバルな時空間情報を交換

Más contenido relacionado

La actualidad más candente

【DL輪読会】Perceiver io a general architecture for structured inputs &amp; outputs
【DL輪読会】Perceiver io  a general architecture for structured inputs &amp; outputs 【DL輪読会】Perceiver io  a general architecture for structured inputs &amp; outputs
【DL輪読会】Perceiver io a general architecture for structured inputs &amp; outputs Deep Learning JP
 
[DL輪読会]A Higher-Dimensional Representation for Topologically Varying Neural R...
[DL輪読会]A Higher-Dimensional Representation for Topologically Varying Neural R...[DL輪読会]A Higher-Dimensional Representation for Topologically Varying Neural R...
[DL輪読会]A Higher-Dimensional Representation for Topologically Varying Neural R...Deep Learning JP
 
【DL輪読会】Variable Bitrate Neural Fields
【DL輪読会】Variable Bitrate Neural Fields【DL輪読会】Variable Bitrate Neural Fields
【DL輪読会】Variable Bitrate Neural FieldsDeep Learning JP
 
Devil is in the Edges: Learning Semantic Boundaries from Noisy Annotations
Devil is in the Edges: Learning Semantic Boundaries from Noisy AnnotationsDevil is in the Edges: Learning Semantic Boundaries from Noisy Annotations
Devil is in the Edges: Learning Semantic Boundaries from Noisy AnnotationsKazuyuki Miyazawa
 
【DL輪読会】DiffRF: Rendering-guided 3D Radiance Field Diffusion [N. Muller+ CVPR2...
【DL輪読会】DiffRF: Rendering-guided 3D Radiance Field Diffusion [N. Muller+ CVPR2...【DL輪読会】DiffRF: Rendering-guided 3D Radiance Field Diffusion [N. Muller+ CVPR2...
【DL輪読会】DiffRF: Rendering-guided 3D Radiance Field Diffusion [N. Muller+ CVPR2...Deep Learning JP
 
帰納バイアスが成立する条件
帰納バイアスが成立する条件帰納バイアスが成立する条件
帰納バイアスが成立する条件Shinobu KINJO
 
【ECCV 2022】NeDDF: Reciprocally Constrained Field for Distance and Density
【ECCV 2022】NeDDF: Reciprocally Constrained Field for Distance and Density【ECCV 2022】NeDDF: Reciprocally Constrained Field for Distance and Density
【ECCV 2022】NeDDF: Reciprocally Constrained Field for Distance and Densitycvpaper. challenge
 
【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features
【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features
【論文読み会】Deep Clustering for Unsupervised Learning of Visual FeaturesARISE analytics
 
【メタサーベイ】Video Transformer
 【メタサーベイ】Video Transformer 【メタサーベイ】Video Transformer
【メタサーベイ】Video Transformercvpaper. challenge
 
三次元点群を取り扱うニューラルネットワークのサーベイ
三次元点群を取り扱うニューラルネットワークのサーベイ三次元点群を取り扱うニューラルネットワークのサーベイ
三次元点群を取り扱うニューラルネットワークのサーベイNaoya Chiba
 
SuperGlue; Learning Feature Matching with Graph Neural Networks (CVPR'20)
SuperGlue;Learning Feature Matching with Graph Neural Networks (CVPR'20)SuperGlue;Learning Feature Matching with Graph Neural Networks (CVPR'20)
SuperGlue; Learning Feature Matching with Graph Neural Networks (CVPR'20)Yusuke Uchida
 
SSII2022 [SS1] ニューラル3D表現の最新動向〜 ニューラルネットでなんでも表せる?? 〜​
SSII2022 [SS1] ニューラル3D表現の最新動向〜 ニューラルネットでなんでも表せる?? 〜​SSII2022 [SS1] ニューラル3D表現の最新動向〜 ニューラルネットでなんでも表せる?? 〜​
SSII2022 [SS1] ニューラル3D表現の最新動向〜 ニューラルネットでなんでも表せる?? 〜​SSII
 
DNNの曖昧性に関する研究動向
DNNの曖昧性に関する研究動向DNNの曖昧性に関する研究動向
DNNの曖昧性に関する研究動向Naoki Matsunaga
 
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...Deep Learning JP
 
CNN-SLAMざっくり
CNN-SLAMざっくりCNN-SLAMざっくり
CNN-SLAMざっくりEndoYuuki
 
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning 画像×言語の大規模基盤モ...
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning   画像×言語の大規模基盤モ...【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning   画像×言語の大規模基盤モ...
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning 画像×言語の大規模基盤モ...Deep Learning JP
 
【DL輪読会】Scaling Laws for Neural Language Models
【DL輪読会】Scaling Laws for Neural Language Models【DL輪読会】Scaling Laws for Neural Language Models
【DL輪読会】Scaling Laws for Neural Language ModelsDeep Learning JP
 
深層生成モデルと世界モデル
深層生成モデルと世界モデル深層生成モデルと世界モデル
深層生成モデルと世界モデルMasahiro Suzuki
 
【DL輪読会】言語以外でのTransformerのまとめ (ViT, Perceiver, Frozen Pretrained Transformer etc)
【DL輪読会】言語以外でのTransformerのまとめ (ViT, Perceiver, Frozen Pretrained Transformer etc)【DL輪読会】言語以外でのTransformerのまとめ (ViT, Perceiver, Frozen Pretrained Transformer etc)
【DL輪読会】言語以外でのTransformerのまとめ (ViT, Perceiver, Frozen Pretrained Transformer etc)Deep Learning JP
 

La actualidad más candente (20)

【DL輪読会】Perceiver io a general architecture for structured inputs &amp; outputs
【DL輪読会】Perceiver io  a general architecture for structured inputs &amp; outputs 【DL輪読会】Perceiver io  a general architecture for structured inputs &amp; outputs
【DL輪読会】Perceiver io a general architecture for structured inputs &amp; outputs
 
[DL輪読会]A Higher-Dimensional Representation for Topologically Varying Neural R...
[DL輪読会]A Higher-Dimensional Representation for Topologically Varying Neural R...[DL輪読会]A Higher-Dimensional Representation for Topologically Varying Neural R...
[DL輪読会]A Higher-Dimensional Representation for Topologically Varying Neural R...
 
【DL輪読会】Variable Bitrate Neural Fields
【DL輪読会】Variable Bitrate Neural Fields【DL輪読会】Variable Bitrate Neural Fields
【DL輪読会】Variable Bitrate Neural Fields
 
Devil is in the Edges: Learning Semantic Boundaries from Noisy Annotations
Devil is in the Edges: Learning Semantic Boundaries from Noisy AnnotationsDevil is in the Edges: Learning Semantic Boundaries from Noisy Annotations
Devil is in the Edges: Learning Semantic Boundaries from Noisy Annotations
 
【DL輪読会】DiffRF: Rendering-guided 3D Radiance Field Diffusion [N. Muller+ CVPR2...
【DL輪読会】DiffRF: Rendering-guided 3D Radiance Field Diffusion [N. Muller+ CVPR2...【DL輪読会】DiffRF: Rendering-guided 3D Radiance Field Diffusion [N. Muller+ CVPR2...
【DL輪読会】DiffRF: Rendering-guided 3D Radiance Field Diffusion [N. Muller+ CVPR2...
 
帰納バイアスが成立する条件
帰納バイアスが成立する条件帰納バイアスが成立する条件
帰納バイアスが成立する条件
 
【ECCV 2022】NeDDF: Reciprocally Constrained Field for Distance and Density
【ECCV 2022】NeDDF: Reciprocally Constrained Field for Distance and Density【ECCV 2022】NeDDF: Reciprocally Constrained Field for Distance and Density
【ECCV 2022】NeDDF: Reciprocally Constrained Field for Distance and Density
 
Lucas kanade法について
Lucas kanade法についてLucas kanade法について
Lucas kanade法について
 
【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features
【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features
【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features
 
【メタサーベイ】Video Transformer
 【メタサーベイ】Video Transformer 【メタサーベイ】Video Transformer
【メタサーベイ】Video Transformer
 
三次元点群を取り扱うニューラルネットワークのサーベイ
三次元点群を取り扱うニューラルネットワークのサーベイ三次元点群を取り扱うニューラルネットワークのサーベイ
三次元点群を取り扱うニューラルネットワークのサーベイ
 
SuperGlue; Learning Feature Matching with Graph Neural Networks (CVPR'20)
SuperGlue;Learning Feature Matching with Graph Neural Networks (CVPR'20)SuperGlue;Learning Feature Matching with Graph Neural Networks (CVPR'20)
SuperGlue; Learning Feature Matching with Graph Neural Networks (CVPR'20)
 
SSII2022 [SS1] ニューラル3D表現の最新動向〜 ニューラルネットでなんでも表せる?? 〜​
SSII2022 [SS1] ニューラル3D表現の最新動向〜 ニューラルネットでなんでも表せる?? 〜​SSII2022 [SS1] ニューラル3D表現の最新動向〜 ニューラルネットでなんでも表せる?? 〜​
SSII2022 [SS1] ニューラル3D表現の最新動向〜 ニューラルネットでなんでも表せる?? 〜​
 
DNNの曖昧性に関する研究動向
DNNの曖昧性に関する研究動向DNNの曖昧性に関する研究動向
DNNの曖昧性に関する研究動向
 
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
 
CNN-SLAMざっくり
CNN-SLAMざっくりCNN-SLAMざっくり
CNN-SLAMざっくり
 
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning 画像×言語の大規模基盤モ...
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning   画像×言語の大規模基盤モ...【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning   画像×言語の大規模基盤モ...
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning 画像×言語の大規模基盤モ...
 
【DL輪読会】Scaling Laws for Neural Language Models
【DL輪読会】Scaling Laws for Neural Language Models【DL輪読会】Scaling Laws for Neural Language Models
【DL輪読会】Scaling Laws for Neural Language Models
 
深層生成モデルと世界モデル
深層生成モデルと世界モデル深層生成モデルと世界モデル
深層生成モデルと世界モデル
 
【DL輪読会】言語以外でのTransformerのまとめ (ViT, Perceiver, Frozen Pretrained Transformer etc)
【DL輪読会】言語以外でのTransformerのまとめ (ViT, Perceiver, Frozen Pretrained Transformer etc)【DL輪読会】言語以外でのTransformerのまとめ (ViT, Perceiver, Frozen Pretrained Transformer etc)
【DL輪読会】言語以外でのTransformerのまとめ (ViT, Perceiver, Frozen Pretrained Transformer etc)
 

Similar a 文献紹介:Token Shift Transformer for Video Classification

文献紹介:TSM: Temporal Shift Module for Efficient Video Understanding
文献紹介:TSM: Temporal Shift Module for Efficient Video Understanding文献紹介:TSM: Temporal Shift Module for Efficient Video Understanding
文献紹介:TSM: Temporal Shift Module for Efficient Video UnderstandingToru Tamaki
 
文献紹介:Gate-Shift Networks for Video Action Recognition
文献紹介:Gate-Shift Networks for Video Action Recognition文献紹介:Gate-Shift Networks for Video Action Recognition
文献紹介:Gate-Shift Networks for Video Action RecognitionToru Tamaki
 
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...Toru Tamaki
 
文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting
文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting
文献紹介:Learnable Gated Temporal Shift Module for Free-form Video InpaintingToru Tamaki
 
文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...
文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...
文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...Toru Tamaki
 
文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...
文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...
文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...Toru Tamaki
 
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleToru Tamaki
 
文献紹介:Selective Feature Compression for Efficient Activity Recognition Inference
文献紹介:Selective Feature Compression for Efficient Activity Recognition Inference文献紹介:Selective Feature Compression for Efficient Activity Recognition Inference
文献紹介:Selective Feature Compression for Efficient Activity Recognition InferenceToru Tamaki
 
文献紹介:SlowFast Networks for Video Recognition
文献紹介:SlowFast Networks for Video Recognition文献紹介:SlowFast Networks for Video Recognition
文献紹介:SlowFast Networks for Video RecognitionToru Tamaki
 
文献紹介:Video Transformer Network
文献紹介:Video Transformer Network文献紹介:Video Transformer Network
文献紹介:Video Transformer NetworkToru Tamaki
 
文献紹介:2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video Reco...
文献紹介:2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video Reco...文献紹介:2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video Reco...
文献紹介:2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video Reco...Toru Tamaki
 
第11回 配信講義 計算科学技術特論B(2022)
第11回 配信講義 計算科学技術特論B(2022)第11回 配信講義 計算科学技術特論B(2022)
第11回 配信講義 計算科学技術特論B(2022)RCCSRENKEI
 
文献紹介:Rethinking Data Augmentation for Image Super-resolution: A Comprehensive...
文献紹介:Rethinking Data Augmentation for Image Super-resolution: A Comprehensive...文献紹介:Rethinking Data Augmentation for Image Super-resolution: A Comprehensive...
文献紹介:Rethinking Data Augmentation for Image Super-resolution: A Comprehensive...Toru Tamaki
 
文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification
文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification
文献紹介:VideoMix: Rethinking Data Augmentation for Video ClassificationToru Tamaki
 
文献紹介:TinyVIRAT: Low-resolution Video Action Recognition
文献紹介:TinyVIRAT: Low-resolution Video Action Recognition文献紹介:TinyVIRAT: Low-resolution Video Action Recognition
文献紹介:TinyVIRAT: Low-resolution Video Action RecognitionToru Tamaki
 
文献紹介:Video Description: A Survey of Methods, Datasets, and Evaluation Metrics
文献紹介:Video Description: A Survey of Methods, Datasets, and Evaluation Metrics文献紹介:Video Description: A Survey of Methods, Datasets, and Evaluation Metrics
文献紹介:Video Description: A Survey of Methods, Datasets, and Evaluation MetricsToru Tamaki
 
200730material fujita
200730material fujita200730material fujita
200730material fujitaRCCSRENKEI
 
200702material hirokawa
200702material hirokawa200702material hirokawa
200702material hirokawaRCCSRENKEI
 
STARC RTL設計スタイルガイドによるVerilog HDL並列記述の補強
STARC RTL設計スタイルガイドによるVerilog HDL並列記述の補強STARC RTL設計スタイルガイドによるVerilog HDL並列記述の補強
STARC RTL設計スタイルガイドによるVerilog HDL並列記述の補強Kiyoshi Ogawa
 

Similar a 文献紹介:Token Shift Transformer for Video Classification (20)

文献紹介:TSM: Temporal Shift Module for Efficient Video Understanding
文献紹介:TSM: Temporal Shift Module for Efficient Video Understanding文献紹介:TSM: Temporal Shift Module for Efficient Video Understanding
文献紹介:TSM: Temporal Shift Module for Efficient Video Understanding
 
文献紹介:Gate-Shift Networks for Video Action Recognition
文献紹介:Gate-Shift Networks for Video Action Recognition文献紹介:Gate-Shift Networks for Video Action Recognition
文献紹介:Gate-Shift Networks for Video Action Recognition
 
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...
 
文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting
文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting
文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting
 
文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...
文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...
文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...
 
文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...
文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...
文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...
 
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
 
文献紹介:Selective Feature Compression for Efficient Activity Recognition Inference
文献紹介:Selective Feature Compression for Efficient Activity Recognition Inference文献紹介:Selective Feature Compression for Efficient Activity Recognition Inference
文献紹介:Selective Feature Compression for Efficient Activity Recognition Inference
 
文献紹介:SlowFast Networks for Video Recognition
文献紹介:SlowFast Networks for Video Recognition文献紹介:SlowFast Networks for Video Recognition
文献紹介:SlowFast Networks for Video Recognition
 
文献紹介:Video Transformer Network
文献紹介:Video Transformer Network文献紹介:Video Transformer Network
文献紹介:Video Transformer Network
 
文献紹介:2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video Reco...
文献紹介:2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video Reco...文献紹介:2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video Reco...
文献紹介:2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video Reco...
 
第11回 配信講義 計算科学技術特論B(2022)
第11回 配信講義 計算科学技術特論B(2022)第11回 配信講義 計算科学技術特論B(2022)
第11回 配信講義 計算科学技術特論B(2022)
 
文献紹介:Rethinking Data Augmentation for Image Super-resolution: A Comprehensive...
文献紹介:Rethinking Data Augmentation for Image Super-resolution: A Comprehensive...文献紹介:Rethinking Data Augmentation for Image Super-resolution: A Comprehensive...
文献紹介:Rethinking Data Augmentation for Image Super-resolution: A Comprehensive...
 
文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification
文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification
文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification
 
文献紹介:TinyVIRAT: Low-resolution Video Action Recognition
文献紹介:TinyVIRAT: Low-resolution Video Action Recognition文献紹介:TinyVIRAT: Low-resolution Video Action Recognition
文献紹介:TinyVIRAT: Low-resolution Video Action Recognition
 
文献紹介:Video Description: A Survey of Methods, Datasets, and Evaluation Metrics
文献紹介:Video Description: A Survey of Methods, Datasets, and Evaluation Metrics文献紹介:Video Description: A Survey of Methods, Datasets, and Evaluation Metrics
文献紹介:Video Description: A Survey of Methods, Datasets, and Evaluation Metrics
 
200730material fujita
200730material fujita200730material fujita
200730material fujita
 
200702material hirokawa
200702material hirokawa200702material hirokawa
200702material hirokawa
 
画像処理の高性能計算
画像処理の高性能計算画像処理の高性能計算
画像処理の高性能計算
 
STARC RTL設計スタイルガイドによるVerilog HDL並列記述の補強
STARC RTL設計スタイルガイドによるVerilog HDL並列記述の補強STARC RTL設計スタイルガイドによるVerilog HDL並列記述の補強
STARC RTL設計スタイルガイドによるVerilog HDL並列記述の補強
 

Más de Toru Tamaki

論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...Toru Tamaki
 
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...Toru Tamaki
 
論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNetToru Tamaki
 
論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A surveyToru Tamaki
 
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex ScenesToru Tamaki
 
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...Toru Tamaki
 
論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video SegmentationToru Tamaki
 
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New HopeToru Tamaki
 
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...Toru Tamaki
 
論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt TuningToru Tamaki
 
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in MoviesToru Tamaki
 
論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICAToru Tamaki
 
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context RefinementToru Tamaki
 
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...Toru Tamaki
 
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...Toru Tamaki
 
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusionToru Tamaki
 
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous DrivingToru Tamaki
 
論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large Motion論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large MotionToru Tamaki
 
論文紹介:Vision Transformer Adapter for Dense Predictions
論文紹介:Vision Transformer Adapter for Dense Predictions論文紹介:Vision Transformer Adapter for Dense Predictions
論文紹介:Vision Transformer Adapter for Dense PredictionsToru Tamaki
 
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
動画像理解のための深層学習アプローチ Deep learning approaches to video understandingToru Tamaki
 

Más de Toru Tamaki (20)

論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
 
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
 
論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet
 
論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey
 
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
 
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
 
論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation
 
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
 
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
 
論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning
 
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
 
論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA
 
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
 
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
 
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
 
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
 
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
 
論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large Motion論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large Motion
 
論文紹介:Vision Transformer Adapter for Dense Predictions
論文紹介:Vision Transformer Adapter for Dense Predictions論文紹介:Vision Transformer Adapter for Dense Predictions
論文紹介:Vision Transformer Adapter for Dense Predictions
 
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
 

文献紹介:Token Shift Transformer for Video Classification

  • 3. 関連研究 n3FモデルをGFモデルに拡張 • YGF$KB:,I&=:'&#<:L)$-NEO3434Q n3Fモデルで動作認識 • 8S2$KC,'T)$Z--N345[Q nN,;,#'$8L"';<#L>:L$N,8]$KF#;#^,=;9,_T)$ Z-CO3435Qも8S2のような手法で拡張を行う X3D: Expanding Architectures for Efficient Video Recognition Christoph Feichtenhofer Facebook AI Research (FAIR) Abstract This paper presents X3D, a family of efficient video net- works that progressively expand a tiny 2D image classifi- cation architecture along multiple network axes, in space, time, width and depth. Inspired by feature selection methods in machine learning, a simple stepwise network expansion approach is employed that expands a single axis in each step, such that good accuracy to complexity trade-off is achieved. To expand X3D to a specific target complexity, we perform progressive forward expansion followed by backward con- traction. X3D achieves state-of-the-art performance while requiring 4.8⇥ and 5.5⇥ fewer multiply-adds and parame- ters for similar accuracy as previous work. Our most surpris- ing finding is that networks with high spatiotemporal resolu- tion can perform well, while being extremely light in terms of network width and parameters. We report competitive accu- racy at unprecedented efficiency on video classification and detection benchmarks. Code will be available at: https: //github.com/facebookresearch/SlowFast. 1. Introduction Neural networks for video recognition have been largely driven by expanding 2D image architectures [29,47,64,71] into spacetime. Naturally, these expansions often happen along the temporal axis, involving extending the network inputs, features, and/or filter kernels into spacetime (e.g. [7, 13, 17, 42, 56, 75]); other design decisions—including depth (number of layers), width (number of channels), and T C H,W prediction Ít ÍÜ Ís Íd Íw Íb Ís Input frames res2 res3 res4 Figure 1. X3D networks progressively expand a 2D network across the following axes: Temporal duration t, frame rate ⌧ , spatial resolution s, width w, bottleneck width b, and depth d. This paper focuses on the low-computation regime in terms of computation/accuracy trade-off for video recogni- tion. We base our design upon the “mobile-regime" mod- els [31,32,61] developed for image recognition. Our core idea is that while expanding a small model along the tem- poral axis can increase accuracy, the computation/accuracy trade-off may not always be best compared with expanding other axes, especially in the low-computation regime where accuracy can increase quickly along different axes. In this paper, we progressively “expand" a tiny base 2D image architecture into a spatiotemporal one by expanding multiple possible axes shown in Fig. 1. The candidate axes are temporal duration t, frame rate ⌧ , spatial resolution s, network width w, bottleneck width b, and depth d. The resulting architecture is referred as X3D (Expand 3D) for expanding from the 2D space into 3D spacetime domain. The 2D base architecture is driven by the MobileNet [31, 32,61] core concept of channel-wise1 separable convolutions, arXiv:2004.04730v1 [cs.CV] 9 Apr 2020 [Feichtenhofer, CVPR2020] [Lin+, ICCV2019] TSM: Temporal Shift Module for Efficient Video Understanding Ji Lin MIT jilin@mit.edu Chuang Gan MIT-IBM Watson AI Lab ganchuang@csail.mit.edu Song Han MIT songhan@mit.edu Abstract The explosive growth in video streaming gives rise to challenges on performing video understanding at high accu- racy and low computation cost. Conventional 2D CNNs are computationally cheap but cannot capture temporal relationships; 3D CNN based methods can achieve good performance but are computationally intensive, making it expensive to deploy. In this paper, we propose a generic and effective Temporal Shift Module (TSM) that enjoys both high efficiency and high performance. Specifically, it can achieve the performance of 3D CNN but maintain 2D CNN’s complexity. TSM shifts part of the channels along the tempo- ral dimension; thus facilitate information exchanged among neighboring frames. It can be inserted into 2D CNNs to achieve temporal modeling at zero computation and zero Channel C Temporal T (a) The original ten- sor without shift. pad zero temporal shift truncate T C H,W (b) Offline temporal shift (bi-direction). t=0 t=3 … t=1 t=2 Channel C (c) Online temporal shift (uni-direction). Figure 1. Temporal Shift Module (TSM) performs efficient tem- poral modeling by moving the feature map along the temporal dimension. It is computationally free on top of a 2D convolution, but achieves strong temporal modeling ability. TSM efficiently supports both offline and online video recognition. Bi-directional TSM mingles both past and future frames with the current frame, which is suitable for high-throughput offline video recognition. Uni-directional TSM mingles only the past frame with the current frame, which is suitable for low-latency online video recognition.
  • 4. !"#"$%&'()%#*$(+,(の拡張 nN,8を動画認識モデルに拡張したい n1==:'=,#'を時間方向に拡張 • 計算量 • フレーム数増につれ計算量爆発 n8L"';<#L>:LをGFに拡張するのは現実的でない • 3F 8L"';<#L>:L$N,8]の計算量が多い nN,8ベースで動作認識 (a) Full space-time atten- tion: O(T2 S2 ) (b) Spatial-only attention: O(TS2 ) (c) TimeSform ViViT (Mode O(T2 S + TS Figure 1: Different approaches to space-time self-attention for vi locations that the query vector, located at the center of the grid Unlike prior work, our key vector is constructed by mixing info same spatial location within a local temporal window. Our meth these tokens. Note that our mechanism allows for an efficient attention at no extra cost. A baseline solution to this problem is to consider spatial-on !"#$%&'()*+#,-./01023 <latexit sha1_base64="No9GqnmNeE52qj8tWzuF/PHJ220=">AAACbnichVHLSsNAFD2N7/qqCiKIWCyVuinTIiquim7c2XeF2pYkjhpMk5CkhVr8AffiQlAURMTPcOMPuOgniBuhghsX3qQBUVFvmMyZM/fcOXNHMlTFshlr+YSu7p7evv4B/+DQ8MhoYGw8b+k1U+Y5WVd1c0sSLa4qGs/Ziq3yLcPkYlVSeUE6WHf2C3VuWoquZe2GwUtVcU9TdhVZtIkqbkay5XgwU44v+CuBEIsyN4I/QcwDIXiR1AM32MYOdMiooQoODTZhFSIs+oqIgcEgroQmcSYhxd3nOIKftDXK4pQhEntA/z1aFT1Wo7VT03LVMp2i0jBJGUSYPbJb1mYP7I49sfdfazXdGo6XBs1SR8uNyujxVObtX1WVZhv7n6o/PdvYxYrrVSHvhss4t5A7+vrhaTuzmg4359kVeyb/l6zF7ukGWv1Vvk7x9BmcB4h9b/dPkI9HY0vRxdRiKLHmPUU/pjGHCPV7GQlsIImc27ETnOPC9yJMCjPCbCdV8HmaCXwJIfIBsB2Mfg==</latexit> O(T2 S2 )
  • 5. 手法 n;&,<=$>#?@A:を入れ込むことにより時空間情報を得る n残差の分岐の前におく n-A";;$=#9:'のみを時間方向にずらす • シフト操作のみで計算量,パラメータはN,8 (a) ViT (video) (a) ViT (video) (b) TokShift (c) TokShift-A mer for Video Classification anbin Hao∗ sity of Science and nology of China Hefei, China nbin@hotmail.com Chong-Wah Ngo Singapore Management University Singapore cwngo@smu.edu.sg and nd- , it wer in- en- vy men , a po- (a) Video Tensor (b) Temporal Shift University of Science and Technology of China Hefei, China haoyanbin@hotmail.com Singapore Management University Singapore cwngo@smu.edu.sg ding 1 and nderstand- etworks, it ive power length in- onal inten- ing heavy x 3-dimen okShift), a ng tempo- fically, the n features nsely plug ansformer icing that deo trans- nderstand- obustness, t clips of precision: Gaze+, and (a) Video Tensor (b) Temporal Shift (c) Patch Shift (d) Token Shift
  • 6. 実験設定 n学習 • 短辺をK33J)$GG4Qにリサイズ • O"'?#>$+L,(&=':;; • S"=@L"=,#' • V">>" • !@: • !#L,W#'="A$<A,` • O">?#>IL#` 33J$a$33J$または3b7)$ GXJ] • 5クリップX657フレーム • フレーム間G3 n推論 • 2A@=,^,:c =:;= • 54個のクリップを一様にサンプリング • 短辺を33J$3b7)GXJ]にリサイズ • 33J$a$33J 3b7)$GXJ]にクロップ • -:'=:LIL#` • O,(&=IL#` • A:<=IL#` nデータセット • H,':=,;J44 • R-B545 • UV8U1$V"W:T
  • 7. 実験結果:-"%,."/#011 nS&,<=による効果 • サンプリングステップを変えて比較 • ステップ小 • 動画として情報が少ない • ステップ大 • フレーム間が離れすぎていて動画 としての情報が欠落 ark ow- od- the s. the the lly, nce res. elf- ock, m”, ify ngs ere rks ua- to 330] and maintain aspect ratio), random brightness, saturation, gamma, hue, and horizontal-flip. We train the model for 18 epochs with batch-size 21 per GPU. Base lr is set to 0.1, is decayed by 0.1 at epoch [10, 15] during training. The Base-16 ViT [9] (contains 12 encoders) is adopted as the backbone and the shifted proportion is set to 𝑎 𝐷 , 𝑐 𝐷 = 1 4 . All experiments are run on 2/8× V100 GPUs. Inference adopts the same testing strategies as [12]. Specifi- cally, we uniformly sample 10 clips from a testing video. Each clip are resized to short-side=224 (256/384), then cropped into three 224×224 sub-clips (from “left”, “center”, “right” parts). A video pre- diction is obtained by averaging scores of 30 sub-clips. 4.3 Ablation Study Our ablation studies the impacts of the shift-types, integrations, hyperparameters of video transformers on the Kinetics-400 dataset. Non Shift vs TokShift. We compare the Non Shift and TokShift under various sampling steps. Specifically, we fix clip size (𝑇) to 8 frames, but change sampling steps (𝑆) to cover variable temporal duration. Xfmr Acc1 𝑇 × 𝑆 8 × 8 8 × 16 8 × 32 8 × 64 ViT (video)[9] 76.17 76.32 76.02 75.73 TokShift 76.90 (0.73↑) 76.81 (0.49↑) 77.28 (1.26↑) 76.60 (0.87↑) Table 1: NonShift vs TokShift under diffrent sampling strate- gies (“T/S” refers to frames/sampling-step; Accuracy-1). nS&,<=$>#?@A:の埋め込み位置 (a) ViT (video) (b) TokShift (c) TokShift-A (d) TokShift-B (e) TokShift-C Figure 3: Integrations of TokShift module into ViT Encoder: the TokShift can be plugged in multiple positions of a vision encoder: “prior residual” (TokShift), “prior layer-norm” (TokShift-A), “prior MSA/FFN” (TokShift-B), and “post MSA/FFN (TokShift-C)”. Notably, a fronter position will affect the subsequent modules inside residual blocks more. MSA ! 𝑧 𝑧 𝑧 (𝑡) 𝑙−1 " = Concat # head (𝑡) 1 , head (𝑡) 2 , ..., head (𝑡) 𝑀 $ (5) head (𝑡) 𝑖 = Softmax % 𝑄 𝑄 𝑄(𝑡)𝐾 𝐾 𝐾 (𝑡) √ 𝐷 & 𝑉 𝑉 𝑉 (𝑡) (6) 𝑄 𝑄 𝑄(𝑡),𝐾 𝐾 𝐾(𝑡),𝑉 𝑉 𝑉 (𝑡) = 𝑧 𝑧 𝑧 (𝑡) 𝑙−1 ×𝑊 𝑊 𝑊𝑞,𝑊 𝑊 𝑊 𝑘,𝑊 𝑊 𝑊 𝑣 (7) 𝑊 𝑊 𝑊𝑞,𝑊 𝑊 𝑊 𝑘,𝑊 𝑊 𝑊 𝑣 ∈ R𝐷×𝐷 Given a token tensor 𝑐 𝑐 𝑐𝑙 ∈ R𝑇×𝐷 representing the global se- quence of a clip, the TokShift works as follows: 𝑐 𝑐 𝑐𝑙 is firstly split into 3 groups along channel dimension (Equation (12)). 𝑐 𝑐 𝑐𝑙 = [𝑠 𝑠 𝑠𝑎,𝑠 𝑠 𝑠𝑏,𝑠 𝑠 𝑠𝑐] (12) 𝑠 𝑠 𝑠𝑎,𝑠 𝑠 𝑠𝑏,𝑠 𝑠 𝑠𝑐 ∈ R𝑇×𝑎 , R𝑇×𝑏 , R𝑇×𝑐 𝑎 + 𝑏 + 𝑐 = 𝐷 “prior MSA/FFN” and “post MSA/FFN”. We experimentally verify that placing TokShift at a front position(i.e., “prior residual”) brings the best effect. This finding is different from TSM on CNN, where the optimal position is “post residual". 4 EXPERIMENTS We conduct extensive experiments on three standard benchmarks for video classification and adopt Top-1/5 accuracy (%) as evalua- tion metrics. We also report the model parameters and GFLOPs to quantify computations. 4.1 Datasets We adopt one large-scale (Kinetics-400) and two small-scale (EGTEA- Gaze+ & UCF-101) datasets as evaluation benchmarks. Below presents their brief descriptions. Kinetics-400 [4] serves as a standard large-scale benchmark for video classification. Its clips are truncated into 10 seconds duration with humans’ annotations. The training/validation set separately contains ∼ 246k/20k clips covering 400 video categories. EGTEA Gaze+ [19] is a First-Person-View video dataset cov- ering 106 fine-grained daily action categories. We select split-1 to evaluate our model, and this split contains 8,299/2,022 train- ing/validating clips, with an average clip length of 3.1 seconds. UCF-101 [26] contains 101 action categories, such as “human- human/object” interactions, sports and etc. We also select split-1 for performance reporting. This split contains 9,537/3,783 train- ing/validating clips with average duration of 5.8 seconds. 4.2 Implementations We implement a paradigm of shift-transformers, including the ViT (video), TemporalShift/PatchShift/TokShift-xfmr, on top of 2D ViT Non Shift vs TokShift. We compare the Non Shift and TokShift under various sampling steps. Specifically, we fix clip size (𝑇) to 8 frames, but change sampling steps (𝑆) to cover variable temporal duration. Xfmr Acc1 𝑇 × 𝑆 8 × 8 8 × 16 8 × 32 8 × 64 ViT (video)[9] 76.17 76.32 76.02 75.73 TokShift 76.90 (0.73↑) 76.81 (0.49↑) 77.28 (1.26↑) 76.60 (0.87↑) Table 1: NonShift vs TokShift under diffrent sampling strate- gies (“T/S” refers to frames/sampling-step; Accuracy-1). As in Tab. 1, the TokShift-xfmr outperforms ViT (video) under all settings. We observe that the performance of TokShift varies with different temporal steps. A smaller step makes TokShift not effective in learning temporal evolution. On the other hand, a large step of 64 frames results in a highly discontinuous frame sequence, affecting learning effectiveness. A temporal step of 32 frames ap- pears to be a suitable choice. Integration. We study the impacts of implanting the TokShift at different positions of an encoder. As mentioned in section 3.2, a fronter position will affect the later modules inside residual blocks more. Model Words Shiftted 𝐺𝐹𝐿𝑂𝑃𝑠 × 𝑉𝑖𝑒𝑤𝑠 Acc1 Acc5 (%) (%) ViT (video) [9] None 134.7 × 30 76.02 92.52 TokShift-𝐴 Token 134.7 × 30 76.85 93.10 TokShift-𝐵 Token 134.7 × 30 77.21 92.81 TokShift-𝐶 Token 134.7 × 30 77.00 92.92 TokShift Token 134.7 × 30 77.28 92.91 Table 2: Comparisons of the TokShift residing in different positions (GFLOPs, Accuracy-1/5).
  • 8. 実験結果:-"%,."/#011 nシフトバリエーションによる性能比較 Shift Type Words Shiftted 𝐺𝐹𝐿𝑂𝑃𝑠 × 𝑉𝑖𝑒𝑤𝑠 Acc1 Acc5 (%) (%) ViT (video) [9] None 134.7 × 30 76.02 92.52 TemporalShift Token + Patches 134.7 × 30 72.88 91.24 PachShift Patches 134.7 × 30 73.08 91.17 TokShift Token 134.7 × 30 77.28 92.91 Table 3: Comparisons of shifting different visual “Words” (GFLOPs, Accuracy-1/5). Model Backbone Res # Words Acc1 (𝐻 ×𝑊 ) 𝑇 · (𝑁 + 1) (%) TokShift (HR) Base-16 384 × 384 8 · (242 + 1) 78.14 TokShift-Large (HR) Large-16 384 × 384 8 · (242 + 1) 79.83 TokShift-Large (HR) Large-16 384 × 384 12 · (242 + 1) 80.40 TokShift (MR) Base-16 256 × 256 8 · (162 + 1) 77.68 TokShift-Hybrid (MR) R50+Base-16 256 × 256 8 · (162 + 1) 77.55 TokShift-Hybrid (MR) R50+Base-16 256 × 256 16 · (162 + 1) 78.34 eo Classification Chong-Wah Ngo Singapore Management University Singapore cwngo@smu.edu.sg a) Video Tensor (b) Temporal Shift (c) Patch Shift (d) Token Shift pes of Shift for video transformer. A video em- ains two types of “words”: 𝑁 [Patch] + 1 [Class] e, the Shift can be applied on: (a) Neither, (b) h, and (d) [Class] Token along the temporal axis. n Shift Transformer for Video Classification ng Hong Kong R, China mail.com Yanbin Hao∗ University of Science and Technology of China Hefei, China haoyanbin@hotmail.com Chong-Wah Ngo Singapore Management University Singapore cwngo@smu.edu.sg kable successes in understanding 1 and NLP and Image Content Understand- ve to convolutional neural networks, it pretability, high discriminative power ibility in processing varying length in- naturally contain computational inten- -wise self-attention, incurring heavy being applied on the complex 3-dimen ken Shift Module (i.e., TokShift), a FLOPs operator, for modeling tempo- ransformer encoder. Specifically, the shifts partial [Class] token features cent frames. Then, we densely plug der of a plain 2D vision transformer esentation. It is worth noticing that a pure convolutional-free video trans- onal efficiency for video understand- rd benchmarks verify its robustness, y. Particularly, with input clips of ransformer achieves SOTA precision: cs-400, 66.56% on EGTEA-Gaze+, and , comparable or better than existing rparts. Our code is open-sourced in: works/TokShift-Transformer. ies → Activity recognition and un- rmer; Shift; Self-attention Chong-Wah Ngo. 2021. Token Shift Trans- n Proceedings of the 29th ACM International thor. (a) Video Tensor (b) Temporal Shift (c) Patch Shift (d) Token Shift Figure 1: Types of Shift for video transformer. A video em- bedding contains two types of “words”: 𝑁 [Patch] + 1 [Class] Token. Hence, the Shift can be applied on: (a) Neither, (b) Both, (c) Patch, and (d) [Class] Token along the temporal axis. “!” indicates padding zeros. Conference on Multimedia (MM ’21), October 20–24, 2021, Virtual Event, China. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3474085.3475272 1 INTRODUCTION “If a picture is worth a thousand words, is a video worth a million?” [40]. This quote basically predicts that image and video can be po- tentially interpreted as linguistical sentences, except that videos contain richer information than images. The recent progress of ex- tending linguistical-style, convolutional-free transformers [9] on visual content understanding successfully verifies the quote’s prior oken Shift Transformer for Video Classification ao Zhang rsity of Hong Kong ong SAR, China oinf@gmail.com Yanbin Hao∗ University of Science and Technology of China Hefei, China haoyanbin@hotmail.com Chong-Wah Ngo Singapore Management University Singapore cwngo@smu.edu.sg es remarkable successes in understanding 1 and als (e.g., NLP and Image Content Understand- lternative to convolutional neural networks, it ong interpretability, high discriminative power and flexibility in processing varying length in- ncoders naturally contain computational inten- h as pair-wise self-attention, incurring heavy en when being applied on the complex 3-dimen ents Token Shift Module (i.e., TokShift), a er, zero-FLOPs operator, for modeling tempo- n each transformer encoder. Specifically, the mporally shifts partial [Class] token features oss adjacent frames. Then, we densely plug ch encoder of a plain 2D vision transformer deo representation. It is worth noticing that ormer is a pure convolutional-free video trans- omputational efficiency for video understand- n standard benchmarks verify its robustness, efficiency. Particularly, with input clips of kShift transformer achieves SOTA precision: e Kinetics-400, 66.56% on EGTEA-Gaze+, and datasets, comparable or better than existing l counterparts. Our code is open-sourced in: ideoNetworks/TokShift-Transformer. TS hodologies → Activity recognition and un- Transformer; Shift; Self-attention mat: ao, and Chong-Wah Ngo. 2021. Token Shift Trans- ification. In Proceedings of the 29th ACM International ponding author. (a) Video Tensor (b) Temporal Shift (c) Patch Shift (d) Token Shift Figure 1: Types of Shift for video transformer. A video em- bedding contains two types of “words”: 𝑁 [Patch] + 1 [Class] Token. Hence, the Shift can be applied on: (a) Neither, (b) Both, (c) Patch, and (d) [Class] Token along the temporal axis. “!” indicates padding zeros. Conference on Multimedia (MM ’21), October 20–24, 2021, Virtual Event, China. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3474085.3475272 1 INTRODUCTION “If a picture is worth a thousand words, is a video worth a million?” [40]. This quote basically predicts that image and video can be po- tentially interpreted as linguistical sentences, except that videos contain richer information than images. The recent progress of ex- tending linguistical-style, convolutional-free transformers [9] on visual content understanding successfully verifies the quote’s prior nシフトの割合による性能比較 • dまではシフト量を増やすほど性能向 上に寄与した same spatial grid across times incurs visual misalignment whe rigidly dividing a moving object. We verify this hypothesis by me suring the mean cosine similarity between the patch and its temp ral neighbor features on kinetics-400 val set with Temporal/Patc Shift transformer and observe distances scores of 0.577/0.570. A for [Class] token part, since it reflects global frame-level conten without alignment concerns, the distance between neighboring te poral tokens is 0.95. Consequently, for the TokShift or CNN-Shi the drawback is bypassed by global alignment or sliding window Proportion of shifted channels. This hyperparameter co trols the ratio of dynamic/static information in a video feature. W evaluate 𝑎 𝐷 , 𝑐 𝐷 in range [1/4, 1/8, 1/16] and present results in Tab 4. We experimentally find that 1/4 is an optimal value, indicatin that half channels are shifted (1/4 back + 1/4 forth) while the re half remain unchanged. Model Channels Shifted Acc1 Acc5 ( 𝑎 𝐷 + 𝑐 𝐷 ) (%) (%) TokShift 1/4 + 1/4 77.28 92.91 TokShift 1/8 + 1/8 77.18 92.95 TokShift 1/16 + 1/16 76.93 92.82 Table 4: Comparisons of the TokShift regarding to propo
  • 9. 実験結果:-"%,."/#011 nフレーム数と解像度が与える影響 • フレーム数が多いと性能向上 • 解像度が増えると性能向上 • どちらも情報量が増加 that half channels are shifted (1/4 back + 1/4 forth) while the rest half remain unchanged. Model Channels Shifted Acc1 Acc5 ( 𝑎 𝐷 + 𝑐 𝐷 ) (%) (%) TokShift 1/4 + 1/4 77.28 92.91 TokShift 1/8 + 1/8 77.18 92.95 TokShift 1/16 + 1/16 76.93 92.82 Table 4: Comparisons of the TokShift regarding to propor- tions of shifted channels (Accuracy-1/5). Word count is an essential factor in transformer learning since more words indicate more details. As for videos, the number of visual words is positively correlated with two factors: spatial reso- lutions and temporal frames. We study the impacts of word count on the TokShift-xfmr. Table 5 lists the detailed performances un- der different word counts. Here, MR/HR represents middle/high resolutions. Model Res Words/Frame #Frames # Words Acc1 𝐻 ×𝑊 𝑁 + 1 𝑇 𝑇 · (𝑁 + 1) (%) TokShift 224 × 224 142 + 1 6 1,182 76.72 TokShift 224 × 224 142 + 1 8 1,576 77.28 TokShift 224 × 224 142 + 1 10 19,70 77.56 TokShift 224 × 224 142 + 1 16 3,152 78.18 TokShift (MR) 256 × 256 162 + 1 8 2,056 77.68 TokShift (HR) 384 × 384 242 + 1 8 4,616 78.14 Table 5: Comparisons of the TokShift according to visual word counts (Accuracy-1). 4.4 Comparison with the State-of-the-Art We compare our TokShift-xfmrs with current SOTAs on Kinetics- 400 datasets in Table 7. Since there is no prior work that purely utilizes transformer, we mainly select 3D-CNNs, such as I3D, Non- Local, TSM, SlowFast, X3D as SOTAs. Compared to 3D-CNNs, transformers exhibit prominent perfor- mances. Particularly, the spatial-only transformer, i.e., ViT (video), achieves comparable or better performance (76.02) than strong 3D- CNNs, such as I3D (71.1), S3D-G (74.7) and TSM (76.3). Moreover, a comparison of our most slimed TokShift-xfmr (77.28) with TSM (76.3) indicates that shift gains extra benefits on transformers than CNNs. Thirdly, the TokShift-xfmr (MR) achieves comparable per- formance (77.68) with Non-Local R101 networks (77.7) under the same spatial resolutions (255) but using fewer frames (8 vs 128). This is reasonable since both contain the attentions. Finally, we compare TokShift-xfmr with the current best 3D-CNNs: SlowFast and X3D. The two 3D-CNNs are particularly optimized by a dual- paths or efficient network parameters searching mechanism. Since both SlowFast & X3D introduce large structural changes over cor- responding 2D nets, their authors prefer to train on videos with sufficient epochs (256), rather than initialize from ImageNet pre- trained weights. We show that our TokShift-Large-xfmr (80.40) per- forms better than SlowFast (79.80) and is comparable with X3D- XXL (80.4), where models are under their best settings. Notably, we only use 12 frames, whereas SlowFast/X3D-XXL uses 32/16+ frames. In all, a pure transformer can perform comparable or better than Shift Type Words Shiftted 𝐺𝐹𝐿𝑂𝑃𝑠 × 𝑉𝑖𝑒𝑤𝑠 Acc1 Acc5 (%) (%) ViT (video) [9] None 134.7 × 30 76.02 92.52 TemporalShift Token + Patches 134.7 × 30 72.88 91.24 PachShift Patches 134.7 × 30 73.08 91.17 TokShift Token 134.7 × 30 77.28 92.91 Table 3: Comparisons of shifting different visual “Words” (GFLOPs, Accuracy-1/5). In Table 3, the TokShift exhibits the best top-1 accuracy. Addi- Model Backbone Res # Words Acc1 (𝐻 ×𝑊 ) 𝑇 · (𝑁 + 1) (%) TokShift (HR) Base-16 384 × 384 8 · (242 + 1) 78.14 TokShift-Large (HR) Large-16 384 × 384 8 · (242 + 1) 79.83 TokShift-Large (HR) Large-16 384 × 384 12 · (242 + 1) 80.40 TokShift (MR) Base-16 256 × 256 8 · (162 + 1) 77.68 TokShift-Hybrid (MR) R50+Base-16 256 × 256 8 · (162 + 1) 77.55 TokShift-Hybrid (MR) R50+Base-16 256 × 256 16 · (162 + 1) 78.34 Table 6: Comparisons of the TokShift on different trans- former backbones (Accuracy-1).
  • 10. 実験結果:-"%,."/#011 n従来手法との比較 • N,8のみでも3FTシフトと同等の性能 • 8#9:';&,<=は既存のGF手法を上回る性能 Model Backbone Pretrain Inference Res # Frames/Clip GFLOPs×Views Params Accuracy-1 Accuracy-5 (𝐻 ×𝑊 ) 𝑇 (%) (%) I3D [4] from [11] InceptionV1 ImageNet 224 × 224 250 108× NA 12M 71.1 90.3 Two-Stream I3D [4] from [11] InceptionV1 ImageNet 224 × 224 500 216× NA 25M 75.7 92.0 S3D-G [39] InceptionV1 ImageNet 224 × 224 250 71.3×NA 11.5M 74.7 93.4 Two-Stream S3D-G [39] InceptionV1 ImageNet 224 × 224 500 142.6×NA 11.5M 77.2 93.0 Non-Local R50 [33] from [11] ResNet50 ImageNet 256 × 256 128 282 × 30 35.3M 76.5 92.6 Non-Local R101[33] from [11] ResNet101 ImageNet 256 × 256 128 359 × 30 54.3M 77.7 93.3 TSM [20] ResNet50 ImageNet 256 × 256 8 33 × 10 24.3M 74.1 91.2 TSM [20] ResNet50 ImageNet 256 × 256 16 65 × 10 24.3M 74.7 - TSM [20] ResNext101 ImageNet 256 × 256 8 NA×10 - 76.3 - SlowFast 4 × 16 [12] ResNet50 None 256 × 256 32 36.1 × 30 34.4M 75.6 92.1 SlowFast 8 × 8 [12] ResNet50 None 256 × 256 32 65.7 × 30 - 77.0 92.6 SlowFast 8 × 8 [12] ResNet101 None 256 × 256 32 106 × 30 53.7M 77.9 93.2 SlowFast 8 × 8 [12] ResNet101+NL None 256 × 256 32 116 × 30 59.9M 78.7 93.5 SlowFast 16 × 8 [12] ResNet101+NL None 256 × 256 32 234 × 30 59.9M 79.8 93.9 X3D-M [11] X2D [11] None 256 × 256 16 6.2 × 30 3.8M 76.0 92.3 X3D-L [11] X2D [11] None 356 × 356 16 24.8 × 30 6.1M 77.5 92.9 X3D-XL[11] X2D [11] None 356 × 356 16 48.4 × 30 11M 79.1 93.9 X3D-XXL [11] X2D [11] None NA NA 194.1 × 30 20.3M 80.4 94.6 ViT (Video)[9] Base-16 ImageNet-21k 224 × 224 8 134.7 × 30 85.9M 76.02 92.52 TokShift Base-16 ImageNet-21k 224 × 224 8 134.7 × 30 85.9M 77.28 92.91 TokShift (MR) Base-16 ImageNet-21k 256 × 256 8 175.8 × 30 85.9M 77.68 93.55 TokShift (HR) Base-16 ImageNet-21k 384 × 384 8 394.7 × 30 85.9M 78.14 93.91 TokShift Base-16 ImageNet-21k 224 × 224 16 269.5 × 30 85.9M 78.18 93.78 TokShift-Large (HR) Large-16 ImageNet-21k 384 × 384 8 1397.6 × 30 303.4M 79.83 94.39 TokShift-Large (HR) Large-16 ImageNet-21k 384 × 384 12 2096.4 × 30 303.4M 80.40 94.45
  • 11. 実験結果:-"%,."/#011 n従来手法との比較 • N,8のみでも3FTシフトと同等の性能 • 8#9:';&,<=は既存のGF手法を上回る性能 Model Backbone Pretrain Inference Res # Frames/Clip GFLOPs×Views Params Accuracy-1 Accuracy-5 (𝐻 ×𝑊 ) 𝑇 (%) (%) I3D [4] from [11] InceptionV1 ImageNet 224 × 224 250 108× NA 12M 71.1 90.3 Two-Stream I3D [4] from [11] InceptionV1 ImageNet 224 × 224 500 216× NA 25M 75.7 92.0 S3D-G [39] InceptionV1 ImageNet 224 × 224 250 71.3×NA 11.5M 74.7 93.4 Two-Stream S3D-G [39] InceptionV1 ImageNet 224 × 224 500 142.6×NA 11.5M 77.2 93.0 Non-Local R50 [33] from [11] ResNet50 ImageNet 256 × 256 128 282 × 30 35.3M 76.5 92.6 Non-Local R101[33] from [11] ResNet101 ImageNet 256 × 256 128 359 × 30 54.3M 77.7 93.3 TSM [20] ResNet50 ImageNet 256 × 256 8 33 × 10 24.3M 74.1 91.2 TSM [20] ResNet50 ImageNet 256 × 256 16 65 × 10 24.3M 74.7 - TSM [20] ResNext101 ImageNet 256 × 256 8 NA×10 - 76.3 - SlowFast 4 × 16 [12] ResNet50 None 256 × 256 32 36.1 × 30 34.4M 75.6 92.1 SlowFast 8 × 8 [12] ResNet50 None 256 × 256 32 65.7 × 30 - 77.0 92.6 SlowFast 8 × 8 [12] ResNet101 None 256 × 256 32 106 × 30 53.7M 77.9 93.2 SlowFast 8 × 8 [12] ResNet101+NL None 256 × 256 32 116 × 30 59.9M 78.7 93.5 SlowFast 16 × 8 [12] ResNet101+NL None 256 × 256 32 234 × 30 59.9M 79.8 93.9 X3D-M [11] X2D [11] None 256 × 256 16 6.2 × 30 3.8M 76.0 92.3 X3D-L [11] X2D [11] None 356 × 356 16 24.8 × 30 6.1M 77.5 92.9 X3D-XL[11] X2D [11] None 356 × 356 16 48.4 × 30 11M 79.1 93.9 X3D-XXL [11] X2D [11] None NA NA 194.1 × 30 20.3M 80.4 94.6 ViT (Video)[9] Base-16 ImageNet-21k 224 × 224 8 134.7 × 30 85.9M 76.02 92.52 TokShift Base-16 ImageNet-21k 224 × 224 8 134.7 × 30 85.9M 77.28 92.91 TokShift (MR) Base-16 ImageNet-21k 256 × 256 8 175.8 × 30 85.9M 77.68 93.55 TokShift (HR) Base-16 ImageNet-21k 384 × 384 8 394.7 × 30 85.9M 78.14 93.91 TokShift Base-16 ImageNet-21k 224 × 224 16 269.5 × 30 85.9M 78.18 93.78 TokShift-Large (HR) Large-16 ImageNet-21k 384 × 384 8 1397.6 × 30 303.4M 79.83 94.39 TokShift-Large (HR) Large-16 ImageNet-21k 384 × 384 12 2096.4 × 30 303.4M 80.40 94.45
  • 12. 実験結果2小規模データセット nC"_:L$0#L>"A,W"=,#'の影響 • モダリティ ^,?:#$#L$,>"(:]によって変化 • どちらのデータセットにおいても既存手法を上回る 4 × 224 500 216× NA 25M 75.7 92.0 4 × 224 250 71.3×NA 11.5M 74.7 93.4 4 × 224 500 142.6×NA 11.5M 77.2 93.0 6 × 256 128 282 × 30 35.3M 76.5 92.6 6 × 256 128 359 × 30 54.3M 77.7 93.3 6 × 256 8 33 × 10 24.3M 74.1 91.2 6 × 256 16 65 × 10 24.3M 74.7 - 6 × 256 8 NA×10 - 76.3 - 6 × 256 32 36.1 × 30 34.4M 75.6 92.1 6 × 256 32 65.7 × 30 - 77.0 92.6 6 × 256 32 106 × 30 53.7M 77.9 93.2 6 × 256 32 116 × 30 59.9M 78.7 93.5 6 × 256 32 234 × 30 59.9M 79.8 93.9 6 × 256 16 6.2 × 30 3.8M 76.0 92.3 6 × 356 16 24.8 × 30 6.1M 77.5 92.9 6 × 356 16 48.4 × 30 11M 79.1 93.9 NA NA 194.1 × 30 20.3M 80.4 94.6 4 × 224 8 134.7 × 30 85.9M 76.02 92.52 4 × 224 8 134.7 × 30 85.9M 77.28 92.91 6 × 256 8 175.8 × 30 85.9M 77.68 93.55 4 × 384 8 394.7 × 30 85.9M 78.14 93.91 4 × 224 16 269.5 × 30 85.9M 78.18 93.78 4 × 384 8 1397.6 × 30 303.4M 79.83 94.39 4 × 384 12 2096.4 × 30 303.4M 80.40 94.45 ate-of-the-arts on Kinetics-400 Val. e- a- al- Ns e- ry m- re fi- at- Model Modaility Pretrain Res # Frames EGAZ+ (𝐻 ×𝑊 ) 𝑇 Acc1 (%) TSM [20] RGB Kinetics-400 224 × 224 8 63.45 SAP [34] RGB Kinetics-400 256 × 256 64 64.10 ViT (Video) [9] RGB ImageNet-21k 224 × 224 8 62.59 TokShift RGB None 224 × 224 8 28.90 TokShift RGB ImageNet-21k 224 × 224 8 62.85 TokShift* RGB ImageNet-21k 224 × 224 8 59.24 TokShift RGB Kinetics-400 224 × 224 8 63.69 TokShift* RGB Kinetics-400 224 × 224 8 64.82 TokShift OptFlow Kinetics-400 224 × 224 8 48.81 TokShift*-En OptFlow+RGB Kinetics-400 224 × 224 8 65.08 TokShift* (HR) RGB Kinetics-400 384 × 384 8 65.77 TokShift-Large* (HR) RGB Kinetics-400 384 × 384 8 66.56 Table 8: Impacts of various pretrained weights on EGTEA- GAZE++ Split-1 dataset (“*” means freeze layer-norm). (e) “Playing ice hockey” Figure 4: Visualization of attention on sample clips from the Ki the even presents corresponding attention maps. Model Pretrain Res # Frames UCF101 (𝐻 ×𝑊 ) 𝑇 Acc1 (%) I3D [4] Kinetics-400 224 × 224 250 84.50 P3D [21] Kinetics-400 224 × 224 16 84.20 Two-Stream I3D [4] Kinetics-400 224 × 224 500 93.40 TSM [20] Kinetics-400 256 × 256 8 95.90 ViT (Video) [9] ImageNet-21k 256 × 256 8 91.46 TokShift None 256 × 256 8 91.60 TokShift ImageNet-21k 256 × 256 8 91.65 TokShift* Kinetics-400 256 × 256 8 95.35 TokShift* (HR) Kinetics-400 384 × 384 8 96.14 TokShift-Large* (HR) Kinetics-400 384 × 384 8 96.80 Table 9: Impacts of various pretrained weights on UCF-101 Split dataset (“*” means freeze layer-norm). EGTEA Gaze+ UCF101
  • 13. まとめ n8L"';<#L>:Lの中間特徴量をシフトすることで動作認識を行う • 8#9:';&,<= >#?@A:の提案 • S&,<=操作のみ • ゼロパラメータ • ゼロ計算量 • -C1SS$=#9:'のみをシフト • グローバルな時空間情報を交換