SlideShare una empresa de Scribd logo
1 de 59
Descargar para leer sin conexión
サーベイ論文:Deep Learningを用いた経路予測の研究動向
箕浦 大晃 平川 翼 山下 隆義 藤吉 弘亘
PRMU研究会
October 9-10, 2020
中部大学 機械知覚&ロボティクスグループ
予測対象の未来の経路を予測する技術
経路予測
2
応用先
自動運転
・事故防止
・自律走行
・ナビゲーション
ロボット
経路予測のカテゴリ
3
Bayesian-based Deep Learning-based Planning-based
内部状態
観測状態
Update Update
ex. model : Kalman Filter
Past Future
Prediction
model
Input
Output
ex. model : LSTM,CNN ex. Model : IRL, RRT*
Start
Goal
X
O
X
観測状態にノイズを付与した値から
未来の内部状態を更新し予測値を逐次推定
予測対象の過去の軌跡から未来の行動を学習 スタートからゴールまでの報酬値を最適化
本サーベイの対象
Deep Learningによる経路予測の必要な要素
4
一人称視点
車載カメラ視点
鳥瞰視点
View point
and the yellow-orange heatmaps are
n ones are ground truth multi-future
Single-Future Multi-Future
18.51 / 35.84 166.1 / 329.5
28.68 / 49.87 184.5 / 363.2
PIE JAAD
Method MSE CMSE CFMSE MSE CMSE CFM
0.5s 1s 1.5s 1.5s 1.5s 0.5s 1s 1.5s 1.5s 1.5
Linear 123 477 1365 950 3983 223 857 2303 1565 611
LSTM 172 330 911 837 3352 289 569 1558 1473 576
B-LSTM[5] 101 296 855 811 3259 159 539 1535 1447 561
PIEtraj 58 200 636 596 2477 110 399 1248 1183 478
Table 3: Location (bounding box) prediction errors over varying future time steps. MSE in pixels is ca
predicted time steps, CMSE and CFMSE are the MSEs calculated over the center of the bounding box
predicted sequence and only the last time step respectively.
MSE
Method 0.5s 1s 1.5s
Linear 0.87 2.28 4.27
LSTM 1.50 1.91 3.00
PIEspeed 0.63 1.44 2.65
Long Short-Term Memory
Convolutional Neural Network
Gated Recurrent Unit
Temporal Convolutional Network
Model
!"
!#
!$
!%!&
!'
t=t input layert=t-1 hidden layer
Input Gate
t=t output layert=t+1 hidden layer
Memory Cell
Forget Gate
Output Gate
対象クラス
対象間のインタラクション
静的環境情報
Context
Deep Learningによる経路予測の必要な要素
5
一人称視点
車載カメラ視点
鳥瞰視点
View point
対象クラス
対象間のインタラクション
静的環境情報
Context
Long Short-Term Memory
Convolutional Neural Network
Gated Recurrent Unit
Temporal Convolutional Network
Model
and the yellow-orange heatmaps are
n ones are ground truth multi-future
Single-Future Multi-Future
18.51 / 35.84 166.1 / 329.5
28.68 / 49.87 184.5 / 363.2
PIE JAAD
Method MSE CMSE CFMSE MSE CMSE CFM
0.5s 1s 1.5s 1.5s 1.5s 0.5s 1s 1.5s 1.5s 1.5
Linear 123 477 1365 950 3983 223 857 2303 1565 611
LSTM 172 330 911 837 3352 289 569 1558 1473 576
B-LSTM[5] 101 296 855 811 3259 159 539 1535 1447 561
PIEtraj 58 200 636 596 2477 110 399 1248 1183 478
Table 3: Location (bounding box) prediction errors over varying future time steps. MSE in pixels is ca
predicted time steps, CMSE and CFMSE are the MSEs calculated over the center of the bounding box
predicted sequence and only the last time step respectively.
MSE
Method 0.5s 1s 1.5s
Linear 0.87 2.28 4.27
LSTM 1.50 1.91 3.00
PIEspeed 0.63 1.44 2.65
!"
!#
!$
!%!&
!'
t=t input layert=t-1 hidden layer
Input Gate
t=t output layert=t+1 hidden layer
Memory Cell
Forget Gate
Output Gate
移動対象同士の衝突を避ける経路の予測
●
移動対象間の距離値や方向からインタラクション情報を求める
経路予測におけるインタラクションとは
6
Deep Learningを用いた予測手法の傾向と分類
7
2016
interaction
other
2020
Social-LSTM
[A. Alahi+, CVPR, 2016]
DESIRE
[N. Lee+, CVPR, 2017]
Conv.Social-Pooling
[N. Deo+, CVPRW, 2018]
SoPhie
[A. Sadeghian+, CVPR, 2019]
Social-BiGAT
[V. Kosaraju+, NeurIPS, 2019]
Social-STGCNN
[A. Mohamedl+, CVPR, 2020]
Social-GAN
[A. Gupta+, CVPR, 2018]
Next
[J. Liang+, CVPR, 2019]
STGAT
[Y. Huang+, ICCV, 2019]
Trajectron
[B. Ivanovic+, ICCV, 2019]
Social-Attention
[A. Vemula+, ICRA, 2018]
Multi-Agent Tensor Fusion
[T. Zhao+, CVPR, 2019]
MX-LSTM
[I. Hasan+, CVPR, 2018]
CIDNN
[Y. Xu+, CVPR, 2018]
SR-LSTM
[P. Zhang+, CVPR, 2019]
Group-LSTM
[N. Bisagno+, CVPR, 2018]
Reciprocal Network
[S. Hao+, CVPR, 2020]
PECNet
[K. Mangalam+, ECCV, 2020]
RSBG
[J. SUN+, CVPR, 2020]
STAR
[C. Yu+, ECCV, 2020]
Behavior CNN
[S. Yi+, ECCV, 2016]
Future localization in first-person videos
[T. Yagi+, CVPR, 2018]
Fast and Furious
[W. Luo+, CVPR, 2018]
OPPU
[A. Bhattacharyya+, CVPR, 2018]
Object Attributes and Semantic Segmentation
[H. Minoura+, VISAPP, 2019]
Rule of the Road
[J. Hong+, CVPR, 2019]
Multiverse
[J. Liang+, CVPR, 2020]
Trajectron++
[T. Salzmann+, ECCV, 2020]
Attentionモデル
Poolingモデル
近年ではAttentionモデルと複数経路を予測する手法が主流
multimodal paths
インタラクションを用いた経路予測モデル
8
予測対象と他対象間の位置情報を共にPoolingすることで,

衝突を避ける経路予測が可能
他対象についてのAttentionを求めることで,

予測対象が誰にどの程度着目して ったかを視覚的に捉えることが可能
Pooling モデル Attention モデル
●
各カテゴリに属する経路予測手法毎の特徴をまとめる
- インタラクションあり
• Poolingモデル
• Attentionモデル
- インタラクションなし (Other)
●
定量的評価のためのデータセット,評価指標も紹介
●
代表的モデルを使用して,各モデルの精度と予測結果について議論
本サーベイの目的
9
Deep Learningを用いた経路予測手法の動向調査
本サーベイの目的
10
Deep Learningを用いた経路予測手法の動向調査
●
各カテゴリに属する経路予測手法毎の特徴をまとめる
- インタラクションあり
• Poolingモデル
• Attentionモデル
- インタラクションなし (Other)
●
定量的評価のためのデータセット,評価指標も紹介
●
代表的モデルを使用して,各モデルの精度と予測結果について議論
複数の歩行者の移動経路を同時に予測
●
歩行者同士の衝突を避けるためにSocial-Pooling layer (S-Pooling)を提案
- 予測対象周辺の他対象の位置と中間層出力を入力
- 次時刻のLSTMの内部状態に歩行者同士の空間的関係が保持
- 衝突を避ける経路予測が可能
Social LSTM [A. Alahi+, CVPR, 2016]
11
Linear!
Social-LSTM!
GT!
SF [73]!
Linear!
Social-LSTM!
GT!
SF [73]!
Linear!
Social-LSTM!
GT!
SF [73]!
Linear!
Social-LSTM!
GT!
SF [73]!
Linear!
Social-LSTM!
GT!
SF [73]!
Linear!
GT!
SF [73]!Linear!
Social-LSTM!
GT!
SF [73]!
Linear!
Social-LSTM!
GT!
SF [73]!
Linear!
Social-LSTM!
GT!
SF [73]!
Linear!
Social-LSTM!
GT!
SF [73]!
A. Alahi, et al., “Social LSTM: Human Trajectory Prediction in Crowded Spaces,” CVPR, 2016.
対象間のインタラクションに加え周囲の環境情報を考慮
●
交差点や道沿い端などの障害物領域を避ける経路予測を実現
●
CVAEでエンコードすることで複数の経路を予測可能
Ranking & Refinement Moduleで予測経路にランキング付け
●
経路を反復的に改善することで予測精度向上を図る
DESIRE [N. Lee+, CVPR, 2017]
12
Input
KLD Loss
fc
+
soft
max
r1 rtr2
fc
Y
Sample Generation Module Ranking & Re nement Module
RNN Encoder1
GRU GRU GRU
RNN Encoder2
GRU GRU GRU
RNN Decoder1
GRU GRU GRU
RNN Decoder2
GRU GRU GRU
CVAE
fc
fc
z
X
Y
Regression
Scoring
fc fc fc
Y
Y
Recon
Loss
CNN
SCF SCF SCF
Feature
Pooling
(I)
Iterative Feedback
concat
mask
addition
Figure 2. The overview of proposed prediction framework DESIRE. First, DESIRE generates multiple plausible prediction samples ˆY via a
CVAE-based RNN encoder-decoder (Sample Generation Module). Then the following module assigns a reward to the prediction samples
at each time-step sequentially as IOC frameworks and learns displacements vector ∆ ˆY to regress the prediction hypotheses (Ranking
DESIRE-S Top1
DESIRE-S Top10
DESIRE-SI Top1
DESIRE-SI Top10
Linear
RNN ED
RNN ED-SI
X
Y
Method
Linear
RNN ED
RNN ED-SI
CVAE 1
CVAE 10%
DESIRE-S-IT
DESIRE-S-IT
DESIRE-S-IT
DESIRE-S-IT
DESIRE-SI-I
DESIRE-SI-I
DESIRE-SI-I
DESIRE-SI-I
Linear
RNN ED
RNN ED-SI
CVAE 1
CVAE 10%
DESIRE-S-IT
Linear
RNN ED
RNN ED-SI
X
Y
X
Y
(a) GT
Figure 6. KITTI resul
DESIRE-S Top1
DESIRE-S Top10
DESIRE-SI Top1
DESIRE-SI Top10
Linear
RNN ED
RNN ED-SI
X
Y
真値 予測値
インタラクションあり
Top1
Top10
インタラクションなし
Top1
Top10
N. Lee, et al., “DESIRE: Distant Future Prediction in Dynamic Scenes with Interacting Agents,” CVPR, 2017.
高速道路上で隣接する自動車同士のインタラクションを考慮した予測手法
●
インタラクション情報に空間的意味合いを持たせるConvolution Social Poolingを提案
- LSTM Encoderで得た軌跡特徴量を固定サイズのSocial Tensorに格納
- CNNでインタラクションの特徴量を求める
- 予測車の特徴量と連結し,LSTM Decoderで経路を予測
Convolutional Social-Pooling [N. Deo+, CVPRW, 2018]
13
Figure 3. Proposed Model: The encoder is an LSTM with shared weights that learns vehicle dynamics based on track histories. The
convolutional social pooling layers learn the spatial interdependencies of of the tracks. Finally, the maneuver based decoder outputs a
multi-modal predictive distribution for the future motion of the vehicle being predicted
Convolutional Social Pooling for Vehicle Trajectory Prediction
Nachiket Deo Mohan M. Trivedi
University of California, San Diego
La Jolla, 92093
ndeo@ucsd.edu mtrivedi@ucsd.edu
Abstract
Forecasting the motion of surrounding vehicles is a crit-
ical ability for an autonomous vehicle deployed in complex
traffic. Motion of all vehicles in a scene is governed by the
traffic context, i.e., the motion and relative spatial config-
uration of neighboring vehicles. In this paper we propose
an LSTM encoder-decoder model that uses convolutional
social pooling as an improvement to social pooling lay-
ers for robustly learning interdependencies in vehicle mo-
tion. Additionally, our model outputs a multi-modal predic-
tive distribution over future trajectories based on maneuver
classes. We evaluate our model using the publicly available
NGSIM US-101 and I-80 datasets. Our results show im-
provement over the state of the art in terms of RMS values
of prediction error and negative log-likelihoods of true fu-
Figure 1. Imagine the blue vehicle is an autonomous vehicle in
the traffic scenario shown. Our proposed model allows it to make
multi-modal predictions of future motion of it’s surrounding ve-
hicles, along with prediction uncertainty shown here for the red
v1[cs.CV]15May2018
N. Deo, et al., “Convolutional Social Pooling for Vehicle Trajectory Prediction,” CVPRW, 2018.
歩行者の視線情報を活用した経路予測手法
●
頭部を中心とした視野角内の他対象のみPooling処理
- 予測対象の頭部方向,他対象との距離値からPooling処理する対象を選択
●
軌跡,頭部方向,インタラクション情報をLSTMへ入力
- 視野角内にいる他対象との衝突を避ける経路予測を実現
- 視線情報を任意に変更することで,任意方向に向かった経路予測が可能
MX-LSTM [I. Hasan+, CVPR, 2018]
14
3. Our approach
In this section we present the MX-LSTM, capable of
jointly forecasting positions and head orientations of an in-
dividual thanks to the presence of two information streams:
Tracklets and vislets.
3.1. Tracklets and vislets
Given a subject i, a tracklet (see Fig. 1a) ) is formed
by consecutive (x, y) positions on the ground plane,
{x
(i)
t }t=1,...,T , x
(i)
t = (x, y) ∈ R2
, while a vislet is formed
by anchor points {a
(i)
t }t=1,...,T , with a
(i)
t = (ax, ay) ∈ R2
indicating a reference point at a fixed distance r from the
corresponding x
(i)
t , towards which the face is oriented1
. In
b)
d
a)
)(i
ta
)(i
tx
)(i
t
r
)(
1
i
tx
c)
1tx
ta
tx
t
t
t
(i) (i)
e
(x,i)
t = φ x
(i)
t , Wx
e
(a,i)
t = φ a
(i)
t , Wa
where the embedding function φ consists in a
jection through the embedding weigths Wx and
D-dimensional vector, multiplied by a RELU n
where D is the dimension of the hidden space.
3.2. VFOA social pooling
The social pooling introduced in [3] is an eff
to let the LSTM capture how people move in
scene avoiding collisions. This work considers a
interest area around the single pedestrian, in wh
den states of the the neighbors are considered
those which are behind the pedestrian. In our ca
prove this module using the vislet information
ing which individuals to consider, by building a
tum of attention (VFOA), that is a triangle origin
x
(i)
t , aligned with a
(i)
t , and with an aperture given
gle γ and a depth d; these parameters have been
cross-validation on the training partition of the T
dataset (see Sec. 5).
Our view-frustum social pooling is a No × N
sor, in which the space around the pedestrian is d
Figure 3. Qualitative results: a) MX-LSTM b) Ablation qualitative study on Individual MX-LSTM (better in color).
I. Hasan, et al., “MX-LSTM: mixing tracklets and vislets to jointly forecast trajectories and head poses,” CVPR, 2018.
グループに関するインタラクションを考慮した経路予測手法
●
運動傾向が類似する歩行者同士をグループとみなす
●
予測対象が属するグループ以外の個人の情報をPooling
- 異なるグループとの衝突を避ける経路を予測
Group-LSTM [N. Bisagno+, ECCVW, 2018]
15
Group LSTM 7
Fig. 3. Representation of the Social hidden-state tensor Hi
t . The black dot represents the pedes-
trian of interest pedi. Other pedestrians pedj (∀j = i) are shown in different color codes, namely
green for pedestrians belonging to the same set, and red for pedestrians belonging to a different
Group LSTM 9
ing to the studies in interpersonal distances [15, 10], socially correlated people tend
to stay closer in their personal space and walk together in crowded environments as
compared to pacing with unknown pedestrians. Pooling only unrelated pedestrians will
focus more on macroscopic inter-group interactions rather than intra-group dynamics,
thus allowing the LSTM network to improve the trajectory prediction performance.
Collision avoidance influences the future motion of pedestrians in a similar manner if
two pedestrians are walking together as in a group.
In Tables 2, 3 and Fig. 4, we display some demos of predicted trajectories which
highlight how our Group-LSTM is able to predict pedestrian trajectories with better
precision, showing how the prediction is improved when we pool in the social tensor of
each pedestrian only pedestrians not belonging to his group.
In Table 2, we show how the prediction of two pedestrians walking together in the
crowd improves when they are not pooled in each other’s pooling layer. When the two
pedestrians are pooled together, the network applies on them the typical repulsion force
to avoid colliding with each other. Since they are in the same group, they allow the other
pedestrian to stay closer in they personal space.
In Fig. 4 we display the sequences of two groups walking toward each other. In
Table 3, we show how the prediction for the two groups is improved with respect to the
Social LSTM. While both prediction are not very accurate, our Group LSTM perform
better because it is able to forecast how pedestrian belonging to the same group will
stay together when navigating the environment.
Name Scene Our Group-LSTM Social-LSTM
ETH
Univ
Frame
2425
Table 2. ETH dataset: the prediction is improved when pooling in the social tensor of each pedes-
trian only pedestrians not belonging to his group. The green dots represent the ground truth tra-
jectories; the blue crosses represent the predicted paths.
5 Conclusion
In this work, we tackle the problem of pedestrian trajectory prediction in crowded
scenes. We propose a novel approach, which combines the coherent filtering algorithm
with the LSTM networks. The coherent filtering is used to identify pedestrians walking
together in a crowd, while the LSTM network is used to predict the future trajectories
10 Niccol´o Bisagno, Bo Zhang and Nicola Conci
(a) (b) (c) (d)
Fig. 4. Sequences taken from the UCY dataset. It displays an interaction example between two
groups, which will be further analyzed in Table 3.
Name Scene Our Group-LSTM Social-LSTM
UCY
Univ
Frame
1025
Table 3. We display how the prediction is improved for two groups walking in opposite direc-
tions. The green dots represent the ground truth trajectories, while the blue crosses represent the
predicted paths.
N. Bisagno, et al., “Group LSTM: Group Trajectory Prediction in Crowded Scenarios,” ECCVW, 2018.
GANを用いて複数経路を予測する手法
●
Generator:複数の予測経路をサンプリング
- LSTM Encoderの特徴量を用いて,Pooling Moduleでインタラクション情報を出力
- 各出力とノイズベクトルを連結し,LSTM Decoderで未来の複数の予測経路を出力
●
Discriminator:予測経路と実際の経路を判別
- 敵対的に学習させることで,実際の経路と騙す予測経路を生成することを期待
Social-GAN [A. Gupta+, CVPR, 2018]
16
Figure 2: System overview. Our model consists of three key components: Generator (G), Pooling Module, and Discriminator
(D). G takes as input past trajectories Xi and encodes the history of the person i as Ht
i . The pooling module takes as input
all Htobs
i and outputs a pooled vector Pi for each person. The decoder generates the future trajectory conditioned on Htobs
i
Figure 5: Comparison between our model w
avoidance scenarios: two people meeting (1
meeting at an angle (4). For each example
to pooling, SGAN-P predicts socially accepFigure 5: Comparison between our model without pooling (SGAN, top) and with pooli
予測分布
A. Gupta, et al., “Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks,” CVPR, 2018.
歩行者や自動車等の異なる移動対象とのインタラクションを考慮した予測手法
●
インタラクションに加えシーンコンテキストを共同でモデル化
- 動的と静的の2つの物体との衝突を避ける経路を予測可能
●
Multi-Agent Tensor Fusion
- CNNでシーンに関するコンテキスト情報を抽出
- 移動対象毎の位置情報から空間的グリッドにLSTMの出力を格納
- コンテキスト情報と空間的グリッドをチャネル方向に連結し,CNNでFusion
- Fusionした特徴量からLSTM Decoderで経路を予測
Multi-Agent Tensor Fusion [T. Zhao+, CVPR, 2019]
17Figure 5: Ablative results on Stanford Drone dataset. From left to right are results from MATF Multi Agent Scene, MAT
入力値 真値 予測値
T. Zhao, et al., “Multi-Agent Tensor Fusion for Contextual Trajectory Prediction,” CVPR, 2019.
2つのネットワークを結合した相互学習による経路予測手法
●
Forward Prediction Network:一般的な軌道予測手法 (観測 → 予測)
●
Backward Prediction Network:一般的な軌道予測手法の逆 (予測 → 観測)
相互制約に基づいてAdversarial Attackの概念に基づくモデルを構築
●
入力軌跡をiterativeに変更
●
モデルの出力と一致させることで,新しい概念(相互攻撃)と呼ぶモデルを開発
Reciprocal Network [S. Hao+, CVPR, 2020]
18
orks for Human Trajectory Prediction
iqun Zhao, and Zhihai He
ersity of Missouri
,hezhi}@mail.missouri.edu
ward
ward
orms
y dif-
prop-
earn- Figure 1. Illustration of our idea of reciprocal learning for human
3. The generator is constructed by a decoder LSTM. Sim-
ilar to the conditional GAN [24], a white noise vector Z
is sampled from a multivariate normal distribution. Then, a
merge layer is used in our proposed network which concate-
nates all encoded features mentioned above with the noise
vector Z. We take this as the input to the LSTM decoder
to generate the candidate future paths for each human. The
discriminator is built with an LSTM encoder which takes
the input as randomly chosen trajectory from either ground
truth or predicted trajectories and classifies them as “real”
or “fake”. Generally speaking, the discriminator classifies
the trajectories which are not accurate as “fake” and forces
the generator to generate more realistic and feasible trajec-
tories.
Within the framework of our reciprocal learning for hu-
man trajectory prediction, let Gθ
: X → Y and Gφ
: Y →
Figure 4. Illustration of the proposed attack method.
S. Hao, et al., “Reciprocal Learning Networks for Human Trajectory Prediction,” CVPR, 2020.
Predicted Endpoint Conditioned Network (PECNet)
●
予測最終地点 (エンドポイント)を重視した学習を行う経路予測手法
-    でエンドポイントを予測し,Past Encodingの出力と連結 (concat encoding)
- 連結した特徴量からSocial Pooling内の各パラメタ特徴量を取得
- 歩行者 x 歩行者のSocial Maskで歩行者間のインタラクションを求める
- concat encodingとインタラクション情報から で経路を予測
PECNet [K. Mangalam+, ECCV, 2020]
19
6 K. Mangalam, H. Girase, S. Agarwal, K. Lee, E. Adeli, J. Malik, A. Gaidon
Dlatent
PECNet: Pedestrian Endpoint Conditioned Trajectory Prediction Network 13
lower prediction error than way-points in the middle! This in a nutshell, con-
firms the motivation of this work.
E↵ect of Number of samples (K): All the previous works use K = 20 sam-
ples (except DESIRE which uses K = 5) to evaluate the multi-modal predictions
for metrics ADE & FDE. Referring to Figure 5, we see the expected decreas-
ing trend in ADE & FDE with time as K increases. Further, we observe that
our proposed method achieves the same error as the previous works with much
smaller K. Previous state-of-the-art achieves 12.58 [39] ADE using K = 20 sam-
ples which is matched by PECNet at half the number of samples, K = 10. This
further lends support to our hypothesis that conditioning on the inferred way-
point significantly reduces the modeling complexity for multi-modal trajectory
forecasting, providing a better estimate of the ground truth.
Lastly, as K grows large (K ! 1) we observe that the FDE slowly gets closer
to 0 with more number of samples, as the ground truth Gc is eventually found.
However, the ADE error is still large (6.49) because of the errors in the rest of
the predicted trajectory. This is in accordance with the observed ADE (8.24) for
the oracle conditioned on the last observed point (i.e. 0 FDE error) in Fig. 4.
Design choice for VAE: We also evaluate our design choice of using the in-
ferred future way-points ˆGc for training subsequent modeules (social pooling &
prediction) instead of using the ground truth Gc. As mentioned in Section 3.2,
this is also a valid choice for training PECNet end to end. Empirically, we find
Fig. 6. Visualizing Multimodality: We show visualizations for some multi-modal
入力値 真値 予測値
K. Mangalam, et al., “It is Not the Journey but the Destination: Endpoint Conditioned Trajectory Prediction,” ECCV, 2020.
Pfuture
本サーベイの目的
20
Deep Learningを用いた経路予測手法の動向調査
●
各カテゴリに属する経路予測手法毎の特徴をまとめる
- インタラクションあり
• Poolingモデル
• Attentionモデル
- インタラクションなし (Other)
●
定量的評価のためのデータセット,評価指標も紹介
●
代表的モデルを使用して,各モデルの精度と予測結果について議論
グラフ構造を時空間方向に拡張した経路予測手法
●
Node:対象の位置情報
●
Edge:対象間の空間情報,時間方向へ伝播する対象自身の情報
●
NodeとEdgeからAttentionを求め,注目対象を導出
- 注目対象を回避する経路予測が可能
- Attentionを求めることで視覚的な説明が可能
Social-Attention [A. Vemula+, ICRA, 2018]
21
ponding factor graph
el
(a)
(b)
(a)
(b)Fig. 3. Architecture of EdgeRNN (left), Attention module (middle) and NodeRNN (right)
and Interacting Gaussian Processes [4]. Hence, we chose
Social LSTM as the baseline to compare the performance
of our method.
C. Quantitative Results
The prediction errors for all the methods on the 5 crowdA. Vemula, et al., “Social Attention: Modeling Attention in Human Crowds,” ICRA, 2018.
移動対象の行動による危険度をAttentionで推定し,行動の特徴に重み付け
●
Motion Encoder Moduleで対象毎の行動をエンコード
●
Location Encoder Moduleで対象毎の位置情報をエンコード
- 予測対象と全他対象の内積を求め,Softmaxで他対象の特徴に重み付け
●
2つのModuleを連結し,次時刻以降の経路を予測
CIDNN [Y. Xu+, CVPR, 2018]
22
N
th
1fc 2fc 3fc1
tS
1fc 2fc 3fci
tS
1fc 2fc 3fcN
tS
1
th
i
th
ÖÖ
,1i
ta
,i i
ta
,i N
ta
ÖÖ
1 1 1
1 2, ,..., tS S S
LSTMLSTM1
tz
1 2, ,...,i i i
tS S S
LSTMLSTMi
tz
1 2, ,...,N N N
tS S S
LSTMLSTMN
tz
ÖÖ
ÖÖ
Displacement Prediction Module
1
i
tSd +
i
tc
fc
#
ƒ
ƒ
ƒ
Location Encoder Module Motion Encoder ModuleCrowd
Interaction
ƒ #Inner
product
Scalar
multiplication
Sum
ÖÖ
ÖÖ
Figure 2. The architecture of crowd interaction deep neural network (CIDNN).
Successful Cases
Figure 3. Qualitative results: history traj
Successful Cases
Figure 3. Qualitative results: history trajectory (red), ground truth (blue
Successful Cases
Figure 3. Qualitative results: history trajectory (red), ground truth (blue), and predicted trajectories from ou
入力値 真値 予測値
Y. Xu, et al., “Encoding Crowd Interaction with Deep Neural Network for Pedestrian Trajectory Prediction,” CVPR, 2018.
現時刻のインタラクション情報から予測対象の未来の予測経路を更新
●
States refinement module内の2つの機構で高精度な経路予測を実現
- 他対象との衝突を防ぐPedestrian-aware attention (PA)
- 他対象の動きから,予測対象自身が経路を選択するMotion gate (MG)
●
MGで衝突を起こしそうな対象の動きから経路を選択
●
PAで予測対象近隣の他対象に着目
SR-LSTM [P. Zhang+, CVPR, 2019]
23
5,36,39]. Vemula
from the hidden
gives an impor-
et al. [33] utilize
ght the important
pairwise velocity
who are in simi-
ms to selects mo-
strian during the
lly aware neigh-
d in previous ap-
ramework. This
lution Networks
LSTM
LSTM
LSTM
LSTMSR
t t+1
LSTM
LSTM
States refinement module
LSTM states
Input the
location to
LSTM
Ouput the
prediction
...
selects the features, where each row is related to a certain
dimension of hidden feature.
In Fig.6, the first column shows the trajectory patterns
captured by hidden features started from origin and ended
at the dots, which are extracted in similar way as Fig.2(a).
The motion gate for a feature considers pairwise input tra-
jectories with similar configurations. Some examples for
high response of the gate are shown in the other columns of
Fig.6. In these pairwise trajectory samples, the red and blue
ones are respectively the trajectories of pedestrian i and j,
and the time step we calculate the motion gate are shown
with dots (where the trajectory ends). These pairwise sam-
ples are extracted by searching from database with highest
activation for the motion gate neuron. High response of gate
means that the corresponding feature is selected.
Figure 6. Selected feature patterns by motion gate. Each row is
related to a hidden neuron (feature) of LSTM. Column 1: Activa-
tion trajectory pattern of the hidden feature. Column 2-6: Pairwise
trajectory examples (end with solid dots) having high activation to
3) Row 3: Thi
considers more
tion. 4) Row 4
hidden feature
attention on th
walk towards h
Pedestrian-
ples of the ped
LSTM in Fig.7
to the close ne
tention, 2) the
often largely fo
refinement ten
bors with grou
longer time ran
Figure 7. Illustr
magenta represe
the dashed circle
ment. Larger cir
represents the ta
ones are his/her
their walking dir
5. Conclusio
selects the features, where each row is related to a certain
dimension of hidden feature.
In Fig.6, the first column shows the trajectory patterns
captured by hidden features started from origin and ended
at the dots, which are extracted in similar way as Fig.2(a).
The motion gate for a feature considers pairwise input tra-
jectories with similar configurations. Some examples for
high response of the gate are shown in the other columns of
Fig.6. In these pairwise trajectory samples, the red and blue
ones are respectively the trajectories of pedestrian i and j,
and the time step we calculate the motion gate are shown
with dots (where the trajectory ends). These pairwise sam-
ples are extracted by searching from database with highest
activation for the motion gate neuron. High response of gate
means that the corresponding feature is selected.
3) Row 3: This case is similar to row 2. This gate element
considers more distant neighbor walking in opposite direc-
tion. 4) Row 4: The neighbor in blue is static, the selected
hidden feature shows that pedestrian i in red potentially pay
attention on this stationary neighbor in case he is about to
walk towards him/her.
Pedestrian-wise attention. We illustrate some exam-
ples of the pedestrian-wise attention expected by our SR-
LSTM in Fig.7. It shows that 1) dominant attention is paid
to the close neighbors, while the others also take slight at-
tention, 2) the attention given by the first refinement layer
often largely focuses on the close neighbors, and the second
refinement tends to strengthen the effect of farther neigh-
bors with group behavior or may influence the pedestrian in
longer time range.
Pedestrian-aware attention
Motion gate 予測対象
予測対象
P. Zhang, et al., “SR-LSTM: State Refinement for LSTM towards Pedestrian Trajectory Prediction,” CVPR, 2019.
将来の経路と行動を同時に予測するモデルを提案
●
Person Behavior Module:歩行者の外見情報と骨格情報をエンコード
●
Person Interaction Module:周辺の静的環境情報と自動車等の物体情報をエンコード
●
Visual Feature Tensor Q:上記2つの特徴と過去の軌跡情報をエンコード
●
Trajectory Generator:将来の経路を予測
●
Activity Prediction:予測最終時刻の行動を予測
Next [J. Liang+, CVPR, 2019]
24
Figure 2. Overview of our model. Given a sequence of frames containing the person for prediction, our model utilizes person behavior
module and person interaction module to encode rich visual semantics into a feature tensor. We propose novel person interaction module
that takes into account both person-scene and person-object relations for joint activities and locations prediction.
3. Approach RoIAlign
CNN
Figure 6. (Better viewed in color.) Qualitative comparison between our method and the baselines. Yellow path is the observable trajectory
and Green path is the ground truth trajectory during the prediction period. Predictions are shown as Blue heatmaps. Our model also predicts
the future activity, which is shown in the text and with the person pose template.
Figure 7. (Better viewed in color.) Qualitative analysis of o
Method ETH HOTEL UN
Model
Linear 1.33 / 2.94 0.39 / 0.72 0.82
LSTM 1.09 / 2.41 0.86 / 1.91 0.61J. Liang, et al., “Peeking into the Future: Predicting Future Person Activities and Locations in Videos,” CVPR, 2019.
歩行者同士のインタラクションに加え,静的環境情報を考慮した予測手法
●
Physical Attention:静的環境に関するAttentionを推定
●
Social Attention:動的物体に関するAttentionを推定
各AttentionとLSTM Encoderの出力から将来の経路を予測
SoPhie [A. Sadeghian+, CVPR, 2019]
25
Physical Attention:
!""#$
Social Attention:
!""%&
Generator
Attention Module for i-th person
GAN Module
concat.
Discriminator
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
concat.
concat.
concat.
z
z
z
decoder
1st agent
i-th agent
N-th agent
Attention Module for 1st person
Attention Module for n-th person
CNN
Feature Extractor Module
i-th agent
calc. relative
relative
relative
relativeLSTM
LSTM
LSTM
encoder
(a) (b) (c)
N-th agent
1st agent
Figure 2. An overview of SoPhie architecture. Sophie consists of three key modules including: (a) A feature extractor module, (b) An
attention module, and (c) An LSTM based GAN module.
3.3. Feature extractors where πj is the index of the other agents sorted according to
their distances to the target agent i. In this framework, each
Nexus 6 Li
Figure 3. Using the generator to sample trajectories and the discrimin
maps for SDD scenes. Maps are presented in red, and generated only
Ground Truth Social LSTM Social GAN Sophie (Ours)
Figure 4. Comparison of Sophie’s predictions against the ground
truth trajectories and two baselines. Each pedestrian is displayed
with a different color, where dashed lines are observed trajecto-A. Sadeghian, et al., “SoPhie: An Attentive GAN for Predicting Paths Compliant to Social and Physical Constraints,” CVPR, 2019.
インタラクションを時間方向へ伝搬した予測手法
インタラクションを考慮するためにGraph Attention Network (GAT)を適用
●
GAT:グラフ構造を取り入れたAttentionに基づくGraph Convolutional Networks
- シーン全体にいる他対象の関係の重要度をAttention機構で学習
●
GATで求めた特徴を時間方向に伝播することで時空間のインタラクションを考慮
- 衝突の可能性がある対象の情報を過去の経路から導出可能
STGAT [Y. Huang+, ICCV, 2019]
26
GAT GAT GAT GAT
c
c
c
Encoder State Decoder
GAT Graph Attention Network
Concat Noisec
· · ·<latexit sha1_base64="RyHhXBLV/0cOSbkT6YVaYO8lbL8=">AAAB7XicbVBNS8NAEJ34WetX1aOXYBE8lUQFPXgoePFYwX5AG8pms2nXbnbD7kQoof/BiwdFvPp/vPlv3LY5aOuDgcd7M8zMC1PBDXret7Oyura+sVnaKm/v7O7tVw4OW0ZlmrImVULpTkgME1yyJnIUrJNqRpJQsHY4up367SemDVfyAccpCxIykDzmlKCVWj0aKTT9StWreTO4y8QvSBUKNPqVr16kaJYwiVQQY7q+l2KQE42cCjYp9zLDUkJHZMC6lkqSMBPks2sn7qlVIjdW2pZEd6b+nshJYsw4CW1nQnBoFr2p+J/XzTC+DnIu0wyZpPNFcSZcVO70dTfimlEUY0sI1dze6tIh0YSiDahsQ/AXX14mrfOaf1Hz7i+r9ZsijhIcwwmcgQ9XUIc7aEATKDzCM7zCm6OcF+fd+Zi3rjjFzBH8gfP5A60zjys=</latexit>
tobs<latexit sha1_base64="vcpPldxy0fXoWuPkuol+dvCEp1Q=">AAAB7nicbVDLSgNBEOyNrxhfUY9eBoPgKeyqYI4BLx4jmAckS5idTJIhszPLTK8QlnyEFw+KePV7vPk3TpI9aGJBQ1HVTXdXlEhh0fe/vcLG5tb2TnG3tLd/cHhUPj5pWZ0axptMS206EbVcCsWbKFDyTmI4jSPJ29Hkbu63n7ixQqtHnCY8jOlIiaFgFJ3Uxn6mIzvrlyt+1V+ArJMgJxXI0eiXv3oDzdKYK2SSWtsN/ATDjBoUTPJZqZdanlA2oSPedVTRmNswW5w7IxdOGZChNq4UkoX6eyKjsbXTOHKdMcWxXfXm4n9eN8VhLcyESlLkii0XDVNJUJP572QgDGcop45QZoS7lbAxNZShS6jkQghWX14nratqcF31H24q9VoeRxHO4BwuIYBbqMM9NKAJDCbwDK/w5iXei/fufSxbC14+cwp/4H3+ALeBj8c=</latexit>
t2<latexit sha1_base64="7ItLwn8Q8RXHU3NW/LT2SA7MEc4=">AAAB7HicbVBNS8NAEJ34WetX1aOXxSJ4KkkV7LHgxWMF0xbaUDbbTbt0swm7E6GE/gYvHhTx6g/y5r9x2+agrQ8GHu/NMDMvTKUw6Lrfzsbm1vbObmmvvH9weHRcOTltmyTTjPsskYnuhtRwKRT3UaDk3VRzGoeSd8LJ3dzvPHFtRKIecZryIKYjJSLBKFrJx0Fenw0qVbfmLkDWiVeQKhRoDSpf/WHCspgrZJIa0/PcFIOcahRM8lm5nxmeUjahI96zVNGYmyBfHDsjl1YZkijRthSShfp7IqexMdM4tJ0xxbFZ9ebif14vw6gR5EKlGXLFlouiTBJMyPxzMhSaM5RTSyjTwt5K2JhqytDmU7YheKsvr5N2veZd19yHm2qzUcRRgnO4gCvw4BaacA8t8IGBgGd4hTdHOS/Ou/OxbN1wipkz+APn8wfJU46h</latexit>
t1<latexit sha1_base64="Bdkp2HUnpwjOv5mbz6SwzTMJ8ag=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lUsMeCF48VTFtoQ9lsN+3SzSbsToQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74STu7nfeeLaiEQ94jTlQUxHSkSCUbSSj4Pcmw2qNbfuLkDWiVeQGhRoDapf/WHCspgrZJIa0/PcFIOcahRM8lmlnxmeUjahI96zVNGYmyBfHDsjF1YZkijRthSShfp7IqexMdM4tJ0xxbFZ9ebif14vw6gR5EKlGXLFlouiTBJMyPxzMhSaM5RTSyjTwt5K2JhqytDmU7EheKsvr5P2Vd27rrsPN7Vmo4ijDGdwDpfgwS004R5a4AMDAc/wCm+Ocl6cd+dj2VpyiplT+APn8wfHzo6g</latexit>
t3<latexit sha1_base64="OcT9ss7O545agx6P8OIS9udeh5k=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lUsMeCF48VTFtoQ9lsN+3SzSbsToQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74STu7nfeeLaiEQ94jTlQUxHSkSCUbSSj4P8ejao1ty6uwBZJ15BalCgNah+9YcJy2KukElqTM9zUwxyqlEwyWeVfmZ4StmEjnjPUkVjboJ8ceyMXFhlSKJE21JIFurviZzGxkzj0HbGFMdm1ZuL/3m9DKNGkAuVZsgVWy6KMkkwIfPPyVBozlBOLaFMC3srYWOqKUObT8WG4K2+vE7aV3Xvuu4+3NSajSKOMpzBOVyCB7fQhHtogQ8MBDzDK7w5ynlx3p2PZWvJKWZO4Q+czx/K2I6i</latexit>
z
M-LSTM G-LSTM
Figure 2. The architecture of our proposed STGAT model. The framework is based on seq2seq model and consists of 3 parts: Encod
Intermediate State and Decoder. The Encoder module includes three components: 2 types of LSTMs and GAT. The Intermediate St
encapsulates the spatial and temporal information of all observed trajectories. The Decoder module generates the future trajectories bas
edestrians in a scene are considered as nodes on the
aph at every time-step. The edges on the graph repre-
st of human-human interactions.
~h1<latexit sha1_base64="16Yxx8+YpYlp2tGJX9qHHqgdbbY=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69BIvgqSQq6LHoxWMF+wFtKJvtpl262Q27k0IJ+RlePCji1V/jzX/jts1Bqw8GHu/NMDMvTAQ36HlfTmltfWNzq7xd2dnd2z+oHh61jUo1ZS2qhNLdkBgmuGQt5ChYN9GMxKFgnXByN/c7U6YNV/IRZwkLYjKSPOKUoJV6/Smj2TgfZH4+qNa8ureA+5f4BalBgeag+tkfKprGTCIVxJie7yUYZEQjp4LllX5qWELohIxYz1JJYmaCbHFy7p5ZZehGStuS6C7UnxMZiY2ZxaHtjAmOzao3F//zeilGN0HGZZIik3S5KEqFi8qd/+8OuWYUxcwSQjW3t7p0TDShaFOq2BD81Zf/kvZF3b+sew9XtcZtEUcZTuAUzsGHa2jAPTShBRQUPMELvDroPDtvzvuyteQUM8fwC87HN46CkWw=</latexit>
~h5<latexit sha1_base64="R1g6gAHeZA/axOfzhGwObt3Ay0g=">AAAB8nicbVDLSsNAFJ3UV62vqks3g0VwVRIf6LLoxmUF+4A0lMn0ph06mQkzk0IJ+Qw3LhRx69e482+ctllo64ELh3Pu5d57woQzbVz32ymtrW9sbpW3Kzu7e/sH1cOjtpapotCikkvVDYkGzgS0DDMcuokCEoccOuH4fuZ3JqA0k+LJTBMIYjIULGKUGCv5vQnQbJT3s+u8X625dXcOvEq8gtRQgWa/+tUbSJrGIAzlRGvfcxMTZEQZRjnklV6qISF0TIbgWypIDDrI5ifn+MwqAxxJZUsYPFd/T2Qk1noah7YzJmakl72Z+J/npya6DTImktSAoItFUcqxkXj2Px4wBdTwqSWEKmZvxXREFKHGplSxIXjLL6+S9kXdu6y7j1e1xl0RRxmdoFN0jjx0gxroATVRC1Ek0TN6RW+OcV6cd+dj0Vpyiplj9AfO5w+UlpFw</latexit>
~h4<latexit sha1_base64="zHvzoSStxfsE4+HkOUWqlEWKApg=">AAAB8nicbVBNS8NAEN3Ur1q/qh69LBbBU0m0oMeiF48V7AekoWy2k3bpZjfsbgol5Gd48aCIV3+NN/+N2zYHbX0w8Hhvhpl5YcKZNq777ZQ2Nre2d8q7lb39g8Oj6vFJR8tUUWhTyaXqhUQDZwLahhkOvUQBiUMO3XByP/e7U1CaSfFkZgkEMRkJFjFKjJX8/hRoNs4HWSMfVGtu3V0ArxOvIDVUoDWofvWHkqYxCEM50dr33MQEGVGGUQ55pZ9qSAidkBH4lgoSgw6yxck5vrDKEEdS2RIGL9TfExmJtZ7Foe2MiRnrVW8u/uf5qYlug4yJJDUg6HJRlHJsJJ7/j4dMATV8ZgmhitlbMR0TRaixKVVsCN7qy+ukc1X3ruvuY6PWvCviKKMzdI4ukYduUBM9oBZqI4okekav6M0xzovz7nwsW0tOMXOK/sD5/AGTEZFv</latexit>
~h3<latexit sha1_base64="4MzhxHH1VdXzxV7PaR8PPHgLBTc=">AAAB8nicbVBNS8NAEN3Ur1q/qh69LBbBU0msoMeiF48V7AekoWy2m3bpZjfsTgol5Gd48aCIV3+NN/+N2zYHbX0w8Hhvhpl5YSK4Adf9dkobm1vbO+Xdyt7+weFR9fikY1SqKWtTJZTuhcQwwSVrAwfBeolmJA4F64aT+7nfnTJtuJJPMEtYEJOR5BGnBKzk96eMZuN8kDXyQbXm1t0F8DrxClJDBVqD6ld/qGgaMwlUEGN8z00gyIgGTgXLK/3UsITQCRkx31JJYmaCbHFyji+sMsSR0rYk4IX6eyIjsTGzOLSdMYGxWfXm4n+en0J0G2RcJikwSZeLolRgUHj+Px5yzSiImSWEam5vxXRMNKFgU6rYELzVl9dJ56ruNeru43WteVfEUUZn6BxdIg/doCZ6QC3URhQp9Ixe0ZsDzovz7nwsW0tOMXOK/sD5/AGRjJFu</latexit>
~h2<latexit sha1_base64="hAcukdL3sj/GIPn2AtYuEsqEXSA=">AAAB8nicbVBNS8NAEN3Ur1q/qh69LBbBU0mqoMeiF48V7AekoWy2m3bpZjfsTgol5Gd48aCIV3+NN/+N2zYHbX0w8Hhvhpl5YSK4Adf9dkobm1vbO+Xdyt7+weFR9fikY1SqKWtTJZTuhcQwwSVrAwfBeolmJA4F64aT+7nfnTJtuJJPMEtYEJOR5BGnBKzk96eMZuN8kDXyQbXm1t0F8DrxClJDBVqD6ld/qGgaMwlUEGN8z00gyIgGTgXLK/3UsITQCRkx31JJYmaCbHFyji+sMsSR0rYk4IX6eyIjsTGzOLSdMYGxWfXm4n+en0J0G2RcJikwSZeLolRgUHj+Px5yzSiImSWEam5vxXRMNKFgU6rYELzVl9dJp1H3ruru43WteVfEUUZn6BxdIg/doCZ6QC3URhQp9Ixe0ZsDzovz7nwsW0tOMXOK/sD5/AGQB5Ft</latexit>
~h6<latexit sha1_base64="Kuq84mWaskTChJNeNnyTi5aZ6oc=">AAAB8nicbVBNS8NAEN3Ur1q/qh69LBbBU0lU1GPRi8cK9gPSUDbbSbt0sxt2N4US8jO8eFDEq7/Gm//GbZuDtj4YeLw3w8y8MOFMG9f9dkpr6xubW+Xtys7u3v5B9fCorWWqKLSo5FJ1Q6KBMwEtwwyHbqKAxCGHTji+n/mdCSjNpHgy0wSCmAwFixglxkp+bwI0G+X97DrvV2tu3Z0DrxKvIDVUoNmvfvUGkqYxCEM50dr33MQEGVGGUQ55pZdqSAgdkyH4lgoSgw6y+ck5PrPKAEdS2RIGz9XfExmJtZ7Goe2MiRnpZW8m/uf5qYlug4yJJDUg6GJRlHJsJJ79jwdMATV8agmhitlbMR0RRaixKVVsCN7yy6ukfVH3Luvu41WtcVfEUUYn6BSdIw/doAZ6QE3UQhRJ9Ixe0ZtjnBfn3flYtJacYuYY/YHz+QOWG5Fx</latexit>
~h1<latexit sha1_base64="KoV/fuxRJ+UPzkEixIjJIx41qvU=">AAAB+HicbVBNS8NAEJ34WetHox69LBbRU0lU0GPRi8cK9gPaGDbbTbt0swm7m0IN+SVePCji1Z/izX/jts1BWx8MPN6bYWZekHCmtON8Wyura+sbm6Wt8vbO7l7F3j9oqTiVhDZJzGPZCbCinAna1Exz2kkkxVHAaTsY3U799phKxWLxoCcJ9SI8ECxkBGsj+XalN6YkG+aP2WnuZ27u21Wn5syAlolbkCoUaPj2V68fkzSiQhOOleq6TqK9DEvNCKd5uZcqmmAywgPaNVTgiCovmx2eoxOj9FEYS1NCo5n6eyLDkVKTKDCdEdZDtehNxf+8bqrDay9jIkk1FWS+KEw50jGapoD6TFKi+cQQTCQztyIyxBITbbIqmxDcxZeXSeu85l7UnPvLav2miKMER3AMZ+DCFdThDhrQBAIpPMMrvFlP1ov1bn3MW1esYuYQ/sD6/AHv15NC</latexit>
~12<latexit sha1_base64="7wH2WVdWISdwYnqsMTVyR5rAKM0=">AAAB+nicbVBNS8NAEJ34WetXqkcvi0XwVJIq6LHoxWMF+wFtCJvtpl262YTdTaXE/BQvHhTx6i/x5r9x2+agrQ8GHu/NMDMvSDhT2nG+rbX1jc2t7dJOeXdv/+DQrhy1VZxKQlsk5rHsBlhRzgRtaaY57SaS4ijgtBOMb2d+Z0KlYrF40NOEehEeChYygrWRfLvSn1CS9TFPRjj3M7ee+3bVqTlzoFXiFqQKBZq+/dUfxCSNqNCEY6V6rpNoL8NSM8JpXu6niiaYjPGQ9gwVOKLKy+an5+jMKAMUxtKU0Giu/p7IcKTUNApMZ4T1SC17M/E/r5fq8NrLmEhSTQVZLApTjnSMZjmgAZOUaD41BBPJzK2IjLDERJu0yiYEd/nlVdKu19yLmnN/WW3cFHGU4ARO4RxcuIIG3EETWkDgEZ7hFd6sJ+vFerc+Fq1rVjFzDH9gff4AT7aUBQ==</latexit>
~13<latexit sha1_base64="Oj5OOfl5UD0Mrh/Gd9yVwPeeGOY=">AAAB+nicbVBNS8NAEJ3Ur1q/Uj16WSyCp5KooMeiF48V7Ae0IWy2m3bpZhN2N5US81O8eFDEq7/Em//GbZuDtj4YeLw3w8y8IOFMacf5tkpr6xubW+Xtys7u3v6BXT1sqziVhLZIzGPZDbCinAna0kxz2k0kxVHAaScY3878zoRKxWLxoKcJ9SI8FCxkBGsj+Xa1P6Ek62OejHDuZ+5F7ts1p+7MgVaJW5AaFGj69ld/EJM0okITjpXquU6ivQxLzQineaWfKppgMsZD2jNU4IgqL5ufnqNTowxQGEtTQqO5+nsiw5FS0ygwnRHWI7XszcT/vF6qw2svYyJJNRVksShMOdIxmuWABkxSovnUEEwkM7ciMsISE23SqpgQ3OWXV0n7vO5e1J37y1rjpoijDMdwAmfgwhU04A6a0AICj/AMr/BmPVkv1rv1sWgtWcXMEfyB9fkDUTuUBg==</latexit>
~11<latexit sha1_base64="Z1DxO/2cTeAPqRO0wzlGCoK0+XY=">AAAB+nicbVBNS8NAEJ3Ur1q/Uj16WSyCp5KooMeiF48V7Ae0IWy2m3bpZhN2N5US81O8eFDEq7/Em//GbZuDtj4YeLw3w8y8IOFMacf5tkpr6xubW+Xtys7u3v6BXT1sqziVhLZIzGPZDbCinAna0kxz2k0kxVHAaScY3878zoRKxWLxoKcJ9SI8FCxkBGsj+Xa1P6Ek62OejHDuZ66b+3bNqTtzoFXiFqQGBZq+/dUfxCSNqNCEY6V6rpNoL8NSM8JpXumniiaYjPGQ9gwVOKLKy+an5+jUKAMUxtKU0Giu/p7IcKTUNApMZ4T1SC17M/E/r5fq8NrLmEhSTQVZLApTjnSMZjmgAZOUaD41BBPJzK2IjLDERJu0KiYEd/nlVdI+r7sXdef+sta4KeIowzGcwBm4cAUNuIMmtIDAIzzDK7xZT9aL9W59LFpLVjFzBH9gff4ATjGUBA==</latexit>
~14<latexit sha1_base64="bUiB7GM/EueyHNLIaRxhSyx8/6A=">AAAB+nicbVBNS8NAEN34WetXqkcvi0XwVBIt6LHoxWMF+wFtCJPtpl262YTdTaXE/BQvHhTx6i/x5r9x2+agrQ8GHu/NMDMvSDhT2nG+rbX1jc2t7dJOeXdv/+DQrhy1VZxKQlsk5rHsBqAoZ4K2NNOcdhNJIQo47QTj25nfmVCpWCwe9DShXgRDwUJGQBvJtyv9CSVZH3gygtzP3Hru21Wn5syBV4lbkCoq0PTtr/4gJmlEhSYclOq5TqK9DKRmhNO83E8VTYCMYUh7hgqIqPKy+ek5PjPKAIexNCU0nqu/JzKIlJpGgemMQI/UsjcT//N6qQ6vvYyJJNVUkMWiMOVYx3iWAx4wSYnmU0OASGZuxWQEEog2aZVNCO7yy6ukfVFzL2vOfb3auCniKKETdIrOkYuuUAPdoSZqIYIe0TN6RW/Wk/VivVsfi9Y1q5g5Rn9gff4AUsCUBw==</latexit>
~15<latexit sha1_base64="cC09I5yAhT8haElJRiWSguUZ4i0=">AAAB+nicbVDLSsNAFJ34rPWV6tLNYBFclcQHuiy6cVnBPqAJYTK9bYdOJmFmUikxn+LGhSJu/RJ3/o3TNgttPXDhcM693HtPmHCmtON8Wyura+sbm6Wt8vbO7t6+XTloqTiVFJo05rHshEQBZwKammkOnUQCiUIO7XB0O/XbY5CKxeJBTxLwIzIQrM8o0UYK7Io3Bpp5hCdDkgeZe5kHdtWpOTPgZeIWpIoKNAL7y+vFNI1AaMqJUl3XSbSfEakZ5ZCXvVRBQuiIDKBrqCARKD+bnZ7jE6P0cD+WpoTGM/X3REYipSZRaDojoodq0ZuK/3ndVPev/YyJJNUg6HxRP+VYx3iaA+4xCVTziSGESmZuxXRIJKHapFU2IbiLLy+T1lnNPa859xfV+k0RRwkdoWN0ilx0heroDjVQE1H0iJ7RK3qznqwX6936mLeuWMXMIfoD6/MHVEWUCA==</latexit>
~16<latexit sha1_base64="Th2k0L1quCqnOCL5jErebK97YEU=">AAAB+nicbVBNS8NAEN34WetXqkcvi0XwVBIV9Vj04rGC/YAmhM122i7dbMLuplJifooXD4p49Zd489+4bXPQ1gcDj/dmmJkXJpwp7Tjf1srq2vrGZmmrvL2zu7dvVw5aKk4lhSaNeSw7IVHAmYCmZppDJ5FAopBDOxzdTv32GKRisXjQkwT8iAwE6zNKtJECu+KNgWYe4cmQ5EHmXuaBXXVqzgx4mbgFqaICjcD+8noxTSMQmnKiVNd1Eu1nRGpGOeRlL1WQEDoiA+gaKkgEys9mp+f4xCg93I+lKaHxTP09kZFIqUkUms6I6KFa9Kbif1431f1rP2MiSTUIOl/UTznWMZ7mgHtMAtV8YgihkplbMR0SSag2aZVNCO7iy8ukdVZzz2vO/UW1flPEUUJH6BidIhddoTq6Qw3URBQ9omf0it6sJ+vFerc+5q0rVjFziP7A+vwBVcqUCQ==</latexit>
n illustration of graph attention layer. It allows a node
fferent importance to different nodes within a neigh-
propose to use another LSTM to model the temp
lations between interactions explicitly. We term t
as G-LSTM:
gt
i = G-LSTM(gt 1
i , ˆmt
i;Wg)
where ˆmt
i is from Eq. 5. Wg is the G-LSTM we
shared among all the sequences.
In Encoder component, two LSTMs (M-L
LSTM) are used to model the motion pattern of e
trian, and the temporal correlations of interaction
tively. We combine these two parts to accomplish
of spatial and temporal information. At time-step
are two hidden variables (m
Tobs
i , g
Tobs
i ) from two
each pedestrian. In our implementation, these two
予測対象
他対象
他対象
他対象
他対象
他対象
Y. Huang, et al., “STGAT: Modeling Spatial-Temporal Interactions for Human Trajectory Prediction,” ICCV, 2019.
複数対象を動的なグラフ構造で効率的にモデル化
●
NHE:観測時刻のNode特徴をLSTMへ入力
●
NFE:学習時にNodeの未来の真の軌跡をエンコードするためにBiLSTMを適用
●
EE:特定範囲内の全対象からAttentionを求める
- 重要度の高いEdge情報を取得
- 時刻毎にEdge情報は変動
●
各特徴からDecoderで経路を予測
- 内部のCVAEでマルチモーダルな経路を予測
- Gaussian Mixture Modelで予測経路を洗練
環境情報を追加したTrajectron++ [T. Salzmann+, ECCV, 2020]が提案
Trajectron [B. Ivanovic+, CVPR, 2019]
27
Overall, we chose to make our model part of the “graph
as architecture” methods, as a result of their stateful graph
representation (leading to efficient iterative predictions on-
line) and modularity (enabling model reuse and extensive
parameter sharing).
3. Problem Formulation
In this work, we are interested in jointly reasoning and
generating a distribution of future trajectories for each agent
in a scene simultaneously. We assume that each scene is
preprocessed to track and classify agents as well as obtain
their spatial coordinates at each timestep. As a result, each
agent i has a classification type Ci (e.g. “Pedestrian”). Let
Xt
i = (xt
i, yt
i ) represent the position of the ith
agent at time
t and let Xt
1,...,N represent the same quantity for all agents
in a scene. Further, let X
(t1:t2)
i = (Xt1
i , Xt1+1
i , . . . , Xt2
i )
denote a sequence of values for time steps t 2 [t1, t2].
As in previous works [1, 16, 49], we take as input the
previous trajectories of all agents in a scene X
(1:tobs)
1,...,N and
aim to produce predictions bX
(tobs+1:tobs+T )
1,...,N that match the
true future trajectories X
(tobs+1:tobs+T )
1,...,N . Note that we have
not assumed N to be static, i.e. we can have N = f(t).
!"/$
!%/&
!'/(!"-!% !'-!%
Legend
Modeled Node
!"/$ Node $ is of type !"
!"-!% Edge is of type !"-!%
Edge being created
Normal Edge!"/)
!"-!%
Attention
!(#$%)
!(#$')
!(#)
!(#(')
!(#(%)
!(#())
F
C
F
C
EE
NHE
NFE
Encoder
ℎ+
ℎ,
-(.|0)
1(.|0, 3)
F
C
.
!(#)
., ℎ+
4!(#(')
! #('
., ℎ+
4!(#(%)
! #(%
., ℎ+
4!(#(')
GMM GMM GMM
4!(#(%)
4!(#())
Decoder 5)
5)
5)
6(#$%)
+
8(#$%)
6(#$')
+
8(#$')
6(#)
+
8(#)
5'-5)
9(#$%)
9(#$')
9(#)
5%-5)
Legend
LSTM Cell
Modulating Function
FC Fully-Connected Layer
Projection to a GMM
Concatenation
Randomly sampled
Train time only
Predict time only
Train and Predict
GMM
MMM
M
M MM
+
Figure 2. Top: An example graph with four nodes. a is our mod-T. Salzmann, et al., “Trajectron++: Dynamically-Feasible Trajectory Forecasting with Heterogeneous Data,” ECCV, 2020.
B. Ivanovic, et al., “The Trajectron: Probabilistic Multi-Agent Trajectory Modeling with Dynamic Spatiotemporal Graphs,” ICCV, 2019.
単純にノイズベクトルを付与すると,高い分散を持つ経路を予測してしまう
●
既存研究は真にマルチモーダルな分布を学習できていない
予測経路とノイズベクトル間の潜在的表現を学習
●
ノイズベクトルから生成した予測経路をLSTM Encoderへ入力
●
元のノイズベクトルと類似するようにマッピング
●
真にマルチモーダルな経路を生成可能
Social-BiGAT [V. Kosaraju+, NeurIPS, 2019]
28Figure 2: Architecture for the proposal Social-BiGAT model. The model consists of a single generator, two
Figure 4: Generated trajectories visualized for the S-GAN-P, Sophie, and Social-BiGAT models across four
main scenes. Observed trajectories are shown as solid lines, ground truth future movements are shown as dashed
lines, and generated samples are shown as contour maps. Different colors correspond to different pedestrians.V. Kosaraju, et al., “Social-BiGAT: Multimodal Trajectory Forecasting using Bicycle-GAN and Graph Attention Networks,” NeurIPS, 2019.
Spatial-Temporal Graphを用いてモデル化
●
Graph Convolution Network (GCN)でインタラクションに関する特徴抽出
- 隣接行列からインタラクション情報を求める
●
GCNで得た特徴からTemporal Convolutional Network (TCN)で予測分布を出力
- LSTMは予測経路を逐次出力するが,TCNは予測経路を並列に出力
- 推論速度を大幅に改善
Social-STGCNN [A. Mohamed+, CVPR, 2020]
29
Figure 2. The Social-STGCNN Model. Given T frames, we construct the spatio-temporal graph representing G = (V, A). Then G is
forwarded through the Spatio-Temporal Graph Convolution Neural Networks (ST-GCNNs) creating a spatio-temporal embedding. Following
this, the TXP-CNNs predicts future trajectories. P is the dimension of pedestrian position, N is the number of pedestrians, T is the number
ˆA. Mohamed, et al., “Social-STGCNN: A Social Spatio-Temporal Graph Convolutional Neural Network for Human Trajectory Prediction,” CVPR, 2020.
歩行者間の関係を調査するグループベースのインタラクションをモデル化
●
グループ:同じ目的地に行く歩行者,遠方の同方向へ進む歩行者, etc.
●
グループの判別のために,人がグループ情報をアノテーション
Relational Social Representationでグループのインタラクションを求める
●
周囲の静的環境情報や過去の軌跡情報と連結し,経路を予測
RSBG [J. Sun+, CVPR, 2020]
30
BiLSTM
CNN
RSBG
Generator
Coordinates
Image patch
Coordinates GCN
LSTM
Individual Representation Decoder
Relational Social Representation
Features
RSBG
J. Sun, et al., “Recursive Social Behavior Graph for Trajectory Prediction,” CVPR, 2020.
経路予測で用いるLSTMは2点の問題がある
●
LSTMは複雑な時間依存性のモデル化が困難
●
Attentionモデルの予測手法がインタラクションを完全にモデル化できない
Transformerを時空間的Attentionへ拡張し,経路予測タスクに応用
●
Temporal Transformerで軌跡特徴をエンコード
●
Spatial Transformerで時刻毎に独立したインタラクションを抽出
●
2つのTransformerを用いることで,LSTMを用いた予測手法の予測精度を大幅に改善
STAR [C. Yu+, ECCV, 2020]
31
6 Yu C., Ma X., Ren J., Zhao H., Yi S.
(a) Temporal Transformer (b) Spatial Transformer
Fig. 3. STAR has two main components, Temporal Transformer and Spatial Tr
former. (a) Temporal Transformer treats each pedestrians independently and extr
the temporal dependencies by Transformer model (h is the embedding of pedest
Yu C., Ma X., Ren J., Zhao H., Yi S.
Temporal Transformer Spatial Transformer
C. Yu, et al., “Spatio-Temporal Graph Transformer Networks for Pedestrian Trajectory Prediction,” ECCV, 2020.
本サーベイの目的
32
Deep Learningを用いた経路予測手法の動向調査
●
各カテゴリに属する経路予測手法毎の特徴をまとめる
- インタラクションあり
• Poolingモデル
• Attentionモデル
- インタラクションなし (Other)
●
定量的評価のためのデータセット,評価指標も紹介
●
代表的モデルを使用して,各モデルの精度と予測結果について議論
CNNを用いた予測手法
●
過去の軌跡情報をエンコードし,スパースなボクセルに格納
●
ConvolutionとMax-Poolingを複数回行い,Deconvolutionで予測経路を出力
●
Location bias mapにより特定シーンの物体情報と潜在的特徴表現で要素積をとる
- 特定シーンによって変化する歩行者の振る舞いを考慮
Behavior-CNN [S. Yi+, ECCV, 2016]
33
output of CNN, as they are of variable lengths and observed in different periods.
3 Pedestrian Behavior Modeling and Prediction
The overall framework is shown in Fig. 2. The input to our system is pedestrian
walking paths in previous frames (colored curves in Fig. 2(a)). They could be
obtained by simple trackers such as KLT [41]. They are then encoded into a
displacement volume (Fig. 2(b)) with the proposed walking behavior encoding
scheme. Behavior-CNN in Fig. 2(c) takes the encoded displacement volume as
Fig. 2. System flowchart. (a) Pedestrian walking paths in previous frames. Three exam-
ples are shown in different colors. Rectangles indicate current locations of pedestrians.
(b) The displacement volume encoded from pedestrians’ past walking paths in (a).
(c) Behavior-CNN. (d) The predicted displacement volume by Behavior-CNN. (e) Pre-
dicted future pedestrian walking paths decoded from (d).
Three bottom convolution layers, conv1, conv2, and conv3, are to be con-
volved with input data of size X × Y × 2M. conv1 contains 64 filters of size
3 × 3 × 2M, while both conv2 and conv3 contain 64 filters of size 3 × 3 × 64.
Zeros are padded to each convolution input in order to guarantee feature maps
of these layers be of the same spatial size with the input. The three bottom
convolution layers are followed by max pooling layers max-pool with stride 2.
The output size of max-pool is X/2 × Y/2 × 64. In this way, the receptive field
of the network can be doubled. Large receptive field is necessary for the task
of pedestrian walking behavior modeling because each individual’s behavior are
significantly influenced by his/her neighbors. A learnable location bias map of
size X/2×Y/2 is channel-wisely added to each of the pooled feature maps. Every
spatial location has one independent bias value shared across channels. With the
location bias map, location information of the scene can be automatically learned
by the proposed Behavior-CNN. As for the three top convolution layers, conv4
and conv5 contain 64 filters of size 3 × 3 × 64, while conv6 contains 2M∗
filters
of size 3 × 3 × 64 to output the predicted displacement volume. Zeros are also
S. Yi, et al., “Pedestrian Behavior Understanding and Prediction with Deep Neural Networks,” ECCV, 2016.
1人称視点における対面の歩行者のための位置予測
●
1人称視点特有の手掛かりを位置予測に利用
1. 対面の歩行者の位置に影響するエゴモーション
2. 対面の歩行者のスケール
3. 対面の歩行者の姿勢
●
上記3つの情報を用いたマルチストリームモデルで将来の位置予測
Future localization in first-person videos [T. Yagi+, CVPR, 2018]
34
Figure 2. Future Person Localization in First-Person Videos. Given a) Tprev-frames observations as input, we b) predict future locations
of a target person in the subsequent Tfuture frames. Our approach makes use of c-1) locations and c-2) scales of target persons, d) ego-
motion of camera wearers and e) poses of the target persons as a salient cue for the prediction.
Channel-wise
Concatenation
Input
Output
Location-Scale Stream Ego-Motion Stream Pose Stream
Figure 3. Proposed Network Architecture. Blue blocks corre-
tain direction at a constant speed, our best guess based on
only previous locations would be to expect them to keep go-
ing in that direction in subsequent future frames too. How-
ever, visual distances in first-person videos can correspond
to different physical distances depending on where people
are observed in the frame.
In order to take into account this perspective effect, we
propose to learn both locations and scales of target peo-
ple jointly. Given a simple assumption that heights of
people do not differ too much, scales of observed peo-
ple can make a rough estimate of how large movements
they made in the actual physical world. Formally, let
Lin = (lt0−Tprev+1, . . . , lt0 ) be a history of previous tar-
get locations. Then, we extend each location lt ∈ R2
+ of
a) Input
b) Prediction
e) Pose
c-1) Location
c-2) Scale
d) Ego-motion
Figure 2. Future Person Localization in First-Person Videos. Given a) Tprev-frames observations as input, we b) predict future locations
of a target person in the subsequent Tfuture frames. Our approach makes use of c-1) locations and c-2) scales of target persons, d) ego-
motion of camera wearers and e) poses of the target persons as a salient cue for the prediction.
Channel-wise
Concatenation
Input
Output
Location-Scale Stream Ego-Motion Stream Pose Stream
Figure 3. Proposed Network Architecture. Blue blocks corre-
spond to convolution/deconvolution layers while gray blocks de-
scribe intermediate deep features.
tain direction at a constant speed, our best guess based on
only previous locations would be to expect them to keep go-
ing in that direction in subsequent future frames too. How-
ever, visual distances in first-person videos can correspond
to different physical distances depending on where people
are observed in the frame.
In order to take into account this perspective effect, we
propose to learn both locations and scales of target peo-
ple jointly. Given a simple assumption that heights of
people do not differ too much, scales of observed peo-
ple can make a rough estimate of how large movements
they made in the actual physical world. Formally, let
Lin = (lt0−Tprev+1, . . . , lt0 ) be a history of previous tar-
get locations. Then, we extend each location lt ∈ R2
+ of
a target person by adding the scale information of that per-
son st ∈ R+, i.e., xt = (lt , st) . Then, the ‘location-
scale’ input stream in Figure 3 learns time evolution in
Xin = (xt0−Tprev+1, . . . , xt0 ), and the output stream gen-
erates Xout = (xt0+1 − xt0 , . . . , xt0+Tfuture
− xt0 ).
Ground Truth OursSocial LSTMInput NNeighbor
Past observations Predictions
(a)
(b)
(c)
(d)
(e)
Figure 5. Visual Examples of Future Person Localization. Using locations (shown with solid blue lines), scales and poses of target
people (highlighted in pink, left column) as well as ego-motion of camera wearers in the past observations highlighted in blue, we predict
locations of that target (the ground-truth shown with red crosses with dotted red lines) in the future frames highlighted in red. We compared
T. Yagi, et al., “Future Person Localization in First-Person Videos,” CVPR, 2018.
車載カメラ映像に映る歩行者の将来の位置を予測する手法
●
歩行者の矩形領域,自車の移動量,車載カメラ画像を入力
●
歩行者の将来の矩形領域を出力
OPPU [A. Bhattacharyya+, CVPR, 2018]
35
Figure 2: Two stream architecture for prediction of future pedestrian bounding boxes.
ion of Pedestrian Trajectories sequence ˆv (containing information about past pedestrian
Last Observation: t Prediction: t + 5 Prediction: t + 10 Prediction: t + 15
Figure 4: Rows 1-3: Point estimates. Blue: Ground-truth, Red: Kalman Filter (Table 1 row 1), Yellow: One-stream model
(Table 1 row 4), Green: Two-stream model (mean of predictive distribution, Table 4 row 3). Rows 4-6: Predictive distributions
of our two-stream model as heat maps. (Link to video results in the Appendix).
sequences have low error (note, log(530) ≈ 6.22 the MSE that, the predicted uncertainty upper bounds the error of the
A. Bhattacharyya, et al., “Long-Term On-Board Prediction of People in Traffic Scenes under Uncertainty,” CVPR, 2018.
自動車検出,追跡,経路予測を同時に推論するモデルを提案
●
3D点群データを入力に使用
- 3次元空間でスパースな特徴表現
- ネットワークの計算コストを抑制
- リアルタイムで3つのタスクを同時に計算可能
Fast and Furious [W. Luo+, CVPR, 2018]
36
Real Time End-to-End 3D Detection, Tracking and Motion
orecasting with a Single Convolutional Net
Wenjie Luo, Bin Yang and Raquel Urtasun
Uber Advanced Technologies Group
University of Toronto
{wenjie, byang10, urtasun}@uber.com
ract
a novel deep neural network
about 3D detection, track-
iven data captured by a 3D
about these tasks, our holis-
occlusion as well as sparse
h performs 3D convolutions
bird’s eye view representa-
is very efficient in terms of
n. Our experiments on a new
ed in several north american
W. Luo, et al., “Fast and Furious: Real Time End-to-End 3D Detection, Tracking and Motion Forecasting with a Single Convolutional Net,” CVPR, 2018.
異なる移動物体を属性とみなし,属性毎の特徴的な経路を予測
●
異なる移動物体が保有する潜在的特徴を考慮
- 歩行者:歩道や車道を歩く
- 自動車:車道を走る
●
属性をone-hot vectorで表現するため,属性毎にモデルを作成する必要がない
- 計算コストの抑制
●
属性毎の特徴的な経路を予測するために,シーンラベルを利用
Object Attributes and Semantic Environment [H. Minoura+, VISAPP, 2019]
37H. Minoura, et al., “Path predictions using object attributes and semantic environment,” VISAPP, 2019.
一般道における自動車の経路予測手法
●
異なるシーンコンテキストを考慮するCNNとGRUによるモデルを提案
- 予測車・他車の位置,コンテキスト,道路情報をチャネル方向に連結
- 連結したテンソルを時刻毎にCNNでエンコード
- エンコードした特徴をGRUで時間方向へ伝播し,将来の経路を予測
Rules of the Road [J. Hong+, CVPR, 2019]
38
Figure 2: Entity and world context representation. For an example scene (visualized left-most), the world is represented with
the tensors shown, as described in the text.
ontext representation. For an example scene (visualized left-most), the world is represented with
bed in the text.
One-shot (b) with RNN decoder
(a) Gaussian Regression (b) GMM-CVAE
Figure 4: Examples of Gaussian Regression and GMM-CVAE methods. Ellipses represent a standard deviation of uncertainty, a
only drawn for the top trajectory; only trajectories with probability > 0.05 are shown, with cyan the most probable.We see that uncer
ellipses are larger when turning than straight, and often follow the direction of velocity. In the GMM-CVAE example, different sa
シーン例 予測車の位置 他車の位置 コンテキスト 道路情報
J. Hong, et al., “Rules of the Road: Predicting Driving Behavior with a Convolutional Model of Semantic Interactions,” CVPR, 2019.
複数の尤もらしい経路を予測するMultiverseを提案
●
過去のセマンティックラベルと固定グリッドをHistory Encoder (HE)へ入力
- Convolution Recurrent Neural Networkで時空間特徴をエンコード
- セマンティックラベルを入力することで,ドメインシフトに対して強固になる
●
HEの出力と観測最終のセマンティックラベルをCoarse Location Decoder (CLD)へ入力
- GATでグリッドの各格子に重み付けし,重みマップを生成
●
Fine Location Decoder (FLD)でグリッドの各格子に距離ベクトルを格納
- CLD同様,GATで各格子に重み付け
●
FLDとCLDから複数の経路を予測
Multiverse [J. Liang+, CVPR, 2020]
39
Figure 2: Overview of our model. The input to the model is the ground truth location history, and a set of video frames,
which are preprocessed by a semantic segmentation model. This is encoded by the “History Encoder” convolutional RNN.J. Liang, et al., “The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction,” CVPR, 2020.
本サーベイの目的
40
Deep Learningを用いた経路予測手法の動向調査
●
各カテゴリに属する経路予測手法毎の特徴をまとめる
- インタラクションあり
• Poolingモデル
• Attentionモデル
- インタラクションなし (Other)
●
定量的評価のためのデータセット,評価指標も紹介
●
代表的モデルを使用して,各モデルの精度と予測結果について議論
経路予測で最も使用されるデータセット
41
ETH Dataset UCY Dataset Stanford Drone Dataset
• サンプル数:786
市街地の歩行者を撮影したデータセット 市街地の歩行者を撮影したデータセット
• シーン数:2
• 対象種類
- pedestrian
• サンプル数:750
• シーン数:3
• 対象種類
- pedestrian
スタンフォード大学構内を撮影したデータセット
• サンプル数:10,300
• シーン数:8
• 対象種類
- pedestrian, car, cyclist, bus, skater, cart
R. Alexandre, et al., “Learning Social Etiquette: Human Trajectory Understanding in Crowded Scenes,” ECCV, 2016.
A. Lerner, et al., “Crowds by Example,” CGF, 2007.
S. Pellegrini, et al., “You’ll Never Walk Alone: Modeling Social Behavior for Multi-target Tracking,” ICCV, 2009.
固定カメラやドローンで撮影
●
豊富なデータを取得できるため,非常に大規模なデータセット
俯瞰視点のデータセット
42
lts from NN+map(prior) m-
aseline. The orange trajectory
Red represents ground truth for
n represents the multiple fore-
3 s. Top left: The car starts to
Argoverse Dataset
The inD Dataset: A Drone Dataset of Naturalistic
Road User Trajectories at German Intersections
Julian Bock1, Robert Krajewski1, Tobias Moers2, Steffen Runde1, Lennart Vater1 and Lutz Eckstein1
Fig. 1: Exemplary result of road user trajectories in the inD dataset. The position and speed of each road user is measured
accurately over time and shown by bounding boxes and tracks. For privacy reasons, the buildings were made unrecognizable.
Abstract—Automated vehicles rely heavily on data-driven
methods, especially for complex urban environments. Large
datasets of real world measurement data in the form of road
user trajectories are crucial for several tasks like road user
prediction models or scenario-based safety validation. So far,
though, this demand is unmet as no public dataset of urban
road user trajectories is available in an appropriate size, quality
and variety. By contrast, the highway drone dataset (highD) has
recently shown that drones are an efficient method for acquiring
naturalistic road user trajectories. Compared to driving studies
or ground-level infrastructure sensors, one major advantage of
using a drone is the possibility to record naturalistic behavior,
as road users do not notice measurements taking place. Due to
the ideal viewing angle, an entire intersection scenario can be
measured with significantly less occlusion than with sensors at
ground level. Both the class and the trajectory of each road
user can be extracted from the video recordings with high
precision using state-of-the-art deep neural networks. Therefore,
we propose the creation of a comprehensive, large-scale urban
intersection dataset with naturalistic road user behavior using
camera-equipped drones as successor of the highD dataset. The
resulting dataset contains more than 11500 road users including
vehicles, bicyclists and pedestrians at intersections in Germany
and is called inD. The dataset consists of 10 hours of measurement
data from four intersections and is available online for non-
commercial research at: http://www.inD-dataset.com
1The authors are with the Automated Driving Department, Institute for
Automotive Engineering RWTH Aachen University (Aachen, Germany).
(E-mail: {bock, krajewski, steffen.runde, vater, eckstein}@ika.rwth-
aachen.de).
2The author is with the Automated Driving Department, fka GmbH
(Aachen, Germany). (E-mail: tobias.moers@fka.de).
Index Terms—Dataset, Trajectories, Road Users, Machine
Learning
I. INTRODUCTION
Automated driving is expected to reduce the number and
severity of accidents significantly [13]. However, intersections
are challenging for automated driving due to the large com-
plexity and variety of scenarios [15]. Scientists and companies
are researching how to technically handle those scenarios by
an automated driving function and how to proof safety of
these systems. An ever-increasing proportion of the approaches
to tackle both challenges are data-driven and therefore large
amounts of measurement data are required. For example, re-
cent road user behaviour models, which are used for prediction
or simulation, use probabilistic approaches based on large
scale datasets [2], [11]. Furthermore, current approaches for
safety validation of highly automated driving such as scenario-
based testing heavily rely on large-scale measurement data on
trajectory level [3], [5], [17].
However, the widely used ground-level or on-board
measurement methods have several disadvantages. These
include that road users can be (partly) occluded by other
road users and do not behave naturally as they notice being
part of a measurement due to conspicuous sensors [5].
We propose to use camera-equipped drones to record road
user movements at urban intersections (see Fig. 2). Drones
with high-resolution cameras allow to record traffic from a
arXiv:1911.07602v1[cs.CV]18Nov2019
inD Dataset
ysis. The red trajectories are single-future method predictions and the yellow-orange heatmaps are
ctions. The yellow trajectories are observations and the green ones are ground truth multi-future
etails.
Method Single-Future Multi-Future
Our full model 18.51 / 35.84 166.1 / 329.5
The Forking Paths Dataset
Figure 7: Example output of the motion prediction solution
supplied as part of the software development kit. A convo-
lution neural network takes rasterised scenes around nearby
vehicles as input, and predicts their future motion.
ity and multi-threading to make it suitable for distributed
machine learning.
Customisable scene visualisation and rasterisation.
We provide several functions to visualise and rasterise
Lyft Level 5 Dataset
• サンプル数:300K
• シーン数:113
• 対象種類
• 追加情報
- car
- 車線情報,地図データ,センサー情報
• サンプル数:13K
• シーン数:4
• 対象種類
- pedestrian, car, cyclist
• サンプル数:3B
• シーン数:170,000
• 対象種類
• 追加情報
- pedestrian, car, cyclist
- 航空情報

- セマンティックラベル
• サンプル数:0.7K
• シーン数:7
• 対象種類
• 追加情報
- pedestrian
- 複数経路情報

- セマンティックラベル
一般道を撮影したデータセット 交差点を撮影したデータセット
一般道を撮影したデータセット シミュレータで作成されたデータセット
J. Houston, et al., “One Thousand and One Hours: Self-driving Motion Prediction Dataset,” CoRR, 2020.
J. Bock, et al., “The inD Dataset: A Drone Dataset of Naturalistic Road User Trajectories at German Intersections,” CoRR, 2019.
M.F. Chang, et al., “Argoverse: 3D Tracking and Forecasting with Rich Maps,” CVPR, 2019.
J. Liang, et al., “The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction,” CVPR, 2020.
自動車前方の移動対象の経路予測を目的
車載カメラ視点のデータセット
43
• サンプル数:1.8K
• シーン数:53
• 対象種類
• 追加情報
- pedestrian
- 車両情報,インフラストラクチャ
一般道を撮影したデータセット
Apolloscape DatasetFigure 3. Example scenarios of the TITAN Dataset: a pedestrian bounding box with tracking ID is shown in , vehicle bounding b
with ID is shown in , future locations are displayed in . Action labels are shown in different colors following Figure 2.
centric views captured from a mobile platform.
In the TITAN dataset, every participant (individuals,
vehicles, cyclists, etc.) in each frame is localized us-
ing a bounding box. We annotated 3 labels (person, 4-
wheeled vehicle, 2-wheeled vehicle), 3 age groups for per-
son (child, adult, senior), 3 motion-status labels for both 2
and 4-wheeled vehicles, and door/trunk status labels for 4-
wheeled vehicles. For action labels, we created 5 mutually
exclusive person action sets organized hierarchically (Fig-
ure 2). In the first action set in the hierarchy, the annota-
tor is instructed to assign exactly one class label among 9
atomic whole body actions/postures that describe primitive
action poses such as sitting, standing, standing, bending,
etc. The second action set includes 13 actions that involve
single atomic actions with simple scene context such as jay-
walking, waiting to cross, etc. The third action set includes
7 complex contextual actions that involve a sequence of
atomic actions with higher contextual understanding, such
agent i at each past time step from 1 to Tobs, where (cu,
and (lu, lv) represent the center and the dimension of
bounding box, respectively. The proposed TITAN fram
work requires three inputs as follows: Ii
t=1:Tobs
for the
tion detector, xi
t for both the interaction encoder and p
object location encoder, and et = {αt, ωt} for the eg
motion encoder where αt and ωt correspond to the accel
ation and yaw rate of the ego-vehicle at time t, respective
During inference, the multiple modes of future bound
box locations are sampled from a bi-variate Gaussian g
erated by the noise parameters, and the future ego-motio
ˆet are accordingly predicted, considering the multi-mo
nature of the future prediction problem.
Henceforth, the notation of the feature embedding fu
tion using multi-layer perceptron (MLP) is as follows: Φ
without any activation, and Φr, Φt, and Φs are associa
with ReLU, tanh, and a sigmoid function, respectively.
TITAN Dataset
LSTM 172 330 911 837 3352 289 569 155
B-LSTM[5] 101 296 855 811 3259 159 539 153
PIEtraj 58 200 636 596 2477 110 399 124
Table 3: Location (bounding box) prediction errors over varying future time steps. M
predicted time steps, CMSE and CFMSE are the MSEs calculated over the center of
predicted sequence and only the last time step respectively.
Method 0.5
Linear 0.8
LSTM 1.5
PIEspeed 0.6
Table 4: Speed predict
on the PIE dataset. Las
results are reported in km
is generally better on bou
degrees of freedom.
Context in trajector
PIE Dataset Figure 5: Illustration of our TrafficPredict (TP) method on camera-based images. T
conditions and traffic situations. We only show the trajectories of several instances in
drawn in green and the prediction results of other methods (ED,SL,SA) are shown w
trajectories of our TP algorithm (pink lines) are the closest to ground truth in most of
stance layer to ca
instances and use
larities of moveme
type and guide the
in spatial and tem
ferred in our desig
previous state-of-t
racy of trajectory p
heterogeneous tra
一般道を撮影したデータセット 一般道を撮影したデータセット
• サンプル数:81K
• シーン数:100,000
• 対象種類
- pedestrian, car, cyclist
• サンプル数:645K
• シーン数:700
• 対象種類
• 追加情報
- pedestrian, car, cyclist
- 行動ラベル,歩行者の年齢
Y. Ma, et al., “TrafficPredict: Trajectory Prediction for Heterogeneous Traffic-Agents,” AAAI, 2019.
A. Rasouli, et al., “PIE: A Large-Scale Dataset and Models for Pedestrian Intention Estimation and Trajectory Prediction, ” ICCV, 2019.
S. Malla, et al., “TITAN: Future Forecast using Action Priors,” CVPR, 2020.
前方の歩行者の経路予測を目的
●
被験者にウェアラブルカメラを装着
1人称視点のデータセット
44
• サンプル数:5K
• シーン数:87
• 対象種類
• 追加情報
- pedestrian
- 姿勢情報,エゴモーション
歩道を撮影したデータセット
𝑡 𝑡0 6 𝑡 𝑡0 10
Predictions
First-Person Locomotion Dataset
T. Yagi, et al., “Future Person Localization in First-Person Videos,” CVPR, 2018.
予測された矩形領域と真の矩形領域の中心座標で評価
●
車載カメラ映像における経路予測で利用
●
矩形領域の重なり率からF値で評価もできる
評価指標
45
Displacement Error Negative log-likelihood
Mean Square Error Collision rate
(a) Average Displacement Error (b) Final Displacement ErrorADE FDE
真値と予測値とのユークリッド距離誤差
●
Average Displacement Error (ADE):予測時刻間の平均誤差
●
Final Displacement Error (FDE):予測最終時刻の誤差
!"#$
…
!"#&
#ofSamples:'
Prediction Horizon: (
Figure 5. An illustration of our probabilistic evaluation methodol-
ogy. It uses kernel density estimates at each timestep to compute
the log-likelihood of the ground truth trajectory at each timestep,
averaging across time to obtain a single value.
Figure 6. Mean NLL for each dataset. Error bars are bootstrapped
95% confidence intervals. 2000 trajectories were sampled per
model at each prediction timestep. Lower is better.
ADE and FDE are useful metrics for comparing determinis-
tic regressors, they are not able to compare the distributions
produced by generative models, neglecting aspects such as
variance and multimodality [40]. To bridge this gap in eval-
Dataset
ADE / FDE,
SGAN [16]
ETH 0.64 / 1.13
Hotel 0.43 / 0.91
Univ 0.53 / 1.12
Zara 1 0.29 / 0.58
Zara 2 0.27 / 0.56
Average 0.43 / 0.86
Table 1. Quantitative ADE and
metric where N = 100.
Both of our methods signi
the ETH datasets, the UCY
(P <.001; two-tailed t-test
and SGAN’s mean NLL). On
Full model is identical in pe
same t-test). However, on th
model performs worse than
We believe that this is caused
tions more often than in other
truth trajectories to frequentl
tions whereas SGAN’s highe
to have density there. Acros
uration outperforms our zbes
model’s full multimodal mo
for strong performance on th
We also evaluated our m
to determine how much the p
prediction horizon. The resu
be seen, our Full model sig
at every timestep (P <.001;
ence between our and SGAN
推定した分布の元での真値の対数尤度の期待値
●
ADEとFDEで複数経路を評価するのはマルチモーダル性を無視
●
Negative log-likelihoodで複数経路の予測の評価指標として利用
truth
prediction
MSE
L2 norm
L2 norm
真値 予測値
従来の評価指標
提案する評価指標
2つのDisplacement Errorで評価
非線形経路のDisplacement Error,2つの物体との衝突率で評価
L2 norm
L2 norm
真値 予測値
従来の評価指標
提案する評価指標
2つのDisplacement Errorで評価
Displacement Errorは全サンプルに対し平均を求める
●
インタラクション情報がどの予測経路に効果的か評価できない
予測値が各物体と衝突したか否かの衝突率で評価
●
動的物体:映像中の他対象
●
静的物体:建物や木などの障害物
動的物体 静的物体
本サーベイの目的
46
Deep Learningを用いた経路予測手法の動向調査
●
各カテゴリに属する経路予測手法毎の特徴をまとめる
- インタラクションあり
• Poolingモデル
• Attentionモデル
- インタラクションなし (Other)
●
定量的評価のためのデータセット,評価指標も紹介
●
代表的モデルを使用して,各モデルの精度と予測結果について議論
代表的なモデルを用いて,精度検証を行う
評価実験
47
モデル名 インタラクション Deep Learning 環境 データセット
LSTM - ✔ - ETH/UCY, SDD
RED - ✔ - ETH/UCY, SDD
ConstVel - - - ETH/UCY, SDD
Social-LSTM ✔ ✔ - ETH/UCY, SDD
Social-GAN ✔ ✔ - ETH/UCY, SDD
STGAT ✔ ✔ - ETH/UCY, SDD
Trajectron ✔ ✔ - ETH/UCY, SDD
Env-LSTM - ✔ ✔ SDD
Social-STGCNN ✔ ✔ - ETH/UCY
PECNet ✔ ✔ - ETH/UCY
精度比較を行うモデル
Attentionモデル
Poolingモデル
データセット
●
ETH/UCY
●
SDD
- 歩行者限定
エポック数:300
バッチサイズ:64
最適化手法:Adam
●
学習率:0.001
観測時刻:3.2秒
予測時刻:4.8秒
評価指標
●
Displacement Error,Collision rate
●
複数の予測経路をサンプリングする手法では,サンプリングした中で最良のものを使用
実験条件
48
ETH/UCYにおけるDisplacement Error
49
Scene
Method C-ETH ETH HOTEL UCY ZARA01 ZARA02 AVG
LSTM 0.56 / 1.15 0.91 / 1.57 0.29 / 0.56 0.84 / 1.56 1.33 / 2.69 0.77 / 1.50 0.78 / 1.51
RED 0.58 / 1.22 0.70 / 1.46 0.17 / 0.33 0.61 / 1.32 0.45 / 0.99 0.36 / 0.79 0.48 / 1.02
ConstVel 0.57 / 1.26 0.57 / 1.26 0.19 / 0.33 0.67 / 1.43 0.49 / 1.07 0.50 / 1.09 0.50 / 1.07
Social-LSTM 0.90 / 1.70 1.30 / 2.55 0.47 / 0.98 1.00 / 2.01 0.92 / 1.69 0.78 / 1.61 0.90 / 1.76
Social-GAN 0.49 / 1.05 0.53 / 1.03 0.35 / 0.77 0.79 / 1.63 0.47 / 1.01 0.44 / 0.94 0.51 / 1.07
PECNet 0.68 / 1.12 0.71 / 1.14 0.14 / 0.21 0.63 / 1.19 0.47 / 0.83 0.36 / 0.67 0.50 / 0.86
STGAT 0.48 / 1.05 0.51 / 1.01 0.19 / 0.31 0.61 / 1.33 0.47 / 1.00 0.39 / 0.78 0.46 / 0.91
Trajectron 0.52 / 1.14 0.56 / 1.18 0.26 / 0.51 0.63 / 1.37 0.50 / 1.05 0.39 / 0.84 0.48 / 1.02
Social-STGCNN 0.68 / 1.27 0.83 / 1.35 0.22 / 0.34 0.84 / 1.46 0.61 / 1.12 0.54 / 0.93 0.62 / 1.08
SingleModel20Outputs
Single Model:インタラクションを考慮しないREDが最も誤差を低減
20 Outputs:ADEでSTGAT,FDEでPECNetが最も誤差を低減
ADE / FDE [m]
ETH/UCYにおけるDisplacement Error
50
Scene
Method C-ETH ETH HOTEL UCY ZARA01 ZARA02 AVG
LSTM 0.56 / 1.15 0.91 / 1.57 0.29 / 0.56 0.84 / 1.56 1.33 / 2.69 0.77 / 1.50 0.78 / 1.51
RED 0.58 / 1.22 0.70 / 1.46 0.17 / 0.33 0.61 / 1.32 0.45 / 0.99 0.36 / 0.79 0.48 / 1.02
ConstVel 0.57 / 1.26 0.57 / 1.26 0.19 / 0.33 0.67 / 1.43 0.49 / 1.07 0.50 / 1.09 0.50 / 1.07
Social-LSTM 0.90 / 1.70 1.30 / 2.55 0.47 / 0.98 1.00 / 2.01 0.92 / 1.69 0.78 / 1.61 0.90 / 1.76
Social-GAN 0.49 / 1.05 0.53 / 1.03 0.35 / 0.77 0.79 / 1.63 0.47 / 1.01 0.44 / 0.94 0.51 / 1.07
PECNet 0.68 / 1.12 0.71 / 1.14 0.14 / 0.21 0.63 / 1.19 0.47 / 0.83 0.36 / 0.67 0.50 / 0.86
STGAT 0.48 / 1.05 0.51 / 1.01 0.19 / 0.31 0.61 / 1.33 0.47 / 1.00 0.39 / 0.78 0.46 / 0.91
Trajectron 0.52 / 1.14 0.56 / 1.18 0.26 / 0.51 0.63 / 1.37 0.50 / 1.05 0.39 / 0.84 0.48 / 1.02
Social-STGCNN 0.68 / 1.27 0.83 / 1.35 0.22 / 0.34 0.84 / 1.46 0.61 / 1.12 0.54 / 0.93 0.62 / 1.08
SingleModel20Outputs
ADE / FDE [m]
Poolingモデルと比較してAttentionモデルによる経路予測手法が有効
LSTM RED ConstVel Social-LSTM Social-GAN PECNet STGAT Trajectron Social-STGCNN
動的物体 0.42 0.78 0.55 0.89 0.99 0.71 1.10 0.54 1.63
静的物体 0.08 0.07 0.09 0.16 0.08 0.12 0.08 0.13 0.16
ETH/UCYにおけるCollision rateと予測結果例
51
Model
Object
ConstVelLSTM
PECNet
RED
Social-GANSocial-LSTMSTGAT
Social-STGCNNTrajectron
入力値 真値 予測値
Collision rate [%]
LSTM RED ConstVel Social-LSTM Social-GAN PECNet STGAT Trajectron Social-STGCNN
動的物体 0.42 0.78 0.55 0.89 0.99 0.71 1.10 0.54 1.63
静的物体 0.08 0.07 0.09 0.16 0.08 0.12 0.08 0.13 0.16
ETH/UCYにおけるCollision rateと予測結果例
52
Model
Object
入力値 真値 予測値
Collision rate [%]
ConstVelLSTM
PECNet
RED
Social-GANSocial-LSTMSTGAT
Social-STGCNNTrajectron
予測誤差が低い手法 ≠ 衝突率が低い手法
- 真値と類似しない場合に予測誤差は増加するが,衝突率は減少することが起こり得る
動的物体に関するCollision rateで衝突していないと判定された経路
動的物体に関するCollision rateで衝突したと判定された経路
SDDにおけるDisplacement Error
53
Scene
Method bookstore coupa deathCircle gates hyang little nexus quad AVG
LSTM 7.00 / 14.8 8.44 / 17.5 7.52 / 15.9 5.78 / 11.9 8.78 / 18.4 10.8 / 23.1 6.61 / 13.1 16.1 / 30.2 8.88 / 18.1
RED 7.91 / 17.1 9.51 / 20.4 8.22 / 17.8 5.72 / 12.0 9.14 / 19.5 11.8 / 25.8 6.24 / 12.7 4.81 / 10.9 7.92 / 17.0
ConstVel 6.63 / 12.9 8.17 / 16.2 7.29 / 14.0 5.76 / 10.9 9.21 / 18.1 10.9 / 22.1 7.14 / 13.7 5.31 / 8.89 7.56 / 14.6
Social-LSTM 33.6 / 74.0 34.8 / 76.0 33.4 / 74.7 35.6 / 83.2 35.4 / 75.9 36.7 / 77.5 32.3 / 71.3 32.4 / 71.3 34.3 / 75.5
Env-LSTM 13.5 / 30.1 17.2 / 36.7 15.3 / 32.2 17.3 / 36.5 12.6 / 27.9 14.2 / 31.1 10.8 / 24.0 8.05 / 19.0 13.8 / 30.0
Social-GAN 18.4 / 36.7 19.5 / 39.1 18.6 / 37.2 18.6 / 37.3 20.1 / 40.6 20.0 / 40.8 18.1 / 36.3 13.2 / 26.2 18.3 / 36.8
STGAT 7.58 / 14.6 9.00 / 17.4 7.57 / 14.4 6.33 / 11.7 9.17 / 17.9 10.9 / 21.8 7.37 / 14.0 4.83 / 7.95 7.85 / 15.0
Trajectron 6.18 / 13.2 7.24 / 15.5 6.43 / 13.6 6.29 / 13.0 7.72 / 16.5 9.38 / 20.8 6.55 / 13.4 6.80 / 15.1 7.07 / 15.1
SingleModel20Outputs
ADE / FDE [pixel]
Single Model:Deep Learningを使用しないConstVelが最も誤差を低減
20 Outputs:ADEでTrajectron,FDEでSTGATが最も誤差を低減
撮影箇所に影響される
●
ETH/UCYは低所で撮影
- 人の経路がsensitiveになる
●
SDDは高所で撮影
- 人の経路がinsensitiveになる
- 人の動きが線形になり,線形予測するConstVelの予測誤差が低下
SDDでConstVelの予測誤差が低い要因は何か
54
ETH/UCY SDD
SDDにおけるCollision rateと予測結果例
55
Model
Object
Collision rate [%]
Social-GANRED ConstVel TrajectronSTGATLSTM Social-LSTM
真値入力値 予測値
Env-LSTMConstVelLSTM RED Social-GANSocial-LSTM STGAT Trajectron Env-LSTM
入力値
真値
予測値
LSTM RED ConstVel Social-LSTM Social-GAN STGAT Trajectron Env-LSTM
動的物体 10.71 11.27 10.98 15.12 10.91 10.97 12.80 12.12
静的物体 2.82 2.86 2.40 20.33 6.41 2.17 1.71 1.58
SDDにおけるCollision rateと予測結果例
56
Model
Object
Social-GANRED ConstVel TrajectronSTGATLSTM Social-LSTM
真値入力値 予測値
Env-LSTMConstVelLSTM RED Social-GANSocial-LSTM STGAT Trajectron Env-LSTM
入力値
真値
予測値
LSTM RED ConstVel Social-LSTM Social-GAN STGAT Trajectron Env-LSTM
動的物体 10.71 11.27 10.98 15.12 10.91 10.97 12.80 12.12
静的物体 2.82 2.86 2.40 20.33 6.41 2.17 1.71 1.58
環境情報を導入することで,障害物との接触を避ける経路予測が可能
Collision rate [%]
Deep Learningの発展によるデータセットの大規模化
経路予測の評価指標の再考
●
予測精度が良い 最も良い予測手法
●
コミュニティ全体で考え直す必要がある
複数経路を予測するアプローチの増加
●
Multiverseを筆頭に複数経路を重視する経路予測手法が増加する?
今後の経路予測は?
572016
interaction
multimodal paths
other
2020
Social-LSTM
[A. Alahi+, CVPR, 2016]
DESIRE
[N. Lee+, CVPR, 2017]
Conv.Social-Pooling
[N. Deo+, CVPRW, 2018]
SoPhie
[A. Sadeghian+, CVPR, 2019]
Social-BiGAT
[V. Kosaraju+, NeurIPS, 2019]
Social-STGCNN
[A. Mohamedl+, CVPR, 2020]
Social-GAN
[A. Gupta+, CVPR, 2018]
Next
[J. Liang+, CVPR, 2019]
STGAT
[Y. Huang+, ICCV, 2019]
Trajectron
[B. Ivanovic+, ICCV, 2019]
Social-Attention
[A. Vemula+, ICRA, 2018]
Multi-Agent Tensor Fusion
[T. Zhao+, CVPR, 2019]
MX-LSTM
[I. Hasan+, CVPR, 2018]
CIDNN
[Y. Xu+, CVPR, 2018]
SR-LSTM
[P. Zhang+, CVPR, 2019]
Group-LSTM
[N. Bisagno+, CVPR, 2018]
Reciprocal Network
[S. Hao+, CVPR, 2020]
PECNet
[K. Mangalam+, ECCV, 2020]
RSBG
[J. SUN+, CVPR, 2020]
STAR
[C. Yu+, ECCV, 2020]
Behavior CNN
[S. Yi+, ECCV, 2016]
Future localization in first-person videos
[T. Yagi+, CVPR, 2018]
Fast and Furious
[W. Luo+, CVPR, 2018]
OPPU
[A. Bhattacharyya+, CVPR, 2018]
Object Attributes and Semantic Segmentation
[H. Minoura+, VISAPP, 2019]
Rule of the Road
[J. Hong+, CVPR, 2019] Multiverse
[J. Liang+, CVPR, 2020]
Trajectron++
[T. Salzmann+, ECCV, 2020]
Multimodal paths + (interaction)
・・・
Deep Learningを用いた経路予測手法の動向調査
●
各カテゴリに属した経路予測手法の特徴を調査
- インタラクションあり
• Poolingモデル
• Attentionモデル
- インタラクションなし (Other)
●
定量的評価のためのデータセット,評価指標を紹介
- Deep Learningの発展により大規模なデータセットが増加
●
代表的モデルを使用して,各モデルの精度と予測結果について議論
- AttentionモデルはPoolingモデルより予測誤差が低い
- 最も衝突率が低いモデル 予測誤差が低いモデル
- SensitiveなデータセットでDeep Learningによる予測手法は効果的
まとめ
58
Deep Learningを用いた経路予測の研究動向

Más contenido relacionado

La actualidad más candente

[DL輪読会]相互情報量最大化による表現学習
[DL輪読会]相互情報量最大化による表現学習[DL輪読会]相互情報量最大化による表現学習
[DL輪読会]相互情報量最大化による表現学習Deep Learning JP
 
POMDP下での強化学習の基礎と応用
POMDP下での強化学習の基礎と応用POMDP下での強化学習の基礎と応用
POMDP下での強化学習の基礎と応用Yasunori Ozaki
 
Domain Adaptation 発展と動向まとめ(サーベイ資料)
Domain Adaptation 発展と動向まとめ(サーベイ資料)Domain Adaptation 発展と動向まとめ(サーベイ資料)
Domain Adaptation 発展と動向まとめ(サーベイ資料)Yamato OKAMOTO
 
CV分野におけるサーベイ方法
CV分野におけるサーベイ方法CV分野におけるサーベイ方法
CV分野におけるサーベイ方法Hirokatsu Kataoka
 
深層学習の不確実性 - Uncertainty in Deep Neural Networks -
深層学習の不確実性 - Uncertainty in Deep Neural Networks -深層学習の不確実性 - Uncertainty in Deep Neural Networks -
深層学習の不確実性 - Uncertainty in Deep Neural Networks -tmtm otm
 
オープンソース SLAM の分類
オープンソース SLAM の分類オープンソース SLAM の分類
オープンソース SLAM の分類Yoshitaka HARA
 
[DL輪読会]“Spatial Attention Point Network for Deep-learning-based Robust Autono...
[DL輪読会]“Spatial Attention Point Network for Deep-learning-based Robust Autono...[DL輪読会]“Spatial Attention Point Network for Deep-learning-based Robust Autono...
[DL輪読会]“Spatial Attention Point Network for Deep-learning-based Robust Autono...Deep Learning JP
 
SSII2019企画: 点群深層学習の研究動向
SSII2019企画: 点群深層学習の研究動向SSII2019企画: 点群深層学習の研究動向
SSII2019企画: 点群深層学習の研究動向SSII
 
[DL輪読会]逆強化学習とGANs
[DL輪読会]逆強化学習とGANs[DL輪読会]逆強化学習とGANs
[DL輪読会]逆強化学習とGANsDeep Learning JP
 
【DL輪読会】Perceiver io a general architecture for structured inputs &amp; outputs
【DL輪読会】Perceiver io  a general architecture for structured inputs &amp; outputs 【DL輪読会】Perceiver io  a general architecture for structured inputs &amp; outputs
【DL輪読会】Perceiver io a general architecture for structured inputs &amp; outputs Deep Learning JP
 
顕著性マップの推定手法
顕著性マップの推定手法顕著性マップの推定手法
顕著性マップの推定手法Takao Yamanaka
 
【メタサーベイ】Vision and Language のトップ研究室/研究者
【メタサーベイ】Vision and Language のトップ研究室/研究者【メタサーベイ】Vision and Language のトップ研究室/研究者
【メタサーベイ】Vision and Language のトップ研究室/研究者cvpaper. challenge
 
[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder
[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder
[DL輪読会]NVAE: A Deep Hierarchical Variational AutoencoderDeep Learning JP
 
【論文読み会】Self-Attention Generative Adversarial Networks
【論文読み会】Self-Attention Generative  Adversarial Networks【論文読み会】Self-Attention Generative  Adversarial Networks
【論文読み会】Self-Attention Generative Adversarial NetworksARISE analytics
 
ゼロから始める深層強化学習(NLP2018講演資料)/ Introduction of Deep Reinforcement Learning
ゼロから始める深層強化学習(NLP2018講演資料)/ Introduction of Deep Reinforcement Learningゼロから始める深層強化学習(NLP2018講演資料)/ Introduction of Deep Reinforcement Learning
ゼロから始める深層強化学習(NLP2018講演資料)/ Introduction of Deep Reinforcement LearningPreferred Networks
 
【DL輪読会】ViT + Self Supervised Learningまとめ
【DL輪読会】ViT + Self Supervised Learningまとめ【DL輪読会】ViT + Self Supervised Learningまとめ
【DL輪読会】ViT + Self Supervised LearningまとめDeep Learning JP
 
【DL輪読会】Vision-Centric BEV Perception: A Survey
【DL輪読会】Vision-Centric BEV Perception: A Survey【DL輪読会】Vision-Centric BEV Perception: A Survey
【DL輪読会】Vision-Centric BEV Perception: A SurveyDeep Learning JP
 
[DL輪読会] Spectral Norm Regularization for Improving the Generalizability of De...
[DL輪読会] Spectral Norm Regularization for Improving the Generalizability of De...[DL輪読会] Spectral Norm Regularization for Improving the Generalizability of De...
[DL輪読会] Spectral Norm Regularization for Improving the Generalizability of De...Deep Learning JP
 
三次元点群を取り扱うニューラルネットワークのサーベイ
三次元点群を取り扱うニューラルネットワークのサーベイ三次元点群を取り扱うニューラルネットワークのサーベイ
三次元点群を取り扱うニューラルネットワークのサーベイNaoya Chiba
 

La actualidad más candente (20)

[DL輪読会]相互情報量最大化による表現学習
[DL輪読会]相互情報量最大化による表現学習[DL輪読会]相互情報量最大化による表現学習
[DL輪読会]相互情報量最大化による表現学習
 
POMDP下での強化学習の基礎と応用
POMDP下での強化学習の基礎と応用POMDP下での強化学習の基礎と応用
POMDP下での強化学習の基礎と応用
 
Domain Adaptation 発展と動向まとめ(サーベイ資料)
Domain Adaptation 発展と動向まとめ(サーベイ資料)Domain Adaptation 発展と動向まとめ(サーベイ資料)
Domain Adaptation 発展と動向まとめ(サーベイ資料)
 
CV分野におけるサーベイ方法
CV分野におけるサーベイ方法CV分野におけるサーベイ方法
CV分野におけるサーベイ方法
 
深層学習の不確実性 - Uncertainty in Deep Neural Networks -
深層学習の不確実性 - Uncertainty in Deep Neural Networks -深層学習の不確実性 - Uncertainty in Deep Neural Networks -
深層学習の不確実性 - Uncertainty in Deep Neural Networks -
 
Depth Estimation論文紹介
Depth Estimation論文紹介Depth Estimation論文紹介
Depth Estimation論文紹介
 
オープンソース SLAM の分類
オープンソース SLAM の分類オープンソース SLAM の分類
オープンソース SLAM の分類
 
[DL輪読会]“Spatial Attention Point Network for Deep-learning-based Robust Autono...
[DL輪読会]“Spatial Attention Point Network for Deep-learning-based Robust Autono...[DL輪読会]“Spatial Attention Point Network for Deep-learning-based Robust Autono...
[DL輪読会]“Spatial Attention Point Network for Deep-learning-based Robust Autono...
 
SSII2019企画: 点群深層学習の研究動向
SSII2019企画: 点群深層学習の研究動向SSII2019企画: 点群深層学習の研究動向
SSII2019企画: 点群深層学習の研究動向
 
[DL輪読会]逆強化学習とGANs
[DL輪読会]逆強化学習とGANs[DL輪読会]逆強化学習とGANs
[DL輪読会]逆強化学習とGANs
 
【DL輪読会】Perceiver io a general architecture for structured inputs &amp; outputs
【DL輪読会】Perceiver io  a general architecture for structured inputs &amp; outputs 【DL輪読会】Perceiver io  a general architecture for structured inputs &amp; outputs
【DL輪読会】Perceiver io a general architecture for structured inputs &amp; outputs
 
顕著性マップの推定手法
顕著性マップの推定手法顕著性マップの推定手法
顕著性マップの推定手法
 
【メタサーベイ】Vision and Language のトップ研究室/研究者
【メタサーベイ】Vision and Language のトップ研究室/研究者【メタサーベイ】Vision and Language のトップ研究室/研究者
【メタサーベイ】Vision and Language のトップ研究室/研究者
 
[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder
[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder
[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder
 
【論文読み会】Self-Attention Generative Adversarial Networks
【論文読み会】Self-Attention Generative  Adversarial Networks【論文読み会】Self-Attention Generative  Adversarial Networks
【論文読み会】Self-Attention Generative Adversarial Networks
 
ゼロから始める深層強化学習(NLP2018講演資料)/ Introduction of Deep Reinforcement Learning
ゼロから始める深層強化学習(NLP2018講演資料)/ Introduction of Deep Reinforcement Learningゼロから始める深層強化学習(NLP2018講演資料)/ Introduction of Deep Reinforcement Learning
ゼロから始める深層強化学習(NLP2018講演資料)/ Introduction of Deep Reinforcement Learning
 
【DL輪読会】ViT + Self Supervised Learningまとめ
【DL輪読会】ViT + Self Supervised Learningまとめ【DL輪読会】ViT + Self Supervised Learningまとめ
【DL輪読会】ViT + Self Supervised Learningまとめ
 
【DL輪読会】Vision-Centric BEV Perception: A Survey
【DL輪読会】Vision-Centric BEV Perception: A Survey【DL輪読会】Vision-Centric BEV Perception: A Survey
【DL輪読会】Vision-Centric BEV Perception: A Survey
 
[DL輪読会] Spectral Norm Regularization for Improving the Generalizability of De...
[DL輪読会] Spectral Norm Regularization for Improving the Generalizability of De...[DL輪読会] Spectral Norm Regularization for Improving the Generalizability of De...
[DL輪読会] Spectral Norm Regularization for Improving the Generalizability of De...
 
三次元点群を取り扱うニューラルネットワークのサーベイ
三次元点群を取り扱うニューラルネットワークのサーベイ三次元点群を取り扱うニューラルネットワークのサーベイ
三次元点群を取り扱うニューラルネットワークのサーベイ
 

Similar a Deep Learningを用いた経路予測の研究動向

IRJET - Fake Currency Detection using CNN
IRJET -  	  Fake Currency Detection using CNNIRJET -  	  Fake Currency Detection using CNN
IRJET - Fake Currency Detection using CNNIRJET Journal
 
Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...
Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...
Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...IRJET Journal
 
IRJET - Object Detection using Deep Learning with OpenCV and Python
IRJET - Object Detection using Deep Learning with OpenCV and PythonIRJET - Object Detection using Deep Learning with OpenCV and Python
IRJET - Object Detection using Deep Learning with OpenCV and PythonIRJET Journal
 
Driving behaviors for adas and autonomous driving xiv
Driving behaviors for adas and autonomous driving xivDriving behaviors for adas and autonomous driving xiv
Driving behaviors for adas and autonomous driving xivYu Huang
 
RESUME SCREENING USING LSTM
RESUME SCREENING USING LSTMRESUME SCREENING USING LSTM
RESUME SCREENING USING LSTMIRJET Journal
 
Real-Time Pertinent Maneuver Recognition for Surveillance
Real-Time Pertinent Maneuver Recognition for SurveillanceReal-Time Pertinent Maneuver Recognition for Surveillance
Real-Time Pertinent Maneuver Recognition for SurveillanceIRJET Journal
 
Driving Behavior for ADAS and Autonomous Driving VII
Driving Behavior for ADAS and Autonomous Driving VIIDriving Behavior for ADAS and Autonomous Driving VII
Driving Behavior for ADAS and Autonomous Driving VIIYu Huang
 
Facial Expression Recognition
Facial Expression Recognition Facial Expression Recognition
Facial Expression Recognition Rupinder Saini
 
Gait Based Person Recognition Using Partial Least Squares Selection Scheme
Gait Based Person Recognition Using Partial Least Squares Selection Scheme Gait Based Person Recognition Using Partial Least Squares Selection Scheme
Gait Based Person Recognition Using Partial Least Squares Selection Scheme ijcisjournal
 
Real Estate Investment Advising Using Machine Learning
Real Estate Investment Advising Using Machine LearningReal Estate Investment Advising Using Machine Learning
Real Estate Investment Advising Using Machine LearningIRJET Journal
 
IRJET- Face Detection and Recognition using OpenCV
IRJET- Face Detection and Recognition using OpenCVIRJET- Face Detection and Recognition using OpenCV
IRJET- Face Detection and Recognition using OpenCVIRJET Journal
 
Vision based system for monitoring the loss of attention in automotive driver
Vision based system for monitoring the loss of attention in automotive driverVision based system for monitoring the loss of attention in automotive driver
Vision based system for monitoring the loss of attention in automotive driverVinay Diddi
 
Unconstrained Activity Recognition in an Office Environment
Unconstrained Activity Recognition in an Office EnvironmentUnconstrained Activity Recognition in an Office Environment
Unconstrained Activity Recognition in an Office EnvironmentChristopher Ramirez
 
PS_Unconstrained_Activity
PS_Unconstrained_ActivityPS_Unconstrained_Activity
PS_Unconstrained_ActivityParker Sankey
 
Stock market trend prediction using k nearest neighbor(knn) algorithm
Stock market trend prediction using k nearest neighbor(knn) algorithmStock market trend prediction using k nearest neighbor(knn) algorithm
Stock market trend prediction using k nearest neighbor(knn) algorithmVenkat Projects
 
Stock market trend prediction using k nearest neighbor(knn) algorithm
Stock market trend prediction using k nearest neighbor(knn) algorithmStock market trend prediction using k nearest neighbor(knn) algorithm
Stock market trend prediction using k nearest neighbor(knn) algorithmVenkat Projects
 
Integrated Hidden Markov Model and Kalman Filter for Online Object Tracking
Integrated Hidden Markov Model and Kalman Filter for Online Object TrackingIntegrated Hidden Markov Model and Kalman Filter for Online Object Tracking
Integrated Hidden Markov Model and Kalman Filter for Online Object Trackingijsrd.com
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
EE660_Report_YaxinLiu_8448347171
EE660_Report_YaxinLiu_8448347171EE660_Report_YaxinLiu_8448347171
EE660_Report_YaxinLiu_8448347171Yaxin Liu
 

Similar a Deep Learningを用いた経路予測の研究動向 (20)

IRJET - Fake Currency Detection using CNN
IRJET -  	  Fake Currency Detection using CNNIRJET -  	  Fake Currency Detection using CNN
IRJET - Fake Currency Detection using CNN
 
Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...
Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...
Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...
 
IRJET - Object Detection using Deep Learning with OpenCV and Python
IRJET - Object Detection using Deep Learning with OpenCV and PythonIRJET - Object Detection using Deep Learning with OpenCV and Python
IRJET - Object Detection using Deep Learning with OpenCV and Python
 
Driving behaviors for adas and autonomous driving xiv
Driving behaviors for adas and autonomous driving xivDriving behaviors for adas and autonomous driving xiv
Driving behaviors for adas and autonomous driving xiv
 
RESUME SCREENING USING LSTM
RESUME SCREENING USING LSTMRESUME SCREENING USING LSTM
RESUME SCREENING USING LSTM
 
Real-Time Pertinent Maneuver Recognition for Surveillance
Real-Time Pertinent Maneuver Recognition for SurveillanceReal-Time Pertinent Maneuver Recognition for Surveillance
Real-Time Pertinent Maneuver Recognition for Surveillance
 
C1802022430
C1802022430C1802022430
C1802022430
 
Driving Behavior for ADAS and Autonomous Driving VII
Driving Behavior for ADAS and Autonomous Driving VIIDriving Behavior for ADAS and Autonomous Driving VII
Driving Behavior for ADAS and Autonomous Driving VII
 
Facial Expression Recognition
Facial Expression Recognition Facial Expression Recognition
Facial Expression Recognition
 
Gait Based Person Recognition Using Partial Least Squares Selection Scheme
Gait Based Person Recognition Using Partial Least Squares Selection Scheme Gait Based Person Recognition Using Partial Least Squares Selection Scheme
Gait Based Person Recognition Using Partial Least Squares Selection Scheme
 
Real Estate Investment Advising Using Machine Learning
Real Estate Investment Advising Using Machine LearningReal Estate Investment Advising Using Machine Learning
Real Estate Investment Advising Using Machine Learning
 
IRJET- Face Detection and Recognition using OpenCV
IRJET- Face Detection and Recognition using OpenCVIRJET- Face Detection and Recognition using OpenCV
IRJET- Face Detection and Recognition using OpenCV
 
Vision based system for monitoring the loss of attention in automotive driver
Vision based system for monitoring the loss of attention in automotive driverVision based system for monitoring the loss of attention in automotive driver
Vision based system for monitoring the loss of attention in automotive driver
 
Unconstrained Activity Recognition in an Office Environment
Unconstrained Activity Recognition in an Office EnvironmentUnconstrained Activity Recognition in an Office Environment
Unconstrained Activity Recognition in an Office Environment
 
PS_Unconstrained_Activity
PS_Unconstrained_ActivityPS_Unconstrained_Activity
PS_Unconstrained_Activity
 
Stock market trend prediction using k nearest neighbor(knn) algorithm
Stock market trend prediction using k nearest neighbor(knn) algorithmStock market trend prediction using k nearest neighbor(knn) algorithm
Stock market trend prediction using k nearest neighbor(knn) algorithm
 
Stock market trend prediction using k nearest neighbor(knn) algorithm
Stock market trend prediction using k nearest neighbor(knn) algorithmStock market trend prediction using k nearest neighbor(knn) algorithm
Stock market trend prediction using k nearest neighbor(knn) algorithm
 
Integrated Hidden Markov Model and Kalman Filter for Online Object Tracking
Integrated Hidden Markov Model and Kalman Filter for Online Object TrackingIntegrated Hidden Markov Model and Kalman Filter for Online Object Tracking
Integrated Hidden Markov Model and Kalman Filter for Online Object Tracking
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
EE660_Report_YaxinLiu_8448347171
EE660_Report_YaxinLiu_8448347171EE660_Report_YaxinLiu_8448347171
EE660_Report_YaxinLiu_8448347171
 

Último

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 

Último (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 

Deep Learningを用いた経路予測の研究動向

  • 1. サーベイ論文:Deep Learningを用いた経路予測の研究動向 箕浦 大晃 平川 翼 山下 隆義 藤吉 弘亘 PRMU研究会 October 9-10, 2020 中部大学 機械知覚&ロボティクスグループ
  • 3. 経路予測のカテゴリ 3 Bayesian-based Deep Learning-based Planning-based 内部状態 観測状態 Update Update ex. model : Kalman Filter Past Future Prediction model Input Output ex. model : LSTM,CNN ex. Model : IRL, RRT* Start Goal X O X 観測状態にノイズを付与した値から 未来の内部状態を更新し予測値を逐次推定 予測対象の過去の軌跡から未来の行動を学習 スタートからゴールまでの報酬値を最適化 本サーベイの対象
  • 4. Deep Learningによる経路予測の必要な要素 4 一人称視点 車載カメラ視点 鳥瞰視点 View point and the yellow-orange heatmaps are n ones are ground truth multi-future Single-Future Multi-Future 18.51 / 35.84 166.1 / 329.5 28.68 / 49.87 184.5 / 363.2 PIE JAAD Method MSE CMSE CFMSE MSE CMSE CFM 0.5s 1s 1.5s 1.5s 1.5s 0.5s 1s 1.5s 1.5s 1.5 Linear 123 477 1365 950 3983 223 857 2303 1565 611 LSTM 172 330 911 837 3352 289 569 1558 1473 576 B-LSTM[5] 101 296 855 811 3259 159 539 1535 1447 561 PIEtraj 58 200 636 596 2477 110 399 1248 1183 478 Table 3: Location (bounding box) prediction errors over varying future time steps. MSE in pixels is ca predicted time steps, CMSE and CFMSE are the MSEs calculated over the center of the bounding box predicted sequence and only the last time step respectively. MSE Method 0.5s 1s 1.5s Linear 0.87 2.28 4.27 LSTM 1.50 1.91 3.00 PIEspeed 0.63 1.44 2.65 Long Short-Term Memory Convolutional Neural Network Gated Recurrent Unit Temporal Convolutional Network Model !" !# !$ !%!& !' t=t input layert=t-1 hidden layer Input Gate t=t output layert=t+1 hidden layer Memory Cell Forget Gate Output Gate 対象クラス 対象間のインタラクション 静的環境情報 Context
  • 5. Deep Learningによる経路予測の必要な要素 5 一人称視点 車載カメラ視点 鳥瞰視点 View point 対象クラス 対象間のインタラクション 静的環境情報 Context Long Short-Term Memory Convolutional Neural Network Gated Recurrent Unit Temporal Convolutional Network Model and the yellow-orange heatmaps are n ones are ground truth multi-future Single-Future Multi-Future 18.51 / 35.84 166.1 / 329.5 28.68 / 49.87 184.5 / 363.2 PIE JAAD Method MSE CMSE CFMSE MSE CMSE CFM 0.5s 1s 1.5s 1.5s 1.5s 0.5s 1s 1.5s 1.5s 1.5 Linear 123 477 1365 950 3983 223 857 2303 1565 611 LSTM 172 330 911 837 3352 289 569 1558 1473 576 B-LSTM[5] 101 296 855 811 3259 159 539 1535 1447 561 PIEtraj 58 200 636 596 2477 110 399 1248 1183 478 Table 3: Location (bounding box) prediction errors over varying future time steps. MSE in pixels is ca predicted time steps, CMSE and CFMSE are the MSEs calculated over the center of the bounding box predicted sequence and only the last time step respectively. MSE Method 0.5s 1s 1.5s Linear 0.87 2.28 4.27 LSTM 1.50 1.91 3.00 PIEspeed 0.63 1.44 2.65 !" !# !$ !%!& !' t=t input layert=t-1 hidden layer Input Gate t=t output layert=t+1 hidden layer Memory Cell Forget Gate Output Gate
  • 7. Deep Learningを用いた予測手法の傾向と分類 7 2016 interaction other 2020 Social-LSTM [A. Alahi+, CVPR, 2016] DESIRE [N. Lee+, CVPR, 2017] Conv.Social-Pooling [N. Deo+, CVPRW, 2018] SoPhie [A. Sadeghian+, CVPR, 2019] Social-BiGAT [V. Kosaraju+, NeurIPS, 2019] Social-STGCNN [A. Mohamedl+, CVPR, 2020] Social-GAN [A. Gupta+, CVPR, 2018] Next [J. Liang+, CVPR, 2019] STGAT [Y. Huang+, ICCV, 2019] Trajectron [B. Ivanovic+, ICCV, 2019] Social-Attention [A. Vemula+, ICRA, 2018] Multi-Agent Tensor Fusion [T. Zhao+, CVPR, 2019] MX-LSTM [I. Hasan+, CVPR, 2018] CIDNN [Y. Xu+, CVPR, 2018] SR-LSTM [P. Zhang+, CVPR, 2019] Group-LSTM [N. Bisagno+, CVPR, 2018] Reciprocal Network [S. Hao+, CVPR, 2020] PECNet [K. Mangalam+, ECCV, 2020] RSBG [J. SUN+, CVPR, 2020] STAR [C. Yu+, ECCV, 2020] Behavior CNN [S. Yi+, ECCV, 2016] Future localization in first-person videos [T. Yagi+, CVPR, 2018] Fast and Furious [W. Luo+, CVPR, 2018] OPPU [A. Bhattacharyya+, CVPR, 2018] Object Attributes and Semantic Segmentation [H. Minoura+, VISAPP, 2019] Rule of the Road [J. Hong+, CVPR, 2019] Multiverse [J. Liang+, CVPR, 2020] Trajectron++ [T. Salzmann+, ECCV, 2020] Attentionモデル Poolingモデル 近年ではAttentionモデルと複数経路を予測する手法が主流 multimodal paths
  • 9. ● 各カテゴリに属する経路予測手法毎の特徴をまとめる - インタラクションあり • Poolingモデル • Attentionモデル - インタラクションなし (Other) ● 定量的評価のためのデータセット,評価指標も紹介 ● 代表的モデルを使用して,各モデルの精度と予測結果について議論 本サーベイの目的 9 Deep Learningを用いた経路予測手法の動向調査
  • 10. 本サーベイの目的 10 Deep Learningを用いた経路予測手法の動向調査 ● 各カテゴリに属する経路予測手法毎の特徴をまとめる - インタラクションあり • Poolingモデル • Attentionモデル - インタラクションなし (Other) ● 定量的評価のためのデータセット,評価指標も紹介 ● 代表的モデルを使用して,各モデルの精度と予測結果について議論
  • 11. 複数の歩行者の移動経路を同時に予測 ● 歩行者同士の衝突を避けるためにSocial-Pooling layer (S-Pooling)を提案 - 予測対象周辺の他対象の位置と中間層出力を入力 - 次時刻のLSTMの内部状態に歩行者同士の空間的関係が保持 - 衝突を避ける経路予測が可能 Social LSTM [A. Alahi+, CVPR, 2016] 11 Linear! Social-LSTM! GT! SF [73]! Linear! Social-LSTM! GT! SF [73]! Linear! Social-LSTM! GT! SF [73]! Linear! Social-LSTM! GT! SF [73]! Linear! Social-LSTM! GT! SF [73]! Linear! GT! SF [73]!Linear! Social-LSTM! GT! SF [73]! Linear! Social-LSTM! GT! SF [73]! Linear! Social-LSTM! GT! SF [73]! Linear! Social-LSTM! GT! SF [73]! A. Alahi, et al., “Social LSTM: Human Trajectory Prediction in Crowded Spaces,” CVPR, 2016.
  • 12. 対象間のインタラクションに加え周囲の環境情報を考慮 ● 交差点や道沿い端などの障害物領域を避ける経路予測を実現 ● CVAEでエンコードすることで複数の経路を予測可能 Ranking & Refinement Moduleで予測経路にランキング付け ● 経路を反復的に改善することで予測精度向上を図る DESIRE [N. Lee+, CVPR, 2017] 12 Input KLD Loss fc + soft max r1 rtr2 fc Y Sample Generation Module Ranking & Re nement Module RNN Encoder1 GRU GRU GRU RNN Encoder2 GRU GRU GRU RNN Decoder1 GRU GRU GRU RNN Decoder2 GRU GRU GRU CVAE fc fc z X Y Regression Scoring fc fc fc Y Y Recon Loss CNN SCF SCF SCF Feature Pooling (I) Iterative Feedback concat mask addition Figure 2. The overview of proposed prediction framework DESIRE. First, DESIRE generates multiple plausible prediction samples ˆY via a CVAE-based RNN encoder-decoder (Sample Generation Module). Then the following module assigns a reward to the prediction samples at each time-step sequentially as IOC frameworks and learns displacements vector ∆ ˆY to regress the prediction hypotheses (Ranking DESIRE-S Top1 DESIRE-S Top10 DESIRE-SI Top1 DESIRE-SI Top10 Linear RNN ED RNN ED-SI X Y Method Linear RNN ED RNN ED-SI CVAE 1 CVAE 10% DESIRE-S-IT DESIRE-S-IT DESIRE-S-IT DESIRE-S-IT DESIRE-SI-I DESIRE-SI-I DESIRE-SI-I DESIRE-SI-I Linear RNN ED RNN ED-SI CVAE 1 CVAE 10% DESIRE-S-IT Linear RNN ED RNN ED-SI X Y X Y (a) GT Figure 6. KITTI resul DESIRE-S Top1 DESIRE-S Top10 DESIRE-SI Top1 DESIRE-SI Top10 Linear RNN ED RNN ED-SI X Y 真値 予測値 インタラクションあり Top1 Top10 インタラクションなし Top1 Top10 N. Lee, et al., “DESIRE: Distant Future Prediction in Dynamic Scenes with Interacting Agents,” CVPR, 2017.
  • 13. 高速道路上で隣接する自動車同士のインタラクションを考慮した予測手法 ● インタラクション情報に空間的意味合いを持たせるConvolution Social Poolingを提案 - LSTM Encoderで得た軌跡特徴量を固定サイズのSocial Tensorに格納 - CNNでインタラクションの特徴量を求める - 予測車の特徴量と連結し,LSTM Decoderで経路を予測 Convolutional Social-Pooling [N. Deo+, CVPRW, 2018] 13 Figure 3. Proposed Model: The encoder is an LSTM with shared weights that learns vehicle dynamics based on track histories. The convolutional social pooling layers learn the spatial interdependencies of of the tracks. Finally, the maneuver based decoder outputs a multi-modal predictive distribution for the future motion of the vehicle being predicted Convolutional Social Pooling for Vehicle Trajectory Prediction Nachiket Deo Mohan M. Trivedi University of California, San Diego La Jolla, 92093 ndeo@ucsd.edu mtrivedi@ucsd.edu Abstract Forecasting the motion of surrounding vehicles is a crit- ical ability for an autonomous vehicle deployed in complex traffic. Motion of all vehicles in a scene is governed by the traffic context, i.e., the motion and relative spatial config- uration of neighboring vehicles. In this paper we propose an LSTM encoder-decoder model that uses convolutional social pooling as an improvement to social pooling lay- ers for robustly learning interdependencies in vehicle mo- tion. Additionally, our model outputs a multi-modal predic- tive distribution over future trajectories based on maneuver classes. We evaluate our model using the publicly available NGSIM US-101 and I-80 datasets. Our results show im- provement over the state of the art in terms of RMS values of prediction error and negative log-likelihoods of true fu- Figure 1. Imagine the blue vehicle is an autonomous vehicle in the traffic scenario shown. Our proposed model allows it to make multi-modal predictions of future motion of it’s surrounding ve- hicles, along with prediction uncertainty shown here for the red v1[cs.CV]15May2018 N. Deo, et al., “Convolutional Social Pooling for Vehicle Trajectory Prediction,” CVPRW, 2018.
  • 14. 歩行者の視線情報を活用した経路予測手法 ● 頭部を中心とした視野角内の他対象のみPooling処理 - 予測対象の頭部方向,他対象との距離値からPooling処理する対象を選択 ● 軌跡,頭部方向,インタラクション情報をLSTMへ入力 - 視野角内にいる他対象との衝突を避ける経路予測を実現 - 視線情報を任意に変更することで,任意方向に向かった経路予測が可能 MX-LSTM [I. Hasan+, CVPR, 2018] 14 3. Our approach In this section we present the MX-LSTM, capable of jointly forecasting positions and head orientations of an in- dividual thanks to the presence of two information streams: Tracklets and vislets. 3.1. Tracklets and vislets Given a subject i, a tracklet (see Fig. 1a) ) is formed by consecutive (x, y) positions on the ground plane, {x (i) t }t=1,...,T , x (i) t = (x, y) ∈ R2 , while a vislet is formed by anchor points {a (i) t }t=1,...,T , with a (i) t = (ax, ay) ∈ R2 indicating a reference point at a fixed distance r from the corresponding x (i) t , towards which the face is oriented1 . In b) d a) )(i ta )(i tx )(i t r )( 1 i tx c) 1tx ta tx t t t (i) (i) e (x,i) t = φ x (i) t , Wx e (a,i) t = φ a (i) t , Wa where the embedding function φ consists in a jection through the embedding weigths Wx and D-dimensional vector, multiplied by a RELU n where D is the dimension of the hidden space. 3.2. VFOA social pooling The social pooling introduced in [3] is an eff to let the LSTM capture how people move in scene avoiding collisions. This work considers a interest area around the single pedestrian, in wh den states of the the neighbors are considered those which are behind the pedestrian. In our ca prove this module using the vislet information ing which individuals to consider, by building a tum of attention (VFOA), that is a triangle origin x (i) t , aligned with a (i) t , and with an aperture given gle γ and a depth d; these parameters have been cross-validation on the training partition of the T dataset (see Sec. 5). Our view-frustum social pooling is a No × N sor, in which the space around the pedestrian is d Figure 3. Qualitative results: a) MX-LSTM b) Ablation qualitative study on Individual MX-LSTM (better in color). I. Hasan, et al., “MX-LSTM: mixing tracklets and vislets to jointly forecast trajectories and head poses,” CVPR, 2018.
  • 15. グループに関するインタラクションを考慮した経路予測手法 ● 運動傾向が類似する歩行者同士をグループとみなす ● 予測対象が属するグループ以外の個人の情報をPooling - 異なるグループとの衝突を避ける経路を予測 Group-LSTM [N. Bisagno+, ECCVW, 2018] 15 Group LSTM 7 Fig. 3. Representation of the Social hidden-state tensor Hi t . The black dot represents the pedes- trian of interest pedi. Other pedestrians pedj (∀j = i) are shown in different color codes, namely green for pedestrians belonging to the same set, and red for pedestrians belonging to a different Group LSTM 9 ing to the studies in interpersonal distances [15, 10], socially correlated people tend to stay closer in their personal space and walk together in crowded environments as compared to pacing with unknown pedestrians. Pooling only unrelated pedestrians will focus more on macroscopic inter-group interactions rather than intra-group dynamics, thus allowing the LSTM network to improve the trajectory prediction performance. Collision avoidance influences the future motion of pedestrians in a similar manner if two pedestrians are walking together as in a group. In Tables 2, 3 and Fig. 4, we display some demos of predicted trajectories which highlight how our Group-LSTM is able to predict pedestrian trajectories with better precision, showing how the prediction is improved when we pool in the social tensor of each pedestrian only pedestrians not belonging to his group. In Table 2, we show how the prediction of two pedestrians walking together in the crowd improves when they are not pooled in each other’s pooling layer. When the two pedestrians are pooled together, the network applies on them the typical repulsion force to avoid colliding with each other. Since they are in the same group, they allow the other pedestrian to stay closer in they personal space. In Fig. 4 we display the sequences of two groups walking toward each other. In Table 3, we show how the prediction for the two groups is improved with respect to the Social LSTM. While both prediction are not very accurate, our Group LSTM perform better because it is able to forecast how pedestrian belonging to the same group will stay together when navigating the environment. Name Scene Our Group-LSTM Social-LSTM ETH Univ Frame 2425 Table 2. ETH dataset: the prediction is improved when pooling in the social tensor of each pedes- trian only pedestrians not belonging to his group. The green dots represent the ground truth tra- jectories; the blue crosses represent the predicted paths. 5 Conclusion In this work, we tackle the problem of pedestrian trajectory prediction in crowded scenes. We propose a novel approach, which combines the coherent filtering algorithm with the LSTM networks. The coherent filtering is used to identify pedestrians walking together in a crowd, while the LSTM network is used to predict the future trajectories 10 Niccol´o Bisagno, Bo Zhang and Nicola Conci (a) (b) (c) (d) Fig. 4. Sequences taken from the UCY dataset. It displays an interaction example between two groups, which will be further analyzed in Table 3. Name Scene Our Group-LSTM Social-LSTM UCY Univ Frame 1025 Table 3. We display how the prediction is improved for two groups walking in opposite direc- tions. The green dots represent the ground truth trajectories, while the blue crosses represent the predicted paths. N. Bisagno, et al., “Group LSTM: Group Trajectory Prediction in Crowded Scenarios,” ECCVW, 2018.
  • 16. GANを用いて複数経路を予測する手法 ● Generator:複数の予測経路をサンプリング - LSTM Encoderの特徴量を用いて,Pooling Moduleでインタラクション情報を出力 - 各出力とノイズベクトルを連結し,LSTM Decoderで未来の複数の予測経路を出力 ● Discriminator:予測経路と実際の経路を判別 - 敵対的に学習させることで,実際の経路と騙す予測経路を生成することを期待 Social-GAN [A. Gupta+, CVPR, 2018] 16 Figure 2: System overview. Our model consists of three key components: Generator (G), Pooling Module, and Discriminator (D). G takes as input past trajectories Xi and encodes the history of the person i as Ht i . The pooling module takes as input all Htobs i and outputs a pooled vector Pi for each person. The decoder generates the future trajectory conditioned on Htobs i Figure 5: Comparison between our model w avoidance scenarios: two people meeting (1 meeting at an angle (4). For each example to pooling, SGAN-P predicts socially accepFigure 5: Comparison between our model without pooling (SGAN, top) and with pooli 予測分布 A. Gupta, et al., “Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks,” CVPR, 2018.
  • 17. 歩行者や自動車等の異なる移動対象とのインタラクションを考慮した予測手法 ● インタラクションに加えシーンコンテキストを共同でモデル化 - 動的と静的の2つの物体との衝突を避ける経路を予測可能 ● Multi-Agent Tensor Fusion - CNNでシーンに関するコンテキスト情報を抽出 - 移動対象毎の位置情報から空間的グリッドにLSTMの出力を格納 - コンテキスト情報と空間的グリッドをチャネル方向に連結し,CNNでFusion - Fusionした特徴量からLSTM Decoderで経路を予測 Multi-Agent Tensor Fusion [T. Zhao+, CVPR, 2019] 17Figure 5: Ablative results on Stanford Drone dataset. From left to right are results from MATF Multi Agent Scene, MAT 入力値 真値 予測値 T. Zhao, et al., “Multi-Agent Tensor Fusion for Contextual Trajectory Prediction,” CVPR, 2019.
  • 18. 2つのネットワークを結合した相互学習による経路予測手法 ● Forward Prediction Network:一般的な軌道予測手法 (観測 → 予測) ● Backward Prediction Network:一般的な軌道予測手法の逆 (予測 → 観測) 相互制約に基づいてAdversarial Attackの概念に基づくモデルを構築 ● 入力軌跡をiterativeに変更 ● モデルの出力と一致させることで,新しい概念(相互攻撃)と呼ぶモデルを開発 Reciprocal Network [S. Hao+, CVPR, 2020] 18 orks for Human Trajectory Prediction iqun Zhao, and Zhihai He ersity of Missouri ,hezhi}@mail.missouri.edu ward ward orms y dif- prop- earn- Figure 1. Illustration of our idea of reciprocal learning for human 3. The generator is constructed by a decoder LSTM. Sim- ilar to the conditional GAN [24], a white noise vector Z is sampled from a multivariate normal distribution. Then, a merge layer is used in our proposed network which concate- nates all encoded features mentioned above with the noise vector Z. We take this as the input to the LSTM decoder to generate the candidate future paths for each human. The discriminator is built with an LSTM encoder which takes the input as randomly chosen trajectory from either ground truth or predicted trajectories and classifies them as “real” or “fake”. Generally speaking, the discriminator classifies the trajectories which are not accurate as “fake” and forces the generator to generate more realistic and feasible trajec- tories. Within the framework of our reciprocal learning for hu- man trajectory prediction, let Gθ : X → Y and Gφ : Y → Figure 4. Illustration of the proposed attack method. S. Hao, et al., “Reciprocal Learning Networks for Human Trajectory Prediction,” CVPR, 2020.
  • 19. Predicted Endpoint Conditioned Network (PECNet) ● 予測最終地点 (エンドポイント)を重視した学習を行う経路予測手法 -    でエンドポイントを予測し,Past Encodingの出力と連結 (concat encoding) - 連結した特徴量からSocial Pooling内の各パラメタ特徴量を取得 - 歩行者 x 歩行者のSocial Maskで歩行者間のインタラクションを求める - concat encodingとインタラクション情報から で経路を予測 PECNet [K. Mangalam+, ECCV, 2020] 19 6 K. Mangalam, H. Girase, S. Agarwal, K. Lee, E. Adeli, J. Malik, A. Gaidon Dlatent PECNet: Pedestrian Endpoint Conditioned Trajectory Prediction Network 13 lower prediction error than way-points in the middle! This in a nutshell, con- firms the motivation of this work. E↵ect of Number of samples (K): All the previous works use K = 20 sam- ples (except DESIRE which uses K = 5) to evaluate the multi-modal predictions for metrics ADE & FDE. Referring to Figure 5, we see the expected decreas- ing trend in ADE & FDE with time as K increases. Further, we observe that our proposed method achieves the same error as the previous works with much smaller K. Previous state-of-the-art achieves 12.58 [39] ADE using K = 20 sam- ples which is matched by PECNet at half the number of samples, K = 10. This further lends support to our hypothesis that conditioning on the inferred way- point significantly reduces the modeling complexity for multi-modal trajectory forecasting, providing a better estimate of the ground truth. Lastly, as K grows large (K ! 1) we observe that the FDE slowly gets closer to 0 with more number of samples, as the ground truth Gc is eventually found. However, the ADE error is still large (6.49) because of the errors in the rest of the predicted trajectory. This is in accordance with the observed ADE (8.24) for the oracle conditioned on the last observed point (i.e. 0 FDE error) in Fig. 4. Design choice for VAE: We also evaluate our design choice of using the in- ferred future way-points ˆGc for training subsequent modeules (social pooling & prediction) instead of using the ground truth Gc. As mentioned in Section 3.2, this is also a valid choice for training PECNet end to end. Empirically, we find Fig. 6. Visualizing Multimodality: We show visualizations for some multi-modal 入力値 真値 予測値 K. Mangalam, et al., “It is Not the Journey but the Destination: Endpoint Conditioned Trajectory Prediction,” ECCV, 2020. Pfuture
  • 20. 本サーベイの目的 20 Deep Learningを用いた経路予測手法の動向調査 ● 各カテゴリに属する経路予測手法毎の特徴をまとめる - インタラクションあり • Poolingモデル • Attentionモデル - インタラクションなし (Other) ● 定量的評価のためのデータセット,評価指標も紹介 ● 代表的モデルを使用して,各モデルの精度と予測結果について議論
  • 21. グラフ構造を時空間方向に拡張した経路予測手法 ● Node:対象の位置情報 ● Edge:対象間の空間情報,時間方向へ伝播する対象自身の情報 ● NodeとEdgeからAttentionを求め,注目対象を導出 - 注目対象を回避する経路予測が可能 - Attentionを求めることで視覚的な説明が可能 Social-Attention [A. Vemula+, ICRA, 2018] 21 ponding factor graph el (a) (b) (a) (b)Fig. 3. Architecture of EdgeRNN (left), Attention module (middle) and NodeRNN (right) and Interacting Gaussian Processes [4]. Hence, we chose Social LSTM as the baseline to compare the performance of our method. C. Quantitative Results The prediction errors for all the methods on the 5 crowdA. Vemula, et al., “Social Attention: Modeling Attention in Human Crowds,” ICRA, 2018.
  • 22. 移動対象の行動による危険度をAttentionで推定し,行動の特徴に重み付け ● Motion Encoder Moduleで対象毎の行動をエンコード ● Location Encoder Moduleで対象毎の位置情報をエンコード - 予測対象と全他対象の内積を求め,Softmaxで他対象の特徴に重み付け ● 2つのModuleを連結し,次時刻以降の経路を予測 CIDNN [Y. Xu+, CVPR, 2018] 22 N th 1fc 2fc 3fc1 tS 1fc 2fc 3fci tS 1fc 2fc 3fcN tS 1 th i th ÖÖ ,1i ta ,i i ta ,i N ta ÖÖ 1 1 1 1 2, ,..., tS S S LSTMLSTM1 tz 1 2, ,...,i i i tS S S LSTMLSTMi tz 1 2, ,...,N N N tS S S LSTMLSTMN tz ÖÖ ÖÖ Displacement Prediction Module 1 i tSd + i tc fc # ƒ ƒ ƒ Location Encoder Module Motion Encoder ModuleCrowd Interaction ƒ #Inner product Scalar multiplication Sum ÖÖ ÖÖ Figure 2. The architecture of crowd interaction deep neural network (CIDNN). Successful Cases Figure 3. Qualitative results: history traj Successful Cases Figure 3. Qualitative results: history trajectory (red), ground truth (blue Successful Cases Figure 3. Qualitative results: history trajectory (red), ground truth (blue), and predicted trajectories from ou 入力値 真値 予測値 Y. Xu, et al., “Encoding Crowd Interaction with Deep Neural Network for Pedestrian Trajectory Prediction,” CVPR, 2018.
  • 23. 現時刻のインタラクション情報から予測対象の未来の予測経路を更新 ● States refinement module内の2つの機構で高精度な経路予測を実現 - 他対象との衝突を防ぐPedestrian-aware attention (PA) - 他対象の動きから,予測対象自身が経路を選択するMotion gate (MG) ● MGで衝突を起こしそうな対象の動きから経路を選択 ● PAで予測対象近隣の他対象に着目 SR-LSTM [P. Zhang+, CVPR, 2019] 23 5,36,39]. Vemula from the hidden gives an impor- et al. [33] utilize ght the important pairwise velocity who are in simi- ms to selects mo- strian during the lly aware neigh- d in previous ap- ramework. This lution Networks LSTM LSTM LSTM LSTMSR t t+1 LSTM LSTM States refinement module LSTM states Input the location to LSTM Ouput the prediction ... selects the features, where each row is related to a certain dimension of hidden feature. In Fig.6, the first column shows the trajectory patterns captured by hidden features started from origin and ended at the dots, which are extracted in similar way as Fig.2(a). The motion gate for a feature considers pairwise input tra- jectories with similar configurations. Some examples for high response of the gate are shown in the other columns of Fig.6. In these pairwise trajectory samples, the red and blue ones are respectively the trajectories of pedestrian i and j, and the time step we calculate the motion gate are shown with dots (where the trajectory ends). These pairwise sam- ples are extracted by searching from database with highest activation for the motion gate neuron. High response of gate means that the corresponding feature is selected. Figure 6. Selected feature patterns by motion gate. Each row is related to a hidden neuron (feature) of LSTM. Column 1: Activa- tion trajectory pattern of the hidden feature. Column 2-6: Pairwise trajectory examples (end with solid dots) having high activation to 3) Row 3: Thi considers more tion. 4) Row 4 hidden feature attention on th walk towards h Pedestrian- ples of the ped LSTM in Fig.7 to the close ne tention, 2) the often largely fo refinement ten bors with grou longer time ran Figure 7. Illustr magenta represe the dashed circle ment. Larger cir represents the ta ones are his/her their walking dir 5. Conclusio selects the features, where each row is related to a certain dimension of hidden feature. In Fig.6, the first column shows the trajectory patterns captured by hidden features started from origin and ended at the dots, which are extracted in similar way as Fig.2(a). The motion gate for a feature considers pairwise input tra- jectories with similar configurations. Some examples for high response of the gate are shown in the other columns of Fig.6. In these pairwise trajectory samples, the red and blue ones are respectively the trajectories of pedestrian i and j, and the time step we calculate the motion gate are shown with dots (where the trajectory ends). These pairwise sam- ples are extracted by searching from database with highest activation for the motion gate neuron. High response of gate means that the corresponding feature is selected. 3) Row 3: This case is similar to row 2. This gate element considers more distant neighbor walking in opposite direc- tion. 4) Row 4: The neighbor in blue is static, the selected hidden feature shows that pedestrian i in red potentially pay attention on this stationary neighbor in case he is about to walk towards him/her. Pedestrian-wise attention. We illustrate some exam- ples of the pedestrian-wise attention expected by our SR- LSTM in Fig.7. It shows that 1) dominant attention is paid to the close neighbors, while the others also take slight at- tention, 2) the attention given by the first refinement layer often largely focuses on the close neighbors, and the second refinement tends to strengthen the effect of farther neigh- bors with group behavior or may influence the pedestrian in longer time range. Pedestrian-aware attention Motion gate 予測対象 予測対象 P. Zhang, et al., “SR-LSTM: State Refinement for LSTM towards Pedestrian Trajectory Prediction,” CVPR, 2019.
  • 24. 将来の経路と行動を同時に予測するモデルを提案 ● Person Behavior Module:歩行者の外見情報と骨格情報をエンコード ● Person Interaction Module:周辺の静的環境情報と自動車等の物体情報をエンコード ● Visual Feature Tensor Q:上記2つの特徴と過去の軌跡情報をエンコード ● Trajectory Generator:将来の経路を予測 ● Activity Prediction:予測最終時刻の行動を予測 Next [J. Liang+, CVPR, 2019] 24 Figure 2. Overview of our model. Given a sequence of frames containing the person for prediction, our model utilizes person behavior module and person interaction module to encode rich visual semantics into a feature tensor. We propose novel person interaction module that takes into account both person-scene and person-object relations for joint activities and locations prediction. 3. Approach RoIAlign CNN Figure 6. (Better viewed in color.) Qualitative comparison between our method and the baselines. Yellow path is the observable trajectory and Green path is the ground truth trajectory during the prediction period. Predictions are shown as Blue heatmaps. Our model also predicts the future activity, which is shown in the text and with the person pose template. Figure 7. (Better viewed in color.) Qualitative analysis of o Method ETH HOTEL UN Model Linear 1.33 / 2.94 0.39 / 0.72 0.82 LSTM 1.09 / 2.41 0.86 / 1.91 0.61J. Liang, et al., “Peeking into the Future: Predicting Future Person Activities and Locations in Videos,” CVPR, 2019.
  • 25. 歩行者同士のインタラクションに加え,静的環境情報を考慮した予測手法 ● Physical Attention:静的環境に関するAttentionを推定 ● Social Attention:動的物体に関するAttentionを推定 各AttentionとLSTM Encoderの出力から将来の経路を予測 SoPhie [A. Sadeghian+, CVPR, 2019] 25 Physical Attention: !""#$ Social Attention: !""%& Generator Attention Module for i-th person GAN Module concat. Discriminator LSTM LSTM LSTM LSTM LSTM LSTM concat. concat. concat. z z z decoder 1st agent i-th agent N-th agent Attention Module for 1st person Attention Module for n-th person CNN Feature Extractor Module i-th agent calc. relative relative relative relativeLSTM LSTM LSTM encoder (a) (b) (c) N-th agent 1st agent Figure 2. An overview of SoPhie architecture. Sophie consists of three key modules including: (a) A feature extractor module, (b) An attention module, and (c) An LSTM based GAN module. 3.3. Feature extractors where πj is the index of the other agents sorted according to their distances to the target agent i. In this framework, each Nexus 6 Li Figure 3. Using the generator to sample trajectories and the discrimin maps for SDD scenes. Maps are presented in red, and generated only Ground Truth Social LSTM Social GAN Sophie (Ours) Figure 4. Comparison of Sophie’s predictions against the ground truth trajectories and two baselines. Each pedestrian is displayed with a different color, where dashed lines are observed trajecto-A. Sadeghian, et al., “SoPhie: An Attentive GAN for Predicting Paths Compliant to Social and Physical Constraints,” CVPR, 2019.
  • 26. インタラクションを時間方向へ伝搬した予測手法 インタラクションを考慮するためにGraph Attention Network (GAT)を適用 ● GAT:グラフ構造を取り入れたAttentionに基づくGraph Convolutional Networks - シーン全体にいる他対象の関係の重要度をAttention機構で学習 ● GATで求めた特徴を時間方向に伝播することで時空間のインタラクションを考慮 - 衝突の可能性がある対象の情報を過去の経路から導出可能 STGAT [Y. Huang+, ICCV, 2019] 26 GAT GAT GAT GAT c c c Encoder State Decoder GAT Graph Attention Network Concat Noisec · · ·<latexit sha1_base64="RyHhXBLV/0cOSbkT6YVaYO8lbL8=">AAAB7XicbVBNS8NAEJ34WetX1aOXYBE8lUQFPXgoePFYwX5AG8pms2nXbnbD7kQoof/BiwdFvPp/vPlv3LY5aOuDgcd7M8zMC1PBDXret7Oyura+sVnaKm/v7O7tVw4OW0ZlmrImVULpTkgME1yyJnIUrJNqRpJQsHY4up367SemDVfyAccpCxIykDzmlKCVWj0aKTT9StWreTO4y8QvSBUKNPqVr16kaJYwiVQQY7q+l2KQE42cCjYp9zLDUkJHZMC6lkqSMBPks2sn7qlVIjdW2pZEd6b+nshJYsw4CW1nQnBoFr2p+J/XzTC+DnIu0wyZpPNFcSZcVO70dTfimlEUY0sI1dze6tIh0YSiDahsQ/AXX14mrfOaf1Hz7i+r9ZsijhIcwwmcgQ9XUIc7aEATKDzCM7zCm6OcF+fd+Zi3rjjFzBH8gfP5A60zjys=</latexit> tobs<latexit sha1_base64="vcpPldxy0fXoWuPkuol+dvCEp1Q=">AAAB7nicbVDLSgNBEOyNrxhfUY9eBoPgKeyqYI4BLx4jmAckS5idTJIhszPLTK8QlnyEFw+KePV7vPk3TpI9aGJBQ1HVTXdXlEhh0fe/vcLG5tb2TnG3tLd/cHhUPj5pWZ0axptMS206EbVcCsWbKFDyTmI4jSPJ29Hkbu63n7ixQqtHnCY8jOlIiaFgFJ3Uxn6mIzvrlyt+1V+ArJMgJxXI0eiXv3oDzdKYK2SSWtsN/ATDjBoUTPJZqZdanlA2oSPedVTRmNswW5w7IxdOGZChNq4UkoX6eyKjsbXTOHKdMcWxXfXm4n9eN8VhLcyESlLkii0XDVNJUJP572QgDGcop45QZoS7lbAxNZShS6jkQghWX14nratqcF31H24q9VoeRxHO4BwuIYBbqMM9NKAJDCbwDK/w5iXei/fufSxbC14+cwp/4H3+ALeBj8c=</latexit> t2<latexit sha1_base64="7ItLwn8Q8RXHU3NW/LT2SA7MEc4=">AAAB7HicbVBNS8NAEJ34WetX1aOXxSJ4KkkV7LHgxWMF0xbaUDbbTbt0swm7E6GE/gYvHhTx6g/y5r9x2+agrQ8GHu/NMDMvTKUw6Lrfzsbm1vbObmmvvH9weHRcOTltmyTTjPsskYnuhtRwKRT3UaDk3VRzGoeSd8LJ3dzvPHFtRKIecZryIKYjJSLBKFrJx0Fenw0qVbfmLkDWiVeQKhRoDSpf/WHCspgrZJIa0/PcFIOcahRM8lm5nxmeUjahI96zVNGYmyBfHDsjl1YZkijRthSShfp7IqexMdM4tJ0xxbFZ9ebif14vw6gR5EKlGXLFlouiTBJMyPxzMhSaM5RTSyjTwt5K2JhqytDmU7YheKsvr5N2veZd19yHm2qzUcRRgnO4gCvw4BaacA8t8IGBgGd4hTdHOS/Ou/OxbN1wipkz+APn8wfJU46h</latexit> t1<latexit sha1_base64="Bdkp2HUnpwjOv5mbz6SwzTMJ8ag=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lUsMeCF48VTFtoQ9lsN+3SzSbsToQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74STu7nfeeLaiEQ94jTlQUxHSkSCUbSSj4Pcmw2qNbfuLkDWiVeQGhRoDapf/WHCspgrZJIa0/PcFIOcahRM8lmlnxmeUjahI96zVNGYmyBfHDsjF1YZkijRthSShfp7IqexMdM4tJ0xxbFZ9ebif14vw6gR5EKlGXLFlouiTBJMyPxzMhSaM5RTSyjTwt5K2JhqytDmU7EheKsvr5P2Vd27rrsPN7Vmo4ijDGdwDpfgwS004R5a4AMDAc/wCm+Ocl6cd+dj2VpyiplT+APn8wfHzo6g</latexit> t3<latexit sha1_base64="OcT9ss7O545agx6P8OIS9udeh5k=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lUsMeCF48VTFtoQ9lsN+3SzSbsToQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74STu7nfeeLaiEQ94jTlQUxHSkSCUbSSj4P8ejao1ty6uwBZJ15BalCgNah+9YcJy2KukElqTM9zUwxyqlEwyWeVfmZ4StmEjnjPUkVjboJ8ceyMXFhlSKJE21JIFurviZzGxkzj0HbGFMdm1ZuL/3m9DKNGkAuVZsgVWy6KMkkwIfPPyVBozlBOLaFMC3srYWOqKUObT8WG4K2+vE7aV3Xvuu4+3NSajSKOMpzBOVyCB7fQhHtogQ8MBDzDK7w5ynlx3p2PZWvJKWZO4Q+czx/K2I6i</latexit> z M-LSTM G-LSTM Figure 2. The architecture of our proposed STGAT model. The framework is based on seq2seq model and consists of 3 parts: Encod Intermediate State and Decoder. The Encoder module includes three components: 2 types of LSTMs and GAT. The Intermediate St encapsulates the spatial and temporal information of all observed trajectories. The Decoder module generates the future trajectories bas edestrians in a scene are considered as nodes on the aph at every time-step. The edges on the graph repre- st of human-human interactions. ~h1<latexit sha1_base64="16Yxx8+YpYlp2tGJX9qHHqgdbbY=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69BIvgqSQq6LHoxWMF+wFtKJvtpl262Q27k0IJ+RlePCji1V/jzX/jts1Bqw8GHu/NMDMvTAQ36HlfTmltfWNzq7xd2dnd2z+oHh61jUo1ZS2qhNLdkBgmuGQt5ChYN9GMxKFgnXByN/c7U6YNV/IRZwkLYjKSPOKUoJV6/Smj2TgfZH4+qNa8ureA+5f4BalBgeag+tkfKprGTCIVxJie7yUYZEQjp4LllX5qWELohIxYz1JJYmaCbHFy7p5ZZehGStuS6C7UnxMZiY2ZxaHtjAmOzao3F//zeilGN0HGZZIik3S5KEqFi8qd/+8OuWYUxcwSQjW3t7p0TDShaFOq2BD81Zf/kvZF3b+sew9XtcZtEUcZTuAUzsGHa2jAPTShBRQUPMELvDroPDtvzvuyteQUM8fwC87HN46CkWw=</latexit> ~h5<latexit sha1_base64="R1g6gAHeZA/axOfzhGwObt3Ay0g=">AAAB8nicbVDLSsNAFJ3UV62vqks3g0VwVRIf6LLoxmUF+4A0lMn0ph06mQkzk0IJ+Qw3LhRx69e482+ctllo64ELh3Pu5d57woQzbVz32ymtrW9sbpW3Kzu7e/sH1cOjtpapotCikkvVDYkGzgS0DDMcuokCEoccOuH4fuZ3JqA0k+LJTBMIYjIULGKUGCv5vQnQbJT3s+u8X625dXcOvEq8gtRQgWa/+tUbSJrGIAzlRGvfcxMTZEQZRjnklV6qISF0TIbgWypIDDrI5ifn+MwqAxxJZUsYPFd/T2Qk1noah7YzJmakl72Z+J/npya6DTImktSAoItFUcqxkXj2Px4wBdTwqSWEKmZvxXREFKHGplSxIXjLL6+S9kXdu6y7j1e1xl0RRxmdoFN0jjx0gxroATVRC1Ek0TN6RW+OcV6cd+dj0Vpyiplj9AfO5w+UlpFw</latexit> ~h4<latexit sha1_base64="zHvzoSStxfsE4+HkOUWqlEWKApg=">AAAB8nicbVBNS8NAEN3Ur1q/qh69LBbBU0m0oMeiF48V7AekoWy2k3bpZjfsbgol5Gd48aCIV3+NN/+N2zYHbX0w8Hhvhpl5YcKZNq777ZQ2Nre2d8q7lb39g8Oj6vFJR8tUUWhTyaXqhUQDZwLahhkOvUQBiUMO3XByP/e7U1CaSfFkZgkEMRkJFjFKjJX8/hRoNs4HWSMfVGtu3V0ArxOvIDVUoDWofvWHkqYxCEM50dr33MQEGVGGUQ55pZ9qSAidkBH4lgoSgw6yxck5vrDKEEdS2RIGL9TfExmJtZ7Foe2MiRnrVW8u/uf5qYlug4yJJDUg6HJRlHJsJJ7/j4dMATV8ZgmhitlbMR0TRaixKVVsCN7qy+ukc1X3ruvuY6PWvCviKKMzdI4ukYduUBM9oBZqI4okekav6M0xzovz7nwsW0tOMXOK/sD5/AGTEZFv</latexit> ~h3<latexit sha1_base64="4MzhxHH1VdXzxV7PaR8PPHgLBTc=">AAAB8nicbVBNS8NAEN3Ur1q/qh69LBbBU0msoMeiF48V7AekoWy2m3bpZjfsTgol5Gd48aCIV3+NN/+N2zYHbX0w8Hhvhpl5YSK4Adf9dkobm1vbO+Xdyt7+weFR9fikY1SqKWtTJZTuhcQwwSVrAwfBeolmJA4F64aT+7nfnTJtuJJPMEtYEJOR5BGnBKzk96eMZuN8kDXyQbXm1t0F8DrxClJDBVqD6ld/qGgaMwlUEGN8z00gyIgGTgXLK/3UsITQCRkx31JJYmaCbHFyji+sMsSR0rYk4IX6eyIjsTGzOLSdMYGxWfXm4n+en0J0G2RcJikwSZeLolRgUHj+Px5yzSiImSWEam5vxXRMNKFgU6rYELzVl9dJ56ruNeru43WteVfEUUZn6BxdIg/doCZ6QC3URhQp9Ixe0ZsDzovz7nwsW0tOMXOK/sD5/AGRjJFu</latexit> ~h2<latexit sha1_base64="hAcukdL3sj/GIPn2AtYuEsqEXSA=">AAAB8nicbVBNS8NAEN3Ur1q/qh69LBbBU0mqoMeiF48V7AekoWy2m3bpZjfsTgol5Gd48aCIV3+NN/+N2zYHbX0w8Hhvhpl5YSK4Adf9dkobm1vbO+Xdyt7+weFR9fikY1SqKWtTJZTuhcQwwSVrAwfBeolmJA4F64aT+7nfnTJtuJJPMEtYEJOR5BGnBKzk96eMZuN8kDXyQbXm1t0F8DrxClJDBVqD6ld/qGgaMwlUEGN8z00gyIgGTgXLK/3UsITQCRkx31JJYmaCbHFyji+sMsSR0rYk4IX6eyIjsTGzOLSdMYGxWfXm4n+en0J0G2RcJikwSZeLolRgUHj+Px5yzSiImSWEam5vxXRMNKFgU6rYELzVl9dJp1H3ruru43WteVfEUUZn6BxdIg/doCZ6QC3URhQp9Ixe0ZsDzovz7nwsW0tOMXOK/sD5/AGQB5Ft</latexit> ~h6<latexit sha1_base64="Kuq84mWaskTChJNeNnyTi5aZ6oc=">AAAB8nicbVBNS8NAEN3Ur1q/qh69LBbBU0lU1GPRi8cK9gPSUDbbSbt0sxt2N4US8jO8eFDEq7/Gm//GbZuDtj4YeLw3w8y8MOFMG9f9dkpr6xubW+Xtys7u3v5B9fCorWWqKLSo5FJ1Q6KBMwEtwwyHbqKAxCGHTji+n/mdCSjNpHgy0wSCmAwFixglxkp+bwI0G+X97DrvV2tu3Z0DrxKvIDVUoNmvfvUGkqYxCEM50dr33MQEGVGGUQ55pZdqSAgdkyH4lgoSgw6y+ck5PrPKAEdS2RIGz9XfExmJtZ7Goe2MiRnpZW8m/uf5qYlug4yJJDUg6GJRlHJsJJ79jwdMATV8agmhitlbMR0RRaixKVVsCN7yy6ukfVH3Luvu41WtcVfEUUYn6BSdIw/doAZ6QE3UQhRJ9Ixe0ZtjnBfn3flYtJacYuYY/YHz+QOWG5Fx</latexit> ~h1<latexit sha1_base64="KoV/fuxRJ+UPzkEixIjJIx41qvU=">AAAB+HicbVBNS8NAEJ34WetHox69LBbRU0lU0GPRi8cK9gPaGDbbTbt0swm7m0IN+SVePCji1Z/izX/jts1BWx8MPN6bYWZekHCmtON8Wyura+sbm6Wt8vbO7l7F3j9oqTiVhDZJzGPZCbCinAna1Exz2kkkxVHAaTsY3U799phKxWLxoCcJ9SI8ECxkBGsj+XalN6YkG+aP2WnuZ27u21Wn5syAlolbkCoUaPj2V68fkzSiQhOOleq6TqK9DEvNCKd5uZcqmmAywgPaNVTgiCovmx2eoxOj9FEYS1NCo5n6eyLDkVKTKDCdEdZDtehNxf+8bqrDay9jIkk1FWS+KEw50jGapoD6TFKi+cQQTCQztyIyxBITbbIqmxDcxZeXSeu85l7UnPvLav2miKMER3AMZ+DCFdThDhrQBAIpPMMrvFlP1ov1bn3MW1esYuYQ/sD6/AHv15NC</latexit> ~12<latexit sha1_base64="7wH2WVdWISdwYnqsMTVyR5rAKM0=">AAAB+nicbVBNS8NAEJ34WetXqkcvi0XwVJIq6LHoxWMF+wFtCJvtpl262YTdTaXE/BQvHhTx6i/x5r9x2+agrQ8GHu/NMDMvSDhT2nG+rbX1jc2t7dJOeXdv/+DQrhy1VZxKQlsk5rHsBlhRzgRtaaY57SaS4ijgtBOMb2d+Z0KlYrF40NOEehEeChYygrWRfLvSn1CS9TFPRjj3M7ee+3bVqTlzoFXiFqQKBZq+/dUfxCSNqNCEY6V6rpNoL8NSM8JpXu6niiaYjPGQ9gwVOKLKy+an5+jMKAMUxtKU0Giu/p7IcKTUNApMZ4T1SC17M/E/r5fq8NrLmEhSTQVZLApTjnSMZjmgAZOUaD41BBPJzK2IjLDERJu0yiYEd/nlVdKu19yLmnN/WW3cFHGU4ARO4RxcuIIG3EETWkDgEZ7hFd6sJ+vFerc+Fq1rVjFzDH9gff4AT7aUBQ==</latexit> ~13<latexit sha1_base64="Oj5OOfl5UD0Mrh/Gd9yVwPeeGOY=">AAAB+nicbVBNS8NAEJ3Ur1q/Uj16WSyCp5KooMeiF48V7Ae0IWy2m3bpZhN2N5US81O8eFDEq7/Em//GbZuDtj4YeLw3w8y8IOFMacf5tkpr6xubW+Xtys7u3v6BXT1sqziVhLZIzGPZDbCinAna0kxz2k0kxVHAaScY3878zoRKxWLxoKcJ9SI8FCxkBGsj+Xa1P6Ek62OejHDuZ+5F7ts1p+7MgVaJW5AaFGj69ld/EJM0okITjpXquU6ivQxLzQineaWfKppgMsZD2jNU4IgqL5ufnqNTowxQGEtTQqO5+nsiw5FS0ygwnRHWI7XszcT/vF6qw2svYyJJNRVksShMOdIxmuWABkxSovnUEEwkM7ciMsISE23SqpgQ3OWXV0n7vO5e1J37y1rjpoijDMdwAmfgwhU04A6a0AICj/AMr/BmPVkv1rv1sWgtWcXMEfyB9fkDUTuUBg==</latexit> ~11<latexit sha1_base64="Z1DxO/2cTeAPqRO0wzlGCoK0+XY=">AAAB+nicbVBNS8NAEJ3Ur1q/Uj16WSyCp5KooMeiF48V7Ae0IWy2m3bpZhN2N5US81O8eFDEq7/Em//GbZuDtj4YeLw3w8y8IOFMacf5tkpr6xubW+Xtys7u3v6BXT1sqziVhLZIzGPZDbCinAna0kxz2k0kxVHAaScY3878zoRKxWLxoKcJ9SI8FCxkBGsj+Xa1P6Ek62OejHDuZ66b+3bNqTtzoFXiFqQGBZq+/dUfxCSNqNCEY6V6rpNoL8NSM8JpXumniiaYjPGQ9gwVOKLKy+an5+jUKAMUxtKU0Giu/p7IcKTUNApMZ4T1SC17M/E/r5fq8NrLmEhSTQVZLApTjnSMZjmgAZOUaD41BBPJzK2IjLDERJu0KiYEd/nlVdI+r7sXdef+sta4KeIowzGcwBm4cAUNuIMmtIDAIzzDK7xZT9aL9W59LFpLVjFzBH9gff4ATjGUBA==</latexit> ~14<latexit sha1_base64="bUiB7GM/EueyHNLIaRxhSyx8/6A=">AAAB+nicbVBNS8NAEN34WetXqkcvi0XwVBIt6LHoxWMF+wFtCJPtpl262YTdTaXE/BQvHhTx6i/x5r9x2+agrQ8GHu/NMDMvSDhT2nG+rbX1jc2t7dJOeXdv/+DQrhy1VZxKQlsk5rHsBqAoZ4K2NNOcdhNJIQo47QTj25nfmVCpWCwe9DShXgRDwUJGQBvJtyv9CSVZH3gygtzP3Hru21Wn5syBV4lbkCoq0PTtr/4gJmlEhSYclOq5TqK9DKRmhNO83E8VTYCMYUh7hgqIqPKy+ek5PjPKAIexNCU0nqu/JzKIlJpGgemMQI/UsjcT//N6qQ6vvYyJJNVUkMWiMOVYx3iWAx4wSYnmU0OASGZuxWQEEog2aZVNCO7yy6ukfVFzL2vOfb3auCniKKETdIrOkYuuUAPdoSZqIYIe0TN6RW/Wk/VivVsfi9Y1q5g5Rn9gff4AUsCUBw==</latexit> ~15<latexit sha1_base64="cC09I5yAhT8haElJRiWSguUZ4i0=">AAAB+nicbVDLSsNAFJ34rPWV6tLNYBFclcQHuiy6cVnBPqAJYTK9bYdOJmFmUikxn+LGhSJu/RJ3/o3TNgttPXDhcM693HtPmHCmtON8Wyura+sbm6Wt8vbO7t6+XTloqTiVFJo05rHshEQBZwKammkOnUQCiUIO7XB0O/XbY5CKxeJBTxLwIzIQrM8o0UYK7Io3Bpp5hCdDkgeZe5kHdtWpOTPgZeIWpIoKNAL7y+vFNI1AaMqJUl3XSbSfEakZ5ZCXvVRBQuiIDKBrqCARKD+bnZ7jE6P0cD+WpoTGM/X3REYipSZRaDojoodq0ZuK/3ndVPev/YyJJNUg6HxRP+VYx3iaA+4xCVTziSGESmZuxXRIJKHapFU2IbiLLy+T1lnNPa859xfV+k0RRwkdoWN0ilx0heroDjVQE1H0iJ7RK3qznqwX6936mLeuWMXMIfoD6/MHVEWUCA==</latexit> ~16<latexit sha1_base64="Th2k0L1quCqnOCL5jErebK97YEU=">AAAB+nicbVBNS8NAEN34WetXqkcvi0XwVBIV9Vj04rGC/YAmhM122i7dbMLuplJifooXD4p49Zd489+4bXPQ1gcDj/dmmJkXJpwp7Tjf1srq2vrGZmmrvL2zu7dvVw5aKk4lhSaNeSw7IVHAmYCmZppDJ5FAopBDOxzdTv32GKRisXjQkwT8iAwE6zNKtJECu+KNgWYe4cmQ5EHmXuaBXXVqzgx4mbgFqaICjcD+8noxTSMQmnKiVNd1Eu1nRGpGOeRlL1WQEDoiA+gaKkgEys9mp+f4xCg93I+lKaHxTP09kZFIqUkUms6I6KFa9Kbif1431f1rP2MiSTUIOl/UTznWMZ7mgHtMAtV8YgihkplbMR0SSag2aZVNCO7iy8ukdVZzz2vO/UW1flPEUUJH6BidIhddoTq6Qw3URBQ9omf0it6sJ+vFerc+5q0rVjFziP7A+vwBVcqUCQ==</latexit> n illustration of graph attention layer. It allows a node fferent importance to different nodes within a neigh- propose to use another LSTM to model the temp lations between interactions explicitly. We term t as G-LSTM: gt i = G-LSTM(gt 1 i , ˆmt i;Wg) where ˆmt i is from Eq. 5. Wg is the G-LSTM we shared among all the sequences. In Encoder component, two LSTMs (M-L LSTM) are used to model the motion pattern of e trian, and the temporal correlations of interaction tively. We combine these two parts to accomplish of spatial and temporal information. At time-step are two hidden variables (m Tobs i , g Tobs i ) from two each pedestrian. In our implementation, these two 予測対象 他対象 他対象 他対象 他対象 他対象 Y. Huang, et al., “STGAT: Modeling Spatial-Temporal Interactions for Human Trajectory Prediction,” ICCV, 2019.
  • 27. 複数対象を動的なグラフ構造で効率的にモデル化 ● NHE:観測時刻のNode特徴をLSTMへ入力 ● NFE:学習時にNodeの未来の真の軌跡をエンコードするためにBiLSTMを適用 ● EE:特定範囲内の全対象からAttentionを求める - 重要度の高いEdge情報を取得 - 時刻毎にEdge情報は変動 ● 各特徴からDecoderで経路を予測 - 内部のCVAEでマルチモーダルな経路を予測 - Gaussian Mixture Modelで予測経路を洗練 環境情報を追加したTrajectron++ [T. Salzmann+, ECCV, 2020]が提案 Trajectron [B. Ivanovic+, CVPR, 2019] 27 Overall, we chose to make our model part of the “graph as architecture” methods, as a result of their stateful graph representation (leading to efficient iterative predictions on- line) and modularity (enabling model reuse and extensive parameter sharing). 3. Problem Formulation In this work, we are interested in jointly reasoning and generating a distribution of future trajectories for each agent in a scene simultaneously. We assume that each scene is preprocessed to track and classify agents as well as obtain their spatial coordinates at each timestep. As a result, each agent i has a classification type Ci (e.g. “Pedestrian”). Let Xt i = (xt i, yt i ) represent the position of the ith agent at time t and let Xt 1,...,N represent the same quantity for all agents in a scene. Further, let X (t1:t2) i = (Xt1 i , Xt1+1 i , . . . , Xt2 i ) denote a sequence of values for time steps t 2 [t1, t2]. As in previous works [1, 16, 49], we take as input the previous trajectories of all agents in a scene X (1:tobs) 1,...,N and aim to produce predictions bX (tobs+1:tobs+T ) 1,...,N that match the true future trajectories X (tobs+1:tobs+T ) 1,...,N . Note that we have not assumed N to be static, i.e. we can have N = f(t). !"/$ !%/& !'/(!"-!% !'-!% Legend Modeled Node !"/$ Node $ is of type !" !"-!% Edge is of type !"-!% Edge being created Normal Edge!"/) !"-!% Attention !(#$%) !(#$') !(#) !(#(') !(#(%) !(#()) F C F C EE NHE NFE Encoder ℎ+ ℎ, -(.|0) 1(.|0, 3) F C . !(#) ., ℎ+ 4!(#(') ! #(' ., ℎ+ 4!(#(%) ! #(% ., ℎ+ 4!(#(') GMM GMM GMM 4!(#(%) 4!(#()) Decoder 5) 5) 5) 6(#$%) + 8(#$%) 6(#$') + 8(#$') 6(#) + 8(#) 5'-5) 9(#$%) 9(#$') 9(#) 5%-5) Legend LSTM Cell Modulating Function FC Fully-Connected Layer Projection to a GMM Concatenation Randomly sampled Train time only Predict time only Train and Predict GMM MMM M M MM + Figure 2. Top: An example graph with four nodes. a is our mod-T. Salzmann, et al., “Trajectron++: Dynamically-Feasible Trajectory Forecasting with Heterogeneous Data,” ECCV, 2020. B. Ivanovic, et al., “The Trajectron: Probabilistic Multi-Agent Trajectory Modeling with Dynamic Spatiotemporal Graphs,” ICCV, 2019.
  • 28. 単純にノイズベクトルを付与すると,高い分散を持つ経路を予測してしまう ● 既存研究は真にマルチモーダルな分布を学習できていない 予測経路とノイズベクトル間の潜在的表現を学習 ● ノイズベクトルから生成した予測経路をLSTM Encoderへ入力 ● 元のノイズベクトルと類似するようにマッピング ● 真にマルチモーダルな経路を生成可能 Social-BiGAT [V. Kosaraju+, NeurIPS, 2019] 28Figure 2: Architecture for the proposal Social-BiGAT model. The model consists of a single generator, two Figure 4: Generated trajectories visualized for the S-GAN-P, Sophie, and Social-BiGAT models across four main scenes. Observed trajectories are shown as solid lines, ground truth future movements are shown as dashed lines, and generated samples are shown as contour maps. Different colors correspond to different pedestrians.V. Kosaraju, et al., “Social-BiGAT: Multimodal Trajectory Forecasting using Bicycle-GAN and Graph Attention Networks,” NeurIPS, 2019.
  • 29. Spatial-Temporal Graphを用いてモデル化 ● Graph Convolution Network (GCN)でインタラクションに関する特徴抽出 - 隣接行列からインタラクション情報を求める ● GCNで得た特徴からTemporal Convolutional Network (TCN)で予測分布を出力 - LSTMは予測経路を逐次出力するが,TCNは予測経路を並列に出力 - 推論速度を大幅に改善 Social-STGCNN [A. Mohamed+, CVPR, 2020] 29 Figure 2. The Social-STGCNN Model. Given T frames, we construct the spatio-temporal graph representing G = (V, A). Then G is forwarded through the Spatio-Temporal Graph Convolution Neural Networks (ST-GCNNs) creating a spatio-temporal embedding. Following this, the TXP-CNNs predicts future trajectories. P is the dimension of pedestrian position, N is the number of pedestrians, T is the number ˆA. Mohamed, et al., “Social-STGCNN: A Social Spatio-Temporal Graph Convolutional Neural Network for Human Trajectory Prediction,” CVPR, 2020.
  • 30. 歩行者間の関係を調査するグループベースのインタラクションをモデル化 ● グループ:同じ目的地に行く歩行者,遠方の同方向へ進む歩行者, etc. ● グループの判別のために,人がグループ情報をアノテーション Relational Social Representationでグループのインタラクションを求める ● 周囲の静的環境情報や過去の軌跡情報と連結し,経路を予測 RSBG [J. Sun+, CVPR, 2020] 30 BiLSTM CNN RSBG Generator Coordinates Image patch Coordinates GCN LSTM Individual Representation Decoder Relational Social Representation Features RSBG J. Sun, et al., “Recursive Social Behavior Graph for Trajectory Prediction,” CVPR, 2020.
  • 31. 経路予測で用いるLSTMは2点の問題がある ● LSTMは複雑な時間依存性のモデル化が困難 ● Attentionモデルの予測手法がインタラクションを完全にモデル化できない Transformerを時空間的Attentionへ拡張し,経路予測タスクに応用 ● Temporal Transformerで軌跡特徴をエンコード ● Spatial Transformerで時刻毎に独立したインタラクションを抽出 ● 2つのTransformerを用いることで,LSTMを用いた予測手法の予測精度を大幅に改善 STAR [C. Yu+, ECCV, 2020] 31 6 Yu C., Ma X., Ren J., Zhao H., Yi S. (a) Temporal Transformer (b) Spatial Transformer Fig. 3. STAR has two main components, Temporal Transformer and Spatial Tr former. (a) Temporal Transformer treats each pedestrians independently and extr the temporal dependencies by Transformer model (h is the embedding of pedest Yu C., Ma X., Ren J., Zhao H., Yi S. Temporal Transformer Spatial Transformer C. Yu, et al., “Spatio-Temporal Graph Transformer Networks for Pedestrian Trajectory Prediction,” ECCV, 2020.
  • 32. 本サーベイの目的 32 Deep Learningを用いた経路予測手法の動向調査 ● 各カテゴリに属する経路予測手法毎の特徴をまとめる - インタラクションあり • Poolingモデル • Attentionモデル - インタラクションなし (Other) ● 定量的評価のためのデータセット,評価指標も紹介 ● 代表的モデルを使用して,各モデルの精度と予測結果について議論
  • 33. CNNを用いた予測手法 ● 過去の軌跡情報をエンコードし,スパースなボクセルに格納 ● ConvolutionとMax-Poolingを複数回行い,Deconvolutionで予測経路を出力 ● Location bias mapにより特定シーンの物体情報と潜在的特徴表現で要素積をとる - 特定シーンによって変化する歩行者の振る舞いを考慮 Behavior-CNN [S. Yi+, ECCV, 2016] 33 output of CNN, as they are of variable lengths and observed in different periods. 3 Pedestrian Behavior Modeling and Prediction The overall framework is shown in Fig. 2. The input to our system is pedestrian walking paths in previous frames (colored curves in Fig. 2(a)). They could be obtained by simple trackers such as KLT [41]. They are then encoded into a displacement volume (Fig. 2(b)) with the proposed walking behavior encoding scheme. Behavior-CNN in Fig. 2(c) takes the encoded displacement volume as Fig. 2. System flowchart. (a) Pedestrian walking paths in previous frames. Three exam- ples are shown in different colors. Rectangles indicate current locations of pedestrians. (b) The displacement volume encoded from pedestrians’ past walking paths in (a). (c) Behavior-CNN. (d) The predicted displacement volume by Behavior-CNN. (e) Pre- dicted future pedestrian walking paths decoded from (d). Three bottom convolution layers, conv1, conv2, and conv3, are to be con- volved with input data of size X × Y × 2M. conv1 contains 64 filters of size 3 × 3 × 2M, while both conv2 and conv3 contain 64 filters of size 3 × 3 × 64. Zeros are padded to each convolution input in order to guarantee feature maps of these layers be of the same spatial size with the input. The three bottom convolution layers are followed by max pooling layers max-pool with stride 2. The output size of max-pool is X/2 × Y/2 × 64. In this way, the receptive field of the network can be doubled. Large receptive field is necessary for the task of pedestrian walking behavior modeling because each individual’s behavior are significantly influenced by his/her neighbors. A learnable location bias map of size X/2×Y/2 is channel-wisely added to each of the pooled feature maps. Every spatial location has one independent bias value shared across channels. With the location bias map, location information of the scene can be automatically learned by the proposed Behavior-CNN. As for the three top convolution layers, conv4 and conv5 contain 64 filters of size 3 × 3 × 64, while conv6 contains 2M∗ filters of size 3 × 3 × 64 to output the predicted displacement volume. Zeros are also S. Yi, et al., “Pedestrian Behavior Understanding and Prediction with Deep Neural Networks,” ECCV, 2016.
  • 34. 1人称視点における対面の歩行者のための位置予測 ● 1人称視点特有の手掛かりを位置予測に利用 1. 対面の歩行者の位置に影響するエゴモーション 2. 対面の歩行者のスケール 3. 対面の歩行者の姿勢 ● 上記3つの情報を用いたマルチストリームモデルで将来の位置予測 Future localization in first-person videos [T. Yagi+, CVPR, 2018] 34 Figure 2. Future Person Localization in First-Person Videos. Given a) Tprev-frames observations as input, we b) predict future locations of a target person in the subsequent Tfuture frames. Our approach makes use of c-1) locations and c-2) scales of target persons, d) ego- motion of camera wearers and e) poses of the target persons as a salient cue for the prediction. Channel-wise Concatenation Input Output Location-Scale Stream Ego-Motion Stream Pose Stream Figure 3. Proposed Network Architecture. Blue blocks corre- tain direction at a constant speed, our best guess based on only previous locations would be to expect them to keep go- ing in that direction in subsequent future frames too. How- ever, visual distances in first-person videos can correspond to different physical distances depending on where people are observed in the frame. In order to take into account this perspective effect, we propose to learn both locations and scales of target peo- ple jointly. Given a simple assumption that heights of people do not differ too much, scales of observed peo- ple can make a rough estimate of how large movements they made in the actual physical world. Formally, let Lin = (lt0−Tprev+1, . . . , lt0 ) be a history of previous tar- get locations. Then, we extend each location lt ∈ R2 + of a) Input b) Prediction e) Pose c-1) Location c-2) Scale d) Ego-motion Figure 2. Future Person Localization in First-Person Videos. Given a) Tprev-frames observations as input, we b) predict future locations of a target person in the subsequent Tfuture frames. Our approach makes use of c-1) locations and c-2) scales of target persons, d) ego- motion of camera wearers and e) poses of the target persons as a salient cue for the prediction. Channel-wise Concatenation Input Output Location-Scale Stream Ego-Motion Stream Pose Stream Figure 3. Proposed Network Architecture. Blue blocks corre- spond to convolution/deconvolution layers while gray blocks de- scribe intermediate deep features. tain direction at a constant speed, our best guess based on only previous locations would be to expect them to keep go- ing in that direction in subsequent future frames too. How- ever, visual distances in first-person videos can correspond to different physical distances depending on where people are observed in the frame. In order to take into account this perspective effect, we propose to learn both locations and scales of target peo- ple jointly. Given a simple assumption that heights of people do not differ too much, scales of observed peo- ple can make a rough estimate of how large movements they made in the actual physical world. Formally, let Lin = (lt0−Tprev+1, . . . , lt0 ) be a history of previous tar- get locations. Then, we extend each location lt ∈ R2 + of a target person by adding the scale information of that per- son st ∈ R+, i.e., xt = (lt , st) . Then, the ‘location- scale’ input stream in Figure 3 learns time evolution in Xin = (xt0−Tprev+1, . . . , xt0 ), and the output stream gen- erates Xout = (xt0+1 − xt0 , . . . , xt0+Tfuture − xt0 ). Ground Truth OursSocial LSTMInput NNeighbor Past observations Predictions (a) (b) (c) (d) (e) Figure 5. Visual Examples of Future Person Localization. Using locations (shown with solid blue lines), scales and poses of target people (highlighted in pink, left column) as well as ego-motion of camera wearers in the past observations highlighted in blue, we predict locations of that target (the ground-truth shown with red crosses with dotted red lines) in the future frames highlighted in red. We compared T. Yagi, et al., “Future Person Localization in First-Person Videos,” CVPR, 2018.
  • 35. 車載カメラ映像に映る歩行者の将来の位置を予測する手法 ● 歩行者の矩形領域,自車の移動量,車載カメラ画像を入力 ● 歩行者の将来の矩形領域を出力 OPPU [A. Bhattacharyya+, CVPR, 2018] 35 Figure 2: Two stream architecture for prediction of future pedestrian bounding boxes. ion of Pedestrian Trajectories sequence ˆv (containing information about past pedestrian Last Observation: t Prediction: t + 5 Prediction: t + 10 Prediction: t + 15 Figure 4: Rows 1-3: Point estimates. Blue: Ground-truth, Red: Kalman Filter (Table 1 row 1), Yellow: One-stream model (Table 1 row 4), Green: Two-stream model (mean of predictive distribution, Table 4 row 3). Rows 4-6: Predictive distributions of our two-stream model as heat maps. (Link to video results in the Appendix). sequences have low error (note, log(530) ≈ 6.22 the MSE that, the predicted uncertainty upper bounds the error of the A. Bhattacharyya, et al., “Long-Term On-Board Prediction of People in Traffic Scenes under Uncertainty,” CVPR, 2018.
  • 36. 自動車検出,追跡,経路予測を同時に推論するモデルを提案 ● 3D点群データを入力に使用 - 3次元空間でスパースな特徴表現 - ネットワークの計算コストを抑制 - リアルタイムで3つのタスクを同時に計算可能 Fast and Furious [W. Luo+, CVPR, 2018] 36 Real Time End-to-End 3D Detection, Tracking and Motion orecasting with a Single Convolutional Net Wenjie Luo, Bin Yang and Raquel Urtasun Uber Advanced Technologies Group University of Toronto {wenjie, byang10, urtasun}@uber.com ract a novel deep neural network about 3D detection, track- iven data captured by a 3D about these tasks, our holis- occlusion as well as sparse h performs 3D convolutions bird’s eye view representa- is very efficient in terms of n. Our experiments on a new ed in several north american W. Luo, et al., “Fast and Furious: Real Time End-to-End 3D Detection, Tracking and Motion Forecasting with a Single Convolutional Net,” CVPR, 2018.
  • 37. 異なる移動物体を属性とみなし,属性毎の特徴的な経路を予測 ● 異なる移動物体が保有する潜在的特徴を考慮 - 歩行者:歩道や車道を歩く - 自動車:車道を走る ● 属性をone-hot vectorで表現するため,属性毎にモデルを作成する必要がない - 計算コストの抑制 ● 属性毎の特徴的な経路を予測するために,シーンラベルを利用 Object Attributes and Semantic Environment [H. Minoura+, VISAPP, 2019] 37H. Minoura, et al., “Path predictions using object attributes and semantic environment,” VISAPP, 2019.
  • 38. 一般道における自動車の経路予測手法 ● 異なるシーンコンテキストを考慮するCNNとGRUによるモデルを提案 - 予測車・他車の位置,コンテキスト,道路情報をチャネル方向に連結 - 連結したテンソルを時刻毎にCNNでエンコード - エンコードした特徴をGRUで時間方向へ伝播し,将来の経路を予測 Rules of the Road [J. Hong+, CVPR, 2019] 38 Figure 2: Entity and world context representation. For an example scene (visualized left-most), the world is represented with the tensors shown, as described in the text. ontext representation. For an example scene (visualized left-most), the world is represented with bed in the text. One-shot (b) with RNN decoder (a) Gaussian Regression (b) GMM-CVAE Figure 4: Examples of Gaussian Regression and GMM-CVAE methods. Ellipses represent a standard deviation of uncertainty, a only drawn for the top trajectory; only trajectories with probability > 0.05 are shown, with cyan the most probable.We see that uncer ellipses are larger when turning than straight, and often follow the direction of velocity. In the GMM-CVAE example, different sa シーン例 予測車の位置 他車の位置 コンテキスト 道路情報 J. Hong, et al., “Rules of the Road: Predicting Driving Behavior with a Convolutional Model of Semantic Interactions,” CVPR, 2019.
  • 39. 複数の尤もらしい経路を予測するMultiverseを提案 ● 過去のセマンティックラベルと固定グリッドをHistory Encoder (HE)へ入力 - Convolution Recurrent Neural Networkで時空間特徴をエンコード - セマンティックラベルを入力することで,ドメインシフトに対して強固になる ● HEの出力と観測最終のセマンティックラベルをCoarse Location Decoder (CLD)へ入力 - GATでグリッドの各格子に重み付けし,重みマップを生成 ● Fine Location Decoder (FLD)でグリッドの各格子に距離ベクトルを格納 - CLD同様,GATで各格子に重み付け ● FLDとCLDから複数の経路を予測 Multiverse [J. Liang+, CVPR, 2020] 39 Figure 2: Overview of our model. The input to the model is the ground truth location history, and a set of video frames, which are preprocessed by a semantic segmentation model. This is encoded by the “History Encoder” convolutional RNN.J. Liang, et al., “The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction,” CVPR, 2020.
  • 40. 本サーベイの目的 40 Deep Learningを用いた経路予測手法の動向調査 ● 各カテゴリに属する経路予測手法毎の特徴をまとめる - インタラクションあり • Poolingモデル • Attentionモデル - インタラクションなし (Other) ● 定量的評価のためのデータセット,評価指標も紹介 ● 代表的モデルを使用して,各モデルの精度と予測結果について議論
  • 41. 経路予測で最も使用されるデータセット 41 ETH Dataset UCY Dataset Stanford Drone Dataset • サンプル数:786 市街地の歩行者を撮影したデータセット 市街地の歩行者を撮影したデータセット • シーン数:2 • 対象種類 - pedestrian • サンプル数:750 • シーン数:3 • 対象種類 - pedestrian スタンフォード大学構内を撮影したデータセット • サンプル数:10,300 • シーン数:8 • 対象種類 - pedestrian, car, cyclist, bus, skater, cart R. Alexandre, et al., “Learning Social Etiquette: Human Trajectory Understanding in Crowded Scenes,” ECCV, 2016. A. Lerner, et al., “Crowds by Example,” CGF, 2007. S. Pellegrini, et al., “You’ll Never Walk Alone: Modeling Social Behavior for Multi-target Tracking,” ICCV, 2009.
  • 42. 固定カメラやドローンで撮影 ● 豊富なデータを取得できるため,非常に大規模なデータセット 俯瞰視点のデータセット 42 lts from NN+map(prior) m- aseline. The orange trajectory Red represents ground truth for n represents the multiple fore- 3 s. Top left: The car starts to Argoverse Dataset The inD Dataset: A Drone Dataset of Naturalistic Road User Trajectories at German Intersections Julian Bock1, Robert Krajewski1, Tobias Moers2, Steffen Runde1, Lennart Vater1 and Lutz Eckstein1 Fig. 1: Exemplary result of road user trajectories in the inD dataset. The position and speed of each road user is measured accurately over time and shown by bounding boxes and tracks. For privacy reasons, the buildings were made unrecognizable. Abstract—Automated vehicles rely heavily on data-driven methods, especially for complex urban environments. Large datasets of real world measurement data in the form of road user trajectories are crucial for several tasks like road user prediction models or scenario-based safety validation. So far, though, this demand is unmet as no public dataset of urban road user trajectories is available in an appropriate size, quality and variety. By contrast, the highway drone dataset (highD) has recently shown that drones are an efficient method for acquiring naturalistic road user trajectories. Compared to driving studies or ground-level infrastructure sensors, one major advantage of using a drone is the possibility to record naturalistic behavior, as road users do not notice measurements taking place. Due to the ideal viewing angle, an entire intersection scenario can be measured with significantly less occlusion than with sensors at ground level. Both the class and the trajectory of each road user can be extracted from the video recordings with high precision using state-of-the-art deep neural networks. Therefore, we propose the creation of a comprehensive, large-scale urban intersection dataset with naturalistic road user behavior using camera-equipped drones as successor of the highD dataset. The resulting dataset contains more than 11500 road users including vehicles, bicyclists and pedestrians at intersections in Germany and is called inD. The dataset consists of 10 hours of measurement data from four intersections and is available online for non- commercial research at: http://www.inD-dataset.com 1The authors are with the Automated Driving Department, Institute for Automotive Engineering RWTH Aachen University (Aachen, Germany). (E-mail: {bock, krajewski, steffen.runde, vater, eckstein}@ika.rwth- aachen.de). 2The author is with the Automated Driving Department, fka GmbH (Aachen, Germany). (E-mail: tobias.moers@fka.de). Index Terms—Dataset, Trajectories, Road Users, Machine Learning I. INTRODUCTION Automated driving is expected to reduce the number and severity of accidents significantly [13]. However, intersections are challenging for automated driving due to the large com- plexity and variety of scenarios [15]. Scientists and companies are researching how to technically handle those scenarios by an automated driving function and how to proof safety of these systems. An ever-increasing proportion of the approaches to tackle both challenges are data-driven and therefore large amounts of measurement data are required. For example, re- cent road user behaviour models, which are used for prediction or simulation, use probabilistic approaches based on large scale datasets [2], [11]. Furthermore, current approaches for safety validation of highly automated driving such as scenario- based testing heavily rely on large-scale measurement data on trajectory level [3], [5], [17]. However, the widely used ground-level or on-board measurement methods have several disadvantages. These include that road users can be (partly) occluded by other road users and do not behave naturally as they notice being part of a measurement due to conspicuous sensors [5]. We propose to use camera-equipped drones to record road user movements at urban intersections (see Fig. 2). Drones with high-resolution cameras allow to record traffic from a arXiv:1911.07602v1[cs.CV]18Nov2019 inD Dataset ysis. The red trajectories are single-future method predictions and the yellow-orange heatmaps are ctions. The yellow trajectories are observations and the green ones are ground truth multi-future etails. Method Single-Future Multi-Future Our full model 18.51 / 35.84 166.1 / 329.5 The Forking Paths Dataset Figure 7: Example output of the motion prediction solution supplied as part of the software development kit. A convo- lution neural network takes rasterised scenes around nearby vehicles as input, and predicts their future motion. ity and multi-threading to make it suitable for distributed machine learning. Customisable scene visualisation and rasterisation. We provide several functions to visualise and rasterise Lyft Level 5 Dataset • サンプル数:300K • シーン数:113 • 対象種類 • 追加情報 - car - 車線情報,地図データ,センサー情報 • サンプル数:13K • シーン数:4 • 対象種類 - pedestrian, car, cyclist • サンプル数:3B • シーン数:170,000 • 対象種類 • 追加情報 - pedestrian, car, cyclist - 航空情報 - セマンティックラベル • サンプル数:0.7K • シーン数:7 • 対象種類 • 追加情報 - pedestrian - 複数経路情報 - セマンティックラベル 一般道を撮影したデータセット 交差点を撮影したデータセット 一般道を撮影したデータセット シミュレータで作成されたデータセット J. Houston, et al., “One Thousand and One Hours: Self-driving Motion Prediction Dataset,” CoRR, 2020. J. Bock, et al., “The inD Dataset: A Drone Dataset of Naturalistic Road User Trajectories at German Intersections,” CoRR, 2019. M.F. Chang, et al., “Argoverse: 3D Tracking and Forecasting with Rich Maps,” CVPR, 2019. J. Liang, et al., “The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction,” CVPR, 2020.
  • 43. 自動車前方の移動対象の経路予測を目的 車載カメラ視点のデータセット 43 • サンプル数:1.8K • シーン数:53 • 対象種類 • 追加情報 - pedestrian - 車両情報,インフラストラクチャ 一般道を撮影したデータセット Apolloscape DatasetFigure 3. Example scenarios of the TITAN Dataset: a pedestrian bounding box with tracking ID is shown in , vehicle bounding b with ID is shown in , future locations are displayed in . Action labels are shown in different colors following Figure 2. centric views captured from a mobile platform. In the TITAN dataset, every participant (individuals, vehicles, cyclists, etc.) in each frame is localized us- ing a bounding box. We annotated 3 labels (person, 4- wheeled vehicle, 2-wheeled vehicle), 3 age groups for per- son (child, adult, senior), 3 motion-status labels for both 2 and 4-wheeled vehicles, and door/trunk status labels for 4- wheeled vehicles. For action labels, we created 5 mutually exclusive person action sets organized hierarchically (Fig- ure 2). In the first action set in the hierarchy, the annota- tor is instructed to assign exactly one class label among 9 atomic whole body actions/postures that describe primitive action poses such as sitting, standing, standing, bending, etc. The second action set includes 13 actions that involve single atomic actions with simple scene context such as jay- walking, waiting to cross, etc. The third action set includes 7 complex contextual actions that involve a sequence of atomic actions with higher contextual understanding, such agent i at each past time step from 1 to Tobs, where (cu, and (lu, lv) represent the center and the dimension of bounding box, respectively. The proposed TITAN fram work requires three inputs as follows: Ii t=1:Tobs for the tion detector, xi t for both the interaction encoder and p object location encoder, and et = {αt, ωt} for the eg motion encoder where αt and ωt correspond to the accel ation and yaw rate of the ego-vehicle at time t, respective During inference, the multiple modes of future bound box locations are sampled from a bi-variate Gaussian g erated by the noise parameters, and the future ego-motio ˆet are accordingly predicted, considering the multi-mo nature of the future prediction problem. Henceforth, the notation of the feature embedding fu tion using multi-layer perceptron (MLP) is as follows: Φ without any activation, and Φr, Φt, and Φs are associa with ReLU, tanh, and a sigmoid function, respectively. TITAN Dataset LSTM 172 330 911 837 3352 289 569 155 B-LSTM[5] 101 296 855 811 3259 159 539 153 PIEtraj 58 200 636 596 2477 110 399 124 Table 3: Location (bounding box) prediction errors over varying future time steps. M predicted time steps, CMSE and CFMSE are the MSEs calculated over the center of predicted sequence and only the last time step respectively. Method 0.5 Linear 0.8 LSTM 1.5 PIEspeed 0.6 Table 4: Speed predict on the PIE dataset. Las results are reported in km is generally better on bou degrees of freedom. Context in trajector PIE Dataset Figure 5: Illustration of our TrafficPredict (TP) method on camera-based images. T conditions and traffic situations. We only show the trajectories of several instances in drawn in green and the prediction results of other methods (ED,SL,SA) are shown w trajectories of our TP algorithm (pink lines) are the closest to ground truth in most of stance layer to ca instances and use larities of moveme type and guide the in spatial and tem ferred in our desig previous state-of-t racy of trajectory p heterogeneous tra 一般道を撮影したデータセット 一般道を撮影したデータセット • サンプル数:81K • シーン数:100,000 • 対象種類 - pedestrian, car, cyclist • サンプル数:645K • シーン数:700 • 対象種類 • 追加情報 - pedestrian, car, cyclist - 行動ラベル,歩行者の年齢 Y. Ma, et al., “TrafficPredict: Trajectory Prediction for Heterogeneous Traffic-Agents,” AAAI, 2019. A. Rasouli, et al., “PIE: A Large-Scale Dataset and Models for Pedestrian Intention Estimation and Trajectory Prediction, ” ICCV, 2019. S. Malla, et al., “TITAN: Future Forecast using Action Priors,” CVPR, 2020.
  • 44. 前方の歩行者の経路予測を目的 ● 被験者にウェアラブルカメラを装着 1人称視点のデータセット 44 • サンプル数:5K • シーン数:87 • 対象種類 • 追加情報 - pedestrian - 姿勢情報,エゴモーション 歩道を撮影したデータセット 𝑡 𝑡0 6 𝑡 𝑡0 10 Predictions First-Person Locomotion Dataset T. Yagi, et al., “Future Person Localization in First-Person Videos,” CVPR, 2018.
  • 45. 予測された矩形領域と真の矩形領域の中心座標で評価 ● 車載カメラ映像における経路予測で利用 ● 矩形領域の重なり率からF値で評価もできる 評価指標 45 Displacement Error Negative log-likelihood Mean Square Error Collision rate (a) Average Displacement Error (b) Final Displacement ErrorADE FDE 真値と予測値とのユークリッド距離誤差 ● Average Displacement Error (ADE):予測時刻間の平均誤差 ● Final Displacement Error (FDE):予測最終時刻の誤差 !"#$ … !"#& #ofSamples:' Prediction Horizon: ( Figure 5. An illustration of our probabilistic evaluation methodol- ogy. It uses kernel density estimates at each timestep to compute the log-likelihood of the ground truth trajectory at each timestep, averaging across time to obtain a single value. Figure 6. Mean NLL for each dataset. Error bars are bootstrapped 95% confidence intervals. 2000 trajectories were sampled per model at each prediction timestep. Lower is better. ADE and FDE are useful metrics for comparing determinis- tic regressors, they are not able to compare the distributions produced by generative models, neglecting aspects such as variance and multimodality [40]. To bridge this gap in eval- Dataset ADE / FDE, SGAN [16] ETH 0.64 / 1.13 Hotel 0.43 / 0.91 Univ 0.53 / 1.12 Zara 1 0.29 / 0.58 Zara 2 0.27 / 0.56 Average 0.43 / 0.86 Table 1. Quantitative ADE and metric where N = 100. Both of our methods signi the ETH datasets, the UCY (P <.001; two-tailed t-test and SGAN’s mean NLL). On Full model is identical in pe same t-test). However, on th model performs worse than We believe that this is caused tions more often than in other truth trajectories to frequentl tions whereas SGAN’s highe to have density there. Acros uration outperforms our zbes model’s full multimodal mo for strong performance on th We also evaluated our m to determine how much the p prediction horizon. The resu be seen, our Full model sig at every timestep (P <.001; ence between our and SGAN 推定した分布の元での真値の対数尤度の期待値 ● ADEとFDEで複数経路を評価するのはマルチモーダル性を無視 ● Negative log-likelihoodで複数経路の予測の評価指標として利用 truth prediction MSE L2 norm L2 norm 真値 予測値 従来の評価指標 提案する評価指標 2つのDisplacement Errorで評価 非線形経路のDisplacement Error,2つの物体との衝突率で評価 L2 norm L2 norm 真値 予測値 従来の評価指標 提案する評価指標 2つのDisplacement Errorで評価 Displacement Errorは全サンプルに対し平均を求める ● インタラクション情報がどの予測経路に効果的か評価できない 予測値が各物体と衝突したか否かの衝突率で評価 ● 動的物体:映像中の他対象 ● 静的物体:建物や木などの障害物 動的物体 静的物体
  • 46. 本サーベイの目的 46 Deep Learningを用いた経路予測手法の動向調査 ● 各カテゴリに属する経路予測手法毎の特徴をまとめる - インタラクションあり • Poolingモデル • Attentionモデル - インタラクションなし (Other) ● 定量的評価のためのデータセット,評価指標も紹介 ● 代表的モデルを使用して,各モデルの精度と予測結果について議論
  • 47. 代表的なモデルを用いて,精度検証を行う 評価実験 47 モデル名 インタラクション Deep Learning 環境 データセット LSTM - ✔ - ETH/UCY, SDD RED - ✔ - ETH/UCY, SDD ConstVel - - - ETH/UCY, SDD Social-LSTM ✔ ✔ - ETH/UCY, SDD Social-GAN ✔ ✔ - ETH/UCY, SDD STGAT ✔ ✔ - ETH/UCY, SDD Trajectron ✔ ✔ - ETH/UCY, SDD Env-LSTM - ✔ ✔ SDD Social-STGCNN ✔ ✔ - ETH/UCY PECNet ✔ ✔ - ETH/UCY 精度比較を行うモデル Attentionモデル Poolingモデル
  • 49. ETH/UCYにおけるDisplacement Error 49 Scene Method C-ETH ETH HOTEL UCY ZARA01 ZARA02 AVG LSTM 0.56 / 1.15 0.91 / 1.57 0.29 / 0.56 0.84 / 1.56 1.33 / 2.69 0.77 / 1.50 0.78 / 1.51 RED 0.58 / 1.22 0.70 / 1.46 0.17 / 0.33 0.61 / 1.32 0.45 / 0.99 0.36 / 0.79 0.48 / 1.02 ConstVel 0.57 / 1.26 0.57 / 1.26 0.19 / 0.33 0.67 / 1.43 0.49 / 1.07 0.50 / 1.09 0.50 / 1.07 Social-LSTM 0.90 / 1.70 1.30 / 2.55 0.47 / 0.98 1.00 / 2.01 0.92 / 1.69 0.78 / 1.61 0.90 / 1.76 Social-GAN 0.49 / 1.05 0.53 / 1.03 0.35 / 0.77 0.79 / 1.63 0.47 / 1.01 0.44 / 0.94 0.51 / 1.07 PECNet 0.68 / 1.12 0.71 / 1.14 0.14 / 0.21 0.63 / 1.19 0.47 / 0.83 0.36 / 0.67 0.50 / 0.86 STGAT 0.48 / 1.05 0.51 / 1.01 0.19 / 0.31 0.61 / 1.33 0.47 / 1.00 0.39 / 0.78 0.46 / 0.91 Trajectron 0.52 / 1.14 0.56 / 1.18 0.26 / 0.51 0.63 / 1.37 0.50 / 1.05 0.39 / 0.84 0.48 / 1.02 Social-STGCNN 0.68 / 1.27 0.83 / 1.35 0.22 / 0.34 0.84 / 1.46 0.61 / 1.12 0.54 / 0.93 0.62 / 1.08 SingleModel20Outputs Single Model:インタラクションを考慮しないREDが最も誤差を低減 20 Outputs:ADEでSTGAT,FDEでPECNetが最も誤差を低減 ADE / FDE [m]
  • 50. ETH/UCYにおけるDisplacement Error 50 Scene Method C-ETH ETH HOTEL UCY ZARA01 ZARA02 AVG LSTM 0.56 / 1.15 0.91 / 1.57 0.29 / 0.56 0.84 / 1.56 1.33 / 2.69 0.77 / 1.50 0.78 / 1.51 RED 0.58 / 1.22 0.70 / 1.46 0.17 / 0.33 0.61 / 1.32 0.45 / 0.99 0.36 / 0.79 0.48 / 1.02 ConstVel 0.57 / 1.26 0.57 / 1.26 0.19 / 0.33 0.67 / 1.43 0.49 / 1.07 0.50 / 1.09 0.50 / 1.07 Social-LSTM 0.90 / 1.70 1.30 / 2.55 0.47 / 0.98 1.00 / 2.01 0.92 / 1.69 0.78 / 1.61 0.90 / 1.76 Social-GAN 0.49 / 1.05 0.53 / 1.03 0.35 / 0.77 0.79 / 1.63 0.47 / 1.01 0.44 / 0.94 0.51 / 1.07 PECNet 0.68 / 1.12 0.71 / 1.14 0.14 / 0.21 0.63 / 1.19 0.47 / 0.83 0.36 / 0.67 0.50 / 0.86 STGAT 0.48 / 1.05 0.51 / 1.01 0.19 / 0.31 0.61 / 1.33 0.47 / 1.00 0.39 / 0.78 0.46 / 0.91 Trajectron 0.52 / 1.14 0.56 / 1.18 0.26 / 0.51 0.63 / 1.37 0.50 / 1.05 0.39 / 0.84 0.48 / 1.02 Social-STGCNN 0.68 / 1.27 0.83 / 1.35 0.22 / 0.34 0.84 / 1.46 0.61 / 1.12 0.54 / 0.93 0.62 / 1.08 SingleModel20Outputs ADE / FDE [m] Poolingモデルと比較してAttentionモデルによる経路予測手法が有効
  • 51. LSTM RED ConstVel Social-LSTM Social-GAN PECNet STGAT Trajectron Social-STGCNN 動的物体 0.42 0.78 0.55 0.89 0.99 0.71 1.10 0.54 1.63 静的物体 0.08 0.07 0.09 0.16 0.08 0.12 0.08 0.13 0.16 ETH/UCYにおけるCollision rateと予測結果例 51 Model Object ConstVelLSTM PECNet RED Social-GANSocial-LSTMSTGAT Social-STGCNNTrajectron 入力値 真値 予測値 Collision rate [%]
  • 52. LSTM RED ConstVel Social-LSTM Social-GAN PECNet STGAT Trajectron Social-STGCNN 動的物体 0.42 0.78 0.55 0.89 0.99 0.71 1.10 0.54 1.63 静的物体 0.08 0.07 0.09 0.16 0.08 0.12 0.08 0.13 0.16 ETH/UCYにおけるCollision rateと予測結果例 52 Model Object 入力値 真値 予測値 Collision rate [%] ConstVelLSTM PECNet RED Social-GANSocial-LSTMSTGAT Social-STGCNNTrajectron 予測誤差が低い手法 ≠ 衝突率が低い手法 - 真値と類似しない場合に予測誤差は増加するが,衝突率は減少することが起こり得る 動的物体に関するCollision rateで衝突していないと判定された経路 動的物体に関するCollision rateで衝突したと判定された経路
  • 53. SDDにおけるDisplacement Error 53 Scene Method bookstore coupa deathCircle gates hyang little nexus quad AVG LSTM 7.00 / 14.8 8.44 / 17.5 7.52 / 15.9 5.78 / 11.9 8.78 / 18.4 10.8 / 23.1 6.61 / 13.1 16.1 / 30.2 8.88 / 18.1 RED 7.91 / 17.1 9.51 / 20.4 8.22 / 17.8 5.72 / 12.0 9.14 / 19.5 11.8 / 25.8 6.24 / 12.7 4.81 / 10.9 7.92 / 17.0 ConstVel 6.63 / 12.9 8.17 / 16.2 7.29 / 14.0 5.76 / 10.9 9.21 / 18.1 10.9 / 22.1 7.14 / 13.7 5.31 / 8.89 7.56 / 14.6 Social-LSTM 33.6 / 74.0 34.8 / 76.0 33.4 / 74.7 35.6 / 83.2 35.4 / 75.9 36.7 / 77.5 32.3 / 71.3 32.4 / 71.3 34.3 / 75.5 Env-LSTM 13.5 / 30.1 17.2 / 36.7 15.3 / 32.2 17.3 / 36.5 12.6 / 27.9 14.2 / 31.1 10.8 / 24.0 8.05 / 19.0 13.8 / 30.0 Social-GAN 18.4 / 36.7 19.5 / 39.1 18.6 / 37.2 18.6 / 37.3 20.1 / 40.6 20.0 / 40.8 18.1 / 36.3 13.2 / 26.2 18.3 / 36.8 STGAT 7.58 / 14.6 9.00 / 17.4 7.57 / 14.4 6.33 / 11.7 9.17 / 17.9 10.9 / 21.8 7.37 / 14.0 4.83 / 7.95 7.85 / 15.0 Trajectron 6.18 / 13.2 7.24 / 15.5 6.43 / 13.6 6.29 / 13.0 7.72 / 16.5 9.38 / 20.8 6.55 / 13.4 6.80 / 15.1 7.07 / 15.1 SingleModel20Outputs ADE / FDE [pixel] Single Model:Deep Learningを使用しないConstVelが最も誤差を低減 20 Outputs:ADEでTrajectron,FDEでSTGATが最も誤差を低減
  • 54. 撮影箇所に影響される ● ETH/UCYは低所で撮影 - 人の経路がsensitiveになる ● SDDは高所で撮影 - 人の経路がinsensitiveになる - 人の動きが線形になり,線形予測するConstVelの予測誤差が低下 SDDでConstVelの予測誤差が低い要因は何か 54 ETH/UCY SDD
  • 55. SDDにおけるCollision rateと予測結果例 55 Model Object Collision rate [%] Social-GANRED ConstVel TrajectronSTGATLSTM Social-LSTM 真値入力値 予測値 Env-LSTMConstVelLSTM RED Social-GANSocial-LSTM STGAT Trajectron Env-LSTM 入力値 真値 予測値 LSTM RED ConstVel Social-LSTM Social-GAN STGAT Trajectron Env-LSTM 動的物体 10.71 11.27 10.98 15.12 10.91 10.97 12.80 12.12 静的物体 2.82 2.86 2.40 20.33 6.41 2.17 1.71 1.58
  • 56. SDDにおけるCollision rateと予測結果例 56 Model Object Social-GANRED ConstVel TrajectronSTGATLSTM Social-LSTM 真値入力値 予測値 Env-LSTMConstVelLSTM RED Social-GANSocial-LSTM STGAT Trajectron Env-LSTM 入力値 真値 予測値 LSTM RED ConstVel Social-LSTM Social-GAN STGAT Trajectron Env-LSTM 動的物体 10.71 11.27 10.98 15.12 10.91 10.97 12.80 12.12 静的物体 2.82 2.86 2.40 20.33 6.41 2.17 1.71 1.58 環境情報を導入することで,障害物との接触を避ける経路予測が可能 Collision rate [%]
  • 57. Deep Learningの発展によるデータセットの大規模化 経路予測の評価指標の再考 ● 予測精度が良い 最も良い予測手法 ● コミュニティ全体で考え直す必要がある 複数経路を予測するアプローチの増加 ● Multiverseを筆頭に複数経路を重視する経路予測手法が増加する? 今後の経路予測は? 572016 interaction multimodal paths other 2020 Social-LSTM [A. Alahi+, CVPR, 2016] DESIRE [N. Lee+, CVPR, 2017] Conv.Social-Pooling [N. Deo+, CVPRW, 2018] SoPhie [A. Sadeghian+, CVPR, 2019] Social-BiGAT [V. Kosaraju+, NeurIPS, 2019] Social-STGCNN [A. Mohamedl+, CVPR, 2020] Social-GAN [A. Gupta+, CVPR, 2018] Next [J. Liang+, CVPR, 2019] STGAT [Y. Huang+, ICCV, 2019] Trajectron [B. Ivanovic+, ICCV, 2019] Social-Attention [A. Vemula+, ICRA, 2018] Multi-Agent Tensor Fusion [T. Zhao+, CVPR, 2019] MX-LSTM [I. Hasan+, CVPR, 2018] CIDNN [Y. Xu+, CVPR, 2018] SR-LSTM [P. Zhang+, CVPR, 2019] Group-LSTM [N. Bisagno+, CVPR, 2018] Reciprocal Network [S. Hao+, CVPR, 2020] PECNet [K. Mangalam+, ECCV, 2020] RSBG [J. SUN+, CVPR, 2020] STAR [C. Yu+, ECCV, 2020] Behavior CNN [S. Yi+, ECCV, 2016] Future localization in first-person videos [T. Yagi+, CVPR, 2018] Fast and Furious [W. Luo+, CVPR, 2018] OPPU [A. Bhattacharyya+, CVPR, 2018] Object Attributes and Semantic Segmentation [H. Minoura+, VISAPP, 2019] Rule of the Road [J. Hong+, CVPR, 2019] Multiverse [J. Liang+, CVPR, 2020] Trajectron++ [T. Salzmann+, ECCV, 2020] Multimodal paths + (interaction) ・・・
  • 58. Deep Learningを用いた経路予測手法の動向調査 ● 各カテゴリに属した経路予測手法の特徴を調査 - インタラクションあり • Poolingモデル • Attentionモデル - インタラクションなし (Other) ● 定量的評価のためのデータセット,評価指標を紹介 - Deep Learningの発展により大規模なデータセットが増加 ● 代表的モデルを使用して,各モデルの精度と予測結果について議論 - AttentionモデルはPoolingモデルより予測誤差が低い - 最も衝突率が低いモデル 予測誤差が低いモデル - SensitiveなデータセットでDeep Learningによる予測手法は効果的 まとめ 58