Pedestrian behavior/intention modeling for autonomous driving V

Yu Huang
Yu HuangChief Scientist, Global AI Technical Officer, Autonomous Driving
Pedestrian Behavior/Intention
Modeling for Autonomous Driving V
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California
Outline
• Soft + Hardwired Attention: An LSTM Framework for Human Trajectory Prediction and
Abnormal Event Detection (17.2.18)
• Group LSTM: Group Trajectory Prediction in Crowded Scenarios (ECCV2018 workshop)
• Social-BiGAT: Multimodal Trajectory Forecasting using Bicycle-GAN and Graph Attention
Networks (7.17)
• The Trajectron: Probabilistic Multi-Agent Trajectory Modeling With Dynamic
Spatiotemporal Graphs (8.23)
• Trajectory Prediction by Coupling Scene-LSTM with Human Movement LSTM (8.23)
• STGAT: Modeling Spatial-Temporal Interactions for Human Trajectory Prediction (ICCV19)
• Neighbourhood Context Embeddings in Deep Inverse Reinforcement Learning for
Predicting Pedestrian Motion Over Long Time Horizons (ICCV19)
• GraphTCN: Spatio-Temporal Interaction Modeling for Human Trajectory Prediction (3.26)
• Recursive Social Behavior Graph for Trajectory Prediction (4.22)
Soft + Hardwired Attention: An LSTM Framework for Human
Trajectory Prediction and Abnormal Event Detection
• As humans we possess an intuitive ability for navigation which we master through years
of practice; however existing approaches to model this trait for diverse tasks including
monitoring pedestrian flow and detecting abnormal events have been limited by using a
variety of hand-crafted features.
• Recent research in the area of deep- learning has demonstrated the power of learning
features directly from the data; and related research in recurrent neural networks has
shown exemplary results in sequence- to-sequence problems such as neural machine
translation and neural image caption generation.
• Motivated by these approaches, a method to predict the future motion of a pedestrian
given a short history of their, and their neighbours, past behaviour.
• The novelty of the method is the combined attention model which utilises both “soft
attention” as well as “hard-wired” attention in order to map the trajectory information
from the local neighbourhood to the future positions of the pedestrian of interest.
• How a simple approximation of attention weights (i.e. hard-wired) can be merged
together with soft attention weights in order to make our model applicable for
challenging real world scenarios with hundreds of neighbours.
Soft + Hardwired Attention: An LSTM Framework for Human
Trajectory Prediction and Abnormal Event Detection
A scene (on the left): The trajectory of the pedestrian of interest is shown in green, and has two neighbours (shown
in purple) to the left, one in front and none on right. Neighbourhood encoding scheme (on the right): Trajectory
information is encoded with LSTM encoders. A soft attention context vector is used to embed the trajectory
information from the pedestrian of interest, and a hardwired attention context vector is used for neighbouring
trajectories. In order to generate soft attention vector, use a soft attention function. The merged context vector is
then used to predict the future trajectory for the pedestrian of interest (shown in red).
Soft + Hardwired Attention: An LSTM Framework for Human
Trajectory Prediction and Abnormal Event Detection
The Soft + Hardwired Attention model. utilise the trajectory information from both the pedestrian of
interest and the neighbouring trajectories. embed the trajectory information from the pedestrian of
interest with the soft attention context vector, while neighbouring trajectories are embedded with the aid
of a hardwired attention context vector. In order to generate soft attention context vector, use a soft
attention function. Then the merged context vector, is used to predict the future state
Soft + Hardwired Attention: An LSTM Framework for Human
Trajectory Prediction and Abnormal Event Detection
Group LSTM: Group Trajectory Prediction in Crowded Scenarios
• The analysis of crowded scenes is one of the most challenging scenarios in visual
surveillance, and a variety of factors need to be taken into account, such as the structure
of the environments, and the presence of mutual occlusions and obstacles.
• Traditional prediction methods (such as RNN, LSTM, VAE, etc.) focus on anticipating
individual’s future path based on the precise motion history of a pedestrian.
• However, since tracking algorithms are generally not reliable in highly dense scenes,
these methods are not easily applicable in real environments.
• Nevertheless, it is very common that people (friends, couples, family members, etc.)
tend to exhibit coherent motion patterns.
• Motivated by this phenomenon, an approach to predict future trajectories in crowded
scenes, at the group level.
• First, by exploiting the motion coherency, cluster trajectories that have similar motion
trends.
• In this way, pedestrians within the same group can be well segmented.
• Then, an improved social-LSTM is adopted for future path prediction.
Group LSTM: Group Trajectory Prediction in Crowded Scenarios
i
Representation of the Social hidden-state tensor. The black dot represents the pedestrian of interest. Other
pedestrians are shown in different color codes, namely green for pedestrians belonging to the same set, and
red for pedestrians belonging to a different set. The neighborhood of pedestrian of interest is described by
N0 × N0 cells, which preserves the spatial information by pooling spatially adjacent neighbors. Pedestrians
belonging to the same set are not used for the final computation of the pooling layer.
Group LSTM: Group Trajectory Prediction in Crowded Scenarios
The figure represents the chain structure of the LSTM network between two consecutive
time steps. At each time step, the inputs of the LSTM cell are the previous position and
the Social pooling tensor Ht. The output of the LSTM cell is the current position.
Group LSTM: Group Trajectory Prediction in Crowded Scenarios
Social-BiGAT: Multimodal Trajectory Forecasting using
Bicycle-GAN and Graph Attention Networks
• Predicting the future trajectories of multiple interacting agents in a scene has become an
increasingly important problem for many different applications ranging from control of
autonomous vehicles and social robots to security and surveillance.
• This problem is compounded by the presence of social interactions between humans and
their physical interactions with the scene.
• While the existing literature has explored some of these cues, they mainly ignored the
multimodal nature of each human’s future trajectory.
• Social-BiGAT, a graph-based generative adversarial network that generates realistic,
multimodal trajectory predictions by better modelling the social interactions of
pedestrians in a scene.
• Based on a graph attention network (GAT) that learns reliable feature representations
that encode the social interactions between humans in the scene, and a recurrent
encoder-decoder architecture that is trained adversarially to predict, based on the
features, the humans’ paths.
• The multimodal nature of the prediction by forming a reversible transformation between
each scene and its latent noise vector, as in Bicycle-GAN.
Social-BiGAT: Multimodal Trajectory Forecasting using
Bicycle-GAN and Graph Attention Networks
Architecture for the Social-BiGAT model. The
model consists of a single generator, two
discriminators (one at local pedestrian scale,
and one at global scene scale), and a latent
encoder that learns noise from scenes. The
model makes use of a graph attention network
(GAT) and self-attention on an image to
consider the social and physical features of a
scene.
Social-BiGAT: Multimodal Trajectory Forecasting using
Bicycle-GAN and Graph Attention Networks
Training process for the Social-BiGAT model. Teach the generator and discriminators using traditional
adversarial learning techniques, with an additional L2 loss on generated samples to encourage
consistency. Further train the latent encoder by ensuring it can recreate noise passed into the generator,
and by making sure it mirrors a normal distribution.
Social-BiGAT: Multimodal Trajectory Forecasting using
Bicycle-GAN and Graph Attention Networks
The Trajectron: Probabilistic Multi-Agent Trajectory
Modeling With Dynamic Spatiotemporal Graphs
• Developing safe human-robot interaction systems is a necessary step towards the
widespread integration of autonomous agents in society.
• A key component of such systems is the ability to reason about the many potential
futures (e.g. trajectories) of other agents in the scene.
• Trajectron, a graph-structured model that predicts many potential future trajectories of
multiple agents simultaneously in both highly dynamic and multi- modal scenarios (i.e.
where the number of agents in the scene is time-varying and there are many possible
highly- distinct futures for each agent).
• It combines tools from recurrent sequence modeling and variational deep generative
modeling to produce a distribution of future trajectories for each agent in a scene.
• Test the performance of the model on several datasets, obtaining state-of-the-art results
on standard trajectory prediction metrics as well as introducing a new metric for
comparing models that output distributions.
The Trajectron: Probabilistic Multi-Agent Trajectory
Modeling With Dynamic Spatiotemporal Graphs
Top: An example graph with four nodes. a is the
modeled node and is of type T3. It has three
neighbors: b of type T1, c of type T2, and d of type
T1. Here, c is about to connect with a. Bottom: The
corresponding architecture for node a.
Overall, the Trajectron employs a hybrid edge
combination scheme combining aspects of Social
Attention and the Structural-RNN.
The Trajectron: Probabilistic Multi-Agent Trajectory
Modeling With Dynamic Spatiotemporal Graphs
• Trajectron combines elements of variational deep generative models (in particular, CVAEs),
recurrent sequence models (LSTMs), and dynamic spatiotemporal graphical structures to produce
high-quality multimodal trajectories that models/predicts future behaviors of multiple humans.
• Trajectron actually models a human’s velocity, which is then numerically integrated to produce
spatial trajectories.
• Build a graph G = (V , E ) representing the scene with nodes representing agents and edges based
on agents’ spatial proximity.
• Node History Encoder (NHE) to encode a node’s state history;
• Edge Encoders (EEs) to incorporate influence from neighboring nodes.
• With the previous outputs in hand, form a concatenated representation which then
parameterizes the recognition and prior distributions in the CVAE framework.
The Trajectron: Probabilistic Multi-Agent Trajectory
Modeling With Dynamic Spatiotemporal Graphs
Trajectory Prediction by Coupling Scene-LSTM
with Human Movement LSTM
• A trajectory prediction system that incorporates the scene information (Scene-
LSTM) as well as individual pedestrian movement (Pedestrian-LSTM) trained
simultaneously within static crowded scenes.
• Superimpose a two-level grid structure (grid cells and subgrids) on the scene to
encode spatial granularity plus common human movements.
• The Scene-LSTM captures the commonly traveled paths that can be used to
significantly influence the accuracy of human trajectory prediction in local areas
(i.e. grid cells).
• Further design scene data filters, consisting of a hard filter and a soft filter, to
select the relevant scene information in a local region when necessary and
combine it with Pedestrian-LSTM for forecasting a pedestrian’s future locations.
• The experimental results on several publicly available datasets demonstrate that
it produces more accurate predicted trajectories in different scene contexts.
Trajectory Prediction by Coupling Scene-LSTM
with Human Movement LSTM
Scene-LSTM learns common human movements on a two-level
grid structure. The common human movement is filtered and
used in combination with individual movement (Pedestrian-
LSTM) to predict a pedestrian’s future locations.
Trajectory Prediction by Coupling Scene-LSTM
with Human Movement LSTM
The system consists of three main modules: Pedestrian Movement (PM), Scene Data (SD) and Scene Data Filter
(SDF). PM models the individual movement of pedestrians. SD encodes common human movements in each
grid cell. SDF selects relevant scene data to update the Pedestrian-LSTM, which is used to predict the future
locations. ⊗ denotes elementwise multiplication. ⊕ denotes vector addition. hi and hsare the hidden states of
Pedestrian-LSTM and Scene-LSTM, respectively.
Trajectory Prediction by Coupling Scene-LSTM
with Human Movement LSTM
Illustrations of the hard filter, which determines whether the scene data should be applied in predicting the
future locations of a pedestrian. (a) the frame image is first divided into n × n grid cells (n = 4 in this example)
to capture all human movements in each grid cell; (b) & (c) only non-linear grid cells are selected for further
processing at the subgrid level; the scene data is not applied for pedestrians in the linear grid cell; (d) a non-
linear grid cell is further divided into m × m subgrids (m = 4) and each trajectory is parsed into subgrid paths;
(e) the common subgrids, occupied by common subgrid paths; (f) at prediction time, the decision of use/not
use scene data depends on the current location of each pedestrian. If the pedestrian’s current location is in the
common subgrids, the scene data is used (red pedestrian); otherwise, it is not used (green pedestrian).
Trajectory Prediction by Coupling Scene-LSTM
with Human Movement LSTM
Illustrations of the soft filter. The relevant information of scene data (i.e. Scene-LSTM) is selected using
each pedestrians walking behavior. The filtered grid-cell memory of each pedestrian is then used in
combination with pedestrian movements (Pedestrian-LSTM) to predict the future trajectories.
Trajectory Prediction by Coupling Scene-LSTM
with Human Movement LSTM
STGAT: Modeling Spatial-Temporal Interactions for
Human Trajectory Prediction
• Human trajectory prediction is challenging and critical in various applications (e.g., autonomous
vehicles and social robots).
• Because of the continuity and foresight of the pedestrian movements, the moving pedestrians in
crowded spaces will consider both spatial and temporal interactions to avoid future collisions.
• However, most of the existing methods ignore the temporal correlations of interactions with
other pedestrians involved in a scene.
• Spatial-Temporal Graph Attention network (STGAT), based on a sequence-to-sequence
architecture to predict future trajectories of pedestrians.
• Besides the spatial interactions captured by the graph attention mechanism at each time-step,
adopt an extra LSTM to encode the temporal correlations of interactions.
• Test on two publicly available crowd datasets (ETH and UCY) and produces more “socially”
plausible trajectories for pedestrians.
STGAT: Modeling Spatial-Temporal Interactions for
Human Trajectory Prediction
The architecture of the STGAT model. The framework is based on seq2seq model and consists of 3 parts: Encoder,
Intermediate State and Decoder. The Encoder module includes three components: 2 types of LSTMs and Graph
Attention Network (GAT) . The Intermediate State encapsulates the spatial and temporal information of all
observed trajectories. The Decoder module generates the future trajectories based on Intermediate State.
STGAT: Modeling Spatial-Temporal Interactions for
Human Trajectory Prediction
STGAT: Modeling Spatial-Temporal Interactions for
Human Trajectory Prediction
Neighbourhood Context Embeddings in Deep Inverse Reinforcement
Learning for Predicting Pedestrian Motion Over Long Time Horizons
• Despite the fact that Deep Inverse Reinforcement Learning (D-IRL) based modelling
paradigms offer flexibility and robustness when anticipating human behaviour across
long time horizons, compared to their supervised learning counterparts, no existing
state-of-the-art D-IRL methods consider path planning in situations where there are
multiple moving pedestrians in the environment.
• To address this, a recurrent neural network based method for embedding pedestrian
dynamics in a D-IRL setting, where there are multiple moving agents.
• Capture the motion of the pedestrian of interest as well as the motion of other
pedestrians in the neighbourhood through Long-Short-Term Memory networks.
• The neighbourhood dynamics are encoded into a feature map, preserving the spatial
integrity of the observed trajectories.
• Utilising the maximum-entropy based non-linear inverse reinforcement learning
framework, map these features to a reward map.
• The importance of capturing the dynamic evolution of the environment using the
embedding scheme.
Neighbourhood Context Embeddings in Deep Inverse Reinforcement
Learning for Predicting Pedestrian Motion Over Long Time Horizons
The architecture used to embed the
neighbourhood context: The trajectory of the
pedestrian of interest is shown in blue, with
three neighbours shown in green. Heading
directions are indicated with circles. encode
the trajectories using LSTMs where soft
attention is utilised to embed the information
from the pedestrian of interest and the
neighbours use hard-wired attention. Next a
feature map is generated to embed this
information spatially, based on the cartesian
points of each trajectory.
Neighbourhood Context Embeddings in Deep Inverse Reinforcement
Learning for Predicting Pedestrian Motion Over Long Time Horizons
The architecture of the four layer fully convolution network used to
map the feature map G to the reward map R. The first three layers
contain 32, 1 × 1 convolution kernels with a ReLU activation, and
the final layer contains 1, 1 × 1 convolution kernel.
The learned reward map covers all
the areas of the environment,
encapsulating structural factors such
as buildings and pathways that
influence pedestrian behaviour.
Neighbourhood Context Embeddings in Deep Inverse Reinforcement
Learning for Predicting Pedestrian Motion Over Long Time Horizons
GraphTCN: Spatio-Temporal Interaction Modeling
for Human Trajectory Prediction
• Trajectory prediction is a fundamental and challenging task to forecast the future path of the
agents in autonomous applications with multi-agent interaction, where the agents need to
predict the future movements of their neighbors to avoid collisions.
• To respond timely and precisely to the environment, high efficiency and accuracy are required in
the prediction.
• Conventional approaches, e.g., LSTM-based models, take considerable computation costs in the
prediction, especially for the long sequence prediction.
• To support more efficient and accurate trajectory predictions, a CNN-based spatial-temporal
graph framework GraphTCN, which captures the spatial and temporal interactions in an input-
aware manner.
• The spatial interaction between agents at each time step is captured with an edge graph
attention network (EGAT), and the temporal interaction across time step is modeled with a
modified gated convolutional network.
• In contrast to conventional models, both the spatial and temporal modeling in GraphTCN are
computed within each local time window.
• Therefore, GraphTCN can be executed in parallel for much higher efficiency, and meanwhile with
accuracy comparable to best-performing approaches.
GraphTCN: Spatio-Temporal Interaction Modeling
for Human Trajectory Prediction
The overview of GraphTCN, where EGAT captures the
spatial interaction between agents for each time step
and based on the spatial and historical trajectory
embedding, TCN further captures the temporal
interaction across time steps. The decoder module
then produces multiple socially acceptable
trajectories for all the agents simultaneously.
GraphTCN: Spatio-Temporal Interaction Modeling
for Human Trajectory Prediction
TCN with a stack of 3 causal convolution layers of kernel size 3. In each
layer, the left padding is adopted based on the kernel size. The input
contains the spatial information captured by preceding modules. The
output of TCN is collected by concatenating all the outputs across time.
GraphTCN: Spatio-Temporal Interaction Modeling
for Human Trajectory Prediction
Recursive Social Behavior Graph for Trajectory
Prediction
• Social interaction is an important topic in trajectory prediction to generate plausible
paths.
• Force based models utilize the distance to compute force, and they will fail when the
interaction is complicated.
• for pooling methods, the distance between two person at a single timestep is used as a
criterion to calculate the strength of the relationship.
• Attention methods also meet the same problem that Euclidean distance are used in their
method to guide the attention mechanism.
• An insight of group-based social interaction model to explore relationships among
pedestrians.
• recursively extract social representations supervised by group-based annotations and
formulate them into a social behavior graph, called Recursive Social Behavior Graph.
• recursive mechanism explores the representation power largely.
• Graph CNN is used to propagate social interaction information in such a graph.
Recursive Social Behavior Graph for Trajectory
Prediction
Overview. For individual representation, BiLSTMs are used to encode historical trajectory feature, and CNNs are
used to encode human context feature. For relational social representation, first generate RSBG recursively and
then use GCN to propagate social features. At the decoding stage, social features are concatenated with individual
features which finally decoded by an LSTM based decoder.
Recursive Social Behavior Graph for Trajectory
Prediction
Pedestrian behavior/intention modeling for autonomous driving V
1 de 40

Más contenido relacionado

La actualidad más candente(20)

Federated learningFederated learning
Federated learning
Mindos Cheng4K vistas
3D Point Cloud analysis using Deep Learning3D Point Cloud analysis using Deep Learning
3D Point Cloud analysis using Deep Learning
Data Science Milan880 vistas
Final thesis presentationFinal thesis presentation
Final thesis presentation
Pawan Singh822 vistas
Google Driverless Car PPT (Latest Report)Google Driverless Car PPT (Latest Report)
Google Driverless Car PPT (Latest Report)
Shailesh Thakran3.5K vistas
Smart parking systemSmart parking system
Smart parking system
EditorIJAERD2.1K vistas
Anomaly detectionAnomaly detection
Anomaly detection
Hitesh Mohapatra3.6K vistas
Driver Drowsiness Detection Using MatlabDriver Drowsiness Detection Using Matlab
Driver Drowsiness Detection Using Matlab
Hanojhan Rajahrajasingh505 vistas
AI: Introduction to artificial intelligenceAI: Introduction to artificial intelligence
AI: Introduction to artificial intelligence
DataminingTools Inc5.4K vistas
Computer vision - Applications and TrendsComputer vision - Applications and Trends
Computer vision - Applications and Trends
Kshitij Agrawal413 vistas
Saksham seminar reportSaksham seminar report
Saksham seminar report
SakshamTurki757 vistas
18364 1 artificial intelligence18364 1 artificial intelligence
18364 1 artificial intelligence
Abhishek Abhi844 vistas
Self Driving Cars Self Driving Cars
Self Driving Cars
IPRI,Innovation Campus,University of Wollongong,1.7K vistas

Similar a Pedestrian behavior/intention modeling for autonomous driving V(20)

posterposter
poster
Alessandro Corbetta113 vistas
[Seminar] 20210122 Kyunghwan Moon[Seminar] 20210122 Kyunghwan Moon
[Seminar] 20210122 Kyunghwan Moon
ivaderivader28 vistas
Where NextWhere Next
Where Next
Roberto Trasarti541 vistas

Último

Electrical CrimpingElectrical Crimping
Electrical CrimpingIwiss Tools Co.,Ltd
18 vistas22 diapositivas
LFA-NPG-Paper.pdfLFA-NPG-Paper.pdf
LFA-NPG-Paper.pdfharinsrikanth
40 vistas13 diapositivas
String.pptxString.pptx
String.pptxAnanthi Palanisamy
45 vistas24 diapositivas
IWISS Catalog 2022IWISS Catalog 2022
IWISS Catalog 2022Iwiss Tools Co.,Ltd
22 vistas66 diapositivas

Pedestrian behavior/intention modeling for autonomous driving V

  • 1. Pedestrian Behavior/Intention Modeling for Autonomous Driving V Yu Huang Yu.huang07@gmail.com Sunnyvale, California
  • 2. Outline • Soft + Hardwired Attention: An LSTM Framework for Human Trajectory Prediction and Abnormal Event Detection (17.2.18) • Group LSTM: Group Trajectory Prediction in Crowded Scenarios (ECCV2018 workshop) • Social-BiGAT: Multimodal Trajectory Forecasting using Bicycle-GAN and Graph Attention Networks (7.17) • The Trajectron: Probabilistic Multi-Agent Trajectory Modeling With Dynamic Spatiotemporal Graphs (8.23) • Trajectory Prediction by Coupling Scene-LSTM with Human Movement LSTM (8.23) • STGAT: Modeling Spatial-Temporal Interactions for Human Trajectory Prediction (ICCV19) • Neighbourhood Context Embeddings in Deep Inverse Reinforcement Learning for Predicting Pedestrian Motion Over Long Time Horizons (ICCV19) • GraphTCN: Spatio-Temporal Interaction Modeling for Human Trajectory Prediction (3.26) • Recursive Social Behavior Graph for Trajectory Prediction (4.22)
  • 3. Soft + Hardwired Attention: An LSTM Framework for Human Trajectory Prediction and Abnormal Event Detection • As humans we possess an intuitive ability for navigation which we master through years of practice; however existing approaches to model this trait for diverse tasks including monitoring pedestrian flow and detecting abnormal events have been limited by using a variety of hand-crafted features. • Recent research in the area of deep- learning has demonstrated the power of learning features directly from the data; and related research in recurrent neural networks has shown exemplary results in sequence- to-sequence problems such as neural machine translation and neural image caption generation. • Motivated by these approaches, a method to predict the future motion of a pedestrian given a short history of their, and their neighbours, past behaviour. • The novelty of the method is the combined attention model which utilises both “soft attention” as well as “hard-wired” attention in order to map the trajectory information from the local neighbourhood to the future positions of the pedestrian of interest. • How a simple approximation of attention weights (i.e. hard-wired) can be merged together with soft attention weights in order to make our model applicable for challenging real world scenarios with hundreds of neighbours.
  • 4. Soft + Hardwired Attention: An LSTM Framework for Human Trajectory Prediction and Abnormal Event Detection A scene (on the left): The trajectory of the pedestrian of interest is shown in green, and has two neighbours (shown in purple) to the left, one in front and none on right. Neighbourhood encoding scheme (on the right): Trajectory information is encoded with LSTM encoders. A soft attention context vector is used to embed the trajectory information from the pedestrian of interest, and a hardwired attention context vector is used for neighbouring trajectories. In order to generate soft attention vector, use a soft attention function. The merged context vector is then used to predict the future trajectory for the pedestrian of interest (shown in red).
  • 5. Soft + Hardwired Attention: An LSTM Framework for Human Trajectory Prediction and Abnormal Event Detection The Soft + Hardwired Attention model. utilise the trajectory information from both the pedestrian of interest and the neighbouring trajectories. embed the trajectory information from the pedestrian of interest with the soft attention context vector, while neighbouring trajectories are embedded with the aid of a hardwired attention context vector. In order to generate soft attention context vector, use a soft attention function. Then the merged context vector, is used to predict the future state
  • 6. Soft + Hardwired Attention: An LSTM Framework for Human Trajectory Prediction and Abnormal Event Detection
  • 7. Group LSTM: Group Trajectory Prediction in Crowded Scenarios • The analysis of crowded scenes is one of the most challenging scenarios in visual surveillance, and a variety of factors need to be taken into account, such as the structure of the environments, and the presence of mutual occlusions and obstacles. • Traditional prediction methods (such as RNN, LSTM, VAE, etc.) focus on anticipating individual’s future path based on the precise motion history of a pedestrian. • However, since tracking algorithms are generally not reliable in highly dense scenes, these methods are not easily applicable in real environments. • Nevertheless, it is very common that people (friends, couples, family members, etc.) tend to exhibit coherent motion patterns. • Motivated by this phenomenon, an approach to predict future trajectories in crowded scenes, at the group level. • First, by exploiting the motion coherency, cluster trajectories that have similar motion trends. • In this way, pedestrians within the same group can be well segmented. • Then, an improved social-LSTM is adopted for future path prediction.
  • 8. Group LSTM: Group Trajectory Prediction in Crowded Scenarios i Representation of the Social hidden-state tensor. The black dot represents the pedestrian of interest. Other pedestrians are shown in different color codes, namely green for pedestrians belonging to the same set, and red for pedestrians belonging to a different set. The neighborhood of pedestrian of interest is described by N0 × N0 cells, which preserves the spatial information by pooling spatially adjacent neighbors. Pedestrians belonging to the same set are not used for the final computation of the pooling layer.
  • 9. Group LSTM: Group Trajectory Prediction in Crowded Scenarios The figure represents the chain structure of the LSTM network between two consecutive time steps. At each time step, the inputs of the LSTM cell are the previous position and the Social pooling tensor Ht. The output of the LSTM cell is the current position.
  • 10. Group LSTM: Group Trajectory Prediction in Crowded Scenarios
  • 11. Social-BiGAT: Multimodal Trajectory Forecasting using Bicycle-GAN and Graph Attention Networks • Predicting the future trajectories of multiple interacting agents in a scene has become an increasingly important problem for many different applications ranging from control of autonomous vehicles and social robots to security and surveillance. • This problem is compounded by the presence of social interactions between humans and their physical interactions with the scene. • While the existing literature has explored some of these cues, they mainly ignored the multimodal nature of each human’s future trajectory. • Social-BiGAT, a graph-based generative adversarial network that generates realistic, multimodal trajectory predictions by better modelling the social interactions of pedestrians in a scene. • Based on a graph attention network (GAT) that learns reliable feature representations that encode the social interactions between humans in the scene, and a recurrent encoder-decoder architecture that is trained adversarially to predict, based on the features, the humans’ paths. • The multimodal nature of the prediction by forming a reversible transformation between each scene and its latent noise vector, as in Bicycle-GAN.
  • 12. Social-BiGAT: Multimodal Trajectory Forecasting using Bicycle-GAN and Graph Attention Networks Architecture for the Social-BiGAT model. The model consists of a single generator, two discriminators (one at local pedestrian scale, and one at global scene scale), and a latent encoder that learns noise from scenes. The model makes use of a graph attention network (GAT) and self-attention on an image to consider the social and physical features of a scene.
  • 13. Social-BiGAT: Multimodal Trajectory Forecasting using Bicycle-GAN and Graph Attention Networks Training process for the Social-BiGAT model. Teach the generator and discriminators using traditional adversarial learning techniques, with an additional L2 loss on generated samples to encourage consistency. Further train the latent encoder by ensuring it can recreate noise passed into the generator, and by making sure it mirrors a normal distribution.
  • 14. Social-BiGAT: Multimodal Trajectory Forecasting using Bicycle-GAN and Graph Attention Networks
  • 15. The Trajectron: Probabilistic Multi-Agent Trajectory Modeling With Dynamic Spatiotemporal Graphs • Developing safe human-robot interaction systems is a necessary step towards the widespread integration of autonomous agents in society. • A key component of such systems is the ability to reason about the many potential futures (e.g. trajectories) of other agents in the scene. • Trajectron, a graph-structured model that predicts many potential future trajectories of multiple agents simultaneously in both highly dynamic and multi- modal scenarios (i.e. where the number of agents in the scene is time-varying and there are many possible highly- distinct futures for each agent). • It combines tools from recurrent sequence modeling and variational deep generative modeling to produce a distribution of future trajectories for each agent in a scene. • Test the performance of the model on several datasets, obtaining state-of-the-art results on standard trajectory prediction metrics as well as introducing a new metric for comparing models that output distributions.
  • 16. The Trajectron: Probabilistic Multi-Agent Trajectory Modeling With Dynamic Spatiotemporal Graphs Top: An example graph with four nodes. a is the modeled node and is of type T3. It has three neighbors: b of type T1, c of type T2, and d of type T1. Here, c is about to connect with a. Bottom: The corresponding architecture for node a. Overall, the Trajectron employs a hybrid edge combination scheme combining aspects of Social Attention and the Structural-RNN.
  • 17. The Trajectron: Probabilistic Multi-Agent Trajectory Modeling With Dynamic Spatiotemporal Graphs • Trajectron combines elements of variational deep generative models (in particular, CVAEs), recurrent sequence models (LSTMs), and dynamic spatiotemporal graphical structures to produce high-quality multimodal trajectories that models/predicts future behaviors of multiple humans. • Trajectron actually models a human’s velocity, which is then numerically integrated to produce spatial trajectories. • Build a graph G = (V , E ) representing the scene with nodes representing agents and edges based on agents’ spatial proximity. • Node History Encoder (NHE) to encode a node’s state history; • Edge Encoders (EEs) to incorporate influence from neighboring nodes. • With the previous outputs in hand, form a concatenated representation which then parameterizes the recognition and prior distributions in the CVAE framework.
  • 18. The Trajectron: Probabilistic Multi-Agent Trajectory Modeling With Dynamic Spatiotemporal Graphs
  • 19. Trajectory Prediction by Coupling Scene-LSTM with Human Movement LSTM • A trajectory prediction system that incorporates the scene information (Scene- LSTM) as well as individual pedestrian movement (Pedestrian-LSTM) trained simultaneously within static crowded scenes. • Superimpose a two-level grid structure (grid cells and subgrids) on the scene to encode spatial granularity plus common human movements. • The Scene-LSTM captures the commonly traveled paths that can be used to significantly influence the accuracy of human trajectory prediction in local areas (i.e. grid cells). • Further design scene data filters, consisting of a hard filter and a soft filter, to select the relevant scene information in a local region when necessary and combine it with Pedestrian-LSTM for forecasting a pedestrian’s future locations. • The experimental results on several publicly available datasets demonstrate that it produces more accurate predicted trajectories in different scene contexts.
  • 20. Trajectory Prediction by Coupling Scene-LSTM with Human Movement LSTM Scene-LSTM learns common human movements on a two-level grid structure. The common human movement is filtered and used in combination with individual movement (Pedestrian- LSTM) to predict a pedestrian’s future locations.
  • 21. Trajectory Prediction by Coupling Scene-LSTM with Human Movement LSTM The system consists of three main modules: Pedestrian Movement (PM), Scene Data (SD) and Scene Data Filter (SDF). PM models the individual movement of pedestrians. SD encodes common human movements in each grid cell. SDF selects relevant scene data to update the Pedestrian-LSTM, which is used to predict the future locations. ⊗ denotes elementwise multiplication. ⊕ denotes vector addition. hi and hsare the hidden states of Pedestrian-LSTM and Scene-LSTM, respectively.
  • 22. Trajectory Prediction by Coupling Scene-LSTM with Human Movement LSTM Illustrations of the hard filter, which determines whether the scene data should be applied in predicting the future locations of a pedestrian. (a) the frame image is first divided into n × n grid cells (n = 4 in this example) to capture all human movements in each grid cell; (b) & (c) only non-linear grid cells are selected for further processing at the subgrid level; the scene data is not applied for pedestrians in the linear grid cell; (d) a non- linear grid cell is further divided into m × m subgrids (m = 4) and each trajectory is parsed into subgrid paths; (e) the common subgrids, occupied by common subgrid paths; (f) at prediction time, the decision of use/not use scene data depends on the current location of each pedestrian. If the pedestrian’s current location is in the common subgrids, the scene data is used (red pedestrian); otherwise, it is not used (green pedestrian).
  • 23. Trajectory Prediction by Coupling Scene-LSTM with Human Movement LSTM Illustrations of the soft filter. The relevant information of scene data (i.e. Scene-LSTM) is selected using each pedestrians walking behavior. The filtered grid-cell memory of each pedestrian is then used in combination with pedestrian movements (Pedestrian-LSTM) to predict the future trajectories.
  • 24. Trajectory Prediction by Coupling Scene-LSTM with Human Movement LSTM
  • 25. STGAT: Modeling Spatial-Temporal Interactions for Human Trajectory Prediction • Human trajectory prediction is challenging and critical in various applications (e.g., autonomous vehicles and social robots). • Because of the continuity and foresight of the pedestrian movements, the moving pedestrians in crowded spaces will consider both spatial and temporal interactions to avoid future collisions. • However, most of the existing methods ignore the temporal correlations of interactions with other pedestrians involved in a scene. • Spatial-Temporal Graph Attention network (STGAT), based on a sequence-to-sequence architecture to predict future trajectories of pedestrians. • Besides the spatial interactions captured by the graph attention mechanism at each time-step, adopt an extra LSTM to encode the temporal correlations of interactions. • Test on two publicly available crowd datasets (ETH and UCY) and produces more “socially” plausible trajectories for pedestrians.
  • 26. STGAT: Modeling Spatial-Temporal Interactions for Human Trajectory Prediction The architecture of the STGAT model. The framework is based on seq2seq model and consists of 3 parts: Encoder, Intermediate State and Decoder. The Encoder module includes three components: 2 types of LSTMs and Graph Attention Network (GAT) . The Intermediate State encapsulates the spatial and temporal information of all observed trajectories. The Decoder module generates the future trajectories based on Intermediate State.
  • 27. STGAT: Modeling Spatial-Temporal Interactions for Human Trajectory Prediction
  • 28. STGAT: Modeling Spatial-Temporal Interactions for Human Trajectory Prediction
  • 29. Neighbourhood Context Embeddings in Deep Inverse Reinforcement Learning for Predicting Pedestrian Motion Over Long Time Horizons • Despite the fact that Deep Inverse Reinforcement Learning (D-IRL) based modelling paradigms offer flexibility and robustness when anticipating human behaviour across long time horizons, compared to their supervised learning counterparts, no existing state-of-the-art D-IRL methods consider path planning in situations where there are multiple moving pedestrians in the environment. • To address this, a recurrent neural network based method for embedding pedestrian dynamics in a D-IRL setting, where there are multiple moving agents. • Capture the motion of the pedestrian of interest as well as the motion of other pedestrians in the neighbourhood through Long-Short-Term Memory networks. • The neighbourhood dynamics are encoded into a feature map, preserving the spatial integrity of the observed trajectories. • Utilising the maximum-entropy based non-linear inverse reinforcement learning framework, map these features to a reward map. • The importance of capturing the dynamic evolution of the environment using the embedding scheme.
  • 30. Neighbourhood Context Embeddings in Deep Inverse Reinforcement Learning for Predicting Pedestrian Motion Over Long Time Horizons The architecture used to embed the neighbourhood context: The trajectory of the pedestrian of interest is shown in blue, with three neighbours shown in green. Heading directions are indicated with circles. encode the trajectories using LSTMs where soft attention is utilised to embed the information from the pedestrian of interest and the neighbours use hard-wired attention. Next a feature map is generated to embed this information spatially, based on the cartesian points of each trajectory.
  • 31. Neighbourhood Context Embeddings in Deep Inverse Reinforcement Learning for Predicting Pedestrian Motion Over Long Time Horizons The architecture of the four layer fully convolution network used to map the feature map G to the reward map R. The first three layers contain 32, 1 × 1 convolution kernels with a ReLU activation, and the final layer contains 1, 1 × 1 convolution kernel. The learned reward map covers all the areas of the environment, encapsulating structural factors such as buildings and pathways that influence pedestrian behaviour.
  • 32. Neighbourhood Context Embeddings in Deep Inverse Reinforcement Learning for Predicting Pedestrian Motion Over Long Time Horizons
  • 33. GraphTCN: Spatio-Temporal Interaction Modeling for Human Trajectory Prediction • Trajectory prediction is a fundamental and challenging task to forecast the future path of the agents in autonomous applications with multi-agent interaction, where the agents need to predict the future movements of their neighbors to avoid collisions. • To respond timely and precisely to the environment, high efficiency and accuracy are required in the prediction. • Conventional approaches, e.g., LSTM-based models, take considerable computation costs in the prediction, especially for the long sequence prediction. • To support more efficient and accurate trajectory predictions, a CNN-based spatial-temporal graph framework GraphTCN, which captures the spatial and temporal interactions in an input- aware manner. • The spatial interaction between agents at each time step is captured with an edge graph attention network (EGAT), and the temporal interaction across time step is modeled with a modified gated convolutional network. • In contrast to conventional models, both the spatial and temporal modeling in GraphTCN are computed within each local time window. • Therefore, GraphTCN can be executed in parallel for much higher efficiency, and meanwhile with accuracy comparable to best-performing approaches.
  • 34. GraphTCN: Spatio-Temporal Interaction Modeling for Human Trajectory Prediction The overview of GraphTCN, where EGAT captures the spatial interaction between agents for each time step and based on the spatial and historical trajectory embedding, TCN further captures the temporal interaction across time steps. The decoder module then produces multiple socially acceptable trajectories for all the agents simultaneously.
  • 35. GraphTCN: Spatio-Temporal Interaction Modeling for Human Trajectory Prediction TCN with a stack of 3 causal convolution layers of kernel size 3. In each layer, the left padding is adopted based on the kernel size. The input contains the spatial information captured by preceding modules. The output of TCN is collected by concatenating all the outputs across time.
  • 36. GraphTCN: Spatio-Temporal Interaction Modeling for Human Trajectory Prediction
  • 37. Recursive Social Behavior Graph for Trajectory Prediction • Social interaction is an important topic in trajectory prediction to generate plausible paths. • Force based models utilize the distance to compute force, and they will fail when the interaction is complicated. • for pooling methods, the distance between two person at a single timestep is used as a criterion to calculate the strength of the relationship. • Attention methods also meet the same problem that Euclidean distance are used in their method to guide the attention mechanism. • An insight of group-based social interaction model to explore relationships among pedestrians. • recursively extract social representations supervised by group-based annotations and formulate them into a social behavior graph, called Recursive Social Behavior Graph. • recursive mechanism explores the representation power largely. • Graph CNN is used to propagate social interaction information in such a graph.
  • 38. Recursive Social Behavior Graph for Trajectory Prediction Overview. For individual representation, BiLSTMs are used to encode historical trajectory feature, and CNNs are used to encode human context feature. For relational social representation, first generate RSBG recursively and then use GCN to propagate social features. At the decoding stage, social features are concatenated with individual features which finally decoded by an LSTM based decoder.
  • 39. Recursive Social Behavior Graph for Trajectory Prediction