SlideShare una empresa de Scribd logo
1 de 33
Descargar para leer sin conexión
Víctor Campos
victor.campos@bsc.es
PhD Candidate
Barcelona Supercomputing Center
Video Analysis
Day 3 Lectures 1 & 2
#DLUPC
http://bit.ly/dlcv2018
bit.ly/DLCV2018
#DLUPCOutline
1. Self-supervision from videos
2. Architectures for video analysis
3. Exploiting redundancy in videos
4. Tips and tricks for applying deep learning to video
bit.ly/DLCV2018
#DLUPCSelf-supervision: motivation
● Neural Networks generally need tons of annotated data, but
generating those annotations is expensive
● Can we create some tasks for which we can get free annotations?
○ Yes! And videos are very convenient for this
● We want to
○ Frame the problem as a supervised learning task…
○ … but collecting annotations in an unsupervised manner
bit.ly/DLCV2018
#DLUPCSelf-supervision: examples
Temporal coherence
Misra et al., Shuffle and learn: unsupervised learning using temporal order verification. ECCV 2016
bit.ly/DLCV2018
#DLUPCSelf-supervision: examples
Motion as a proxy for foreground segmentation
Pathak et al., Learning features by watching objects move. CVPR 2017
bit.ly/DLCV2018
#DLUPCSelf-supervision: examples
Correspondence between images and audio
Arandjelović & Zisserman. Look, Listen and Learn. ICCV 2017.
bit.ly/DLCV2018
#DLUPCOutline
1. Self-supervision from videos
2. Architectures for video analysis
3. Exploiting redundancy in videos
4. Tips and tricks for applying deep learning to video
bit.ly/DLCV2018
#DLUPCWhat is a video?
● Formally, a video is a 3D signal
○ Spatial coordinates: x, y
○ Temporal coordinate: t
● If we fix t, we obtain an image. We can understand videos as
sequences of images (a.k.a. frames)
bit.ly/DLCV2018
#DLUPCHow do we work with images?
● Convolutional Neural Networks (CNN) provide state of the art
performance on image analysis tasks
● How can we extend CNNs to image sequences?
bit.ly/DLCV2018
#DLUPCCNNs for sequences of images
Several approaches have been proposed
1. Single frame models
2. CNN + RNN
3. 3D convolutions
4. Two-stream CNN
Each method leverages the temporal information in a different way
f ŷ
bit.ly/DLCV2018
#DLUPCSingle frame models
CNN CNN CNN...
Combination method
Combination is commonly
implemented as a small NN on
top of a pooling operation
(e.g. max, sum, average).
Drawback: pooling is not
aware of the temporal order!
Ng et al., Beyond short snippets: Deep networks for video classification, CVPR 2015
bit.ly/DLCV2018
#DLUPCCNN + RNN
Recurrent Neural Networks are
well suited for processing
sequences.
Drawback: RNNs are sequential
and cannot be parallelized.
Donahue et al., Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015
CNN CNN CNN...
RNN RNN RNN...
bit.ly/DLCV2018
#DLUPC3D Convolutions: C3D
We can add an extra dimension to standard CNNs:
● An image is a HxWxD tensor: MxNxD’ conv filters
● A video is a TxHxWxD tensor: KxMxNxD’ conv filters
The video needs to be split into chunks (also known as clips) with a number of
frames that fits the receptive field of the C3D. Usually clips have 16 frames.
Drawbacks:
● How can we handle longer videos?
● How can we capture longer temporal dependencies?
● How can we use pre-trained networks?
Tran et al., Learning spatiotemporal features with 3D convolutional networks, ICCV 2015
bit.ly/DLCV2018
#DLUPCFrom 2D CNNs to 3D CNNs
We can add an extra dimension to standard CNNs:
● An image is a HxWxD tensor: MxNxD’ conv filters
● A video is a TxHxWxD tensor: KxMxNxD’ conv filters
We can convert an MxNxD’ conv filter into a KxMxNxD’ filter by replicating it K
times in the time axis and scaling its values by 1/K.
● This allows to leverage networks pre-trained on ImageNet and alleviate
the computational burden associated to training from scratch
Feichtenhofer et al., Spatiotemporal Residual Networks for Video Action Recognition, NIPS 2016
Carreira et al., Quo vadis, action recognition? A new model and the kinetics dataset, CVPR 2017
bit.ly/DLCV2018
#DLUPCTwo-stream CNNs
Single frame models do not take into account motion in videos. Proposal:
extract optical flow for a stack of frames and use it as an input to a CNN.
Drawback: not scalable due to computational requirements and memory
footprint.
Simonyan & Zisserman, Two-stream convolutional networks for action recognition in videos, ICCV 2015
bit.ly/DLCV2018
#DLUPCOutline
1. Self-supervision from videos
2. Architectures for video analysis
3. Exploiting redundancy in videos
4. Tips and tricks for applying deep learning to video
bit.ly/DLCV2018
#DLUPC
● So far, we considered video-level predictions
● What about frame-level predictions?
● What about applications which require low latency?
Problem definition
f ŷ
f (ŷ1
, …, ŷN
)
bit.ly/DLCV2018
#DLUPCMinimizing latency
Not all methods are suited for real-time applications
1. Single frame models
2. CNN + RNN
3. 3D convolutions
4. Two-stream CNN
bit.ly/DLCV2018
#DLUPCMinimizing latency
Not all methods are suited for real-time applications
1. Single frame models
2. CNN + RNN
3. 3D convolutions
4. Two-stream CNN
bit.ly/DLCV2018
#DLUPCSingle frame models
CNN CNN CNN...
bit.ly/DLCV2018
#DLUPCSingle frame models: redundancy
Shellhamer et al., Clockwork Convnets for Video Semantic Segmentation, ECCV 2016
bit.ly/DLCV2018
#DLUPCSingle frame models: exploiting redundancy
Carreira et al., Massively Parallel Video Networks, arXiv 2018
bit.ly/DLCV2018
#DLUPCCNN+RNN: redundancy
CNN CNN CNN...
RNN RNN RNN...
bit.ly/DLCV2018
#DLUPCCNN+RNN: exploiting redundancy
CNN CNN CNN...
RNN RNN RNN... After processing a frame, let
the RNN decide how many
future frames can be skipped
In skipped frames, simply
copy the output and state
from the previous time step
There is no ground truth for
which frames can be skipped.
The RNN learns it by itself
during training!
Campos et al., Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks, ICLR 2018
bit.ly/DLCV2018
#DLUPCCNN+RNN: exploiting redundancy
Campos et al., Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks, ICLR 2018
Used Unused
CNN CNN CNN...
RNN RNN RNN...
bit.ly/DLCV2018
#DLUPCOutline
1. Self-supervision from videos
2. Architectures for video analysis
3. Exploiting redundancy in videos
4. Tips and tricks for applying deep learning to video
bit.ly/DLCV2018
#DLUPCActivity Recognition: Datasets
Abu-El-Haija et al., Youtube-8m: A large-scale video classification benchmark, arxiv 2016
bit.ly/DLCV2018
#DLUPCComputational burden
● The reference dataset for image classification, ImageNet, has ~1.3M
images
○ Training a state of the art CNN can take up to 2 weeks on a
single GPU
● Now imagine that we have an ‘ImageNet’ of 1.3M videos
○ Assuming videos of 30s at 24fps, we have 936M frames
○ This is 720x ImageNet!
● Videos exhibit a large redundancy in time
○ We can reduce the frame rate without losing too much
information
bit.ly/DLCV2018
#DLUPCMemory issues
● Current GPUs can fit batches of 32~64 images when training state of
the art CNNs
○ This means 32~64 video frames at once
● Memory footprint can be reduced in different ways if a pre-trained
CNN model is used
○ Freezing some of the lower layers, reducing the memory impact
of backprop
○ Extracting frame-level features and training a model on top of it
(e.g. RNN on top of CNN features). This is equivalent to freezing
the whole architecture, but the CNN part needs to be computed
only once.
bit.ly/DLCV2018
#DLUPCI/O bottleneck
● In practice, applying deep learning to video analysis requires from
multi-GPU or distributed settings
● In such settings it is very important to avoid starving the GPUs or we
will not obtain any speedup
○ The next batch needs to be loaded and preprocessed to keep
the GPU as busy as possible
○ Using asynchronous data loading pipelines is a key factor
○ Loading individual files is slow due to the introduced overhead,
so using other formats such as TFRecord/HDF5/LMDB is highly
recommended
bit.ly/DLCV2018
#DLUPCOutline
1. Self-supervision from videos
2. Architectures for video analysis
3. Exploiting redundancy in videos
4. Tips and tricks for applying deep learning to video
bit.ly/DLCV2018
#DLUPCCNN+RNN: redundancy
CNN CNN CNN...
RNN RNN RNN...
Frame-level predictions
are optional and depend
on the task
~99% of the computational
burden is here

Más contenido relacionado

Más de Universitat Politècnica de Catalunya

Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Universitat Politècnica de Catalunya
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Universitat Politècnica de Catalunya
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Universitat Politècnica de Catalunya
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Universitat Politècnica de Catalunya
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Universitat Politècnica de Catalunya
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Universitat Politècnica de Catalunya
 
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Universitat Politècnica de Catalunya
 
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...Universitat Politècnica de Catalunya
 
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...Universitat Politècnica de Catalunya
 
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...Universitat Politècnica de Catalunya
 
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC BarcelonaSelf-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC BarcelonaUniversitat Politècnica de Catalunya
 
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN BarcelonaDeep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN BarcelonaUniversitat Politècnica de Catalunya
 
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 

Más de Universitat Politècnica de Catalunya (20)

Discovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in MinecraftDiscovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in Minecraft
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
 
Curriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object SegmentationCurriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object Segmentation
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
 
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
 
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
 
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
 
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
 
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC BarcelonaSelf-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
 
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN BarcelonaDeep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
 
Neural Architectures for Video Encoding
Neural Architectures for Video EncodingNeural Architectures for Video Encoding
Neural Architectures for Video Encoding
 
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
 

Último

CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 

Último (20)

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 

Deep Video Analysis - Víctor Campos - UPC Barcelona 2018

  • 1. Víctor Campos victor.campos@bsc.es PhD Candidate Barcelona Supercomputing Center Video Analysis Day 3 Lectures 1 & 2 #DLUPC http://bit.ly/dlcv2018
  • 2. bit.ly/DLCV2018 #DLUPCOutline 1. Self-supervision from videos 2. Architectures for video analysis 3. Exploiting redundancy in videos 4. Tips and tricks for applying deep learning to video
  • 3. bit.ly/DLCV2018 #DLUPCSelf-supervision: motivation ● Neural Networks generally need tons of annotated data, but generating those annotations is expensive ● Can we create some tasks for which we can get free annotations? ○ Yes! And videos are very convenient for this ● We want to ○ Frame the problem as a supervised learning task… ○ … but collecting annotations in an unsupervised manner
  • 4. bit.ly/DLCV2018 #DLUPCSelf-supervision: examples Temporal coherence Misra et al., Shuffle and learn: unsupervised learning using temporal order verification. ECCV 2016
  • 5. bit.ly/DLCV2018 #DLUPCSelf-supervision: examples Motion as a proxy for foreground segmentation Pathak et al., Learning features by watching objects move. CVPR 2017
  • 6. bit.ly/DLCV2018 #DLUPCSelf-supervision: examples Correspondence between images and audio Arandjelović & Zisserman. Look, Listen and Learn. ICCV 2017.
  • 7. bit.ly/DLCV2018 #DLUPCOutline 1. Self-supervision from videos 2. Architectures for video analysis 3. Exploiting redundancy in videos 4. Tips and tricks for applying deep learning to video
  • 8. bit.ly/DLCV2018 #DLUPCWhat is a video? ● Formally, a video is a 3D signal ○ Spatial coordinates: x, y ○ Temporal coordinate: t ● If we fix t, we obtain an image. We can understand videos as sequences of images (a.k.a. frames)
  • 9. bit.ly/DLCV2018 #DLUPCHow do we work with images? ● Convolutional Neural Networks (CNN) provide state of the art performance on image analysis tasks ● How can we extend CNNs to image sequences?
  • 10. bit.ly/DLCV2018 #DLUPCCNNs for sequences of images Several approaches have been proposed 1. Single frame models 2. CNN + RNN 3. 3D convolutions 4. Two-stream CNN Each method leverages the temporal information in a different way f ŷ
  • 11. bit.ly/DLCV2018 #DLUPCSingle frame models CNN CNN CNN... Combination method Combination is commonly implemented as a small NN on top of a pooling operation (e.g. max, sum, average). Drawback: pooling is not aware of the temporal order! Ng et al., Beyond short snippets: Deep networks for video classification, CVPR 2015
  • 12. bit.ly/DLCV2018 #DLUPCCNN + RNN Recurrent Neural Networks are well suited for processing sequences. Drawback: RNNs are sequential and cannot be parallelized. Donahue et al., Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015 CNN CNN CNN... RNN RNN RNN...
  • 13. bit.ly/DLCV2018 #DLUPC3D Convolutions: C3D We can add an extra dimension to standard CNNs: ● An image is a HxWxD tensor: MxNxD’ conv filters ● A video is a TxHxWxD tensor: KxMxNxD’ conv filters The video needs to be split into chunks (also known as clips) with a number of frames that fits the receptive field of the C3D. Usually clips have 16 frames. Drawbacks: ● How can we handle longer videos? ● How can we capture longer temporal dependencies? ● How can we use pre-trained networks? Tran et al., Learning spatiotemporal features with 3D convolutional networks, ICCV 2015
  • 14. bit.ly/DLCV2018 #DLUPCFrom 2D CNNs to 3D CNNs We can add an extra dimension to standard CNNs: ● An image is a HxWxD tensor: MxNxD’ conv filters ● A video is a TxHxWxD tensor: KxMxNxD’ conv filters We can convert an MxNxD’ conv filter into a KxMxNxD’ filter by replicating it K times in the time axis and scaling its values by 1/K. ● This allows to leverage networks pre-trained on ImageNet and alleviate the computational burden associated to training from scratch Feichtenhofer et al., Spatiotemporal Residual Networks for Video Action Recognition, NIPS 2016 Carreira et al., Quo vadis, action recognition? A new model and the kinetics dataset, CVPR 2017
  • 15. bit.ly/DLCV2018 #DLUPCTwo-stream CNNs Single frame models do not take into account motion in videos. Proposal: extract optical flow for a stack of frames and use it as an input to a CNN. Drawback: not scalable due to computational requirements and memory footprint. Simonyan & Zisserman, Two-stream convolutional networks for action recognition in videos, ICCV 2015
  • 16. bit.ly/DLCV2018 #DLUPCOutline 1. Self-supervision from videos 2. Architectures for video analysis 3. Exploiting redundancy in videos 4. Tips and tricks for applying deep learning to video
  • 17. bit.ly/DLCV2018 #DLUPC ● So far, we considered video-level predictions ● What about frame-level predictions? ● What about applications which require low latency? Problem definition f ŷ f (ŷ1 , …, ŷN )
  • 18. bit.ly/DLCV2018 #DLUPCMinimizing latency Not all methods are suited for real-time applications 1. Single frame models 2. CNN + RNN 3. 3D convolutions 4. Two-stream CNN
  • 19. bit.ly/DLCV2018 #DLUPCMinimizing latency Not all methods are suited for real-time applications 1. Single frame models 2. CNN + RNN 3. 3D convolutions 4. Two-stream CNN
  • 21. bit.ly/DLCV2018 #DLUPCSingle frame models: redundancy Shellhamer et al., Clockwork Convnets for Video Semantic Segmentation, ECCV 2016
  • 22. bit.ly/DLCV2018 #DLUPCSingle frame models: exploiting redundancy Carreira et al., Massively Parallel Video Networks, arXiv 2018
  • 24. bit.ly/DLCV2018 #DLUPCCNN+RNN: exploiting redundancy CNN CNN CNN... RNN RNN RNN... After processing a frame, let the RNN decide how many future frames can be skipped In skipped frames, simply copy the output and state from the previous time step There is no ground truth for which frames can be skipped. The RNN learns it by itself during training! Campos et al., Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks, ICLR 2018
  • 25. bit.ly/DLCV2018 #DLUPCCNN+RNN: exploiting redundancy Campos et al., Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks, ICLR 2018 Used Unused CNN CNN CNN... RNN RNN RNN...
  • 26. bit.ly/DLCV2018 #DLUPCOutline 1. Self-supervision from videos 2. Architectures for video analysis 3. Exploiting redundancy in videos 4. Tips and tricks for applying deep learning to video
  • 27. bit.ly/DLCV2018 #DLUPCActivity Recognition: Datasets Abu-El-Haija et al., Youtube-8m: A large-scale video classification benchmark, arxiv 2016
  • 28. bit.ly/DLCV2018 #DLUPCComputational burden ● The reference dataset for image classification, ImageNet, has ~1.3M images ○ Training a state of the art CNN can take up to 2 weeks on a single GPU ● Now imagine that we have an ‘ImageNet’ of 1.3M videos ○ Assuming videos of 30s at 24fps, we have 936M frames ○ This is 720x ImageNet! ● Videos exhibit a large redundancy in time ○ We can reduce the frame rate without losing too much information
  • 29. bit.ly/DLCV2018 #DLUPCMemory issues ● Current GPUs can fit batches of 32~64 images when training state of the art CNNs ○ This means 32~64 video frames at once ● Memory footprint can be reduced in different ways if a pre-trained CNN model is used ○ Freezing some of the lower layers, reducing the memory impact of backprop ○ Extracting frame-level features and training a model on top of it (e.g. RNN on top of CNN features). This is equivalent to freezing the whole architecture, but the CNN part needs to be computed only once.
  • 30. bit.ly/DLCV2018 #DLUPCI/O bottleneck ● In practice, applying deep learning to video analysis requires from multi-GPU or distributed settings ● In such settings it is very important to avoid starving the GPUs or we will not obtain any speedup ○ The next batch needs to be loaded and preprocessed to keep the GPU as busy as possible ○ Using asynchronous data loading pipelines is a key factor ○ Loading individual files is slow due to the introduced overhead, so using other formats such as TFRecord/HDF5/LMDB is highly recommended
  • 31. bit.ly/DLCV2018 #DLUPCOutline 1. Self-supervision from videos 2. Architectures for video analysis 3. Exploiting redundancy in videos 4. Tips and tricks for applying deep learning to video
  • 32.
  • 33. bit.ly/DLCV2018 #DLUPCCNN+RNN: redundancy CNN CNN CNN... RNN RNN RNN... Frame-level predictions are optional and depend on the task ~99% of the computational burden is here