Deep Learning for Dense Prediction Problems

Deep Learning for
Big Scale 2D Imagery
Zbigniew
Wojna

https://drive.google.com/a/google.com/file/d/0B_wzP_JlVFcKMUF4aUJNS3F5Tm1pUzhKTEZQV25nYjY1MXVj/view

The Devil is in the Decoder:
Classification, Regression, GANs
Alireza
Fathi
Nathan
Silberman
Liang Chieh
Chen
Vittorio
Ferrari
Sergio
Guadarrama Jasper
Uijlings

Dense prediction problems
● Semantic segmentation
● Instance segmentation
● Saliency estimation
● Depth estimation
● Normal vector estimation
● Stereoscopic images prediction
● Agnostic boundary prediction
● Semantic boundary prediction
● Agnostic instance boundary detection
● Semantic instance boundary detection
● Occlusion estimation
● Decoder part in autoencoders
● Generator network in GANs
● Superresolution
● Image Denoising
● Image Deblurring
● Image Inpainting
● Image Colorization
● Optical Flow prediction
● Human parts prediction
● General parts prediction
● Object Detection (densely sampled proposals)
● Key point detection (densely sampled proposals)

Encoders are well studied...
Inception
Inception-resnet
ResNet
Poly-netResNext
Squeeze-and-excitation net
Dense-net
Xception Wider or Deeper
VGG
Residual
Attention
Net

...but decoders not too much
Laina, Iro, et al. “Deeper Depth Prediction with Fully Convolutional Residual Networks”
Shi, Wenzhe, et al. "Is the deconvolution layer the same as a convolutional layer?”
?

Checkerboards Artifacts
Odena, Augustus, Vincent Dumoulin, and Chris Olah. "Deconvolution and checkerboard artifacts."

Thorough evaluation of decoders
Classification
24 decoder types - 120 experiments
Everingham, Mark, et al. "The pascal visual object classes
(voc) challenge."
Silberman, Nathan, et al. "Indoor segmentation and
support inference from rgbd images."
Yu, Xin, and Fatih Porikli. "Ultra-resolving face
images by discriminative generative networks."Iizuka, Satoshi, Edgar Simo-Serra, and Hiroshi Ishikawa. "Let there
be color!: joint end-to-end learning of global and local image priors
for automatic image colorization with simultaneous classification."
Uijlings, Jasper RR, and Vittorio Ferrari. "Situational object
boundary detection."
Mario, Lucic, et al. “Are GANs Created Equal? A
Large-Scale study”
Regression Generation

Transposed Convolution
Convolution
Decomposed
Convolution
Separable
Convolution
[H, W, D] [2H, 2W, D] [2H, 2W, D/4]
Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully convolutional networks for semantic segmentation."

Conv + Depth To Space
Convolution
[H, W, D] [H, W, D] [2H, 2W, D/4]
Shi, Wenzhe, et al. "Is the deconvolution layer the same as a convolutional layer?”

Bilinear Upsampling + Conv
Convolution
[H, W, D] [2H, 2W, D] [2H, 2W, D/4]

Contribution: Bilinear Additive Upsampling
+
[H, W, D] [2H, 2W, D] [2H, 2W, D/4]

Contribution: Residual-like connections
Bilinear
Additive
Upsampling
Learnable
Upsampling
+
residual
“identity”

1. Decoders matter
Results
High Variance!

Results without residual connection
1. Decoders matter
2. Residual connections yield good improvements

Results with residual connection
1. Decoders matter

Results with residual connection
1. Decoders matter
3. Overall best: bilinear additive upsampling with
residual-like connections

Qualitative analysis of artifacts

Updating Google Maps from
Google Street View Imagery
https://research.googleblog.com/2017/05/updating-google-maps-with-deep-learning.html?m=1
https://www.androidheadlines.com/2017/05/deep-learning-automatically-updates-listings-on-google-maps.html
https://venturebeat.com/2017/05/04/google-street-view-can-now-extract-street-names-numbers-and-businesses-to-keep-maps-up-to-date/
http://post.oreilly.com/form/oreilly/viewhtml/9z1zs4tlnu25k3oe50rmmg7vqq0jfvomebltrdk1kqg?imm_mid=0f18da&cmp=em-data-na-na-newsltr_ai_20170515
Kevin MurphyAlex Gorban Dar-Shyang Lee Qian Yu Julien IbarzYeqing Li

Avenida Presidente Castelo Branco
Midway Avenue
Avenue General Frere
Impasse Des Orfevres

Results from Inception-v3 cuts

Results with one-hot coordinates

Correctly recognizes Zelina Pneus

Instance Segmentation
through pixel embeddings
Alireza
Fathi
Vivek Rathod Peng Wang Hyun Oh
Song
Sergio
Guadarrama Kevin Murphy

Visualization of pixel similarity

Visualization of mask classification
scores

Visualization of sampled seed pixels

3 x 3 x 512 x 512
35 x 35 x 512
35 x 35 x 512
Simple convolution
Operations: 3 x 3 x 35 x 35 x 512 x 512
http://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf
Activations:
height x width x feature depth
Filters:
height x width x input depth x output depth
Activations:
height x width x feature depth

3 x 3 x 512 x 128
35 x 35 x 512
35 x 35 x 512
Still simple convolution
3 x 3 x 512 x 128 3 x 3 x 512 x 128 3 x 3 x 512 x 128
Operations: 4 x 3 x 3 x 35 x 35 x 512 x 128
35 x 35 x 128 35 x 35 x 128 35 x 35 x 128 35 x 35 x 128

1 x 1 x 512 x 128
35 x 35 x 512
35 x 35 x 512
Inception block with dimensionality reduction
1 x 1 x 512 x 128 1 x 1 x 512 x 128 1 x 1 x 512 x 128
3 x 3 x 128 x 1283 x 3 x 128 x 128 3 x 3 x 128 x 128 3 x 3 x 128 x 128
35 x 35 x 128 35 x 35 x 128 35 x 35 x 128 35 x 35 x 128

Simple convolution vs one with dim reduction
Calculations:
4 x 1 x 1 x 35 x 35 x 512 x 128 + 4 x 3 x 3 x 35 x 35 x 128 x 128
Memory:
Additional 4 x 35 x 35 x 128
~ 2.7 less parameters (example 60M -> 22.2M)
~ 2.7 less computations
2 x more nonlinearities = 2 x deeper

Model weights used for evaluation:
(1 - 0.9999) * 0.9999^0 * (weights from the latest step) +
(1 - 0.9999) * 0.9999^1 * (weights from the latest step - 1) +
(1 - 0.9999) * 0.9999^5 * (weights from the latest step - 5) …
ExpAveVar = 0.9999 * ExpAveVar + 0.0001 * (weights from latest step)
Exponential Average of weights over many iterations (model ensemble)
Polyak Averaging

Augmentation
- Random cropping
- Different resizing algorithms: bilinear, bicubic, area, nearest neighbour, lanczos4
- Random affine transformation (rotation + translation + scaling)
- Random vertical / horizontal reflection
- Random brightness, hue, contrast, saturation
Could add more:
- Random elastic transformations
- Random perspective transformation
- Noise: random, gaussian blur, color coding, grayscale, posterize, erode, salt pepper
- Stretching (changing histogram / local histograms of pixel values)

Inception v2
Batch Normalization

Batch normalization
https://arxiv.org/pdf/1502.03167.pdf

Batch normalization - stabilizes training

Batch normalization - speeds up training

Batch normalization for convolutions
- Bias is removed (as anyway would be when centering)
- Gamma factor is constant 1, as convolutional filter take care about scaling
- Batch statistics are taken over the batch and spatial resolution, so they are
averaged over BATCH SIZE x HEIGHT x WIDTH
- Batch statistics for training dataset are taken only over the current batch,
but in the evaluation mode they are constant taken with the exponential
moving average

Improvements
- Prevents vanishing gradient
- Allow for higher learning rates (stabilizes training)
- Allows for lower weight decay (stabilizes training)
- Increase learning rate decay
- Much better substitute for local response normalization
- Benefits from shuffling data every epoch (as different examples don’t appear
in the same batch)
- Doesn’t require so much distortion of the input data

Recent alternatives
- Layer normalization
- Instance normalization
- Weight normalization (every weight to have norm 1)
- Normalization propagation (theoretical calculation)
- Batch Renormalization (forward as for inference, backward as in batch norm)

Inception v3
Christian
Szegedy
Sergey IoffeVincent
Vanhoucke
Jon Shlens

Inception v3
5 billion multiply-adds per inference
less than 25 million parameters
3.58% image classification error rate for top 5 metric
https://arxiv.org/pdf/1512.00567v3.pdf

Guidelines
- Slowly decrease the representation size going from inputs to outputs, to not to
create the bottleneck (representation counted in number of floats)
- High dimensional representation are easier to process, get quickly to
1000-dimensional feature depths
- High correlation between adjacent features allows for dimensionality reduction
before spatial aggregation
- Balance both width and depth of the network, where width mean: FEATURES
HEIGHT x FEATURES WIDTH x FEATURES DEPTH

Big filters factorization
Without factorization: 5 x 5 x 35 x 35 x 512 x 512
With factorization: 2 x 3 x 3 x 35 x 35 x 512 x 512
Gain: 18 / 25 less calculations and parameters
Drawback: additional 35 x 35 x 512 memory

Small filters factorization
Without factorization: 3 x 3 x 35 x 35 x 512 x 512
With factorization: 2 x 1 x 3 x 35 x 35 x 512 x 512
Gain: 6 / 9 less calculations and parameters
Drawback: additional 35 x 35 x 512 memory

Inception Pooling Block
Left pooling block: causes the information bottleneck
Right pooling block: inception / convolution block is expensive

Regularization with label smoothing
- 0.2% absolute error improvement
- Can be seen as deterministic version of inputting the noise into labels that
regularize the network and helps generalization
- Prevents prediction from becoming too confident, no reason for logits to
approach not achievable minus infinity.
- Can be also seen as having additional loss for deviation from uniform prior.

Regularization with auxiliary classifier
Network with auxiliary classifier trains identically as without, the only visible
difference is towards end of training. Therefore we claim it doesn’t help with
training, but has regularization effect. The gain: 0.4% absolute drop in top-1 error.

Sync training with backup workers
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45187.pdf

Output Depth
Width
Height
Input depth
Output Depth
Standard convolution
Computations: 3 x 3 x 299 x 299 x 3 x 64
Parameters: 3 x 3 x 3 x 64

Width
Height
Separable
Depth
“Depthwise”
1
1
InputDepthxSeparableDepth
Input Depth
Pointwise Output Depth
Separable convolution - another factorization
Computations: 8 x (3 x 3 x 299 x 299 x 1 x 8) + 1 x 1 x 299 x 299 x 64 x 64
Parameters: 8 x (3 x 3 x 1 x 8) + 1 x 1 x 64 x 64

Comparison with previous networks

Lower resolution image performance
Trained with the same computational cost:
1. 299 × 299 receptive field with stride 2 and maximum pooling after the first layer.
2. 151 × 151 receptive field with stride 1 and maximum pooling after the first layer.
3. 79 × 79 receptive field with stride 1 and without pooling after the first layer.
Future opportunities: use as post classifier for detection small objects.

Residual Block
Inception Resnet Block
Fewer calculations,
Doesn’t hurt performance
x 0.2 multiplier (“residual scaling”)
Trick to simplify the training
Doesn’t require “warm-up”
No batch-norm
saves memory
Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning by Szegedy et al.
Residual Blocks vs. Inception Resnet Blocks

Guidelines for convolutional architectures
- Use batch normalization layers in all the conv layers (or at least in most of them)
- Increase the feature depth, decrease the spatial resolution in forward transformation
- Use optimizers: momentum, adam or rmsprop
- Decay learning rate after convergence
- Initialization: best orthogonal or variance preservation
- Data augmentation: random color distortion, random cropping etc
- Use only a bit of dropout or not at all
- Maxout: probably overcomplicate the network
- Separable convolutional layers only when feature depth multiplies
- Actually, to be sure you use efficient and powerful network, use
inception-resnet-v2 or resnet

Open Source
Inception Resnet v2 (code + checkpoint) is available on Tensor Flow website.
https://github.com/tensorflow/models/blob/master/slim/nets/inception_resnet_v2.py
http://download.tensorflow.org/models/inception_resnet_v2_2016_08_30.tar.gz
https://research.googleblog.com/2016/08/improving-inception-and-image.html
https://www.tensorflow.org/versions/r0.10/tutorials/image_recognition/index.html

Speed/accuracy trade-offs
for modern convolutional
object detectors
Jonathan Huang Anoop KorattikaraVivek Rathod
Kevin MurphyAlireza Fathi
Chen Sun Menglong Zhu
Yang Song Sergio GuadarramaIan Fischer

AP AP50 AP75 APS APM APL AR1 AR10 AR100 ARS ARM ARL date
G-RMI 0.415 0.624 0.453 0.239 0.439 0.548 0.343 0.552 0.606 0.428 0.646 0.746 9/18/2016
MSRA_2015 0.373 0.589 0.399 0.183 0.419 0.524 0.321 0.477 0.491 0.273 0.556 0.679 11/26/2015
Trimps-Soushen 0.363 0.583 0.386 0.166 0.417 0.506 0.317 0.482 0.5 0.274 0.564 0.68 9/16/2016
Imagine Lab 0.352 0.533 0.388 0.153 0.38 0.52 0.318 0.501 0.528 0.304 0.587 0.722 9/17/2016
FAIRCNN 0.335 0.526 0.366 0.139 0.378 0.477 0.302 0.462 0.485 0.241 0.561 0.664 11/26/2015
CMU_A2_VGG16 0.324 0.532 0.34 0.151 0.357 0.451 0.296 0.463 0.472 0.251 0.523 0.651 9/19/2016
ION 0.31 0.533 0.318 0.123 0.332 0.447 0.279 0.431 0.457 0.238 0.504 0.628 11/26/2015
ToConcoctPellucid 0.286 0.5 0.295 0.105 0.334 0.423 0.277 0.396 0.404 0.173 0.471 0.595 9/16/2016
Wall 0.284 0.49 0.29 0.06 0.316 0.476 0.268 0.408 0.433 0.185 0.485 0.65 9/17/2016
hust-mclab 0.278 0.485 0.289 0.109 0.308 0.398 0.26 0.371 0.377 0.159 0.425 0.549 9/18/2016
CMU_A2 0.257 0.46 0.261 0.059 0.287 0.417 0.248 0.355 0.365 0.105 0.43 0.582 11/27/2015
UofA 0.255 0.437 0.268 0.08 0.273 0.391 0.251 0.354 0.359 0.147 0.389 0.56 11/27/2015
Decode 0.224 0.414 0.222 0.05 0.239 0.369 0.229 0.33 0.338 0.101 0.388 0.54 11/27/2015
Wall_2015 0.205 0.364 0.21 0.043 0.199 0.339 0.218 0.307 0.318 0.109 0.33 0.497 11/27/2015
SinicaChen 0.19 0.363 0.181 0.042 0.199 0.31 0.209 0.301 0.309 0.095 0.335 0.499 11/19/2015
UCSD 0.188 0.369 0.176 0.035 0.188 0.315 0.206 0.303 0.313 0.09 0.342 0.519 11/27/2015
"1026" 0.179 0.32 0.177 0.026 0.18 0.303 0.177 0.248 0.254 0.051 0.283 0.412 11/27/2015
MS COCO 2016 object detection results

Meta architectures
SSD Faster rcnn
R-FCN

Inception Resnet (v2) Feature Extractor
Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning by Szegedy et al.

Average pooling
is important
Note: We don’t update batch norm parameters during training
Softmax and Smooth L1
Box Classifier
(80 + 4) * anchors
Proposal Generator
(2 + 4) * anchors
Faster RCNN w/Inception Resnet (v2)

10-crop inference
● No multiscale training
● No horizontal flip
● No box refinement
● No box voting
● No global context
● No ILSVRC detection data

No of proposals vs Accuracy vs Time

Model Selection for Ensembling
Take best K models?
Or select diverse
K-subset of models?
Model 1 mAP Model 2 mAP Model 3 mAP
Car 20% 23% 70%
Dog 81% 80% 15%
Bear 78% 81% 20%
Chair 10% 12% 71%
Similar Models

Model Selection for Ensembling
Take best K models?
Or select diverse
K-subset of models?
Model 1 mAP Model 2 mAP Model 3 mAP
Car 20% 23% 70%
Dog 81% 80% 15%
Bear 78% 81% 20%
Chair 10% 12% 71%
Complementary Models

Final ensemble selected for challenge submission
Individual mean
AP (on minival)
Feature
Extractor
Output
Stride
Location:Classification
loss ratio
Location Loss function
32.93 Resnet 101 8 3:1 SmoothL1
33.3 Resnet 101 8 1:1 SmoothL1
34.75 Inception Resnet 16 1:1 SmoothL1
35 Inception Resnet 16 2:1 SmoothL1+IOU
35.64 Inception Resnet 8 1:1 SmoothL1
Model NMS for diverse ensembling: Greedily select diverse model collection for
ensembling, pruning away models too similar to already selected models.
Diversity Matters

37.4% by MSRA (the best from 2015
leaderboard)
(Last place from 2015 leaderboard)
Best single model performance
reported in literature that does not do
multiscale or multicrop
34.7: Our best single model
performance before
ensembling/multicrop
Inception
V2 SSD
Inception
Resnet SSD
Resnet
Faster
RCNN
Ensemble of
Resnet Faster
RCNN
Inception
Resnet Faster
RCNN
COCO
deadline:
9/16/2016
Ensemble of Resnet/Inception
Resnet Faster RCNN
Ensemble of Resnet/Inception
Resnet Faster RCNN w/multicrop
Race to the Top
.416
41.6%: Last Google submission
to test-dev server with
Intelligently-Selected Ensemble
of 5 Faster RCNN with Resnet
and Inception-Resnet

Inception Resnet SSD Resnet Faster RCNN
Inception Resnet Faster RCNN Final ensemble with multicrop inference

Digital Mammography
Challenge
9th place in the world
Stephen Morrell, Robert Kemp, Can Khoo, Karl Trygve
Kalleberg, Gerard Cardoso, Zbigniew Wojna

● Dataset
641k mammograms, but only ~3k positive
No cancer location
● Available dataset:
○ DDSM (~10k images) with location
○ Surrey (~7k images) with location
● More than 1300 global competitors
with $1,000,000 in prizes.

True Positive showing ill-defined mass in the MLO view of the
right breast of a 64 year old woman. This FFDM was predicted as
cancerous with 84% probability for the whole image.

True Positive showing microcalcification cluster in the MLO view
of the right breast of a 65 year old woman. This was predicted as
cancerous with 85% probability.

False Negative from
56 year old left
breast.

Source: https://www.synapse.org/#!Synapse:syn9773040/wiki/426908

Artificial intelligence for property
insurance

Tensorflight - AI replacing in-person property
inspection

Web Portal
Restful API
Proprietary
cloud
infrastructure
and
computer vision
Client
requests
an
address
or area
TensorFlight solution
Up-to-date
data on over
90% of
buildings
within
seconds*
*Only for supported US states. We can easily expand to new countries or states upon request.

• Roof - building
footprint, degradation
• Structure - number
of stories, occupancy
type, construction type
• Surrounding area -
e.g. potential
windborne debris
Commercial properties

Residential properties
• Roof - footprint,
shape
• Facade - Windows,
doors
• Surrounding area -
Trees in proximity,
pool, patio, fences

Deep Learning for Dense Prediction Problems

Deep Learning for Dense Prediction Problems

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Similar to Deep Learning for Dense Prediction Problems

Similar to Deep Learning for Dense Prediction Problems (20)

More from Seldon

More from Seldon (20)

Recently uploaded

Recently uploaded (20)

Deep Learning for Dense Prediction Problems