Speaker: Zbigniew Wojna, Deep Learning Researcher and founder of TensorFlight.Inc
Title: Architectures for big scale 2D imagery
Abstract: Zbigniew will present research he conducted during his Ph.D. at University College London and in collaboration with Google. His primary interest lays in the development of neural architectures for 2D imagery problems in big scale. He will present the recently published analysis of different upsampling methods in the decoder part of visual architectures, together with last week ongoing extension for GANs. Will discuss attention mechanism for text recognition and review for what kind of application it can be useful (automatically updating Google Maps based on Google Street View imagery). He will explain the idea behind inception and change in Inception-v3 to have it the best single model on ImageNet 2015 and how does it compare to Resnet architecture which was published 2 weeks after. Together with inception, will present his winning submission to MS COCO 2016 detection challenge and the extensive analysis of different models and backbone architectures inside. At the end will shortly review UCL effort working with 4096x4096 images at The Digital Mammography DREAM Challenge for breast cancer recognition, where they achieved 9th among 1375 teams worldwide and 2nd place in the community phase.
Bio: Zbigniew Wojna is deep learning researcher and founder of TensorFlight Inc. company providing instant remote commercial property inspection (for risk factors for reinsurance enterprises) based on satellite and street view type imagery. Zbigniew is currently in the final stage of his Ph.D. (already with more than 1000 citations) at the University College London under the supervision of Professor Iasonas Kokkinos and professor John Shawe-Taylor. His primary interest lies in finding and solving research problems around 2D machine vision applications usually in big scale. Zbigniew in his Ph.D. career spent most of the time working across different groups in DeepMind, Google Research, and Facebook Research. It includes DeepMind Health Team, Deep Learning Team for Google Maps in collaboration with Google Brain, Machine Perception with Kevin Murphy, Weak Localization Team with Vittorio Ferrari and Facebook AI Research Lab in Paris. His company TensorFlight Inc. was featured as top 2 AI startups among few hundreds by InnovatorsRace50 and closed seed funding last year.
Thanks to all TensorFlow London meetup organisers and supporters:
Seldon.io
Altoros
Rewired
Google Developers
Rise London
6. The Devil is in the Decoder:
Classification, Regression, GANs
Alireza
Fathi
Nathan
Silberman
Liang Chieh
Chen
Vittorio
Ferrari
Sergio
Guadarrama Jasper
Uijlings
10. Encoders are well studied...
Inception
Inception-resnet
ResNet
Poly-netResNext
Squeeze-and-excitation net
Dense-net
Xception Wider or Deeper
VGG
Residual
Attention
Net
11. ...but decoders not too much
Laina, Iro, et al. “Deeper Depth Prediction with Fully Convolutional Residual Networks”
Shi, Wenzhe, et al. "Is the deconvolution layer the same as a convolutional layer?”
?
13. Thorough evaluation of decoders
Classification
24 decoder types - 120 experiments
Everingham, Mark, et al. "The pascal visual object classes
(voc) challenge."
Silberman, Nathan, et al. "Indoor segmentation and
support inference from rgbd images."
Yu, Xin, and Fatih Porikli. "Ultra-resolving face
images by discriminative generative networks."Iizuka, Satoshi, Edgar Simo-Serra, and Hiroshi Ishikawa. "Let there
be color!: joint end-to-end learning of global and local image priors
for automatic image colorization with simultaneous classification."
Uijlings, Jasper RR, and Vittorio Ferrari. "Situational object
boundary detection."
Mario, Lucic, et al. “Are GANs Created Equal? A
Large-Scale study”
Regression Generation
15. Conv + Depth To Space
Convolution
[H, W, D] [H, W, D] [2H, 2W, D/4]
Shi, Wenzhe, et al. "Is the deconvolution layer the same as a convolutional layer?”
25. Updating Google Maps from
Google Street View Imagery
https://research.googleblog.com/2017/05/updating-google-maps-with-deep-learning.html?m=1
https://www.androidheadlines.com/2017/05/deep-learning-automatically-updates-listings-on-google-maps.html
https://venturebeat.com/2017/05/04/google-street-view-can-now-extract-street-names-numbers-and-businesses-to-keep-maps-up-to-date/
http://post.oreilly.com/form/oreilly/viewhtml/9z1zs4tlnu25k3oe50rmmg7vqq0jfvomebltrdk1kqg?imm_mid=0f18da&cmp=em-data-na-na-newsltr_ai_20170515
Kevin MurphyAlex Gorban Dar-Shyang Lee Qian Yu Julien IbarzYeqing Li
48. 3 x 3 x 512 x 512
35 x 35 x 512
35 x 35 x 512
Simple convolution
Operations: 3 x 3 x 35 x 35 x 512 x 512
http://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf
Activations:
height x width x feature depth
Filters:
height x width x input depth x output depth
Activations:
height x width x feature depth
49. 3 x 3 x 512 x 128
35 x 35 x 512
35 x 35 x 512
Still simple convolution
3 x 3 x 512 x 128 3 x 3 x 512 x 128 3 x 3 x 512 x 128
Operations: 4 x 3 x 3 x 35 x 35 x 512 x 128
35 x 35 x 128 35 x 35 x 128 35 x 35 x 128 35 x 35 x 128
50. 1 x 1 x 512 x 128
35 x 35 x 512
35 x 35 x 512
Inception block with dimensionality reduction
1 x 1 x 512 x 128 1 x 1 x 512 x 128 1 x 1 x 512 x 128
3 x 3 x 128 x 1283 x 3 x 128 x 128 3 x 3 x 128 x 128 3 x 3 x 128 x 128
35 x 35 x 128 35 x 35 x 128 35 x 35 x 128 35 x 35 x 128
51. Simple convolution vs one with dim reduction
Calculations:
4 x 1 x 1 x 35 x 35 x 512 x 128 + 4 x 3 x 3 x 35 x 35 x 128 x 128
Memory:
Additional 4 x 35 x 35 x 128
~ 2.7 less parameters (example 60M -> 22.2M)
~ 2.7 less computations
2 x more nonlinearities = 2 x deeper
52. Model weights used for evaluation:
(1 - 0.9999) * 0.9999^0 * (weights from the latest step) +
(1 - 0.9999) * 0.9999^1 * (weights from the latest step - 1) +
(1 - 0.9999) * 0.9999^2 * (weights from the latest step - 2) +
(1 - 0.9999) * 0.9999^3 * (weights from the latest step - 3) +
(1 - 0.9999) * 0.9999^4 * (weights from the latest step - 4) +
(1 - 0.9999) * 0.9999^5 * (weights from the latest step - 5) …
ExpAveVar = 0.9999 * ExpAveVar + 0.0001 * (weights from latest step)
Exponential Average of weights over many iterations (model ensemble)
Polyak Averaging
53. Augmentation
- Random cropping
- Different resizing algorithms: bilinear, bicubic, area, nearest neighbour, lanczos4
- Random affine transformation (rotation + translation + scaling)
- Random vertical / horizontal reflection
- Random brightness, hue, contrast, saturation
Could add more:
- Random elastic transformations
- Random perspective transformation
- Noise: random, gaussian blur, color coding, grayscale, posterize, erode, salt pepper
- Stretching (changing histogram / local histograms of pixel values)
59. Batch normalization for convolutions
- Bias is removed (as anyway would be when centering)
- Gamma factor is constant 1, as convolutional filter take care about scaling
- Batch statistics are taken over the batch and spatial resolution, so they are
averaged over BATCH SIZE x HEIGHT x WIDTH
- Batch statistics for training dataset are taken only over the current batch,
but in the evaluation mode they are constant taken with the exponential
moving average
60. Improvements
- Prevents vanishing gradient
- Allow for higher learning rates (stabilizes training)
- Allows for lower weight decay (stabilizes training)
- Increase learning rate decay
- Much better substitute for local response normalization
- Benefits from shuffling data every epoch (as different examples don’t appear
in the same batch)
- Doesn’t require so much distortion of the input data
61. Recent alternatives
- Layer normalization
- Instance normalization
- Weight normalization (every weight to have norm 1)
- Normalization propagation (theoretical calculation)
- Batch Renormalization (forward as for inference, backward as in batch norm)
63. Inception v3
5 billion multiply-adds per inference
less than 25 million parameters
3.58% image classification error rate for top 5 metric
https://arxiv.org/pdf/1512.00567v3.pdf
64. Guidelines
- Slowly decrease the representation size going from inputs to outputs, to not to
create the bottleneck (representation counted in number of floats)
- High dimensional representation are easier to process, get quickly to
1000-dimensional feature depths
- High correlation between adjacent features allows for dimensionality reduction
before spatial aggregation
- Balance both width and depth of the network, where width mean: FEATURES
HEIGHT x FEATURES WIDTH x FEATURES DEPTH
65. Big filters factorization
Without factorization: 5 x 5 x 35 x 35 x 512 x 512
With factorization: 2 x 3 x 3 x 35 x 35 x 512 x 512
Gain: 18 / 25 less calculations and parameters
Drawback: additional 35 x 35 x 512 memory
66. Small filters factorization
Without factorization: 3 x 3 x 35 x 35 x 512 x 512
With factorization: 2 x 1 x 3 x 35 x 35 x 512 x 512
Gain: 6 / 9 less calculations and parameters
Drawback: additional 35 x 35 x 512 memory
67. Inception Pooling Block
Left pooling block: causes the information bottleneck
Right pooling block: inception / convolution block is expensive
69. Regularization with label smoothing
- 0.2% absolute error improvement
- Can be seen as deterministic version of inputting the noise into labels that
regularize the network and helps generalization
- Prevents prediction from becoming too confident, no reason for logits to
approach not achievable minus infinity.
- Can be also seen as having additional loss for deviation from uniform prior.
70. Regularization with auxiliary classifier
Network with auxiliary classifier trains identically as without, the only visible
difference is towards end of training. Therefore we claim it doesn’t help with
training, but has regularization effect. The gain: 0.4% absolute drop in top-1 error.
71. Sync training with backup workers
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45187.pdf
76. Lower resolution image performance
Trained with the same computational cost:
1. 299 × 299 receptive field with stride 2 and maximum pooling after the first layer.
2. 151 × 151 receptive field with stride 1 and maximum pooling after the first layer.
3. 79 × 79 receptive field with stride 1 and without pooling after the first layer.
Future opportunities: use as post classifier for detection small objects.
78. Residual Block
Inception Resnet Block
Fewer calculations,
Doesn’t hurt performance
x 0.2 multiplier (“residual scaling”)
Trick to simplify the training
Doesn’t require “warm-up”
No batch-norm
saves memory
Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning by Szegedy et al.
Residual Blocks vs. Inception Resnet Blocks
82. Guidelines for convolutional architectures
- Use batch normalization layers in all the conv layers (or at least in most of them)
- Increase the feature depth, decrease the spatial resolution in forward transformation
- Use optimizers: momentum, adam or rmsprop
- Decay learning rate after convergence
- Initialization: best orthogonal or variance preservation
- Data augmentation: random color distortion, random cropping etc
- Use only a bit of dropout or not at all
- Maxout: probably overcomplicate the network
- Separable convolutional layers only when feature depth multiplies
- Actually, to be sure you use efficient and powerful network, use
inception-resnet-v2 or resnet
83. Open Source
Inception Resnet v2 (code + checkpoint) is available on Tensor Flow website.
https://github.com/tensorflow/models/blob/master/slim/nets/inception_resnet_v2.py
http://download.tensorflow.org/models/inception_resnet_v2_2016_08_30.tar.gz
https://research.googleblog.com/2016/08/improving-inception-and-image.html
https://www.tensorflow.org/versions/r0.10/tutorials/image_recognition/index.html
84. Speed/accuracy trade-offs
for modern convolutional
object detectors
Jonathan Huang Anoop KorattikaraVivek Rathod
Kevin MurphyAlireza Fathi
Chen Sun Menglong Zhu
Yang Song Sergio GuadarramaIan Fischer
92. Model Selection for Ensembling
Take best K models?
Or select diverse
K-subset of models?
Model 1 mAP Model 2 mAP Model 3 mAP
Car 20% 23% 70%
Dog 81% 80% 15%
Bear 78% 81% 20%
Chair 10% 12% 71%
Similar Models
93. Model Selection for Ensembling
Take best K models?
Or select diverse
K-subset of models?
Model 1 mAP Model 2 mAP Model 3 mAP
Car 20% 23% 70%
Dog 81% 80% 15%
Bear 78% 81% 20%
Chair 10% 12% 71%
Complementary Models
94. Final ensemble selected for challenge submission
Individual mean
AP (on minival)
Feature
Extractor
Output
Stride
Location:Classification
loss ratio
Location Loss function
32.93 Resnet 101 8 3:1 SmoothL1
33.3 Resnet 101 8 1:1 SmoothL1
34.75 Inception Resnet 16 1:1 SmoothL1
35 Inception Resnet 16 2:1 SmoothL1+IOU
35.64 Inception Resnet 8 1:1 SmoothL1
Model NMS for diverse ensembling: Greedily select diverse model collection for
ensembling, pruning away models too similar to already selected models.
Diversity Matters
95. 37.4% by MSRA (the best from 2015
leaderboard)
(Last place from 2015 leaderboard)
Best single model performance
reported in literature that does not do
multiscale or multicrop
34.7: Our best single model
performance before
ensembling/multicrop
Inception
V2 SSD
Inception
Resnet SSD
Resnet
Faster
RCNN
Ensemble of
Resnet Faster
RCNN
Inception
Resnet Faster
RCNN
COCO
deadline:
9/16/2016
Ensemble of Resnet/Inception
Resnet Faster RCNN
Ensemble of Resnet/Inception
Resnet Faster RCNN w/multicrop
Race to the Top
.416
41.6%: Last Google submission
to test-dev server with
Intelligently-Selected Ensemble
of 5 Faster RCNN with Resnet
and Inception-Resnet
96. Inception Resnet SSD Resnet Faster RCNN
Inception Resnet Faster RCNN Final ensemble with multicrop inference
98. ● Dataset
641k mammograms, but only ~3k positive
No cancer location
● Available dataset:
○ DDSM (~10k images) with location
○ Surrey (~7k images) with location
● More than 1300 global competitors
with $1,000,000 in prizes.
99. True Positive showing ill-defined mass in the MLO view of the
right breast of a 64 year old woman. This FFDM was predicted as
cancerous with 84% probability for the whole image.
100. True Positive showing microcalcification cluster in the MLO view
of the right breast of a 65 year old woman. This was predicted as
cancerous with 85% probability.
108. Web Portal
Restful API
Proprietary
cloud
infrastructure
and
computer vision
Client
requests
an
address
or area
TensorFlight solution
Up-to-date
data on over
90% of
buildings
within
seconds*
*Only for supported US states. We can easily expand to new countries or states upon request.
109. • Roof - building
footprint, degradation
• Structure - number
of stories, occupancy
type, construction type
• Surrounding area -
e.g. potential
windborne debris
Commercial properties
110. Residential properties
• Roof - footprint,
shape
• Facade - Windows,
doors
• Surrounding area -
Trees in proximity,
pool, patio, fences