Build computer vision models to perform object detection and classification with AWS

1© 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved |
Making Computer Vision Real
Dr Ramine Tinati
Sn AI/ML Specialist

Agenda
Introduction to Computer Vision
Advancements and Performance Testing
ML @ AWS
Use Case: Car Insurance Claims

3© 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved | 3© 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved |
Background + State-of-the-Art
Computer Vision

Computer Vision
The Goal: Computer Driven Image Recognition

A Quick Intro: Artificial Neural Network
A Neuron
Takes the inputs and multiplies them
by their weights
Sums them up
Applied the activation function (tanh,
sigmoid, ReLU) to the sum

A Neural Network
Initialise the weights of the neurons
Performs the forward propagation:
takes the input and each neuron
calculates the output, and produces
the final output
Back propagation: Readjusts the
weights by calculating the error
using a chosen cost function (e.g.
sum of squared errors)
Aim is to minimize the cost
Image from: https://medium.com/datathings/neural-networks-and-
backpropagation-explained-in-a-simple-way-f540a3611f5e

Cost Functions + Optimizers
Similar to traditional Machine Learning models, cost functions
help us tell ‘how good’ our model is at making predictions for
a given set of parameters.
Different cost functions measure different types of error
metrics, and are used for different predictive tasks (e.g.
regression vs prediction).
Optimizers are algorithms used to change the parameters of
the neural network (e.g. learning rate, weights), in order to
help reduce the overall calculated.
Optimizers such as Gradient Descent are not favourable in
Neural Networks due to the complexity of the networks
(updating weights, convergence, memory), instead we choose
other optimizer suitable for complexity (e.g. Adam)
Image from: https://medium.com/datathings/neural-networks-and-
backpropagation-explained-in-a-simple-way-f540a3611f5e

Convolutional Neural Networks
The concept of CNNs were introduced in the early 1980
(..NN were even earlier)
The premise of a CNN was to be able to detect an object
within an image, independent of its position within the
frame, its rotation, or it’s interaction with other objects
The basic principle of a Convolutional Neural Network (CNN)
is to transform an image in to a matrix of values
Using this numerical representation segment the image
using a series of special neural network hidden layers which
expose both the depth of the image (e.g. the RGB colors),
and the orientation of the pixels, and use these to detect
patterns!

Convolutional Neural Networks
We want to start with an image And produce a numerical representation
of the image which can be used to detect
repeatable patterns

CNN Architecture - Convolutional Layers
A convolutional layer + a kernel, forms the foundation of a CNN
architecture.
In a traditional Neural Network, a Fully Connected layers is used,
where the nodes are connected to every node in the immediate
previous layer
However, a convolutional layer is locally connected, meaning
that they are only connected to a small subset of the previous
layer, but also share the same hyperparameters.
The process of training, backpropagation and forward pass are
very similar to traditional neural networks.
Tuning CNNs are a little trickier as they several more
hyperparameters:
Filter size
Stride/Padding
Pooling Layers

CNN Architecture - Convolutional Layers
Start with an initial
image size h*w.
(3 color channels)
Create a 5x5 filter (kernel) and slide it across
the image using a specified Stride size. We also
need to take into consideration padding!

CNN Architecture - Pooling Layers
The purpose of a pooling layer (or
down sampling layer) is to reduce
the size of the layer, thus reduce the
number of trainable parameters.
There are several parameters to set
when adding a pooling layer:
- Type (MaxPooling, AvgPooling, GlobalMax, GlobalAverage)
- Pooling size
- Stride

CNN Architecture - Fully Connected Layer
The final layer in the CNN architecture is a fully connected layer
This takes ALL the outputs of the previous layer (which maybe a
pooling or convolutional Layer), and outputs a N dimensional
layer, where N represents the number of classes.
For a multi-class classification problem, a soft-max activation
function is used, whereas for a binary classification problem, a
sigmoid activation function is used in the final layer.
When defining the network, depending on the problem and
dataset, the type of loss function will need to be defined, e.g.
Categorical Cross Entropy.

CNN Architecture - Activation Function
After a convolutional layer, an activation function
will be applied, typically the ReLU function is used
(but not for the final layer*)
The purpose of the activation function is to
introduce non-linearity to the process.
The ReLU activation function is favored over others
such as tanh/sigmoid as it reduces training time,
and decreases the vanishing gradient problem (as it
improves gradient descent) – but it does have it’s
problems, so use wisely!

CNN Architecture - SoftMax Activation Function
Given sample vector input x and weight
vectors {wi }, the predicted probability of y = j
A type of activation layer, usually at the end of FC
layer outputs
Can be viewed as a normalizer (a.k.a. Normalized
exponential function)
Produces a discrete probability distribution vector
Very convenient when combined with cross-
entropy loss
In practice, when building multi-class classifiers,
this is used as the last output layer.
(other activation functions do exist.. SVM,
regression layers)

CNN Architecture - Loss // Regularization Layers
L1, L2 loss
- Cross-Entropy loss works well for
classification
- Huber Loss is more resilient to outliers with
smooth gradient
Minimum Squared Error works well for
regression task
Regularization Layers:
- Dropout
- Batch norm
- Gradient clipping
- Max norm constraint

CNN Architecture - Dropout Layer
Whilst not always required, the Dropout Layer
helps reduce the common problem of overfitting,
and improves generalization.
The dropout layer simply ‘drops’ a random set of
activations in the preceding layer, by setting their
values to 0.
The aim here is to force the network to produce
the correct classification even though some of the
network is ‘deactivated’, thus this reduces the
chances of overfitting to the original data

CNN Architecture - Batch Normalization
Networks train faster - Iterations will be slower, however
convergence will be quicker
Allows higher learning rates – Can use larger learning rates,
thus faster training time (careful experimentation is
required!)
Makes weights easier to initialize – Less effort required on
the initialization of weights, but still recommended to use
some form of distribution to set weights
Makes more activation functions viable - Use it with ReLU
and you reduced the issues of nonlinearities
Provides some regularization – This reduces the amount of
dropout required in the Architecture
Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift, Ioffe and Szegedy (2015)
http://proceedings.mlr.press/v37/ioffe15.pdf

CNN Architecture

Training CNNs
Training CNNs is very similar to training any other Neural Network:
Perform a forward-pass across all nodes
Then update the weights during the backward pass
The aim is to obtain the best weights, which can be considered at the point with the
lowest amount of validation loss.
Hyperparameters play an important part in obtaining a decent accuracy
Tuning the HPs such as learning rate, batch size, filter size, etc. need to reflect the
training data and task trying to be achieved
(deeper) Architectures + Hyperparameters > Training Speed*
*lots of (GPU) compute resources are helpfulJ

Training CNNs – Data Augmentation
The size and quality of the data plays an important part of the performance of a CNN.
However, bigger doesn’t always mean better!
One technique found to boost performance is to augment the original data sources to create
a larger dataset
Additionally, there are several reference datasets which can be used to help train a model
(and extremely useful for transfer learning)
MINST
CIFAR-10 / CIFAR-100
ImageNet
Caltech 101 / Caltech 256

Transfer Learning
1. “Forward” transfer: train on one
task, transfer to a new task
2. Multi-task transfer: train on many
tasks, transfer to a new task
3. Multi-task meta-learning: learn to
learn from many tasks
A Survey on Transfer Learning:
https://www.cse.ust.hk/~qyang/Docs/2009/tkde_transfer_learning.pdf

Transfer Learning Strategies

Transfer Learning in Practice
1. Take an existing trained model (preferably in the
same domain)
2. (If manually creating a model) Train the existing
model, this means training some of the layers within
the architecture, and freezing some.
Small Dataset and lots of params – Freeze more
layers to avoid overfitting
Large Dataset and fewer params – Unfreeze more
layers as overfitting will be less of an issue.
3. Depending on the type of the model and purpose
(CNN, LSTM), we can remove the last layer within the
network (e.g. SoftMax), and then replace with an
appropriate layer which matches our purpose.
Transfer
Learning
Approaches
Description
Instance-
Transfer
To re-weight some labelled data in
the source domain for use in the
target domain
Feature-
Representation-
Transfer
Find a “good” feature representation
that reduces difference between the
source and the target domains and
the error of classification and
regression models
Parameter-
Transfer
Discover shared parameters or priors
between the source domain and
target domain models, which can
benefit for transfer learning
Relational-
Knowledge
Transfer
Build mapping of relational
knowledge between the source
domain and the target domains. Both
domains are relational domains and
i.i.d assumption is relaxed in each
domain

Negative Transfer
Depending on the Domain and task, transfer learning may not
be the appropriate method for developing a predictive model.
The concept of Negative Transfer describes the process of
reducing the learning capacity of the target domain due to
the lack of transferability between the source domain and
task.
Rosenstein et al. [1] discussed the challenges in transfer
learning and the limit/bounds of task transferability.
One approach suggested [2] to overcome negative transfer is
to cluster types of tasks into groups (which share a low-
dimensional representation).
[1] M. T. Rosenstein, Z. Marx, and L. P. Kaelbling, “To transfer or not to transfer,” in a NIPS-05 Workshop on Inductive Transfer: 10 Years Later, December 2005
[2] B. Bakker and T. Heskes, “Task clustering and gating for bayesian multitask learning,” Journal of Machine Learning Reserch, vol. 4, pp. 83–99, 2003.
Source domain Target domain
Transfer Learning
Task: Predict
Car Model
Confidence:
{
‘car’: 0.1,
‘letter_a’: 0.6
}
Inference

Deeper Dive into the techniques used to compare models
Measuring Performance

How to measure performance of models
Understanding how to measure the performance of a
model (or how research papers rank CV models) is
essential for comparing how different models perform
on a given dataset/challenge.
There are several metrics are used to measure the
performance of a model depending on the type of
model (object detection, boundary detection)
Many reference datasets exist, most research papers
use these as benchmarks for comparison (ImageNet,
COCO, CIFAR10, MINST)
The PASCAL Visual Object Classes (VOC) Challenge http://homepages.inf.ed.ac.uk/ckiw/postscript/ijcv_voc09.pdf
PRECISION!
RECALL!
AUC
ROC!
mAP!
IoU!

Model Performance – Precision + Recall
Precision measures how accurate is your
predictions. i.e. the percentage of your
predictions are correct.
Recall measures how good you find all
the positives. For example, we can find
80% of the possible positive cases in
our top K predictions.
F1 Score is also known as the harmonic
mean. Note: it doesn’t take into account
True Negatives
Scikit learn classification matrix

Model Performance – IoU (Intersection over union)
IoU is a metric which represents
the overlap between 2
boundaries, comparing the
predicted boundary region
against the ground truth,
If the predicted boundary boxes
I(x1,y2,x2,y2) was equal to
g(x1,y1,x2,y2), then the IoU score
would be 1.

Model Performance – Average Precision (AP)
AP is a more complex metric which
combines Precision + Recall, IoU and
some simple integrals.
It’s a commonly used for object
detection computer vision models, as it
provides a measure of how well a
model is predicting classes of objects,
based on the ranking of the confidence
of predictions.
Often there will be Metrics such as
AP50, AP75 which represents the AP
when the IoU is at least 50%, 75%, etc.
The AP summarises the shape of the
precision/recall curve, and is defined as
the mean precision at a set of eleven
equally spaced recall levels [0,0.1,...,1].
The precision at each recall level r is
interpolated by taking the maximum
precision measured for a method for
which the corresponding recall exceeds r:
where p(r˜) is the measured precision at recall ˜r.

To calculate AP, first we generate all our predictions, and
then rank them in descending order based on their
confidence score.
A confidence score of > 0.5 means a correct classification.
IoU is the metric for our confidence score
In this example, there are only 5 objects to be detected. We
then row by row calculate the Precision and Recall.
Row 4:
Precision (TP / TP + FP) à 2/4 = 0.5
Recall (TP / TP + FN) à 2/5 = 0.4
Note: As the confidence score decreases, the recall increases,
but the precision flutters up and down.
Rank Conf Correct
?
Precision Recall
1 0.99 True 1.0 0.2
2 0.97 True 1.0 0.4
3 0.80 False 0.67 0.4
4 0.78 False 0.5 0.4
5 0.76 False 0.4 0.4
6 0.75 True 0.5 0.6
7 0.75 True 0.57 0.8
8 0.74 False 0.5 0.8
9 0.71 False 0.44 0.8
10 0.70 True 0.5 1.0

The Zig-Zag effect can be seen more clearly
using a precision-recall plot. At this point we
can now examine the integral for calculating
the AP, which results in a single numerical
value of the AP.
0.00
0.20
0.40
0.60
0.80
1.00
0.20 0.40 0.60 0.80 1.00
Precision Recall
For the mathematicians, we can either smooth
the curve curve, or calculate a polynomial to
represent for use in calculating the integral of
the PR curve.

One approach to calculating the integral
of the PR curve is to use the Maximum
prevision at each ‘step’, which makes it
less suspectable to smaller variations in
the rankings.
For definition for replacing the
precision value for recall (r~) with the
maximum prevision is defined as:

Model Performance – COCO mAP (Mean Average Precision)
As the COCO dataset has become a
gold standard reference dataset for
CV, Ap is the average over multiple
IoU)
The mAP is the average of AP. In some
contexts this is comping the AP for
each class and average them, in other
contexts, AP and mAP are the same
thing.
For example, under the COCO context,
there is no difference between AP and
mAP.

Advancements in Computer Vision
Development of Network Architectures

Computer Vision - AlexNet
In 2012, The first CNN with an acceptable level of accuracy was published.
AlexNet was based on the now popular ImageNet dataset, achieved a top-5 test
error of 15.4% (2012 ILSVRC) (the next best has >25% test error )
Due to the compute complexity, the team split the processing across two GPU
pipelines
ImageNet Classification with Deep Convolutional Neural Networks:
https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

Computer Vision - VGG
In 2014, the Very Deep Convolutional
Network (VGG-19) was produced.
A 19 layer CNN, with a small filter (3x3)
compared to AlexNet.
Used 3 back-to-back convolutional
layers before pooling.
First to demonstrated the use of very
deep layers
Achieved a top-5 test error of 7.3%
Very deep convolutional networks for large-scale image recognition
Https://arxiv.Org/pdf/1409.1556.Pdf

Computer Vision - GoogleNet
Google announced GoogleNet in 2015, which tackled the problem of the huge
computational costs needed to train a CNN
The architecture introduced a method of reducing the number of features (thus trainable
parameters), using 1x1 convolutional layers, and running parallel convolutions (Inception
Model).
Demonstrated that stacking is not the only
approach to developing CNNs
This achieved a very reasonable
top-5 test error of 6.7%
Later revisions of the Inception
model introduced batch normalization as a
layer to improve performance and reduce training time Going Deeper with Convolutions
https://arxiv.org/abs/1409.4842

Computer Vision - ResNet
In 2015, Microsoft produced ResNet, which was a 152 layer network.
The basis behind the Residual Network Is the Residual Block, where the
outputs of the conv-relu-conv cycle is added to the original input.
At each cycle the computation is tracking to small change to the original
input, rather than just form a completely new representation of the
image.
These changes are then used to update the next cycle, which also
improves the process of training during the back-propagation stage
This architecture achieve a top-5 test error of 3.6% (humans are
usually in the range of 5-10%)
Deep Residual Learning for Image Recognition
https://arxiv.org/abs/1512.03385

Computer Vision – Region-CNNs
Region-based CNNs can be considered as
of the recent advancements in the field of
computer vision. R-CNNs aim to solve the
problem of object detection tasks.
By using the fundamentals of CNNs,
regions which correspond to objects within
an image can be detected, and bounding
boxes can be drawn.
Search for Mask-RCNN or Fast-CNNs for
more info on the application and
architectures

YOLO (You Only Look Once)
YOLO is a network for object detection.
Compared to existing R-CNN approaches
which use a pipeline approach, YOLO uses a
single NN to perform object detection (single
regression problem).
The speed allows for real-time processing of
images à Videos!
Network is based on the ResNet Architecture,
and various flavors exist to suit different
computational needs
The most recent YOLOv3 uses 75
Convolutional Layers, No Fully Connected
Layers, No Pooling, and No SoftMax Layer
https://arxiv.org/pdf/1506.02640.pdf

Single Shot MultiBox Detector (SSD)
SSD only requires a single shot to detect multiple
objects within an image. This means only one
forward pass, where as other regional based
models require multiple shots.
The single pass means SSD is great for object
detection in video!
For each region, k bounding boxes b are
identified. These k bounding boxes have different
sizes and aspect ratios. For each b , c class scores
are computed along with 4 offsets relative to the
original default bounding box shape. (hence
MultiBox)
Architecture is built on VGG-16, and for smaller
objects, achieves a higher level of accuracy
compared to YOLO
https://arxiv.org/pdf/1512.02325.pdf
https://cv-tricks.com/object-detection/faster-r-cnn-yolo-ssd/

Deeper Dive into the Technology
Machine Learning @ AWS

The AWS ML Stack
Broadest and most complete set of Machine Learning capabilities
VISION SPEECH TEXT SEARCH CHATBOTS PERSONALIZATION FORECASTING FRAUD DEVELOPMENT CONTACT CENTERS
Ground
Truth
Augmented
AI
ML
Marketplace
Neo
Built-in
algorithms
Notebooks Experiments
Model
training &
tuning
Debugger Autopilot
Model
hosting
Model Monitor
Deep Learning
AMIs & Containers
GPUs &
CPUs
Elastic
Inference
Inferentia FPGA
Amazon
Rekognition
Amazon
Polly
Amazon
Transcribe
+Medical
Amazon
Comprehend
+Medical
Amazon
Translate
Amazon
Lex
Amazon
Personalize
Amazon
Forecast
Amazon
Fraud Detector
Amazon
CodeGuru
AI SERVICES
ML SERVICES
ML FRAMEWORKS & INFRASTRUCTURE
Amazon
Textract
Amazon
Kendra
Contact Lens
For Amazon Connect
SageMaker Studio IDE
NEW
NEW! NEW! NEW! NEW!
NEW!
NEW! NEW! NEW! NEW! NEW!
Amazon SageMaker

Fully managed data
processing jobs and
data labeling
workflows
One-click collaborative
notebooks and built-
in, high performance
algorithms and models
One-click
training
Debugging and
optimization
One-click
deployment and
autoscaling
Amazon SageMaker helps you build, train, and deploy models
Visually track and
compare experiments
Automatically
spot
concept drift
Fully
managed with
auto-scaling
for 75% less
Prepare Build Train & Tune Deploy & Manage
101011010
010101010
000011110
Collect and
prepare
training data
Choose or build an
ML algorithm
Set up and manage
environments
for training
Train, debug, and
tune models
Deploy
model in
production
Manage training runs Monitor
models
Add human
review of
predictions
Web-based IDE for machine learning
Automatically build and train models

AMAZON SAGEMAKER IS FULLY MANAGED
One click model deployment
Auto-scaling Python SDK
Bring your
own model
Low latency and
high throughput
Deploy multiple
models on an
endpoint

Amazon SageMaker Notebooks
Access your notebooks in
seconds
Administrators manage
access and permissions
Share notebooks
with a single click
Dial up or down
compute resources
(Coming soon)
Start your notebooks
without spinning up
compute resources
Fast-start sharable notebooks (in preview)

• Fully-managed training and
hosting
• Near-linear scaling across 100s of
GPUs
• 3x faster network throughput with
EC2 P3
65% Stock
TensorFlow
AWS-optimized
TensorFlow90%
AMAZON SAGEMAKER IS THE BEST PLACE TO RUN TENSORFLOW

Classification Computer Vision Topic Modeling
Working with Text
Recommendation
Forecasting
• Linear Learner
• XGBoost
• KNN
• Image Classification
• BlazingText
• Supervised
• Unsupervised
• Factorization Machines
• DeepAR
• LDA
• NTM
AMAZON SAGEMAKER HAS BUILT-IN ALGORITHMS OR BRING YOUR OWN
Anomaly Detection
• Random Cut Forests
Sequence Translation
• Seq2Seq
• Object Detection
Clustering
• KMeans
Feature Reduction
• PCA
Regression
• Linear
Learner
• XGBoost
• KNN
• IP Insights
• Semantic Segmentation
• Object2Vec

GluonCV: Deep Learning Toolkit for Computer Vision
GluonCV – open source deep learning
interface for quickly build machine
learning models, without compromising
performance
• Training with SOTA results from latest
papers
• Large set of pre-trained models
• Carefully designed APIs, easy to
understand implementations
• Community support
Benefits

AWS AWS R-CNN Example
https://github.com/aws-samples/mask-rcnn-tensorflow
Primary focus was on increasing training
throughput without sacrificing any accuracy.
We do this by training with a batch size > 1
per GPU using FP16 and two custom TF ops.
Dataset: COCO 2017
Pre-Trained Model: ResNet-r50
EC2 Instance Type: P3dn.24xl I
Num_GPUs x
Images_Per_
GPU
Trainin
g time
Box mAP Mask mAP
8x4 9.78h 38.25% 35.08%
16x4 5.60h 38.44% 35.18%
32x4 3.33h 38.33% 35.12%

You can shop for algorithms, models, and data in AWS Marketplace
AWS MARKETPLACE
Browse or search
AWS Marketplace
Subscribe in a
single click
Available in
Amazon SageMaker

HUNDREDS OF ALGORITHMS, MODELS, AND DATA
Natural language processing
Text-to-speech
Object detection
Speech recognition
Grammar and parsingText generation
Speaker identification
Regression
Text OCR
Text classification
Text clustering
Computer vision
3D images
Handwriting recognition
Named entity recognition
Anomaly detection
Ranking Video classification
Automatic labeling via machine learning
IP protection
Automated billing and metering
SELLERS
Broad selection of paid, free, and open-source
algorithms and models
Data protection
Discoverable on your AWS bill
BUYERS

60+ Computer Vision Models and Algorithms

Example Use Case
Using SageMaker for Detecting
False Insurance Claims Images

Detecting False Claims
Global car insurance organizations receive tens of
thousands of claims per day, which require
significant human resources to review, investigate,
and approve the claims.
The use of computer vision can help reduce the
overheads of the claims team by providing an
automated mechanism for detecting potentially false
or spam insurance claims.
In this session we’re going to build a custom solution
which uses computer vision models, to detect cars
and damages on cars.

Detecting False Claims: Solution Architecture

Detecting False Claims: Inferencing
Custom Trained Model

Detecting False Claims: Using SageMaker
Amazon SageMaker is the first step in producing a custom
Image Classification Model.
At this stage, data preparation, exploration, is performed to
ensure the initial data used for training a model is suitable.
Model training is an iterative process, and the first model will
be supported by a cleansed dataset
Once model performance is acceptable, the model can be
deployed and then the code wrapped up for deployment on
Kubernetes

Detecting False Claims: Using SageMaker

Car Image Detection Workflow
Demo

Wrap up…phew!
We’ve covered A LOT of content:
Neural Networks for CV
Architecture Advances
Performance Measuring
AWS Services
Demos of using SageMaker for Image Classification
…Hopefully you can take something from this and go explore!

Build computer vision models to perform object detection and classification with AWS

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Build computer vision models to perform object detection and classification with AWS

Similar a Build computer vision models to perform object detection and classification with AWS (20)

Más de Bill Liu

Más de Bill Liu (20)

Último

Último (20)

Build computer vision models to perform object detection and classification with AWS