DroidCon Cluj 2018 - Hands on machine learning on android

Outline
● Why machine learning on Android?
● Mostly:
○ Some insights about Object Detection algorithms
○ Practical example in Tensorflow
○ Data gathering and labeling
○ Model training
● Hopefully:
○ It will inspire you to deeg deeper
○ It won’t confuse you too much :)

Machine learning
Why machine learning on Android?

● Object detection
○ Is a very common Computer Vision problem
○ Identifies the objects in the image and
provides their precise location

● Why is it useful?
○ StreetView,
○ Self-driving cars etc.
E.g.: Street view - face
blurring
E.g.: Self driving cars - pedestrian
detection

○ StreetView,
● Object detection: impact of deep learning
○ Deep convnets significantly increased
accuracy and processing time

○ StreetView,
● Object detection: impact of deep learning
○ Deep convnets significantly increased
accuracy and processing time
● Why on Android?
○ We are living in the era when mobile took over
○ Running on mobile makes it possible to
deliver interactive and real time applications
○ Latest released phones have great computing
power

Machine learning
Some insights about Object Detection

Image classification with convnets
● Dataset
○ e.g. Cifar-10 dataset:
■ consists of 60000 32x32 colour images in 10 classes,
with 6000 images per class.
■ There are 50000 training images and 10000 test images.
● Training phase
○ e.g. VGG 16 network
○ input: labeled images (x,y)
Forward propagation (Given wl , compute predictions )

Intuition about the convolution
Convolution Kernel
(weights)
Input image
* =
Another way to
understand the
convolution operation:
or: Convolution layer
or: Feature Map
or: Network’s parameters

Image classification with convnets
● Dataset
○ e.g. Cifar-10 dataset:
■ consists of 60000 32x32 colour images in 10 classes,
with 6000 images per class.
■ There are 50000 training images and 10000 test images.
● Training phase
○ e.g. VGG 16 network
○ input: labeled images (x,y)
● Testing phase
○ Use the trained model to classify new instances
○ Detection output: predicted class
Forward propagation (Given wl , compute predictions )
Loss function:
Backward propagation (compute wl+1 by minimizing the loss)
Repeat until
convergence
=> w*

Relation between classification and object detection
● We have an accurate way of classifying images
○ e.g.: does this image contain a pedestrian?
● But how can we say WHERE is this pedestrian?

Solution:
● Sliding window
○ strategy:
■ splits into fragments and classify them independently

Solution:
● Sliding window
○ strategy:
Classified as pedestrian:All fragments:
...

Solution:
● Sliding window
○ strategy:
○ challenges :
■ how to deal with: various object size, various aspect ratio, object overlap or multiple responses

Solution:
● Sliding window
○ strategy:
○ challenges :
■ how to deal with: various object size, various aspect ratio, object overlap or multiple responses
○ problem: need to apply CNN to huge number of locations and scales, very computationally expensive!!

R-CNN (Region-based convolutional neural network)
Two steps:
● Select object proposals: Selective Search Algorithm
○ it has very low precision to be used as object
detector, but it works fine as a first step in the
detection pipeline
● Apply strong CNN classifier to select proposal
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014

Two steps:
● Select object proposal: Selective Search Algorithm
detection pipeline
It outperforms all the previous object detection algorithms
R-CNN

Two steps:
● Select object proposal: Selective Search Algorithm
detection pipeline
It outperforms all the previous object detection algorithms
Limitations:
● Depend on external algorithm hypothesis
● Need to rescale object proposals to fixed resolution
● Redundant computation - all features are
independently computed even for overlapped
proposal regions

Fast R-CNN
From R-CNN to Fast R-CNN:
● input: image + region proposals
● region pooling on “conv5” feature map for feature
extraction
● softmax classifier instead of SVM classifier
● End to end multi-task training:
○ the last FC layer branch into two sibling
output layers:
■ one that produces softmax
probability estimates over K object
classes
■ another layer that outputs the
bounding box coordinates for each
object.
Girshick, “Fast R-CNN”, ICCV 2015

Fast R-CNN
From R-CNN to Fast R-CNN:
● input: image + region proposals
● region pooling on “conv5” feature map for feature
extraction
● softmax classifier instead of SVM classifier
● End to end multi-task training:
○ the last FC layer branch into two sibling
output layers:
■ one that produces softmax
probability estimates over K object
classes
■ another layer that outputs the
bounding box coordinates for each
object.
Advantages:
● Higher detection quality (mAP) than R-CNN
● Training is single-stage
● Training can update all network layers at once
● No disk storage is required for feature caching
Girshick, “Fast R-CNN”, ICCV 2015

Faster R-CNN
Faster R-CNN = Fast R-CNN + RPN (Region Proposal
Network)
● RPN
○ removes dependency from external hypothesis
ROI generation method
○ is a convolutional network trained end-to-end
○ generates a list of high-quality region proposal
(bbox coordinates + objectness scores)
● Then RPN + Fast R-CNN are merged into a single
network by sharing their convolutional features
○ predicts the class of the objects + a refined bbox
position
○ shared convolutional features enables nearly cost-
free region proposals
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, “Faster R-CNN: Towards
Real-Time Object Detection with Region Proposal Networks”, NIPS 2015

SSD (Single shot detector)
● Extra feature layers
○ additional convolutional feature layers of different sizes are placed at
the end of base net
○ each added feature layer produce a set of detection predictions,
allowing predictions at multiple scales
○ this design lead to simple end-to-end training
Wei Liu et al., SSD: Single Shot MultiBox Detector, ECCV 2016

SSD (Single shot detector)
● Extra feature layers
○ additional convolutional feature layers of different sizes are placed at
the end of base net
○ each added feature layer produce a set of detection predictions,
allowing predictions at multiple scales
○ this design lead to simple end-to-end training
● ROIs proposal
○ output space of region proposals contains a fixed set of default boxes
over different aspect ratios and scales per feature map location
○ for each default bounding box, predict
○ the shape offsets Δ(cx, cy, w, h) and
○ the confidence for all object categories (c1, …, cp)
● Non-Maxima suppression
4x4 feature map
Wei Liu et al., SSD: Single Shot MultiBox Detector, ECCV 2016
8x8 feature map

Compare modern convolutional object detectors
Lots of variables to set up ...
● base net:
○ VGG16
○ ResNet101
○ InceptionV2
○ InceptionV3
○ ResNet
○ MobileNet
● Object detection architecture:
○ R-CNN
○ Fast R-CNN
○ Faster R-CNN
○ SSD
● Input image resolution
● Number of region proposal
● Frozen weights - for fine tuning

● base net:
○ VGG16
○ ResNet101
○ InceptionV2
○ InceptionV3
○ ResNet
○ MobileNet
○ R-CNN
○ Fast R-CNN
○ Faster R-CNN
○ SSD
Jonathan Huang et al., Speed/accuracy trade-offs for modern convolutional object detectors, CVPR 2017
Speed/accuracy trade-offs

● base net:
○ VGG16
○ ResNet101
○ InceptionV2
○ InceptionV3
○ ResNet
○ MobileNet
○ R-CNN
○ Fast R-CNN
○ Faster R-CNN
○ SSD
Takeaways:
● Faster R-CNN is slower but more accurate
● SSD is much faster but not as accurate (therefore is a good choice for mobile apps)
Jonathan Huang et al., Speed/accuracy trade-offs for modern convolutional object detectors, CVPR 2017
Speed/accuracy trade-offs

Coding time

Coding time
Problem to solve:
- a mobile app for real time clothes detection
- class categories: Top, Pants, Shorts, Skirt and Dress
Frameworks:
● Tensorflow Object Detection API
- made by GOOGLE
- an open source framework built on top of TensorFlow that
makes it easy to construct, train and deploy object detection
models
- input: images + labels
- output: inference graph (.pb format)
● LabelImg
- an open source graphical image annotation tool
- annotations are saved as XML files in PASCAL VOC format,
the format used by ImageNet dataset

Coding time: step by step
● Create dataset and split it into: train (70%) and test (30%) folders
● Label images with LabelImg tool (output: .xml files for each image in dataset)
● Convert .xml to .csv (use dataset/xml_to_csv.py script; output: train.csv, test.csv)
● Convert to TFRecord format
○ set paths (from ../models/research):
export PYTHONPATH=$PYTHONPATH:`pwd`:`pwd`/object_detection
export PYTHONPATH=$PYTHONPATH:`pwd`:`pwd`/slim
○ edit generate_tfrecord.py file and change the label map + path to the train/test folder:
○ finally execute the generate_tfrecord.py script in Terminal:
python generate_tfrecord.py --csv_input=data/train_labels.csv --output_path=data/train.record
python generate_tfrecord.py --csv_input=data/test_labels.csv --output_path=data/test.record
○ output: train.record, test.record
● Training
○ create a label map: label_map.pbtxt
○ optional, but recommended :), choose a pretrained model from here
○ prepare the .config file: .../models/research/object_detection/samples/configs/ssd_mobilenet_v2_coco.config
○ run training script (from ../models/research/object_detection):
python legacy/train.py --logtostderr --train_dir=training/ --pipeline_config_path=Ssd_mobilenet_v1_pets.config
● Export inference graph:
python export_inference_graph.py --input_type image_tensor --pipeline_config_path pipeline.config
--trained_checkpoint_prefix=training/model.ckpt-10750 --output_directory=inference_graph
output: the model in .pb format

e-mail: anca.ciurte@softvision.ro
Q&A

Integrating with Android
Speaker:
MIHALY NAGY - Android Community Influencer at Softvision

Android + TensorFlow

● Model File
● [Labels File]
● tensorflow-android dependency
● Boilerplate
● Integrate TF to process each frame

Bitmap
Recognition
each Frame

Follow Along:
http://goo.gl/SYHSb7
https://github.com/code-twister/tf_example

Thank You!

DroidCon Cluj 2018 - Hands on machine learning on android

DroidCon Cluj 2018 - Hands on machine learning on android

Recomendados

Recomendados

Más contenido relacionado

Similar a DroidCon Cluj 2018 - Hands on machine learning on android

Similar a DroidCon Cluj 2018 - Hands on machine learning on android (20)

Último

Último (20)

DroidCon Cluj 2018 - Hands on machine learning on android

Notas del editor