SlideShare una empresa de Scribd logo
1 de 29
Descargar para leer sin conexión
1© Cloudera, Inc. All rights reserved.
Improving computer vision models
at scale
Jan Kunigk | Principal Solutions Architect
Dr. Mirko Kämpf | Senior Solutions Architect
Marton Balassi | Solutions Architect
2© Cloudera, Inc. All rights reserved.
The slide deck is an updated
version of our talk
@Strata2018
in London
3© Cloudera, Inc. All rights reserved.
Motivation
4© Cloudera, Inc. All rights reserved.
Imagine the possibilities...
• detect dangerous situations in traffic
• detect a fire in a forest or a landfill via infrared drones early
• detect extremely hard to find tumors
• detect combatants in satellite data
• detect violence in subway station
• detect broken parts in a manufacturing line
... all that @ scale!
5© Cloudera, Inc. All rights reserved.
Requirements
• Fast random access to images
• Free text search for labels
• Visual user interface
• Execute existing Python and Scala deep learning pipelines at scale
• Automatic indexing of labels
• Easy model comparison
• Search for complex scenarios
6© Cloudera, Inc. All rights reserved.
Building blocks of our solution
• Fast random access of images
• HBase is used for storing both the images and the corresponding labels
• Free text search of labels
• Solr indexes are used to query the data
• Enrichment and augmentation with secondary data sources (e.g. GPS, CANbus)
• Hive/Impala tables are used to store enrichment data
• Visual interface
• A Hue dashboard provides the UI
• Execute existing Python and Scala deep learning pipelines at scale
• (Py)Spark is used to scale out the computation
• Automatic indexing of labels
• The Lily indexer is used to automatically populate the Solr collection
7© Cloudera, Inc. All rights reserved.
Solution overview
Main users:
Data Scientist
and
Domain Experts
8© Cloudera, Inc. All rights reserved.
Data Engineering and Model Lifecycle
9© Cloudera, Inc. All rights reserved.
Classifying an image
[1] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna: " Rethinking the Inception Architecture for Computer Vision”
https://arxiv.org/abs/1512.00567
[2] Simon Jégou, Michal Drozdzal, David Vazquez, Adriana Romero, Yoshua Bengio: " The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for
Semantic Segmentation”, https://arxiv.org/abs/1611.09326
http://mi.eng.cam.ac.uk/projects/segnet/demo.php#demo
[3] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He and Piotr Dollár: "Focal Loss for Dense Object Detection”, https://arxiv.org/abs/1708.02002.,
https://github.com/fizyr/keras-retinanet
[4] https://github.com/facebookresearch/Detectron
• Object detection
• InceptionV3 [1]
• Semantic Segmentation
• SegNet [2]
• Bounding Boxes
• RetinaNet [3]
• Object masking
• Detectron [4]
10© Cloudera, Inc. All rights reserved.
Typical use case for model improvement
Consider using a tool that visualizes layer activation, like https://github.com/raghakot/keras-vis.
11© Cloudera, Inc. All rights reserved.
• Models are trained on GPUs
• Since 1.1 Cloudera Data
Science Workbench natively
supports GPUs
• Your Ops people will
appreciate it
Data Science workflow with CDSW
https://blog.cloudera.com/blog/2017/07/prophecy-fulfilled-keras-and-cloudera-data-science-workbench/
https://blog.cloudera.com/blog/2017/09/customizing-docker-images-in-cloudera-data-science-workbench/
12© Cloudera, Inc. All rights reserved.
Access Patterns & Model Lifecycle with CDSW
CDH CDSW
IT-Crowd Data Crowd
GPU GPU
GPU GPU
img img img img
img img img img
tags
tr
if
Search for images by
properties and context
Access Compliance and
Governance data
Search for time series
patterns
Cloudera Data Science Workbench
Algorithm prototyping & model training
HUE Query Editor / Dashboards
Ad-hoc analysis using SQL and Search
img img img imgimg img img img img img img img img img img img img img img img img img img imgimg img img img img img img img img img img img img img img img
HUE Query Editor / Dashboards
Data engineering & curation
Solr
Augmentation
HBase
img img img img
img img img img
img img img img
awesome
.py
Tenzing
if
if
GPU
Depends
on YARN
resource
types
13© Cloudera, Inc. All rights reserved.
Full Data Pipeline
ffmpeg
img img img img
9.2
9.1
lon timestamp
20180428152138
area | tunnel | bridgel |
Geodata
Stadium | no | no
…
9.0
lat
48.1
48.3
48.5
NMEA
AVRO
B14 | yes | no20180428152330
20180428152831 B14 | no | yes
gps2avro
pynmea2
overpy
Image Data
CF:tagsCF:img_all
jpg imagenet
img stop-sign person
img truck
…
retinanet tiny-yolo
…
boatperson person
bicycle person traffic light boat
img img img img
CF:geo
20180428152330
20180428152330
Key:
30 30 30 30
Key
20180428152330
20180428152330
HBaseStorageHandlerNMEA
OpenStreetmap
/ overpass API
30 30 31 31
hbase-indexer-mr-job.jar
Lily
NMEA
Tenzing
if
14© Cloudera, Inc. All rights reserved.
Time domains / resolution
SELECT rnk,system_id,speed,time_gap
FROM
(SELECT
row_number() over (
partition by system_id order by time_gap asc) 'rnk',
system_id,speed,time_gap
FROM
(
SELECT
img_domain.system_id as system_id,
speed_domain.speed as speed,
abs(img_domain.time_s - speed_domain.time_s) as time_gap
FROM img_domain JOIN speed_domain
ON img_domain.system_id = can.car_id
) t
) t WHERE rnk = 1
SELECT system_id,speed
FROM img_domain.time_s, speed_domain.time_s
WHERE WITHIN (img_domain.time_s, speed_domain.time_s, 1.5s)
GPS
Image
Speed
SQL: A cool language that supports range queries, not yet existing:
15© Cloudera, Inc. All rights reserved.
PySpark implementation (Keras)
def predict(iterator):
model = InceptionV3(weights=None)
model.load_weights(FLAGS.weights_file)
return [(x[0], run_inference_on_image(model, x[1])) for x in iterator]
def main():
sc = SparkContext(conf=conf)
hbase_io = common.HbaseIO(FLAGS)
out_format = common.OutputFormatter(FLAGS, MODEL_NAME)
hbase_images = hbase_io.load_from_hbase(sc)
classified_images = hbase_images.mapPartitions(predict) 
.map(out_format.imagenet_format)
classified_images.foreachPartition(hbase_io.put_to_hbase)
16© Cloudera, Inc. All rights reserved.
• The Python environment with tensorflow is distributed to the executors at
runtime, it is not preinstalled on the nodes
• The individual models only need to implement the following functions:
• prepare
• predict
• output_format
• Conceptually this is very close to the scikit-learn or Spark ML Pipelines approach
• Deep Learning Pipelines can be a way to streamline the implementation
PySpark implementation (Keras)
https://databricks.com/blog/2017/06/06/databricks-vision-simplify-large-scale-deep-learning.html
17© Cloudera, Inc. All rights reserved.
Spark implementation (dl4j / Scala)
def predict(pairs: Iterator[(String, (INDArray, Int, Int))]) = {
val model = ModelSerializer.restoreComputationGraph(modelLoc)
pairs.map{ case (name, image) =>
(name, run_inference_on_image(model, image)
}
}
def main(args: Array[String]) = {
val sc = SparkContext(conf=conf)
val hbase_io = common.HbaseIO(args)
val out_format = common.OutputFormatter(args)
val hbase_images = hbase_io.load_from_hbase(sc)
val classified_images = hbase_images.mapPartitions(predict) 
.map(out_format.imagenet_format)
val classified_images.foreachPartition(hbase_io.put_to_hbase)
}
18© Cloudera, Inc. All rights reserved.
Demo
19© Cloudera, Inc. All rights reserved.
Moving further
Label Quality Inspection
20© Cloudera, Inc. All rights reserved.
Visual label inspection via HUE:
Label quality & Relations between objects
Index contains:
- object relations
- predicted labels
- object statistics
Rendered BoundingBoxes
are key to visual inspection.
>>> easy comparison of multiple
model classes (A,B) or
model versions (C1, C2).
Model BModel A
21© Cloudera, Inc. All rights reserved.
From labels to meaning ...
Person in front of car ... bounding boxes overlap,
... property of the object-pair becomes a fact.
Facts, are added into a multivalue field of a
document in a Solr index.
Query:
q=overlap_category:car-person OR overlap_category:person-car
22© Cloudera, Inc. All rights reserved.
Moving further
Semantic Search
23© Cloudera, Inc. All rights reserved.
How to identify relations?
1. Build ontology for traffic scenes or any
domain you work on.
2. Map statistical object properties to
RDF graph using heuristics
3. Combine scene-graphs in a triple store
4. Enable search with SPARQL
24© Cloudera, Inc. All rights reserved.
How to identify semantic relations?
• Build Ontology for Traffic Scenes
• Map statistical object properties to
RDF graph using heuristics
• Combine scene-graphs (triple store)
• Search with SPARQL
• Object detection
• Deep neural networks
• Bounding Box analysis
• Rendering of BBs with labels
• Geometry based heuristics
• Overlap ratios
• Orientation analysis
• SOLR Search by
• Label
• Relation
25© Cloudera, Inc. All rights reserved.
Why search on a knowledge base?
• This approach allows to search easily for complex scenarios:
THINGS (pedestrian, stop sign, hot spot, gun, …)
RELATIONSHIPS (close by, in front of, above, underneath, ...)
ACTIVITIES (danger, theft, evasion, escape)
SITUATIONS (combinations of THINGS, RELATIONS, and ACTIVITIES)
• ... very fast, even in huge image collections.
• Knowledge graphs remove the need to know Solr schema details.
26© Cloudera, Inc. All rights reserved.
Implementation of complementary search channels ...
Triplification using local graphs
27© Cloudera, Inc. All rights reserved.
Summary
What we can do with images today:
• Search for combinations and amounts of objects at scale: „at least 5 cars and 2 trucks”
• Search for basic relationship among those things: „In front of”, ”In a line”
• Enrich the search experience with other domains: geospatial, sensor data, etc.
This helps to:
• Gain better understanding of the quality of our CV models/apps
• Discover corner cases, improve model-lifecycle and build new (data) products faster
In the future:
• Focus on semantic search, advanced visualization and improved model lifecycles
28© Cloudera, Inc. All rights reserved.
Thank you
jk@cloudera.com
mirko@cloudera.com
mbalassi@cloudera.com
29© Cloudera, Inc. All rights reserved.
Appendix: Getting data
There are many great datasets out there for research purposes:
• Cityscapes, https://www.cityscapes-dataset.com/
• COCO, http://cocodataset.org/#home
• YouTube-8M, https://research.google.com/youtube8m/

Más contenido relacionado

La actualidad más candente

Distributed Deep Learning with Hadoop and TensorFlow
Distributed Deep Learning with Hadoop and TensorFlowDistributed Deep Learning with Hadoop and TensorFlow
Distributed Deep Learning with Hadoop and TensorFlowJan Wiegelmann
 
GTC2019 NavInfo Europe Session
GTC2019 NavInfo Europe Session GTC2019 NavInfo Europe Session
GTC2019 NavInfo Europe Session Hong Wang (Suzy)
 
Dog Breed Classification using PyTorch on Azure Machine Learning
Dog Breed Classification using PyTorch on Azure Machine LearningDog Breed Classification using PyTorch on Azure Machine Learning
Dog Breed Classification using PyTorch on Azure Machine LearningHeather Spetalnick
 
High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming DataGeoffrey Fox
 
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...AMD Developer Central
 
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co..."New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...Edge AI and Vision Alliance
 
Applying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksApplying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksDatabricks
 

La actualidad más candente (7)

Distributed Deep Learning with Hadoop and TensorFlow
Distributed Deep Learning with Hadoop and TensorFlowDistributed Deep Learning with Hadoop and TensorFlow
Distributed Deep Learning with Hadoop and TensorFlow
 
GTC2019 NavInfo Europe Session
GTC2019 NavInfo Europe Session GTC2019 NavInfo Europe Session
GTC2019 NavInfo Europe Session
 
Dog Breed Classification using PyTorch on Azure Machine Learning
Dog Breed Classification using PyTorch on Azure Machine LearningDog Breed Classification using PyTorch on Azure Machine Learning
Dog Breed Classification using PyTorch on Azure Machine Learning
 
High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming Data
 
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
 
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co..."New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
 
Applying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksApplying your Convolutional Neural Networks
Applying your Convolutional Neural Networks
 

Similar a Improving computer vision models at scale presentation

Improving computer vision models at scale (Strata Data NYC)
Improving computer vision models at scale  (Strata Data NYC)Improving computer vision models at scale  (Strata Data NYC)
Improving computer vision models at scale (Strata Data NYC)Dr. Mirko Kämpf
 
AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)
AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)
AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)Amazon Web Services Korea
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Jason Dai
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
01 foundations
01 foundations01 foundations
01 foundationsankit_ppt
 
OpenStreetMap in 3D - current developments
OpenStreetMap in 3D - current developmentsOpenStreetMap in 3D - current developments
OpenStreetMap in 3D - current developmentsvirtualcitySYSTEMS GmbH
 
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaEnterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaNeo4j
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshThe Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshIanFurlong4
 
Evolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsEvolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsDataWorks Summit
 
Workshop - Build a Graph Solution
Workshop - Build a Graph SolutionWorkshop - Build a Graph Solution
Workshop - Build a Graph SolutionNeo4j
 
陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰台灣資料科學年會
 
.Net development with Azure Machine Learning (AzureML) Nov 2014
.Net development with Azure Machine Learning (AzureML) Nov 2014.Net development with Azure Machine Learning (AzureML) Nov 2014
.Net development with Azure Machine Learning (AzureML) Nov 2014Mark Tabladillo
 
GPT and Graph Data Science to power your Knowledge Graph
GPT and Graph Data Science to power your Knowledge GraphGPT and Graph Data Science to power your Knowledge Graph
GPT and Graph Data Science to power your Knowledge GraphNeo4j
 
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Databricks
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsConnected Data World
 
Deep Learning for Autonomous Driving
Deep Learning for Autonomous DrivingDeep Learning for Autonomous Driving
Deep Learning for Autonomous DrivingJan Wiegelmann
 
True Reusable Code - DevSum2016
True Reusable Code - DevSum2016True Reusable Code - DevSum2016
True Reusable Code - DevSum2016Eduard Lazar
 

Similar a Improving computer vision models at scale presentation (20)

Improving computer vision models at scale (Strata Data NYC)
Improving computer vision models at scale  (Strata Data NYC)Improving computer vision models at scale  (Strata Data NYC)
Improving computer vision models at scale (Strata Data NYC)
 
AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)
AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)
AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
01 foundations
01 foundations01 foundations
01 foundations
 
OpenStreetMap in 3D - current developments
OpenStreetMap in 3D - current developmentsOpenStreetMap in 3D - current developments
OpenStreetMap in 3D - current developments
 
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaEnterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, Cloudera
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshThe Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
 
Evolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsEvolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data Applications
 
Workshop - Build a Graph Solution
Workshop - Build a Graph SolutionWorkshop - Build a Graph Solution
Workshop - Build a Graph Solution
 
陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰
 
.Net development with Azure Machine Learning (AzureML) Nov 2014
.Net development with Azure Machine Learning (AzureML) Nov 2014.Net development with Azure Machine Learning (AzureML) Nov 2014
.Net development with Azure Machine Learning (AzureML) Nov 2014
 
GPT and Graph Data Science to power your Knowledge Graph
GPT and Graph Data Science to power your Knowledge GraphGPT and Graph Data Science to power your Knowledge Graph
GPT and Graph Data Science to power your Knowledge Graph
 
APAN Cloud WG (2015/3/2)
APAN Cloud WG (2015/3/2)APAN Cloud WG (2015/3/2)
APAN Cloud WG (2015/3/2)
 
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needs
 
Deep Learning for Autonomous Driving
Deep Learning for Autonomous DrivingDeep Learning for Autonomous Driving
Deep Learning for Autonomous Driving
 
True Reusable Code - DevSum2016
True Reusable Code - DevSum2016True Reusable Code - DevSum2016
True Reusable Code - DevSum2016
 

Más de Dr. Mirko Kämpf

Time Series Analysis Using an Event Streaming Platform
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming PlatformDr. Mirko Kämpf
 
IoT meets AI in the Clouds
IoT meets AI in the CloudsIoT meets AI in the Clouds
IoT meets AI in the CloudsDr. Mirko Kämpf
 
Enterprise Metadata Integration
Enterprise Metadata IntegrationEnterprise Metadata Integration
Enterprise Metadata IntegrationDr. Mirko Kämpf
 
PCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System TuningPCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System TuningDr. Mirko Kämpf
 
Etosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapEtosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapDr. Mirko Kämpf
 
From Events to Networks: Time Series Analysis on Scale
From Events to Networks: Time Series Analysis on ScaleFrom Events to Networks: Time Series Analysis on Scale
From Events to Networks: Time Series Analysis on ScaleDr. Mirko Kämpf
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsDr. Mirko Kämpf
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsDr. Mirko Kämpf
 
DPG Berlin - SOE 18 - talk v1.2.4
DPG Berlin - SOE 18 - talk v1.2.4DPG Berlin - SOE 18 - talk v1.2.4
DPG Berlin - SOE 18 - talk v1.2.4Dr. Mirko Kämpf
 
Information Spread in the Context of Evacuation Optimization
Information Spread in the Context of Evacuation OptimizationInformation Spread in the Context of Evacuation Optimization
Information Spread in the Context of Evacuation OptimizationDr. Mirko Kämpf
 
Hadoop & Complex Systems Research
Hadoop & Complex Systems ResearchHadoop & Complex Systems Research
Hadoop & Complex Systems ResearchDr. Mirko Kämpf
 
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"Dr. Mirko Kämpf
 

Más de Dr. Mirko Kämpf (12)

Time Series Analysis Using an Event Streaming Platform
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming Platform
 
IoT meets AI in the Clouds
IoT meets AI in the CloudsIoT meets AI in the Clouds
IoT meets AI in the Clouds
 
Enterprise Metadata Integration
Enterprise Metadata IntegrationEnterprise Metadata Integration
Enterprise Metadata Integration
 
PCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System TuningPCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System Tuning
 
Etosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapEtosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road map
 
From Events to Networks: Time Series Analysis on Scale
From Events to Networks: Time Series Analysis on ScaleFrom Events to Networks: Time Series Analysis on Scale
From Events to Networks: Time Series Analysis on Scale
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
DPG Berlin - SOE 18 - talk v1.2.4
DPG Berlin - SOE 18 - talk v1.2.4DPG Berlin - SOE 18 - talk v1.2.4
DPG Berlin - SOE 18 - talk v1.2.4
 
Information Spread in the Context of Evacuation Optimization
Information Spread in the Context of Evacuation OptimizationInformation Spread in the Context of Evacuation Optimization
Information Spread in the Context of Evacuation Optimization
 
Hadoop & Complex Systems Research
Hadoop & Complex Systems ResearchHadoop & Complex Systems Research
Hadoop & Complex Systems Research
 
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
 

Último

From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 

Último (20)

From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 

Improving computer vision models at scale presentation

  • 1. 1© Cloudera, Inc. All rights reserved. Improving computer vision models at scale Jan Kunigk | Principal Solutions Architect Dr. Mirko Kämpf | Senior Solutions Architect Marton Balassi | Solutions Architect
  • 2. 2© Cloudera, Inc. All rights reserved. The slide deck is an updated version of our talk @Strata2018 in London
  • 3. 3© Cloudera, Inc. All rights reserved. Motivation
  • 4. 4© Cloudera, Inc. All rights reserved. Imagine the possibilities... • detect dangerous situations in traffic • detect a fire in a forest or a landfill via infrared drones early • detect extremely hard to find tumors • detect combatants in satellite data • detect violence in subway station • detect broken parts in a manufacturing line ... all that @ scale!
  • 5. 5© Cloudera, Inc. All rights reserved. Requirements • Fast random access to images • Free text search for labels • Visual user interface • Execute existing Python and Scala deep learning pipelines at scale • Automatic indexing of labels • Easy model comparison • Search for complex scenarios
  • 6. 6© Cloudera, Inc. All rights reserved. Building blocks of our solution • Fast random access of images • HBase is used for storing both the images and the corresponding labels • Free text search of labels • Solr indexes are used to query the data • Enrichment and augmentation with secondary data sources (e.g. GPS, CANbus) • Hive/Impala tables are used to store enrichment data • Visual interface • A Hue dashboard provides the UI • Execute existing Python and Scala deep learning pipelines at scale • (Py)Spark is used to scale out the computation • Automatic indexing of labels • The Lily indexer is used to automatically populate the Solr collection
  • 7. 7© Cloudera, Inc. All rights reserved. Solution overview Main users: Data Scientist and Domain Experts
  • 8. 8© Cloudera, Inc. All rights reserved. Data Engineering and Model Lifecycle
  • 9. 9© Cloudera, Inc. All rights reserved. Classifying an image [1] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna: " Rethinking the Inception Architecture for Computer Vision” https://arxiv.org/abs/1512.00567 [2] Simon Jégou, Michal Drozdzal, David Vazquez, Adriana Romero, Yoshua Bengio: " The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation”, https://arxiv.org/abs/1611.09326 http://mi.eng.cam.ac.uk/projects/segnet/demo.php#demo [3] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He and Piotr Dollár: "Focal Loss for Dense Object Detection”, https://arxiv.org/abs/1708.02002., https://github.com/fizyr/keras-retinanet [4] https://github.com/facebookresearch/Detectron • Object detection • InceptionV3 [1] • Semantic Segmentation • SegNet [2] • Bounding Boxes • RetinaNet [3] • Object masking • Detectron [4]
  • 10. 10© Cloudera, Inc. All rights reserved. Typical use case for model improvement Consider using a tool that visualizes layer activation, like https://github.com/raghakot/keras-vis.
  • 11. 11© Cloudera, Inc. All rights reserved. • Models are trained on GPUs • Since 1.1 Cloudera Data Science Workbench natively supports GPUs • Your Ops people will appreciate it Data Science workflow with CDSW https://blog.cloudera.com/blog/2017/07/prophecy-fulfilled-keras-and-cloudera-data-science-workbench/ https://blog.cloudera.com/blog/2017/09/customizing-docker-images-in-cloudera-data-science-workbench/
  • 12. 12© Cloudera, Inc. All rights reserved. Access Patterns & Model Lifecycle with CDSW CDH CDSW IT-Crowd Data Crowd GPU GPU GPU GPU img img img img img img img img tags tr if Search for images by properties and context Access Compliance and Governance data Search for time series patterns Cloudera Data Science Workbench Algorithm prototyping & model training HUE Query Editor / Dashboards Ad-hoc analysis using SQL and Search img img img imgimg img img img img img img img img img img img img img img img img img img imgimg img img img img img img img img img img img img img img img HUE Query Editor / Dashboards Data engineering & curation Solr Augmentation HBase img img img img img img img img img img img img awesome .py Tenzing if if GPU Depends on YARN resource types
  • 13. 13© Cloudera, Inc. All rights reserved. Full Data Pipeline ffmpeg img img img img 9.2 9.1 lon timestamp 20180428152138 area | tunnel | bridgel | Geodata Stadium | no | no … 9.0 lat 48.1 48.3 48.5 NMEA AVRO B14 | yes | no20180428152330 20180428152831 B14 | no | yes gps2avro pynmea2 overpy Image Data CF:tagsCF:img_all jpg imagenet img stop-sign person img truck … retinanet tiny-yolo … boatperson person bicycle person traffic light boat img img img img CF:geo 20180428152330 20180428152330 Key: 30 30 30 30 Key 20180428152330 20180428152330 HBaseStorageHandlerNMEA OpenStreetmap / overpass API 30 30 31 31 hbase-indexer-mr-job.jar Lily NMEA Tenzing if
  • 14. 14© Cloudera, Inc. All rights reserved. Time domains / resolution SELECT rnk,system_id,speed,time_gap FROM (SELECT row_number() over ( partition by system_id order by time_gap asc) 'rnk', system_id,speed,time_gap FROM ( SELECT img_domain.system_id as system_id, speed_domain.speed as speed, abs(img_domain.time_s - speed_domain.time_s) as time_gap FROM img_domain JOIN speed_domain ON img_domain.system_id = can.car_id ) t ) t WHERE rnk = 1 SELECT system_id,speed FROM img_domain.time_s, speed_domain.time_s WHERE WITHIN (img_domain.time_s, speed_domain.time_s, 1.5s) GPS Image Speed SQL: A cool language that supports range queries, not yet existing:
  • 15. 15© Cloudera, Inc. All rights reserved. PySpark implementation (Keras) def predict(iterator): model = InceptionV3(weights=None) model.load_weights(FLAGS.weights_file) return [(x[0], run_inference_on_image(model, x[1])) for x in iterator] def main(): sc = SparkContext(conf=conf) hbase_io = common.HbaseIO(FLAGS) out_format = common.OutputFormatter(FLAGS, MODEL_NAME) hbase_images = hbase_io.load_from_hbase(sc) classified_images = hbase_images.mapPartitions(predict) .map(out_format.imagenet_format) classified_images.foreachPartition(hbase_io.put_to_hbase)
  • 16. 16© Cloudera, Inc. All rights reserved. • The Python environment with tensorflow is distributed to the executors at runtime, it is not preinstalled on the nodes • The individual models only need to implement the following functions: • prepare • predict • output_format • Conceptually this is very close to the scikit-learn or Spark ML Pipelines approach • Deep Learning Pipelines can be a way to streamline the implementation PySpark implementation (Keras) https://databricks.com/blog/2017/06/06/databricks-vision-simplify-large-scale-deep-learning.html
  • 17. 17© Cloudera, Inc. All rights reserved. Spark implementation (dl4j / Scala) def predict(pairs: Iterator[(String, (INDArray, Int, Int))]) = { val model = ModelSerializer.restoreComputationGraph(modelLoc) pairs.map{ case (name, image) => (name, run_inference_on_image(model, image) } } def main(args: Array[String]) = { val sc = SparkContext(conf=conf) val hbase_io = common.HbaseIO(args) val out_format = common.OutputFormatter(args) val hbase_images = hbase_io.load_from_hbase(sc) val classified_images = hbase_images.mapPartitions(predict) .map(out_format.imagenet_format) val classified_images.foreachPartition(hbase_io.put_to_hbase) }
  • 18. 18© Cloudera, Inc. All rights reserved. Demo
  • 19. 19© Cloudera, Inc. All rights reserved. Moving further Label Quality Inspection
  • 20. 20© Cloudera, Inc. All rights reserved. Visual label inspection via HUE: Label quality & Relations between objects Index contains: - object relations - predicted labels - object statistics Rendered BoundingBoxes are key to visual inspection. >>> easy comparison of multiple model classes (A,B) or model versions (C1, C2). Model BModel A
  • 21. 21© Cloudera, Inc. All rights reserved. From labels to meaning ... Person in front of car ... bounding boxes overlap, ... property of the object-pair becomes a fact. Facts, are added into a multivalue field of a document in a Solr index. Query: q=overlap_category:car-person OR overlap_category:person-car
  • 22. 22© Cloudera, Inc. All rights reserved. Moving further Semantic Search
  • 23. 23© Cloudera, Inc. All rights reserved. How to identify relations? 1. Build ontology for traffic scenes or any domain you work on. 2. Map statistical object properties to RDF graph using heuristics 3. Combine scene-graphs in a triple store 4. Enable search with SPARQL
  • 24. 24© Cloudera, Inc. All rights reserved. How to identify semantic relations? • Build Ontology for Traffic Scenes • Map statistical object properties to RDF graph using heuristics • Combine scene-graphs (triple store) • Search with SPARQL • Object detection • Deep neural networks • Bounding Box analysis • Rendering of BBs with labels • Geometry based heuristics • Overlap ratios • Orientation analysis • SOLR Search by • Label • Relation
  • 25. 25© Cloudera, Inc. All rights reserved. Why search on a knowledge base? • This approach allows to search easily for complex scenarios: THINGS (pedestrian, stop sign, hot spot, gun, …) RELATIONSHIPS (close by, in front of, above, underneath, ...) ACTIVITIES (danger, theft, evasion, escape) SITUATIONS (combinations of THINGS, RELATIONS, and ACTIVITIES) • ... very fast, even in huge image collections. • Knowledge graphs remove the need to know Solr schema details.
  • 26. 26© Cloudera, Inc. All rights reserved. Implementation of complementary search channels ... Triplification using local graphs
  • 27. 27© Cloudera, Inc. All rights reserved. Summary What we can do with images today: • Search for combinations and amounts of objects at scale: „at least 5 cars and 2 trucks” • Search for basic relationship among those things: „In front of”, ”In a line” • Enrich the search experience with other domains: geospatial, sensor data, etc. This helps to: • Gain better understanding of the quality of our CV models/apps • Discover corner cases, improve model-lifecycle and build new (data) products faster In the future: • Focus on semantic search, advanced visualization and improved model lifecycles
  • 28. 28© Cloudera, Inc. All rights reserved. Thank you jk@cloudera.com mirko@cloudera.com mbalassi@cloudera.com
  • 29. 29© Cloudera, Inc. All rights reserved. Appendix: Getting data There are many great datasets out there for research purposes: • Cityscapes, https://www.cityscapes-dataset.com/ • COCO, http://cocodataset.org/#home • YouTube-8M, https://research.google.com/youtube8m/