SlideShare una empresa de Scribd logo
1 de 21
1© Cloudera, Inc. All rights reserved.
Oryx 2 Overview
Sean Owen | Cloudera | @sean_r_owen
2© Cloudera, Inc. All rights reserved.
Consider the Music Recommender
Collect
Play Data
& Do Data
Science
Build
Taste
Model
Offline
Learn
Quickly
from
New Plays
Recommend
New Songs
Now
3© Cloudera, Inc. All rights reserved.
From Exploratory to Operational?
 Exploratory Analytics Operational Analytics 
Explore Data
Pick Model
Build Model
at Scale, Offline
Continuously
Update Model
?
Score Model in
Real-Time
?
4© Cloudera, Inc. All rights reserved.
Large Scale or Real-Time?
Large-Scale
Offline
Batch
Real-Time
Online
Streaming
vs
Why Don’t We Have Both?
λ!
5© Cloudera, Inc. All rights reserved.
• Batch Layer
• High latency, high throughput
• Compute official result
• Speed Layer
• Low latency
• Compute approximate update to
last known result
• Serving Layer
• Real-time
• Merge batch/speed results
The Lambda Architecture
www.ymc.ch/en/lambda-architecture-part-1
6© Cloudera, Inc. All rights reserved.
• Batch Layer
• Train, evaluate, tune model
over all data in hours
• Speed Layer
• Update model approximately in
minutes or seconds
• Serving Layer
• Make prediction, recommendation
from model in milliseconds
λ Architecture fits ML + Hadoop
Streaming MLlib
7© Cloudera, Inc. All rights reserved.
www.mwttl.com/wp-content/uploads/2013/11/IMG_5446_edited-2_mwttl.jpg
8© Cloudera, Inc. All rights reserved.
History (or: 5th time’s a charm)
Taste
2005 – 2009
- Recommender
toolkit in Java
- Local only
- Serves results
Apache Mahout
2009 – 2014
- Adds Hadoop-based
model building
at scale
- But no serving
Myrrix
2011-2013
- Mahout recs
reimagined
- Adds serving to
Hadoop-based
model build
Oryx 1
2013 –
- Extends to
classification,
clustering
- PMML
- Merge with
cloudera/ml
Oryx 2
2014 –
- Same APIs / goals
- Rewrite
- Full lambda
architecture
- Kafka + Spark + YARN
9© Cloudera, Inc. All rights reserved.
Complementary, Not Competitive
Most ML-on-Hadoop tools are
for building models only, and
excel at this.
Oryx and similar projects do
everything else around this:
continuous update, serving
10© Cloudera, Inc. All rights reserved.
Architecture
HDFS
Input (Kafka topic)
Spark
Streaming
Batch
Layer
Recent
Input
Historical
Input
Models + Updates (Kafka topic)
Model
Spark
Streaming
Speed
Layer
Recent
Input
Model
Updates
Model
Input
Serving
Layer
Input
Serving
LayerServing
LayerServing
Layer
Model
Updates
Model
Serving
LayerQueries
Input
11© Cloudera, Inc. All rights reserved.
• Input Kafka topic
• Any type; usually strings
• From external or Serving Layer
• Update Kafka topic
• Serialized models (PMML)
produced by Batch Layer
• Model updates / deltas
produced by Speed Layer
Data Transport
Input (Kafka topic)
Recent
Input
Models + Updates (Kafka topic)
Model
Recent
Input
Model
Updates
Model
Input
Input
Model
Updates
Model
12© Cloudera, Inc. All rights reserved.
• Spark Streaming
• Persists input topic data
to HDFS from Kafka
• Builds “model” occasionally from
historical and new data
• Hours
• ML: can use MLlib
• ML: tunes hyperparameters
• Publishes models as PMML to
update topic
Batch Layer
HDFS
Spark
Streaming
Batch
Layer
Recent
Input
Historical
Input
Model
13© Cloudera, Inc. All rights reserved.
• Spark Streaming
• Listens for new PMML models
• Listens to input topic too
• Computes approximate updates to
model implied by input and publishes
to update topic
• Seconds
Speed Layer
Spark
Streaming
Speed
Layer
Recent
Input
Model
Updates
Model
14© Cloudera, Inc. All rights reserved.
• Tomcat + JAX-RS
• (Can deploy on YARN)
• REST API
• Listens for new PMML models and
updates from update topic
• Scores model / answers queries
• Writes to input topic too
• No shared state; scales horizontally
• Milliseconds
Serving Layer
Serving
Layer
Input
Serving
LayerServing
LayerServing
Layer
Model
Updates
Model
Serving
LayerQueries
Input
15© Cloudera, Inc. All rights reserved.
Logical Architecture
Serving Layer Speed Layer Batch Layer
App Tier oryx-app-serving oryx-app-mllib
oryx-app
oryx-app-mllib
oryx-app
ML Tier oryx-ml oryx-ml
Lambda Tier oryx-lambda-serving oryx-lambda oryx-lambda
Generic Lambda-Architecture support
ML-specific specialization
Prebuilt recommender, clustering,
classification implementations
16© Cloudera, Inc. All rights reserved.
• Scoring on the fly is not cheap
• 1M user/items ≈ 1GB heap
at scale (≈ 200 features)
• Feature, item count determines
latency, throughput
• Java 8 + 16-core 2.3GHz Xeon
• Smallish models ≈
100s QPS, 10s ms latency
• Huge models ≈
Single digit QPS, 100s ms latency
Recommendation Benchmarks
17© Cloudera, Inc. All rights reserved.
• Spark 1.3.1
• MLlib
• Streaming
• Kafka 0.8.2.1
• Hadoop 2.6
• HDFS
• YARN
• JavaEE 7
• JAX-RS 2
• Jersey 2
• Servlet 3.1
• Tomcat 8
• JPMML + PMML 4.2.1
Key Technology Roster
CDH 5.4+
18© Cloudera, Inc. All rights reserved.
• Cloudera Labs project
• Partial collaboration with Intel
• Not shipped with CDH
• Not supported, no plans to yet
• 2.0.0 beta 3
• Suitable for POCs
• 2.0.0 by end of year
• Best For
• Recommender engines
• Real-time anomaly detection
• Real-time classification
• Problems where both scale and
latency are important
• CDH users
Status
19© Cloudera, Inc. All rights reserved.
Get Started in ~1 Hour
http://oryx.io
20© Cloudera, Inc. All rights reserved.
Thank you
@sean_r_owen
sowen@cloudera.com
21© Cloudera, Inc. All rights reserved.
The conference for and by Data Scientists, from startup to enterprise
wrangleconf.com
Public registration is now open!
Who: Featuring data scientists from Salesforce,
Uber, Pinterest, and more
When: Thursday, October 22, 2015
Where: Broadway Studios, San Francisco

Más contenido relacionado

La actualidad más candente

Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingStructured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Databricks
 
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
Choose Your Weapon: Comparing Spark on FPGAs vs GPUsChoose Your Weapon: Comparing Spark on FPGAs vs GPUs
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
Databricks
 

La actualidad más candente (20)

Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsApache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 
Spark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas Geerdink
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingStructured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For Scale
 
Container Orchestrator Smackdown @ContinousLifecycle
Container Orchestrator Smackdown @ContinousLifecycleContainer Orchestrator Smackdown @ContinousLifecycle
Container Orchestrator Smackdown @ContinousLifecycle
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik Sivashanmugam
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Spark
 
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
Choose Your Weapon: Comparing Spark on FPGAs vs GPUsChoose Your Weapon: Comparing Spark on FPGAs vs GPUs
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out Databases
 
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 

Similar a Lambda architecture on Spark, Kafka for real-time large scale ML

Similar a Lambda architecture on Spark, Kafka for real-time large scale ML (20)

Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache Kafka
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
Spark+flume seattle
Spark+flume seattleSpark+flume seattle
Spark+flume seattle
 
End to End Streaming Architectures
End to End Streaming ArchitecturesEnd to End Streaming Architectures
End to End Streaming Architectures
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
 
Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 

Más de huguk

Más de huguk (20)

Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, TrifactaData Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
 
ether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp intro
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Extracting maximum value from data while protecting consumer privacy. Jason ...
Extracting maximum value from data while protecting consumer privacy.  Jason ...Extracting maximum value from data while protecting consumer privacy.  Jason ...
Extracting maximum value from data while protecting consumer privacy. Jason ...
 
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM WatsonIntelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
 
Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink
 
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
 
Jonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & PitchingJonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & Pitching
 
Signal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News MonitoringSignal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News Monitoring
 
Dean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your StartupDean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your Startup
 
Peter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapultPeter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapult
 
Cytora: Real-Time Political Risk Analysis
Cytora:  Real-Time Political Risk AnalysisCytora:  Real-Time Political Risk Analysis
Cytora: Real-Time Political Risk Analysis
 
Cubitic: Predictive Analytics
Cubitic: Predictive AnalyticsCubitic: Predictive Analytics
Cubitic: Predictive Analytics
 
Bird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made SocialBird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made Social
 
Aiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine IntelligenceAiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine Intelligence
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive
 
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthy
 
Fast real-time approximations using Spark streaming
Fast real-time approximations using Spark streamingFast real-time approximations using Spark streaming
Fast real-time approximations using Spark streaming
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 

Lambda architecture on Spark, Kafka for real-time large scale ML

  • 1. 1© Cloudera, Inc. All rights reserved. Oryx 2 Overview Sean Owen | Cloudera | @sean_r_owen
  • 2. 2© Cloudera, Inc. All rights reserved. Consider the Music Recommender Collect Play Data & Do Data Science Build Taste Model Offline Learn Quickly from New Plays Recommend New Songs Now
  • 3. 3© Cloudera, Inc. All rights reserved. From Exploratory to Operational?  Exploratory Analytics Operational Analytics  Explore Data Pick Model Build Model at Scale, Offline Continuously Update Model ? Score Model in Real-Time ?
  • 4. 4© Cloudera, Inc. All rights reserved. Large Scale or Real-Time? Large-Scale Offline Batch Real-Time Online Streaming vs Why Don’t We Have Both? λ!
  • 5. 5© Cloudera, Inc. All rights reserved. • Batch Layer • High latency, high throughput • Compute official result • Speed Layer • Low latency • Compute approximate update to last known result • Serving Layer • Real-time • Merge batch/speed results The Lambda Architecture www.ymc.ch/en/lambda-architecture-part-1
  • 6. 6© Cloudera, Inc. All rights reserved. • Batch Layer • Train, evaluate, tune model over all data in hours • Speed Layer • Update model approximately in minutes or seconds • Serving Layer • Make prediction, recommendation from model in milliseconds λ Architecture fits ML + Hadoop Streaming MLlib
  • 7. 7© Cloudera, Inc. All rights reserved. www.mwttl.com/wp-content/uploads/2013/11/IMG_5446_edited-2_mwttl.jpg
  • 8. 8© Cloudera, Inc. All rights reserved. History (or: 5th time’s a charm) Taste 2005 – 2009 - Recommender toolkit in Java - Local only - Serves results Apache Mahout 2009 – 2014 - Adds Hadoop-based model building at scale - But no serving Myrrix 2011-2013 - Mahout recs reimagined - Adds serving to Hadoop-based model build Oryx 1 2013 – - Extends to classification, clustering - PMML - Merge with cloudera/ml Oryx 2 2014 – - Same APIs / goals - Rewrite - Full lambda architecture - Kafka + Spark + YARN
  • 9. 9© Cloudera, Inc. All rights reserved. Complementary, Not Competitive Most ML-on-Hadoop tools are for building models only, and excel at this. Oryx and similar projects do everything else around this: continuous update, serving
  • 10. 10© Cloudera, Inc. All rights reserved. Architecture HDFS Input (Kafka topic) Spark Streaming Batch Layer Recent Input Historical Input Models + Updates (Kafka topic) Model Spark Streaming Speed Layer Recent Input Model Updates Model Input Serving Layer Input Serving LayerServing LayerServing Layer Model Updates Model Serving LayerQueries Input
  • 11. 11© Cloudera, Inc. All rights reserved. • Input Kafka topic • Any type; usually strings • From external or Serving Layer • Update Kafka topic • Serialized models (PMML) produced by Batch Layer • Model updates / deltas produced by Speed Layer Data Transport Input (Kafka topic) Recent Input Models + Updates (Kafka topic) Model Recent Input Model Updates Model Input Input Model Updates Model
  • 12. 12© Cloudera, Inc. All rights reserved. • Spark Streaming • Persists input topic data to HDFS from Kafka • Builds “model” occasionally from historical and new data • Hours • ML: can use MLlib • ML: tunes hyperparameters • Publishes models as PMML to update topic Batch Layer HDFS Spark Streaming Batch Layer Recent Input Historical Input Model
  • 13. 13© Cloudera, Inc. All rights reserved. • Spark Streaming • Listens for new PMML models • Listens to input topic too • Computes approximate updates to model implied by input and publishes to update topic • Seconds Speed Layer Spark Streaming Speed Layer Recent Input Model Updates Model
  • 14. 14© Cloudera, Inc. All rights reserved. • Tomcat + JAX-RS • (Can deploy on YARN) • REST API • Listens for new PMML models and updates from update topic • Scores model / answers queries • Writes to input topic too • No shared state; scales horizontally • Milliseconds Serving Layer Serving Layer Input Serving LayerServing LayerServing Layer Model Updates Model Serving LayerQueries Input
  • 15. 15© Cloudera, Inc. All rights reserved. Logical Architecture Serving Layer Speed Layer Batch Layer App Tier oryx-app-serving oryx-app-mllib oryx-app oryx-app-mllib oryx-app ML Tier oryx-ml oryx-ml Lambda Tier oryx-lambda-serving oryx-lambda oryx-lambda Generic Lambda-Architecture support ML-specific specialization Prebuilt recommender, clustering, classification implementations
  • 16. 16© Cloudera, Inc. All rights reserved. • Scoring on the fly is not cheap • 1M user/items ≈ 1GB heap at scale (≈ 200 features) • Feature, item count determines latency, throughput • Java 8 + 16-core 2.3GHz Xeon • Smallish models ≈ 100s QPS, 10s ms latency • Huge models ≈ Single digit QPS, 100s ms latency Recommendation Benchmarks
  • 17. 17© Cloudera, Inc. All rights reserved. • Spark 1.3.1 • MLlib • Streaming • Kafka 0.8.2.1 • Hadoop 2.6 • HDFS • YARN • JavaEE 7 • JAX-RS 2 • Jersey 2 • Servlet 3.1 • Tomcat 8 • JPMML + PMML 4.2.1 Key Technology Roster CDH 5.4+
  • 18. 18© Cloudera, Inc. All rights reserved. • Cloudera Labs project • Partial collaboration with Intel • Not shipped with CDH • Not supported, no plans to yet • 2.0.0 beta 3 • Suitable for POCs • 2.0.0 by end of year • Best For • Recommender engines • Real-time anomaly detection • Real-time classification • Problems where both scale and latency are important • CDH users Status
  • 19. 19© Cloudera, Inc. All rights reserved. Get Started in ~1 Hour http://oryx.io
  • 20. 20© Cloudera, Inc. All rights reserved. Thank you @sean_r_owen sowen@cloudera.com
  • 21. 21© Cloudera, Inc. All rights reserved. The conference for and by Data Scientists, from startup to enterprise wrangleconf.com Public registration is now open! Who: Featuring data scientists from Salesforce, Uber, Pinterest, and more When: Thursday, October 22, 2015 Where: Broadway Studios, San Francisco