SlideShare una empresa de Scribd logo
1 de 19
Descargar para leer sin conexión
From The Lab to the Factory
Building A Production Machine Learning Infrastructure
Josh Wills, Senior Director of Data Science
Cloudera

1
One Other Thing About Me

2
Data Science: Another Definition

3
Data Scientists Build Data Products.

4
A Shift In Perspective
Analytics in the Factory

Analytics in the Lab
•
•
•
•
•
•

5

Question-driven
Interactive
Ad-hoc, post-hoc
Fixed data
Focus on speed and
flexibility
Output is embedded into a
report or in-database
scoring engine

•
•
•
•
•
•

Metric-driven
Automated
Systematic
Fluid data
Focus on transparency and
reliability
Output is a production
system that makes
customer-facing decisions
All* Products Become Data Products

6
Identifying the Bottlenecks

7
Oryx: Model Building and Serving
•

Algorithms
•
•
•

ALS Recommenders
K-Means Parallel
RDF

Batch model building
via MapReduce*
• Server for real-time
scoring and updates
• PMML 4.1 Models
•

8
Oryx Design

9
Generational Thinking

10
The Limits of Our Models

11
Space Exploration

12
Data Science Needs DevOps

13
Introducing Gertrude
•

Multivariate Testing
•

•

Overlapping
Experiments
•
•

14

Define and explore a
space of parameters

Tang et al. (2010)
Runs multiple
independent
experiments on every
request
Simple Conditional Logic
•

Declare experiment
flags in compiled code
•

•

15

Settings that can vary
per request

Create a config file that
contains simple rules
for calculating flag
values and rules for
experiment diversion
Separate Data Push from Code Push
•

Validate config files and
push updates to servers
•
•

•

16

Zookeeper via Curator
File-based

Servers pick up new
configs, load them, and
update experiment
space and flag value
calculations
The Experiments Dashboard

17
A Few Links I Love
•

http://research.google.com/pubs/pub36500.html
•

•

http://www.exp-platform.com/
•

•

Collection of all of Microsoft’s papers and presentations on
their experimentation platform

http://www.deaneckles.com/blog/596_lossy-betterthan-lossless-in-online-bootstrapping/
•

18

The original paper on the overlapping experiments
infrastrucure at Google

Dean Eckles on his paper about bootstrapped confidence
intervals with multiple dependencies
Thank you!
Josh Wills, Director of Data Science, Cloudera

@josh_wills

Más contenido relacionado

Destacado

Slalom @ Seattle Interactive Conference 2016
Slalom @ Seattle Interactive Conference 2016Slalom @ Seattle Interactive Conference 2016
Slalom @ Seattle Interactive Conference 2016
Slalom
 

Destacado (8)

Does your content need a dam makeover
Does your content need a dam makeoverDoes your content need a dam makeover
Does your content need a dam makeover
 
AI Everywhere: How Microsoft is Democratizing AI - Lightning Version
AI Everywhere: How Microsoft is Democratizing AI - Lightning VersionAI Everywhere: How Microsoft is Democratizing AI - Lightning Version
AI Everywhere: How Microsoft is Democratizing AI - Lightning Version
 
Love Your Future
Love Your FutureLove Your Future
Love Your Future
 
Slalom @ Seattle Interactive Conference 2016
Slalom @ Seattle Interactive Conference 2016Slalom @ Seattle Interactive Conference 2016
Slalom @ Seattle Interactive Conference 2016
 
Digest customer loyalty_in_retail_banking_2014
Digest customer loyalty_in_retail_banking_2014Digest customer loyalty_in_retail_banking_2014
Digest customer loyalty_in_retail_banking_2014
 
Bain digest. Customer behavior and loyalty in retail banking 2015
Bain digest. Customer behavior and loyalty in retail banking 2015Bain digest. Customer behavior and loyalty in retail banking 2015
Bain digest. Customer behavior and loyalty in retail banking 2015
 
Making loyalty pay: How to build - not destroy - value
Making loyalty pay: How to build - not destroy - valueMaking loyalty pay: How to build - not destroy - value
Making loyalty pay: How to build - not destroy - value
 
Which Innovation strategy should my company pursue?
Which Innovation strategy should my company pursue? Which Innovation strategy should my company pursue?
Which Innovation strategy should my company pursue?
 

Similar a Cloudera User Group - From the Lab to the Factory

Data Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical IndustryData Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical Industry
RTTS
 
Data Warehouse Optimization
Data Warehouse OptimizationData Warehouse Optimization
Data Warehouse Optimization
Cloudera, Inc.
 
Bridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to ProductionBridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to Production
Florian Wilhelm
 

Similar a Cloudera User Group - From the Lab to the Factory (20)

Josh Wills, MLconf 2013
Josh Wills, MLconf 2013Josh Wills, MLconf 2013
Josh Wills, MLconf 2013
 
Making Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons LearnedMaking Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons Learned
 
Efficient & effective data management for research projects : ILRI's Data Ma...
Efficient & effective  data management for research projects : ILRI's Data Ma...Efficient & effective  data management for research projects : ILRI's Data Ma...
Efficient & effective data management for research projects : ILRI's Data Ma...
 
Curiosity Software and RCG Global Services Present - Solving Test Data: the g...
Curiosity Software and RCG Global Services Present - Solving Test Data: the g...Curiosity Software and RCG Global Services Present - Solving Test Data: the g...
Curiosity Software and RCG Global Services Present - Solving Test Data: the g...
 
Data Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical IndustryData Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical Industry
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 
Big Linked Data ETL Benchmark on Cloud Commodity Hardware
Big Linked Data ETL Benchmark on Cloud Commodity HardwareBig Linked Data ETL Benchmark on Cloud Commodity Hardware
Big Linked Data ETL Benchmark on Cloud Commodity Hardware
 
Ds for finance day 4
Ds for finance day 4Ds for finance day 4
Ds for finance day 4
 
Machine Learning Infrastructure
Machine Learning InfrastructureMachine Learning Infrastructure
Machine Learning Infrastructure
 
Co-op’s Transformation from Brick and Mortar to AI with Databricks with Rob M...
Co-op’s Transformation from Brick and Mortar to AI with Databricks with Rob M...Co-op’s Transformation from Brick and Mortar to AI with Databricks with Rob M...
Co-op’s Transformation from Brick and Mortar to AI with Databricks with Rob M...
 
Accelerate Your ML Pipeline with AutoML and MLflow
Accelerate Your ML Pipeline with AutoML and MLflowAccelerate Your ML Pipeline with AutoML and MLflow
Accelerate Your ML Pipeline with AutoML and MLflow
 
Managing Machines: The New AI Dev Stack
Managing Machines: The New AI Dev StackManaging Machines: The New AI Dev Stack
Managing Machines: The New AI Dev Stack
 
Consolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest AirportsConsolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest Airports
 
Data Warehouse Optimization
Data Warehouse OptimizationData Warehouse Optimization
Data Warehouse Optimization
 
Predicting Patient Outcomes in Real-Time at HCA
Predicting Patient Outcomes in Real-Time at HCAPredicting Patient Outcomes in Real-Time at HCA
Predicting Patient Outcomes in Real-Time at HCA
 
(20.05.2009) Cumuy Presenta - Más tecnologías interesantes para conocer - PPT 2
(20.05.2009) Cumuy Presenta - Más tecnologías interesantes para conocer - PPT 2(20.05.2009) Cumuy Presenta - Más tecnologías interesantes para conocer - PPT 2
(20.05.2009) Cumuy Presenta - Más tecnologías interesantes para conocer - PPT 2
 
Bridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to ProductionBridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to Production
 
Building an Experimentation Platform in Clojure
Building an Experimentation Platform in ClojureBuilding an Experimentation Platform in Clojure
Building an Experimentation Platform in Clojure
 
DevOps for Big Data - Data 360 2014 Conference
DevOps for Big Data - Data 360 2014 ConferenceDevOps for Big Data - Data 360 2014 Conference
DevOps for Big Data - Data 360 2014 Conference
 
7 steps to simplifying your AI workflows
7 steps to simplifying your AI workflows7 steps to simplifying your AI workflows
7 steps to simplifying your AI workflows
 

Más de ClouderaUserGroups

Pa cloudera manager-api's_extensibility_v2
Pa   cloudera manager-api's_extensibility_v2Pa   cloudera manager-api's_extensibility_v2
Pa cloudera manager-api's_extensibility_v2
ClouderaUserGroups
 

Más de ClouderaUserGroups (6)

What it takes to bring Hadoop to a production-ready state
What it takes to bring Hadoop to a production-ready stateWhat it takes to bring Hadoop to a production-ready state
What it takes to bring Hadoop to a production-ready state
 
Extending and Automating Cloudera Manager via API
Extending and Automating Cloudera Manager via APIExtending and Automating Cloudera Manager via API
Extending and Automating Cloudera Manager via API
 
Pa cloudera manager-api's_extensibility_v2
Pa   cloudera manager-api's_extensibility_v2Pa   cloudera manager-api's_extensibility_v2
Pa cloudera manager-api's_extensibility_v2
 
Cloudera User Group SF - Cloudera Manager: APIs & Extensibility
Cloudera User Group SF - Cloudera Manager: APIs & ExtensibilityCloudera User Group SF - Cloudera Manager: APIs & Extensibility
Cloudera User Group SF - Cloudera Manager: APIs & Extensibility
 
Cloudera User Group Chicago - Cloudera Manager: APIs & Extensibility
Cloudera User Group Chicago - Cloudera Manager: APIs & ExtensibilityCloudera User Group Chicago - Cloudera Manager: APIs & Extensibility
Cloudera User Group Chicago - Cloudera Manager: APIs & Extensibility
 
Cloudera User Group Chicago - The Future of Data
Cloudera User Group Chicago - The Future of DataCloudera User Group Chicago - The Future of Data
Cloudera User Group Chicago - The Future of Data
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Cloudera User Group - From the Lab to the Factory

  • 1. From The Lab to the Factory Building A Production Machine Learning Infrastructure Josh Wills, Senior Director of Data Science Cloudera 1
  • 2. One Other Thing About Me 2
  • 3. Data Science: Another Definition 3
  • 4. Data Scientists Build Data Products. 4
  • 5. A Shift In Perspective Analytics in the Factory Analytics in the Lab • • • • • • 5 Question-driven Interactive Ad-hoc, post-hoc Fixed data Focus on speed and flexibility Output is embedded into a report or in-database scoring engine • • • • • • Metric-driven Automated Systematic Fluid data Focus on transparency and reliability Output is a production system that makes customer-facing decisions
  • 6. All* Products Become Data Products 6
  • 8. Oryx: Model Building and Serving • Algorithms • • • ALS Recommenders K-Means Parallel RDF Batch model building via MapReduce* • Server for real-time scoring and updates • PMML 4.1 Models • 8
  • 11. The Limits of Our Models 11
  • 13. Data Science Needs DevOps 13
  • 14. Introducing Gertrude • Multivariate Testing • • Overlapping Experiments • • 14 Define and explore a space of parameters Tang et al. (2010) Runs multiple independent experiments on every request
  • 15. Simple Conditional Logic • Declare experiment flags in compiled code • • 15 Settings that can vary per request Create a config file that contains simple rules for calculating flag values and rules for experiment diversion
  • 16. Separate Data Push from Code Push • Validate config files and push updates to servers • • • 16 Zookeeper via Curator File-based Servers pick up new configs, load them, and update experiment space and flag value calculations
  • 18. A Few Links I Love • http://research.google.com/pubs/pub36500.html • • http://www.exp-platform.com/ • • Collection of all of Microsoft’s papers and presentations on their experimentation platform http://www.deaneckles.com/blog/596_lossy-betterthan-lossless-in-online-bootstrapping/ • 18 The original paper on the overlapping experiments infrastrucure at Google Dean Eckles on his paper about bootstrapped confidence intervals with multiple dependencies
  • 19. Thank you! Josh Wills, Director of Data Science, Cloudera @josh_wills