SlideShare a Scribd company logo
1 of 35
Maintainable Machine
Learning Products
ApacheCon Roadshow, Chicago 2019
Andrew Musselman
akm@apache.org
State of the Art in ML Development
So many tools
• Scikit-learn
• Spark MLLib
• Keras
• PyTorch
• DL4J
• Mahout
• MXNet
• SystemML
• PredictionIO
• #justRthings
• …
• Vendor solutions
• Kitchen sink
• Auto-magic
State of the Art in ML Development
And that is just for the ML pieces; also need:
• Data ingest
• Data engineering
• Plotting, charting
• UX/Publishing
• Sidecar functions:
• Search
• Model management/data versioning
• Monitoring/performance metrics
State of the Art in ML Development
All to do a glorified regression or similar
About Me and Why I'm Here
Corp
• Chief Data Scientist at Accenture
• Senior Director at Lucidworks
• Chief Analytics Officer at A2Go
ASF
• Mahout 0.9 release
• Committer
• PMC member
• Chair
Corp/OSS
• Bootstrapped open-source
contribution program at ACN
• Similar program to A2Go
Fun
• Adversarial Learning podcast
• Sailing, snowboarding, amateur
radio (KI7KQA)
About Me and Why I'm Here
In the course of doing
work I have seen
some bad things
Motivation
Moving data through the assembly line* to production requires
beating several bosses:
* There is no "assembly line" the first several times
Ingest Clean and
Transform
PublishTrain/
Test/
Tweak
Zooming Out
Before a project begins, there are multiple other bosses to beat:
Have an
Idea
Design
Solution
Convince
Team
Prototype Get
Priority
Get
Budget
Then
Why Projects Fail
Things can die at any stage, but most poignantly at the end,
when it's "finished"
• Results/findings/"insights" need a total re-write or port to "production
lang/infra
• E.g., a nice tidy model to predict customer behavior needs to be re-
written in Java to run in the "web service farm"
• Add six months!
• Priority battles! Unproven ML/AI pet project less urgent than:
• Ongoing maintenance
• Shifted business priorities
The Best Reason Projects Fail
No established
approach/workflow to
incorporate results into
existing infrastructure
ML/AI Has a Lot of Attention
In the face of these troubles, ML/AI is a stated priority of many,
many, many, orgs
• Leadership team: "we need an ML/AI story immediately; everyone is
doing it and we are behind the competition" 🤔
• Countless teams: "we need sentiment analysis of [our medical
records | social media about us | the stock market]" 😬
• "Can't machine learning fix this problem?" 🤔
• "Machine learning is a commodity now" 😂
ML/AI Has a Lot of Attention
Result: URGENCY
+ LARGE AND
WRONG SCOPE
Combatting Urgency and Bad Scope
People minimize risk by:
• Hiring consultants
• Building it all from scratch
• Buying a vendor solution (and paying their professional
services team to build all the hard parts)
• Researching/benchmarking/assembling some OSS
libraries/frameworks
Sometimes People Do Dumb Things
"Let's migrate off this vendor and use an open
source solution"
Vendor Apache
Sometimes People Do Dumb Things
"Dump the entire Teradata warehouse"
Sometimes People Do Dumb Things
"Into HDFS"
Sometimes People Do Dumb Things
"But let's not keep any of the metadata about the
tables"
Lies We Tell Ourselves
"Let's clean up all this legacy not invented here
(NIH) code and move to that vendor solution"
NIH Vendor
Lies We Tell Ourselves
"The vendor says migration should take less
than a month"
Lies We Tell Ourselves
"Our IT team says they can integrate the vendor
solution next fiscal year"
Lies We Tell Ourselves
"Our summer intern says they think they can
write all the connectors we need by September"
How to Choose Tools
People think about tech/infra
decisions on a 1-D spectrum
Vendor NIH
NIH OSS
But it's a multi-dimensional
problem
How to Choose Tools
Trade-offs
• Vendor-heavy: $$, less control
• NIH-heavy: tribal knowledge +s and –s
• OSS-heavy: config and extend, hiring pool +s
How Not to Choose Tools
IPython
NB Python
REPL
Jupyter
AWS EC2,
S3, DB,
cron
External
Data
APIs
?
bash
and
curl
Data
in/out
Data
in/out
A real workflow
Ideal Workflow
Each phase decoupled and easier to maintain, hire for skills
Ideal Workflow
• Encourage small, low-risk
prototypes
• Promote the successes to real
projects/features/apps
• Avoiding:
• Re-write
• IT Debate Club
• Budget Debate Club
The Six Moving Pieces of a Platform
Load API
Layer
Data API
Layer
Results API
Layer
Publish API
Layer
Serve API
Layer
Data
Engineering
Analytic Jobs,
Splits, Runs
Output,
Results,
Performance
Look and Feel,
Scoring,
Monitoring,
Display of (1) -
(3)
(1) (2) (3) (4)
(5) All APIs
(6) Packaging (1) - (5) for deployment
Skill Sets:
(1) Spark, SQL, Python, Linux, Databases, Key-Value Stores
(2) Python, Spark, SQL
(3) Python, SQL
(4) React, HTML, JavaScript, CSS
(5) React, Python, Linux, Redis
(6) Docker, Chef, Jenkins/Travis
Input
Data
The Six Moving Pieces of a Platform
Load API
Layer
Data API
Layer
Results API
Layer
Publish API
Layer
Serve API
Layer
Data
Engineering
Analytic Jobs,
Splits, Runs
Output,
Results,
Performance
Look and Feel,
Scoring,
Monitoring,
Display of (1) -
(3)
(1) (2) (3) (4)
(5) All APIs
(6) Packaging (1) - (5) for deployment
Skill Sets:
(1) Spark, SQL, Python, Linux, Databases, Key-Value Stores
(2) Python, Spark, SQL
(3) Python, SQL
(4) React, HTML, JavaScript, CSS
(5) React, Python, Linux, Redis
(6) Docker, Chef, Jenkins/Travis
Input
Data
UI/UX
The Six Moving Pieces of a Platform
Load API
Layer
Data API
Layer
Results API
Layer
Publish API
Layer
Serve API
Layer
Data
Engineering
Analytic Jobs,
Splits, Runs
Output,
Results,
Performance
Look and Feel,
Scoring,
Monitoring,
Display of (1) -
(3)
(1) (2) (3) (4)
(5) All APIs
(6) Packaging (1) - (5) for deployment
Skill Sets:
(1) Spark, SQL, Python, Linux, Databases, Key-Value Stores
(2) Python, Spark, SQL
(3) Python, SQL
(4) React, HTML, JavaScript, CSS
(5) React, Python, Linux, Redis
(6) Docker, Chef, Jenkins/Travis
Input
Data
DevOps
The Six Moving Pieces of a Platform
Load API
Layer
Data API
Layer
Results API
Layer
Publish API
Layer
Serve API
Layer
Data
Engineering
Analytic Jobs,
Splits, Runs
Output,
Results,
Performance
Look and Feel,
Scoring,
Monitoring,
Display of (1) -
(3)
(1) (2) (3) (4)
(5) All APIs
(6) Packaging (1) - (5) for deployment
Skill Sets:
(1) Spark, SQL, Python, Linux, Databases, Key-Value Stores
(2) Python, Spark, SQL
(3) Python, SQL
(4) React, HTML, JavaScript, CSS
(5) React, Python, Linux, Redis
(6) Docker, Chef, Jenkins/Travis
Input
Data
Data
Sci/ML
Encourage/Enforce Good Behavior
• Central notebook repository (e.g., Apache Zeppelin)
• Quick dashboard prototyping (e.g., Apache Superset,
Zeppelin)
• Use a model server (e.g., Apache PredictionIO)
• APIs for all stages
• Code reviews
• Unit and integration tests
• "Definition of done"
Encourage/Enforce Good Behavior
Future State
Productivity at scale!
Getting Involved in Open Source
• Fix documentation problems as you're using it
• Fix bugs
• Add features
• Make it an internal team effort
• Grow skills
• Adapt the software to real-life demands
• Give back
Thank You
Q&A

More Related Content

What's hot

From the Monolith to Microservices - CraftConf 2015
From the Monolith to Microservices - CraftConf 2015From the Monolith to Microservices - CraftConf 2015
From the Monolith to Microservices - CraftConf 2015Randy Shoup
 
Building and Supporting Billion Dollar Ships with JIRA - Greg Warner
Building and Supporting Billion Dollar Ships with JIRA - Greg WarnerBuilding and Supporting Billion Dollar Ships with JIRA - Greg Warner
Building and Supporting Billion Dollar Ships with JIRA - Greg WarnerAtlassian
 
Using PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of DataUsing PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of DataRobert Dempsey
 
Great Add-ons for Improving Teamwork
Great Add-ons for Improving TeamworkGreat Add-ons for Improving Teamwork
Great Add-ons for Improving TeamworkAtlassian
 
Building with JIRA REST APIs and Webhooks
Building with JIRA REST APIs and WebhooksBuilding with JIRA REST APIs and Webhooks
Building with JIRA REST APIs and Webhookscolleenfry
 
Serverless Meetup - 12 gennaio 2017
Serverless Meetup - 12 gennaio 2017Serverless Meetup - 12 gennaio 2017
Serverless Meetup - 12 gennaio 2017Luca Bianchi
 
Confluence and HipChat Keynote Summit 2014
Confluence and HipChat Keynote Summit 2014Confluence and HipChat Keynote Summit 2014
Confluence and HipChat Keynote Summit 2014Atlassian
 
How to Encourage Non-Development Teams to Use JIRA and Confluence
How to Encourage Non-Development Teams to Use JIRA and ConfluenceHow to Encourage Non-Development Teams to Use JIRA and Confluence
How to Encourage Non-Development Teams to Use JIRA and ConfluenceAtlassian
 
Making operations visible - devopsdays tokyo 2013
Making operations visible  - devopsdays tokyo 2013Making operations visible  - devopsdays tokyo 2013
Making operations visible - devopsdays tokyo 2013Nick Galbreath
 
Your API Strategy: Why Boring is Best
Your API Strategy: Why Boring is BestYour API Strategy: Why Boring is Best
Your API Strategy: Why Boring is BestNordic APIs
 
Cultivating Content: Designing Wiki Solutions That Scale
Cultivating Content: Designing Wiki Solutions That ScaleCultivating Content: Designing Wiki Solutions That Scale
Cultivating Content: Designing Wiki Solutions That Scalecolleenfry
 
Delivering Projects the Pivotal Way
Delivering Projects the Pivotal WayDelivering Projects the Pivotal Way
Delivering Projects the Pivotal WayAaron Severs
 
Product Managers are from Pluto and UXers are from Uranus
Product Managers are from Pluto and UXers are from UranusProduct Managers are from Pluto and UXers are from Uranus
Product Managers are from Pluto and UXers are from UranusProduct Anonymous
 
Large Scale JIRA Administration
Large Scale JIRA Administration Large Scale JIRA Administration
Large Scale JIRA Administration colleenfry
 
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...GoDataDriven
 
An Agile Approach to Machine Learning
An Agile Approach to Machine LearningAn Agile Approach to Machine Learning
An Agile Approach to Machine LearningRandy Shoup
 
Software team linkedin
Software team linkedinSoftware team linkedin
Software team linkedinPrysmian Group
 
Elasticsearch for Westcoast
Elasticsearch for WestcoastElasticsearch for Westcoast
Elasticsearch for WestcoastCharlie Hull
 

What's hot (20)

From the Monolith to Microservices - CraftConf 2015
From the Monolith to Microservices - CraftConf 2015From the Monolith to Microservices - CraftConf 2015
From the Monolith to Microservices - CraftConf 2015
 
Building and Supporting Billion Dollar Ships with JIRA - Greg Warner
Building and Supporting Billion Dollar Ships with JIRA - Greg WarnerBuilding and Supporting Billion Dollar Ships with JIRA - Greg Warner
Building and Supporting Billion Dollar Ships with JIRA - Greg Warner
 
Using PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of DataUsing PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of Data
 
Great Add-ons for Improving Teamwork
Great Add-ons for Improving TeamworkGreat Add-ons for Improving Teamwork
Great Add-ons for Improving Teamwork
 
Building with JIRA REST APIs and Webhooks
Building with JIRA REST APIs and WebhooksBuilding with JIRA REST APIs and Webhooks
Building with JIRA REST APIs and Webhooks
 
Serverless Meetup - 12 gennaio 2017
Serverless Meetup - 12 gennaio 2017Serverless Meetup - 12 gennaio 2017
Serverless Meetup - 12 gennaio 2017
 
Confluence and HipChat Keynote Summit 2014
Confluence and HipChat Keynote Summit 2014Confluence and HipChat Keynote Summit 2014
Confluence and HipChat Keynote Summit 2014
 
How to Encourage Non-Development Teams to Use JIRA and Confluence
How to Encourage Non-Development Teams to Use JIRA and ConfluenceHow to Encourage Non-Development Teams to Use JIRA and Confluence
How to Encourage Non-Development Teams to Use JIRA and Confluence
 
Making operations visible - devopsdays tokyo 2013
Making operations visible  - devopsdays tokyo 2013Making operations visible  - devopsdays tokyo 2013
Making operations visible - devopsdays tokyo 2013
 
Your API Strategy: Why Boring is Best
Your API Strategy: Why Boring is BestYour API Strategy: Why Boring is Best
Your API Strategy: Why Boring is Best
 
Cultivating Content: Designing Wiki Solutions That Scale
Cultivating Content: Designing Wiki Solutions That ScaleCultivating Content: Designing Wiki Solutions That Scale
Cultivating Content: Designing Wiki Solutions That Scale
 
Delivering Projects the Pivotal Way
Delivering Projects the Pivotal WayDelivering Projects the Pivotal Way
Delivering Projects the Pivotal Way
 
Product Managers are from Pluto and UXers are from Uranus
Product Managers are from Pluto and UXers are from UranusProduct Managers are from Pluto and UXers are from Uranus
Product Managers are from Pluto and UXers are from Uranus
 
Large Scale JIRA Administration
Large Scale JIRA Administration Large Scale JIRA Administration
Large Scale JIRA Administration
 
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
 
An Agile Approach to Machine Learning
An Agile Approach to Machine LearningAn Agile Approach to Machine Learning
An Agile Approach to Machine Learning
 
Software team linkedin
Software team linkedinSoftware team linkedin
Software team linkedin
 
Creating a Documentation Portal
Creating a Documentation PortalCreating a Documentation Portal
Creating a Documentation Portal
 
Elasticsearch for Westcoast
Elasticsearch for WestcoastElasticsearch for Westcoast
Elasticsearch for Westcoast
 
_rapid_miner
_rapid_miner_rapid_miner
_rapid_miner
 

Similar to Maintainable Machine Learning Products

Data Visualization and the Art of Self-Reliance
Data Visualization and the Art of Self-RelianceData Visualization and the Art of Self-Reliance
Data Visualization and the Art of Self-RelianceInside Analysis
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and PythonTravis Oliphant
 
How Celtra Optimizes its Advertising Platform with Databricks
How Celtra Optimizes its Advertising Platformwith DatabricksHow Celtra Optimizes its Advertising Platformwith Databricks
How Celtra Optimizes its Advertising Platform with DatabricksGrega Kespret
 
Productionizing Data Science at Experience
Productionizing Data Science at ExperienceProductionizing Data Science at Experience
Productionizing Data Science at ExperienceMatt Mills
 
Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform   2018 editionArchitecting an Open Source AI Platform   2018 edition
Architecting an Open Source AI Platform 2018 editionDavid Talby
 
Initiate Edinburgh 2019 - Big Data Meets AI
Initiate Edinburgh 2019 - Big Data Meets AIInitiate Edinburgh 2019 - Big Data Meets AI
Initiate Edinburgh 2019 - Big Data Meets AIAmazon Web Services
 
Yaroslav Ravlinko "Build your own Machine Learning Platform or how to develo...
Yaroslav Ravlinko  "Build your own Machine Learning Platform or how to develo...Yaroslav Ravlinko  "Build your own Machine Learning Platform or how to develo...
Yaroslav Ravlinko "Build your own Machine Learning Platform or how to develo...Lviv Startup Club
 
SPSNYC2019 - What is Common Data Model and how to use it?
SPSNYC2019 - What is Common Data Model and how to use it?SPSNYC2019 - What is Common Data Model and how to use it?
SPSNYC2019 - What is Common Data Model and how to use it?Nicolas Georgeault
 
Business Intelligence solutions using Excel 2013 and Power BI
Business Intelligence solutions using Excel 2013 and Power BIBusiness Intelligence solutions using Excel 2013 and Power BI
Business Intelligence solutions using Excel 2013 and Power BIAlan Koo
 
Oracle Discoverer is dead - Where to next for BI?
Oracle Discoverer is dead - Where to next for BI?Oracle Discoverer is dead - Where to next for BI?
Oracle Discoverer is dead - Where to next for BI?Sage Computing Services
 
DEVNET-1125 Partner Case Study - “Project Hybrid Engineer”
DEVNET-1125	Partner Case Study - “Project Hybrid Engineer”DEVNET-1125	Partner Case Study - “Project Hybrid Engineer”
DEVNET-1125 Partner Case Study - “Project Hybrid Engineer”Cisco DevNet
 
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...DataKitchen
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningPaco Nathan
 
[Rakuten TechConf2014] [C-6] Japan ICHIBA Daily Work - Tools & Processes
[Rakuten TechConf2014] [C-6] Japan ICHIBA Daily Work - Tools & Processes[Rakuten TechConf2014] [C-6] Japan ICHIBA Daily Work - Tools & Processes
[Rakuten TechConf2014] [C-6] Japan ICHIBA Daily Work - Tools & ProcessesRakuten Group, Inc.
 
SharePoint as a Business Platform Why, What and How? – No Code
SharePoint as a Business Platform Why, What and How? – No CodeSharePoint as a Business Platform Why, What and How? – No Code
SharePoint as a Business Platform Why, What and How? – No Codedox42
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016StampedeCon
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLPaco Nathan
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about SparkGiivee The
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
 
Citizen Developer Tools are not just for Citizen Developers (session at Share...
Citizen Developer Tools are not just for Citizen Developers (session at Share...Citizen Developer Tools are not just for Citizen Developers (session at Share...
Citizen Developer Tools are not just for Citizen Developers (session at Share...Antti Koskela
 

Similar to Maintainable Machine Learning Products (20)

Data Visualization and the Art of Self-Reliance
Data Visualization and the Art of Self-RelianceData Visualization and the Art of Self-Reliance
Data Visualization and the Art of Self-Reliance
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
How Celtra Optimizes its Advertising Platform with Databricks
How Celtra Optimizes its Advertising Platformwith DatabricksHow Celtra Optimizes its Advertising Platformwith Databricks
How Celtra Optimizes its Advertising Platform with Databricks
 
Productionizing Data Science at Experience
Productionizing Data Science at ExperienceProductionizing Data Science at Experience
Productionizing Data Science at Experience
 
Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform   2018 editionArchitecting an Open Source AI Platform   2018 edition
Architecting an Open Source AI Platform 2018 edition
 
Initiate Edinburgh 2019 - Big Data Meets AI
Initiate Edinburgh 2019 - Big Data Meets AIInitiate Edinburgh 2019 - Big Data Meets AI
Initiate Edinburgh 2019 - Big Data Meets AI
 
Yaroslav Ravlinko "Build your own Machine Learning Platform or how to develo...
Yaroslav Ravlinko  "Build your own Machine Learning Platform or how to develo...Yaroslav Ravlinko  "Build your own Machine Learning Platform or how to develo...
Yaroslav Ravlinko "Build your own Machine Learning Platform or how to develo...
 
SPSNYC2019 - What is Common Data Model and how to use it?
SPSNYC2019 - What is Common Data Model and how to use it?SPSNYC2019 - What is Common Data Model and how to use it?
SPSNYC2019 - What is Common Data Model and how to use it?
 
Business Intelligence solutions using Excel 2013 and Power BI
Business Intelligence solutions using Excel 2013 and Power BIBusiness Intelligence solutions using Excel 2013 and Power BI
Business Intelligence solutions using Excel 2013 and Power BI
 
Oracle Discoverer is dead - Where to next for BI?
Oracle Discoverer is dead - Where to next for BI?Oracle Discoverer is dead - Where to next for BI?
Oracle Discoverer is dead - Where to next for BI?
 
DEVNET-1125 Partner Case Study - “Project Hybrid Engineer”
DEVNET-1125	Partner Case Study - “Project Hybrid Engineer”DEVNET-1125	Partner Case Study - “Project Hybrid Engineer”
DEVNET-1125 Partner Case Study - “Project Hybrid Engineer”
 
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
 
[Rakuten TechConf2014] [C-6] Japan ICHIBA Daily Work - Tools & Processes
[Rakuten TechConf2014] [C-6] Japan ICHIBA Daily Work - Tools & Processes[Rakuten TechConf2014] [C-6] Japan ICHIBA Daily Work - Tools & Processes
[Rakuten TechConf2014] [C-6] Japan ICHIBA Daily Work - Tools & Processes
 
SharePoint as a Business Platform Why, What and How? – No Code
SharePoint as a Business Platform Why, What and How? – No CodeSharePoint as a Business Platform Why, What and How? – No Code
SharePoint as a Business Platform Why, What and How? – No Code
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
 
Citizen Developer Tools are not just for Citizen Developers (session at Share...
Citizen Developer Tools are not just for Citizen Developers (session at Share...Citizen Developer Tools are not just for Citizen Developers (session at Share...
Citizen Developer Tools are not just for Citizen Developers (session at Share...
 

Recently uploaded

Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 

Recently uploaded (20)

Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 

Maintainable Machine Learning Products

  • 1. Maintainable Machine Learning Products ApacheCon Roadshow, Chicago 2019 Andrew Musselman akm@apache.org
  • 2. State of the Art in ML Development So many tools • Scikit-learn • Spark MLLib • Keras • PyTorch • DL4J • Mahout • MXNet • SystemML • PredictionIO • #justRthings • … • Vendor solutions • Kitchen sink • Auto-magic
  • 3. State of the Art in ML Development And that is just for the ML pieces; also need: • Data ingest • Data engineering • Plotting, charting • UX/Publishing • Sidecar functions: • Search • Model management/data versioning • Monitoring/performance metrics
  • 4. State of the Art in ML Development All to do a glorified regression or similar
  • 5. About Me and Why I'm Here Corp • Chief Data Scientist at Accenture • Senior Director at Lucidworks • Chief Analytics Officer at A2Go ASF • Mahout 0.9 release • Committer • PMC member • Chair Corp/OSS • Bootstrapped open-source contribution program at ACN • Similar program to A2Go Fun • Adversarial Learning podcast • Sailing, snowboarding, amateur radio (KI7KQA)
  • 6. About Me and Why I'm Here In the course of doing work I have seen some bad things
  • 7. Motivation Moving data through the assembly line* to production requires beating several bosses: * There is no "assembly line" the first several times Ingest Clean and Transform PublishTrain/ Test/ Tweak
  • 8. Zooming Out Before a project begins, there are multiple other bosses to beat: Have an Idea Design Solution Convince Team Prototype Get Priority Get Budget Then
  • 9. Why Projects Fail Things can die at any stage, but most poignantly at the end, when it's "finished" • Results/findings/"insights" need a total re-write or port to "production lang/infra • E.g., a nice tidy model to predict customer behavior needs to be re- written in Java to run in the "web service farm" • Add six months! • Priority battles! Unproven ML/AI pet project less urgent than: • Ongoing maintenance • Shifted business priorities
  • 10. The Best Reason Projects Fail No established approach/workflow to incorporate results into existing infrastructure
  • 11. ML/AI Has a Lot of Attention In the face of these troubles, ML/AI is a stated priority of many, many, many, orgs • Leadership team: "we need an ML/AI story immediately; everyone is doing it and we are behind the competition" 🤔 • Countless teams: "we need sentiment analysis of [our medical records | social media about us | the stock market]" 😬 • "Can't machine learning fix this problem?" 🤔 • "Machine learning is a commodity now" 😂
  • 12. ML/AI Has a Lot of Attention Result: URGENCY + LARGE AND WRONG SCOPE
  • 13. Combatting Urgency and Bad Scope People minimize risk by: • Hiring consultants • Building it all from scratch • Buying a vendor solution (and paying their professional services team to build all the hard parts) • Researching/benchmarking/assembling some OSS libraries/frameworks
  • 14. Sometimes People Do Dumb Things "Let's migrate off this vendor and use an open source solution" Vendor Apache
  • 15. Sometimes People Do Dumb Things "Dump the entire Teradata warehouse"
  • 16. Sometimes People Do Dumb Things "Into HDFS"
  • 17. Sometimes People Do Dumb Things "But let's not keep any of the metadata about the tables"
  • 18. Lies We Tell Ourselves "Let's clean up all this legacy not invented here (NIH) code and move to that vendor solution" NIH Vendor
  • 19. Lies We Tell Ourselves "The vendor says migration should take less than a month"
  • 20. Lies We Tell Ourselves "Our IT team says they can integrate the vendor solution next fiscal year"
  • 21. Lies We Tell Ourselves "Our summer intern says they think they can write all the connectors we need by September"
  • 22. How to Choose Tools People think about tech/infra decisions on a 1-D spectrum Vendor NIH NIH OSS But it's a multi-dimensional problem
  • 23. How to Choose Tools Trade-offs • Vendor-heavy: $$, less control • NIH-heavy: tribal knowledge +s and –s • OSS-heavy: config and extend, hiring pool +s
  • 24. How Not to Choose Tools IPython NB Python REPL Jupyter AWS EC2, S3, DB, cron External Data APIs ? bash and curl Data in/out Data in/out A real workflow
  • 25. Ideal Workflow Each phase decoupled and easier to maintain, hire for skills
  • 26. Ideal Workflow • Encourage small, low-risk prototypes • Promote the successes to real projects/features/apps • Avoiding: • Re-write • IT Debate Club • Budget Debate Club
  • 27. The Six Moving Pieces of a Platform Load API Layer Data API Layer Results API Layer Publish API Layer Serve API Layer Data Engineering Analytic Jobs, Splits, Runs Output, Results, Performance Look and Feel, Scoring, Monitoring, Display of (1) - (3) (1) (2) (3) (4) (5) All APIs (6) Packaging (1) - (5) for deployment Skill Sets: (1) Spark, SQL, Python, Linux, Databases, Key-Value Stores (2) Python, Spark, SQL (3) Python, SQL (4) React, HTML, JavaScript, CSS (5) React, Python, Linux, Redis (6) Docker, Chef, Jenkins/Travis Input Data
  • 28. The Six Moving Pieces of a Platform Load API Layer Data API Layer Results API Layer Publish API Layer Serve API Layer Data Engineering Analytic Jobs, Splits, Runs Output, Results, Performance Look and Feel, Scoring, Monitoring, Display of (1) - (3) (1) (2) (3) (4) (5) All APIs (6) Packaging (1) - (5) for deployment Skill Sets: (1) Spark, SQL, Python, Linux, Databases, Key-Value Stores (2) Python, Spark, SQL (3) Python, SQL (4) React, HTML, JavaScript, CSS (5) React, Python, Linux, Redis (6) Docker, Chef, Jenkins/Travis Input Data UI/UX
  • 29. The Six Moving Pieces of a Platform Load API Layer Data API Layer Results API Layer Publish API Layer Serve API Layer Data Engineering Analytic Jobs, Splits, Runs Output, Results, Performance Look and Feel, Scoring, Monitoring, Display of (1) - (3) (1) (2) (3) (4) (5) All APIs (6) Packaging (1) - (5) for deployment Skill Sets: (1) Spark, SQL, Python, Linux, Databases, Key-Value Stores (2) Python, Spark, SQL (3) Python, SQL (4) React, HTML, JavaScript, CSS (5) React, Python, Linux, Redis (6) Docker, Chef, Jenkins/Travis Input Data DevOps
  • 30. The Six Moving Pieces of a Platform Load API Layer Data API Layer Results API Layer Publish API Layer Serve API Layer Data Engineering Analytic Jobs, Splits, Runs Output, Results, Performance Look and Feel, Scoring, Monitoring, Display of (1) - (3) (1) (2) (3) (4) (5) All APIs (6) Packaging (1) - (5) for deployment Skill Sets: (1) Spark, SQL, Python, Linux, Databases, Key-Value Stores (2) Python, Spark, SQL (3) Python, SQL (4) React, HTML, JavaScript, CSS (5) React, Python, Linux, Redis (6) Docker, Chef, Jenkins/Travis Input Data Data Sci/ML
  • 31. Encourage/Enforce Good Behavior • Central notebook repository (e.g., Apache Zeppelin) • Quick dashboard prototyping (e.g., Apache Superset, Zeppelin) • Use a model server (e.g., Apache PredictionIO) • APIs for all stages • Code reviews • Unit and integration tests • "Definition of done"
  • 34. Getting Involved in Open Source • Fix documentation problems as you're using it • Fix bugs • Add features • Make it an internal team effort • Grow skills • Adapt the software to real-life demands • Give back