Machine learning applications are typically stitched together from hopes and dreams, shell scripts, cron jobs, home-grown schedulers, snippets of configuration clipped from multiple blog posts, thousands of hard-coded business rules, a.k.a. "our SQL corpus," and a few lines of training and testing code. Organizing all the moving parts into something maintainable and supportive of ongoing development is a challenge most teams have on their TODO list, roadmap, or tech debt pile. Getting ahead of the day-to-day demands and settling into a sane architecture often seems like an unattainable goal. The past several years have seen an explosion of tool-building in the data engineering and analytics area, including in Apache projects spanning the areas of search and information retrieval, job orchestration, file and stream formats, and machine learning libraries. In this talk we will cover our product and development teams' choices of architecture and tools, from data ingestion and storage, through transformations and processing, to presentation of results and publishing to web services, reports, and applications.
2. State of the Art in ML Development
So many tools
• Scikit-learn
• Spark MLLib
• Keras
• PyTorch
• DL4J
• Mahout
• MXNet
• SystemML
• PredictionIO
• #justRthings
• …
• Vendor solutions
• Kitchen sink
• Auto-magic
3. State of the Art in ML Development
And that is just for the ML pieces; also need:
• Data ingest
• Data engineering
• Plotting, charting
• UX/Publishing
• Sidecar functions:
• Search
• Model management/data versioning
• Monitoring/performance metrics
4. State of the Art in ML Development
All to do a glorified regression or similar
5. About Me and Why I'm Here
Corp
• Chief Data Scientist at Accenture
• Senior Director at Lucidworks
• Chief Analytics Officer at A2Go
ASF
• Mahout 0.9 release
• Committer
• PMC member
• Chair
Corp/OSS
• Bootstrapped open-source
contribution program at ACN
• Similar program to A2Go
Fun
• Adversarial Learning podcast
• Sailing, snowboarding, amateur
radio (KI7KQA)
6. About Me and Why I'm Here
In the course of doing
work I have seen
some bad things
7. Motivation
Moving data through the assembly line* to production requires
beating several bosses:
* There is no "assembly line" the first several times
Ingest Clean and
Transform
PublishTrain/
Test/
Tweak
8. Zooming Out
Before a project begins, there are multiple other bosses to beat:
Have an
Idea
Design
Solution
Convince
Team
Prototype Get
Priority
Get
Budget
Then
9. Why Projects Fail
Things can die at any stage, but most poignantly at the end,
when it's "finished"
• Results/findings/"insights" need a total re-write or port to "production
lang/infra
• E.g., a nice tidy model to predict customer behavior needs to be re-
written in Java to run in the "web service farm"
• Add six months!
• Priority battles! Unproven ML/AI pet project less urgent than:
• Ongoing maintenance
• Shifted business priorities
10. The Best Reason Projects Fail
No established
approach/workflow to
incorporate results into
existing infrastructure
11. ML/AI Has a Lot of Attention
In the face of these troubles, ML/AI is a stated priority of many,
many, many, orgs
• Leadership team: "we need an ML/AI story immediately; everyone is
doing it and we are behind the competition" 🤔
• Countless teams: "we need sentiment analysis of [our medical
records | social media about us | the stock market]" 😬
• "Can't machine learning fix this problem?" 🤔
• "Machine learning is a commodity now" 😂
12. ML/AI Has a Lot of Attention
Result: URGENCY
+ LARGE AND
WRONG SCOPE
13. Combatting Urgency and Bad Scope
People minimize risk by:
• Hiring consultants
• Building it all from scratch
• Buying a vendor solution (and paying their professional
services team to build all the hard parts)
• Researching/benchmarking/assembling some OSS
libraries/frameworks
14. Sometimes People Do Dumb Things
"Let's migrate off this vendor and use an open
source solution"
Vendor Apache
17. Sometimes People Do Dumb Things
"But let's not keep any of the metadata about the
tables"
18. Lies We Tell Ourselves
"Let's clean up all this legacy not invented here
(NIH) code and move to that vendor solution"
NIH Vendor
19. Lies We Tell Ourselves
"The vendor says migration should take less
than a month"
20. Lies We Tell Ourselves
"Our IT team says they can integrate the vendor
solution next fiscal year"
21. Lies We Tell Ourselves
"Our summer intern says they think they can
write all the connectors we need by September"
22. How to Choose Tools
People think about tech/infra
decisions on a 1-D spectrum
Vendor NIH
NIH OSS
But it's a multi-dimensional
problem
23. How to Choose Tools
Trade-offs
• Vendor-heavy: $$, less control
• NIH-heavy: tribal knowledge +s and –s
• OSS-heavy: config and extend, hiring pool +s
24. How Not to Choose Tools
IPython
NB Python
REPL
Jupyter
AWS EC2,
S3, DB,
cron
External
Data
APIs
?
bash
and
curl
Data
in/out
Data
in/out
A real workflow
26. Ideal Workflow
• Encourage small, low-risk
prototypes
• Promote the successes to real
projects/features/apps
• Avoiding:
• Re-write
• IT Debate Club
• Budget Debate Club
27. The Six Moving Pieces of a Platform
Load API
Layer
Data API
Layer
Results API
Layer
Publish API
Layer
Serve API
Layer
Data
Engineering
Analytic Jobs,
Splits, Runs
Output,
Results,
Performance
Look and Feel,
Scoring,
Monitoring,
Display of (1) -
(3)
(1) (2) (3) (4)
(5) All APIs
(6) Packaging (1) - (5) for deployment
Skill Sets:
(1) Spark, SQL, Python, Linux, Databases, Key-Value Stores
(2) Python, Spark, SQL
(3) Python, SQL
(4) React, HTML, JavaScript, CSS
(5) React, Python, Linux, Redis
(6) Docker, Chef, Jenkins/Travis
Input
Data
28. The Six Moving Pieces of a Platform
Load API
Layer
Data API
Layer
Results API
Layer
Publish API
Layer
Serve API
Layer
Data
Engineering
Analytic Jobs,
Splits, Runs
Output,
Results,
Performance
Look and Feel,
Scoring,
Monitoring,
Display of (1) -
(3)
(1) (2) (3) (4)
(5) All APIs
(6) Packaging (1) - (5) for deployment
Skill Sets:
(1) Spark, SQL, Python, Linux, Databases, Key-Value Stores
(2) Python, Spark, SQL
(3) Python, SQL
(4) React, HTML, JavaScript, CSS
(5) React, Python, Linux, Redis
(6) Docker, Chef, Jenkins/Travis
Input
Data
UI/UX
29. The Six Moving Pieces of a Platform
Load API
Layer
Data API
Layer
Results API
Layer
Publish API
Layer
Serve API
Layer
Data
Engineering
Analytic Jobs,
Splits, Runs
Output,
Results,
Performance
Look and Feel,
Scoring,
Monitoring,
Display of (1) -
(3)
(1) (2) (3) (4)
(5) All APIs
(6) Packaging (1) - (5) for deployment
Skill Sets:
(1) Spark, SQL, Python, Linux, Databases, Key-Value Stores
(2) Python, Spark, SQL
(3) Python, SQL
(4) React, HTML, JavaScript, CSS
(5) React, Python, Linux, Redis
(6) Docker, Chef, Jenkins/Travis
Input
Data
DevOps
30. The Six Moving Pieces of a Platform
Load API
Layer
Data API
Layer
Results API
Layer
Publish API
Layer
Serve API
Layer
Data
Engineering
Analytic Jobs,
Splits, Runs
Output,
Results,
Performance
Look and Feel,
Scoring,
Monitoring,
Display of (1) -
(3)
(1) (2) (3) (4)
(5) All APIs
(6) Packaging (1) - (5) for deployment
Skill Sets:
(1) Spark, SQL, Python, Linux, Databases, Key-Value Stores
(2) Python, Spark, SQL
(3) Python, SQL
(4) React, HTML, JavaScript, CSS
(5) React, Python, Linux, Redis
(6) Docker, Chef, Jenkins/Travis
Input
Data
Data
Sci/ML
31. Encourage/Enforce Good Behavior
• Central notebook repository (e.g., Apache Zeppelin)
• Quick dashboard prototyping (e.g., Apache Superset,
Zeppelin)
• Use a model server (e.g., Apache PredictionIO)
• APIs for all stages
• Code reviews
• Unit and integration tests
• "Definition of done"
34. Getting Involved in Open Source
• Fix documentation problems as you're using it
• Fix bugs
• Add features
• Make it an internal team effort
• Grow skills
• Adapt the software to real-life demands
• Give back