SlideShare una empresa de Scribd logo
1 de 36
Data Science @ PMI
Tools of The Trade
Best Practices to Start, Develop and Ship a Data Science Product
Manuel Valverde
Tokyo WebHack, 17th January 2019
• PhD.@Granada U. Spain: Physics modelling and MC simulations for
SuperKamiokande
• PostDoc@Osaka U. Osaka: Nuclear Structure Calculations. Think Gaussian
processes
• DataScientist@Rakuten, Tokyo: Search Relevancy for e-commerce
• DataScientist@PMI, Tokyo: Fraud prevention
2
About Me
About Philip Morris International
3
• Founded in 1847
• No. 108 in the 2018 Fortune 500
• 80,000 employees, 180+ markets, 150M consumers
• 6 of the world's top international 15 brands, including
Shifting from combustible cigarettes to smoke-free, reduced risk products (RRP)
https://www.pmi.com/smoke-free-products
• We are part of PMI's Enterprise Analytics and Data (EAD) group
• 40+ Data Scientists across 4 hubs
• Offices in Amsterdam (NL), Kraków (PL), Lausanne (CH) and Tokyo (JP)
• Profiles
• Education: 30% PhD, 70% MSc/BSc
• Data Science Experience: 7.4 yrs on average
• Experience in PMI: 88% under 2yrs
• Expertise in Machine Learning, Big Data Engineering, Insights Communication
• SCRUM certified (Professional Scrum Developer)
4
Data Science @ PMI
5
2 Labs
LA
2 Labs
North
America
Add 1
Lab
EU
2 Labs
EE
Add 2
Labs
Asia
A
( Data x Science x Communication ) = Insight
Data is only one part of the equation. We bring the scientific method. It materializes in the analytical code we
write. It is as valuable as the data itself.
B We are business driven
Whatever we do, it contributes to the business. We are diligent about making an impact.
C
We invest in people
We invest in the ability to ask questions. It can’t be achieved with tools only. Tools are for generating answers,
but questions are posed by people.
D
We self-organize
We choose coordination & cultivation over command & control. We believe this approach allows for the best
solutions to emerge.
E We iterate and improve
We embrace lean development, we learn from mistakes and we do it together with business.
F We co-create
Data insights ecosystem requires collaboration among all parties. We want to be active contributors.
Data Science Principles @ PMI
6
PMI’s Data Ocean
Why are we here?
Because a data scientist is not just someone who knows more statistics than a programmer
Data Science is Software.
The product of a Data Science effort (a Model or a Report)
is essentially a small but critical part of a large,
sophisticated business software. Data Products must
therefore be designed to play well with systems up- and
downstream.
Remember that the system can work without a model, but
a model is pretty much worthless without the system.
Writing code for implementing machine learning
algorithms is getting easier every year.
Building a scikit-learn Pipeline to implement a Random
Forest model with GridSearch is less than twenty lines of
code today. AutoML is around the corner.
We need to acknowledge and understand two things:
 The code, or even the model is not our end-goal.
 We're in the business of building intelligent
applications,
or data products.
Why are we here?
Because a data scientist is not just someone who knows more programming than a statistician
9
• Obtain
connect to DBs,
download flat files
• Scrub
outliers/missing data,
aggregations
• Explore
statistical analysis,
feature engineering
• Model
learning algorithms,
parameter optimization
• INdustrialize
reports, APIs
ExploratoryProduction
Smart
Application
An OSEMN Data Science Process
Explore, Model, Iterate.
Create a Data Product.
10
We define a data product as a system that
 takes raw data as input, 📲
 applies a machine-learned model to it, 🤖
 produces data as output to another system 💻
Additionally, a data product must
 be dynamic and maintainable,
allowing periodic updates 🏃
 be responsive, performant and scalable 👨👨👦👦
What is a Data Product? 🤔
In a nutshell, it’s a software product with an ML Engine.
Examples
Amazon’s Product Recommendation Engine
LinkedIN’s “People You May Know”
Autonomous Vehicles
The Classic Data Science Workflow
Data Product Development Workflow
11
Challenges in Data Product Development 🤔
“Team programming isn’t a divide and conquer problem.
It is a divide, conquer, and integrate problem.”
1. The Process
Infrastructure Setup > Code > Build > Test >
Package > Release > Monitor
2. The Team
Cross-functional group of businesspeople, data
scientists, engineers and developers.
3. The Challenge
As an example, consider we have 2 groups,
 Team A consists of data engineers and
scientists and works on the Prediction Engine.
👨🤖💻👨🤖🎓
 Team B consists of software engineers and
front-end developers working on the UI. 👨🤖🎨
👨🤖🔬
The goal is that every piece in the product should
integrate well into a larger codebase. 🍻
12
Continuous Integration (CI), Delivery (CD) and
DeploymentDevelopment practices for overcoming integration challenges and moving faster to delivery
The CI/CD Cycle
 Continuous Integration requires multiple developers to
integrate code into a shared repository frequently.
Requested merges are automatically tested and
reviewed.
 Enabled by git-flow, code standards and
automated testing
 Continuous Delivery makes sure that the code that we
integrate is always in a deploy-ready state.
 Enabled by agile (iterative) methods,
testing and build automation
 Continuous Deployment is the actual act of pushing
updates out to the user – think of your iPhone apps or
Desktop browser that prompt for updates to be installed
periodically.
14
The Role of Data Scientists
Learn best practices to contribute effectively to data products
Write code that is
 Readable,
so others can understand and add to it
 Testable
so others can verify it does what it advertises
and integrate it into their work
 Reusable
so it may be included in other projects
 Reproducible
uses libraries/packages that are available on
production environments
 Usable
don’t write code in SAS or R,
most engineers don’t speak those languages.
Joel’s Tests
 Do you use source control?
 Can you make a build in one step?
 Do you make daily builds?
 Do you have a bug database?
 Do you fix bugs before writing new code?
15
Data Science Best Practices @ PMI
Python
Style
Guides
Notebooks
to Modules
Testing
Code
Reviews
Docker
Virtual
Environments
Version
Control
Project
Templates
16
Data Science Best Practices @ PMI
Python
Style
Guides
Notebooks
to Modules
Testing
Code
Reviews
Docker
Virtual
Environments
Version
Control
Project
Templates
Agile Data Science Workflow
Our building blocks
Ocean Components
To create a workflow that is …
Our Vision
• Flexible
Adapts to specific needs of every use-case
Accommodates changing requirements
• Inspection
Transparency at all times
Artifacts can be audited at any time.
• Reproducible
Out-of-the-box dependency management
No more ‘But-it-works-on-my-machine’ or ‘Please-industrialize-this-
model’
• Easy to use
Frictionless development experience
Freedom to experiment
🔥
Some things we always need to be mindful of.
Our Principles
 Sensitive Data must never leave the Ocean
 Restricted Open-Source libraries must be avoided
 Every use-case must be industrialization-ready
DS Prod Lab
Scanned by BlackDuck
Automation
On-demand
infrastructure
Data Read/Write
Data Product
Reproducible Containers
Version Control
System Architecture
The dots, connected.
We organize our workflow in 3 phases – Start, Develop and Ship
3 Steps to a Data Product
• Get Infrastructure
• DS Prod Lab
• Docker Container
• Python Environments
• Get Data
• Flat Files
• Database Connections
• Get Code
• Project repo
• Cookiecutter template
• Start Docker container
• Check out a Branch
• For each task in OSEMN,
write
Exploratory code in NBs,
• Standard Code Styles
• Documentation, Tests
• Maintain
dependencies
• Refactor into Modules
• Push
• Review, Merge
• Package Python code,
publish to PyPi on
Artifactory
• Persist models
• Build an API to industrialize
the model.
• Provide endpoints for
app-health checks.
• Set up Jenkins pipeline for
continuous integration
• Plan for the next iteration
Start Develop Ship
For Reproducibility
Docker Containers
Docker for Containerized Data Science
All your dependencies in one place.
Code guaranteed to run anywhere.
A container is a lightweight, stand-alone package of a software that
includes everything needed to run it: code, runtime, system tools,
system libraries, settings.
Containerized software will always run the same, regardless of the
environment.
Benefits for Data Scientists
 Freedom, install all your favorite tools and libraries
 Ease of installation, set up your toolbox once and it will always work
 Reproducibility and Portability,
your development environment can be reproduced anywhere
 Isolation, your Py2 setup doesn’t mess up your Py3 setup, installing
a new library doesn’t mess up system Python
 Speed, get up and running in minutes with images optimized for
specific applications like time-series analysis or deep-learning.
For organization and predictability
Project Templates
CookieCutter
Everything has a place and a purpose
The idea is borrowed from popular web-frameworks like Rails and Django
where each developer uses the same template when starting a new project.
This makes it easier for everyone on the team to figure out where they
would find or put the various moving parts.
We will use a standard project skeleton tailed for data science projects so
that every scientist knows where to put their code, notebooks, data, models,
figures and references.
Benefits of a standardized directory structure:
 allows people to collaborate more easily
 empowers reproducible analysis
 enforces a "data as immutable" design philosophy
Cookiecutters help us generate this folder structure automatically.
CookieCutter
The standard folder structure enforces a design philosophy for faster delivery
Treat Data as Immutable
Raw data should be stored inside /data/raw and should never be modified
by hand. The code you write should ingest the data from /raw and cleaned
or processed data should be written to /processed.
Reproducibility
Everyone on the team should be able to reproduce your analysis with
 the code in src/
 the data in data/raw/
 the dependencies in Dockerfile, requirements file
Notebooks for Exploration, Scripts for Production Code
Jupyter is great for exploratory analysis, but quite challenging for version
control (they're stored as json files.) Once your code works well, move it
from notebooks/ to src/ and package the functions and classes into
modules.
For being deploy-ready
Moving code from
Notebooks to Source Code
Notebooks for Exploration. Files for Production.
The case against Notebooks
 The main cause of unmaintainable code and bad structure in Data Science is the mixing
of exploratory "throw away" code with production code. Notebooks are being used to write
code that ultimately would be deployed in production.
 This is not what notebooks where invented for;
they are essentially browser-based shells and presentation tools with charts and code
blocks.
 Notebooks do not have refactoring tools, code structuring tools and are
notorious for version control management.
Motivation for Organizing Code
 Extract text and plots from notebooks into Markdown Reports for a business audience
 Notebooks with minimal code and clear narrative can be used as Technical Reports
 Move the core functionality into Python modules to speed up subsequent exploration
In the exploratory phase,
the code base is expanded through data analysis, feature
engineering and modelling.
In the refactoring phase,
the most useful results and tools from the exploratory phase are
translated into modules and packages.
The Production Codebase grows across sprints.
For integration and deployment
Automated Testing
 If your code is not performing as expected, will you
know?
 If your data are corrupted, do you notice?
 If you re-run your analysis on different data,
are the methods you used still valid?
Automated Testing
“Why do most developers fear to make continuous changes to their code? They are afraid they’ll break
it!
Why are they afraid they’ll break it? Because they don’t have tests!”
Two Types of Tests useful for DS
 Unit Testing to make sure individual pieces of code work
 Integration Testing to make sure your code works with everyone else's
Challenge with writing Tests for Data Science
For most software, the output is deterministic - a function for averaging numbers can be
Unit tested with a simple function that checks if result is accurate. You can then check your
changes in, and Integration tests can run against the new build with a fabricated set of
results to ensure that everything works as expected.
But not so with Data Science work – the output is probabilistic.
You can't always put in a 2 and 4 and expect a 3 to come out.
Automated Testing for Data Science
 First, implement a Unit Test framework within your code; use pytest or nose
 In some cases, you can set a deterministic value like number of rows or the
expected data type from a function, and write a test for it.
 But if you can't - pick the performance metric (p-value, F1-score, or AUC, etc.)
and check if it lies within an acceptable range.
Test-Driven Development (TDD)
First the developer writes an (initially failing)
automated test case that defines a desired
improvement or new function, then produces the
minimum amount of code to pass that test.” So, before
actually writing any code, you should write your tests.
All tests should go into the tests/ subdirectory of the
specific package. Write tests in three steps
 Get/make the input data
 Manually construct the result you expect
Compare the actual result
to the expected correct result
In Conclusion
 Engineering smart systems around a machine-learned
core is difficult
 It requires teams of exceptionally talented individuals to
work together.
 What makes data scientists special is their ability to work
with both business leaders and technology experts.
 We must acknowledge that we are a part of something
much bigger and learn to play well with each other and
with all parties involved.
Our hope is that these systems, principles and best
practices will help you take the first steps in that direction
Questions?

Más contenido relacionado

La actualidad más candente

Resume_Dip_Shah
Resume_Dip_ShahResume_Dip_Shah
Resume_Dip_Shah
Dip Shah
 
Technology radar-may-2013
Technology radar-may-2013Technology radar-may-2013
Technology radar-may-2013
Carol Bruno
 
2019 12 Clojure/conj: Love Letter To Clojure, and A Datomic Experience Report
2019 12 Clojure/conj: Love Letter To Clojure, and A Datomic Experience Report2019 12 Clojure/conj: Love Letter To Clojure, and A Datomic Experience Report
2019 12 Clojure/conj: Love Letter To Clojure, and A Datomic Experience Report
Gene Kim
 
The Unicorn Project and The Five Ideals (older: see notes for newer version)
The Unicorn Project and The Five Ideals (older: see notes for newer version)The Unicorn Project and The Five Ideals (older: see notes for newer version)
The Unicorn Project and The Five Ideals (older: see notes for newer version)
Gene Kim
 

La actualidad más candente (15)

2019 Top Lessons Learned Since the Phoenix Project Was Released
2019 Top Lessons Learned Since the Phoenix Project Was Released2019 Top Lessons Learned Since the Phoenix Project Was Released
2019 Top Lessons Learned Since the Phoenix Project Was Released
 
Resume_Dip_Shah
Resume_Dip_ShahResume_Dip_Shah
Resume_Dip_Shah
 
Technology radar-may-2013
Technology radar-may-2013Technology radar-may-2013
Technology radar-may-2013
 
2019 12 Clojure/conj: Love Letter To Clojure, and A Datomic Experience Report
2019 12 Clojure/conj: Love Letter To Clojure, and A Datomic Experience Report2019 12 Clojure/conj: Love Letter To Clojure, and A Datomic Experience Report
2019 12 Clojure/conj: Love Letter To Clojure, and A Datomic Experience Report
 
PMI Thailand: DevOps / Roles of Project Manager (20-May-2020)
PMI Thailand:   DevOps / Roles of Project Manager (20-May-2020)PMI Thailand:   DevOps / Roles of Project Manager (20-May-2020)
PMI Thailand: DevOps / Roles of Project Manager (20-May-2020)
 
DataOps, DevOps and the Developer: Treating Database Code Just Like App Code
DataOps, DevOps and the Developer: Treating Database Code Just Like App CodeDataOps, DevOps and the Developer: Treating Database Code Just Like App Code
DataOps, DevOps and the Developer: Treating Database Code Just Like App Code
 
Agile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for SuccessAgile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for Success
 
2013 Velocity DevOps Metrics -- It's Not Just For WebOps Any More!
2013 Velocity DevOps Metrics -- It's Not Just For WebOps Any More!2013 Velocity DevOps Metrics -- It's Not Just For WebOps Any More!
2013 Velocity DevOps Metrics -- It's Not Just For WebOps Any More!
 
The Unicorn Project and The Five Ideals (older: see notes for newer version)
The Unicorn Project and The Five Ideals (older: see notes for newer version)The Unicorn Project and The Five Ideals (older: see notes for newer version)
The Unicorn Project and The Five Ideals (older: see notes for newer version)
 
DOES14 - Scott Prugh - CSG - DevOps and Lean in Legacy Environments
DOES14 - Scott Prugh - CSG - DevOps and Lean in Legacy EnvironmentsDOES14 - Scott Prugh - CSG - DevOps and Lean in Legacy Environments
DOES14 - Scott Prugh - CSG - DevOps and Lean in Legacy Environments
 
Managing and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in PythonManaging and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in Python
 
Fit For Purpose: Preventing a Big Data Letdown
Fit For Purpose: Preventing a Big Data LetdownFit For Purpose: Preventing a Big Data Letdown
Fit For Purpose: Preventing a Big Data Letdown
 
Introduction to Machine Learning - WeCloudData
Introduction to Machine Learning - WeCloudDataIntroduction to Machine Learning - WeCloudData
Introduction to Machine Learning - WeCloudData
 
DOES15 - Scott Prugh & Erica Morrison - Conway & Taylor Meet the Strangler (v...
DOES15 - Scott Prugh & Erica Morrison - Conway & Taylor Meet the Strangler (v...DOES15 - Scott Prugh & Erica Morrison - Conway & Taylor Meet the Strangler (v...
DOES15 - Scott Prugh & Erica Morrison - Conway & Taylor Meet the Strangler (v...
 
Ensuring Cloud Native Success: The Greenfield Journey
Ensuring Cloud Native Success: The Greenfield JourneyEnsuring Cloud Native Success: The Greenfield Journey
Ensuring Cloud Native Success: The Greenfield Journey
 

Similar a Data science tools of the trade

Bridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to ProductionBridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to Production
Florian Wilhelm
 
GEETHAhshansbbsbsbhshnsnsn_INTERNSHIP.pptx
GEETHAhshansbbsbsbhshnsnsn_INTERNSHIP.pptxGEETHAhshansbbsbsbhshnsnsn_INTERNSHIP.pptx
GEETHAhshansbbsbsbhshnsnsn_INTERNSHIP.pptx
Geetha982072
 
hari_duche_updated
hari_duche_updatedhari_duche_updated
hari_duche_updated
Hari Duche
 
Managers guide to effective building of machine learning products
Managers guide to effective building of machine learning productsManagers guide to effective building of machine learning products
Managers guide to effective building of machine learning products
Gianmario Spacagna
 
Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...
DataWorks Summit
 

Similar a Data science tools of the trade (20)

Bridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to ProductionBridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to Production
 
Building successful data science teams
Building successful data science teamsBuilding successful data science teams
Building successful data science teams
 
Approaching risk management with your head in the cloud
Approaching risk management with your head in the cloudApproaching risk management with your head in the cloud
Approaching risk management with your head in the cloud
 
Ds for finance day 4
Ds for finance day 4Ds for finance day 4
Ds for finance day 4
 
Democratizing Apache Spark for the Enterprise with Jonathan Gole
Democratizing Apache Spark for the Enterprise with Jonathan GoleDemocratizing Apache Spark for the Enterprise with Jonathan Gole
Democratizing Apache Spark for the Enterprise with Jonathan Gole
 
Google cloud Study Jam 2023.pptx
Google cloud Study Jam 2023.pptxGoogle cloud Study Jam 2023.pptx
Google cloud Study Jam 2023.pptx
 
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
 
201705 neoteric software development intro
201705 neoteric software development intro201705 neoteric software development intro
201705 neoteric software development intro
 
Artificial Intelligence and Machine Learning with the Oracle Data Science Cloud
Artificial Intelligence and Machine Learning with the Oracle Data Science CloudArtificial Intelligence and Machine Learning with the Oracle Data Science Cloud
Artificial Intelligence and Machine Learning with the Oracle Data Science Cloud
 
Azure Notebooks - Jupyter for the Cloud
Azure Notebooks - Jupyter for the CloudAzure Notebooks - Jupyter for the Cloud
Azure Notebooks - Jupyter for the Cloud
 
GEETHAhshansbbsbsbhshnsnsn_INTERNSHIP.pptx
GEETHAhshansbbsbsbhshnsnsn_INTERNSHIP.pptxGEETHAhshansbbsbsbhshnsnsn_INTERNSHIP.pptx
GEETHAhshansbbsbsbhshnsnsn_INTERNSHIP.pptx
 
hari_duche_updated
hari_duche_updatedhari_duche_updated
hari_duche_updated
 
OA centre of excellence
OA centre of excellenceOA centre of excellence
OA centre of excellence
 
Should You Choose Java or Python for Data Science?
Should You Choose Java or Python for Data Science?Should You Choose Java or Python for Data Science?
Should You Choose Java or Python for Data Science?
 
Nadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
Nadine Schöne, Dataiku. The Complete Data Value Chain in a NutshellNadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
Nadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
 
Managers guide to effective building of machine learning products
Managers guide to effective building of machine learning productsManagers guide to effective building of machine learning products
Managers guide to effective building of machine learning products
 
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
 
Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez
 
It Consulting & Services - Black Basil Technologies
It Consulting & Services  - Black Basil TechnologiesIt Consulting & Services  - Black Basil Technologies
It Consulting & Services - Black Basil Technologies
 
Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...
 

Más de Fangda Wang

Más de Fangda Wang (11)

[WWCode] How aware are you of your deciding model?
[WWCode] How aware are you of your deciding model?[WWCode] How aware are you of your deciding model?
[WWCode] How aware are you of your deciding model?
 
Under the hood of architecture interviews at indeed
Under the hood of architecture interviews at indeedUnder the hood of architecture interviews at indeed
Under the hood of architecture interviews at indeed
 
How Indeed asks coding interview questions
How Indeed asks coding interview questionsHow Indeed asks coding interview questions
How Indeed asks coding interview questions
 
Types are eating the world
Types are eating the worldTypes are eating the world
Types are eating the world
 
From ic to tech lead
From ic to tech leadFrom ic to tech lead
From ic to tech lead
 
Introduction to japanese tokenizer
Introduction to japanese tokenizerIntroduction to japanese tokenizer
Introduction to japanese tokenizer
 
Gentle Introduction to Scala
Gentle Introduction to ScalaGentle Introduction to Scala
Gentle Introduction to Scala
 
To pair or not to pair
To pair or not to pairTo pair or not to pair
To pair or not to pair
 
Balanced Team
Balanced TeamBalanced Team
Balanced Team
 
Functional programming and Elm
Functional programming and ElmFunctional programming and Elm
Functional programming and Elm
 
Elm at large (companies)
Elm at large (companies)Elm at large (companies)
Elm at large (companies)
 

Último

Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
masabamasaba
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

Último (20)

Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 

Data science tools of the trade

  • 1. Data Science @ PMI Tools of The Trade Best Practices to Start, Develop and Ship a Data Science Product Manuel Valverde Tokyo WebHack, 17th January 2019
  • 2. • PhD.@Granada U. Spain: Physics modelling and MC simulations for SuperKamiokande • PostDoc@Osaka U. Osaka: Nuclear Structure Calculations. Think Gaussian processes • DataScientist@Rakuten, Tokyo: Search Relevancy for e-commerce • DataScientist@PMI, Tokyo: Fraud prevention 2 About Me
  • 3. About Philip Morris International 3 • Founded in 1847 • No. 108 in the 2018 Fortune 500 • 80,000 employees, 180+ markets, 150M consumers • 6 of the world's top international 15 brands, including Shifting from combustible cigarettes to smoke-free, reduced risk products (RRP) https://www.pmi.com/smoke-free-products
  • 4. • We are part of PMI's Enterprise Analytics and Data (EAD) group • 40+ Data Scientists across 4 hubs • Offices in Amsterdam (NL), Kraków (PL), Lausanne (CH) and Tokyo (JP) • Profiles • Education: 30% PhD, 70% MSc/BSc • Data Science Experience: 7.4 yrs on average • Experience in PMI: 88% under 2yrs • Expertise in Machine Learning, Big Data Engineering, Insights Communication • SCRUM certified (Professional Scrum Developer) 4 Data Science @ PMI
  • 5. 5 2 Labs LA 2 Labs North America Add 1 Lab EU 2 Labs EE Add 2 Labs Asia A ( Data x Science x Communication ) = Insight Data is only one part of the equation. We bring the scientific method. It materializes in the analytical code we write. It is as valuable as the data itself. B We are business driven Whatever we do, it contributes to the business. We are diligent about making an impact. C We invest in people We invest in the ability to ask questions. It can’t be achieved with tools only. Tools are for generating answers, but questions are posed by people. D We self-organize We choose coordination & cultivation over command & control. We believe this approach allows for the best solutions to emerge. E We iterate and improve We embrace lean development, we learn from mistakes and we do it together with business. F We co-create Data insights ecosystem requires collaboration among all parties. We want to be active contributors. Data Science Principles @ PMI
  • 7. Why are we here? Because a data scientist is not just someone who knows more statistics than a programmer Data Science is Software. The product of a Data Science effort (a Model or a Report) is essentially a small but critical part of a large, sophisticated business software. Data Products must therefore be designed to play well with systems up- and downstream. Remember that the system can work without a model, but a model is pretty much worthless without the system.
  • 8. Writing code for implementing machine learning algorithms is getting easier every year. Building a scikit-learn Pipeline to implement a Random Forest model with GridSearch is less than twenty lines of code today. AutoML is around the corner. We need to acknowledge and understand two things:  The code, or even the model is not our end-goal.  We're in the business of building intelligent applications, or data products. Why are we here? Because a data scientist is not just someone who knows more programming than a statistician
  • 9. 9 • Obtain connect to DBs, download flat files • Scrub outliers/missing data, aggregations • Explore statistical analysis, feature engineering • Model learning algorithms, parameter optimization • INdustrialize reports, APIs ExploratoryProduction Smart Application An OSEMN Data Science Process Explore, Model, Iterate. Create a Data Product.
  • 10. 10 We define a data product as a system that  takes raw data as input, 📲  applies a machine-learned model to it, 🤖  produces data as output to another system 💻 Additionally, a data product must  be dynamic and maintainable, allowing periodic updates 🏃  be responsive, performant and scalable 👨👨👦👦 What is a Data Product? 🤔 In a nutshell, it’s a software product with an ML Engine. Examples Amazon’s Product Recommendation Engine LinkedIN’s “People You May Know” Autonomous Vehicles The Classic Data Science Workflow Data Product Development Workflow
  • 11. 11 Challenges in Data Product Development 🤔 “Team programming isn’t a divide and conquer problem. It is a divide, conquer, and integrate problem.” 1. The Process Infrastructure Setup > Code > Build > Test > Package > Release > Monitor 2. The Team Cross-functional group of businesspeople, data scientists, engineers and developers. 3. The Challenge As an example, consider we have 2 groups,  Team A consists of data engineers and scientists and works on the Prediction Engine. 👨🤖💻👨🤖🎓  Team B consists of software engineers and front-end developers working on the UI. 👨🤖🎨 👨🤖🔬 The goal is that every piece in the product should integrate well into a larger codebase. 🍻
  • 12. 12 Continuous Integration (CI), Delivery (CD) and DeploymentDevelopment practices for overcoming integration challenges and moving faster to delivery The CI/CD Cycle  Continuous Integration requires multiple developers to integrate code into a shared repository frequently. Requested merges are automatically tested and reviewed.  Enabled by git-flow, code standards and automated testing  Continuous Delivery makes sure that the code that we integrate is always in a deploy-ready state.  Enabled by agile (iterative) methods, testing and build automation  Continuous Deployment is the actual act of pushing updates out to the user – think of your iPhone apps or Desktop browser that prompt for updates to be installed periodically.
  • 13. 14 The Role of Data Scientists Learn best practices to contribute effectively to data products Write code that is  Readable, so others can understand and add to it  Testable so others can verify it does what it advertises and integrate it into their work  Reusable so it may be included in other projects  Reproducible uses libraries/packages that are available on production environments  Usable don’t write code in SAS or R, most engineers don’t speak those languages. Joel’s Tests  Do you use source control?  Can you make a build in one step?  Do you make daily builds?  Do you have a bug database?  Do you fix bugs before writing new code?
  • 14. 15 Data Science Best Practices @ PMI Python Style Guides Notebooks to Modules Testing Code Reviews Docker Virtual Environments Version Control Project Templates
  • 15. 16 Data Science Best Practices @ PMI Python Style Guides Notebooks to Modules Testing Code Reviews Docker Virtual Environments Version Control Project Templates
  • 16. Agile Data Science Workflow
  • 18. To create a workflow that is … Our Vision • Flexible Adapts to specific needs of every use-case Accommodates changing requirements • Inspection Transparency at all times Artifacts can be audited at any time. • Reproducible Out-of-the-box dependency management No more ‘But-it-works-on-my-machine’ or ‘Please-industrialize-this- model’ • Easy to use Frictionless development experience Freedom to experiment 🔥
  • 19. Some things we always need to be mindful of. Our Principles  Sensitive Data must never leave the Ocean  Restricted Open-Source libraries must be avoided  Every use-case must be industrialization-ready
  • 20. DS Prod Lab Scanned by BlackDuck Automation On-demand infrastructure Data Read/Write Data Product Reproducible Containers Version Control System Architecture The dots, connected.
  • 21. We organize our workflow in 3 phases – Start, Develop and Ship 3 Steps to a Data Product • Get Infrastructure • DS Prod Lab • Docker Container • Python Environments • Get Data • Flat Files • Database Connections • Get Code • Project repo • Cookiecutter template • Start Docker container • Check out a Branch • For each task in OSEMN, write Exploratory code in NBs, • Standard Code Styles • Documentation, Tests • Maintain dependencies • Refactor into Modules • Push • Review, Merge • Package Python code, publish to PyPi on Artifactory • Persist models • Build an API to industrialize the model. • Provide endpoints for app-health checks. • Set up Jenkins pipeline for continuous integration • Plan for the next iteration Start Develop Ship
  • 23. Docker for Containerized Data Science All your dependencies in one place. Code guaranteed to run anywhere. A container is a lightweight, stand-alone package of a software that includes everything needed to run it: code, runtime, system tools, system libraries, settings. Containerized software will always run the same, regardless of the environment. Benefits for Data Scientists  Freedom, install all your favorite tools and libraries  Ease of installation, set up your toolbox once and it will always work  Reproducibility and Portability, your development environment can be reproduced anywhere  Isolation, your Py2 setup doesn’t mess up your Py3 setup, installing a new library doesn’t mess up system Python  Speed, get up and running in minutes with images optimized for specific applications like time-series analysis or deep-learning.
  • 24. For organization and predictability Project Templates
  • 25. CookieCutter Everything has a place and a purpose The idea is borrowed from popular web-frameworks like Rails and Django where each developer uses the same template when starting a new project. This makes it easier for everyone on the team to figure out where they would find or put the various moving parts. We will use a standard project skeleton tailed for data science projects so that every scientist knows where to put their code, notebooks, data, models, figures and references. Benefits of a standardized directory structure:  allows people to collaborate more easily  empowers reproducible analysis  enforces a "data as immutable" design philosophy Cookiecutters help us generate this folder structure automatically.
  • 26. CookieCutter The standard folder structure enforces a design philosophy for faster delivery Treat Data as Immutable Raw data should be stored inside /data/raw and should never be modified by hand. The code you write should ingest the data from /raw and cleaned or processed data should be written to /processed. Reproducibility Everyone on the team should be able to reproduce your analysis with  the code in src/  the data in data/raw/  the dependencies in Dockerfile, requirements file Notebooks for Exploration, Scripts for Production Code Jupyter is great for exploratory analysis, but quite challenging for version control (they're stored as json files.) Once your code works well, move it from notebooks/ to src/ and package the functions and classes into modules.
  • 27. For being deploy-ready Moving code from Notebooks to Source Code
  • 28. Notebooks for Exploration. Files for Production. The case against Notebooks  The main cause of unmaintainable code and bad structure in Data Science is the mixing of exploratory "throw away" code with production code. Notebooks are being used to write code that ultimately would be deployed in production.  This is not what notebooks where invented for; they are essentially browser-based shells and presentation tools with charts and code blocks.  Notebooks do not have refactoring tools, code structuring tools and are notorious for version control management. Motivation for Organizing Code  Extract text and plots from notebooks into Markdown Reports for a business audience  Notebooks with minimal code and clear narrative can be used as Technical Reports  Move the core functionality into Python modules to speed up subsequent exploration
  • 29. In the exploratory phase, the code base is expanded through data analysis, feature engineering and modelling. In the refactoring phase, the most useful results and tools from the exploratory phase are translated into modules and packages. The Production Codebase grows across sprints.
  • 30. For integration and deployment Automated Testing
  • 31.  If your code is not performing as expected, will you know?  If your data are corrupted, do you notice?  If you re-run your analysis on different data, are the methods you used still valid?
  • 32. Automated Testing “Why do most developers fear to make continuous changes to their code? They are afraid they’ll break it! Why are they afraid they’ll break it? Because they don’t have tests!” Two Types of Tests useful for DS  Unit Testing to make sure individual pieces of code work  Integration Testing to make sure your code works with everyone else's Challenge with writing Tests for Data Science For most software, the output is deterministic - a function for averaging numbers can be Unit tested with a simple function that checks if result is accurate. You can then check your changes in, and Integration tests can run against the new build with a fabricated set of results to ensure that everything works as expected. But not so with Data Science work – the output is probabilistic. You can't always put in a 2 and 4 and expect a 3 to come out.
  • 33. Automated Testing for Data Science  First, implement a Unit Test framework within your code; use pytest or nose  In some cases, you can set a deterministic value like number of rows or the expected data type from a function, and write a test for it.  But if you can't - pick the performance metric (p-value, F1-score, or AUC, etc.) and check if it lies within an acceptable range. Test-Driven Development (TDD) First the developer writes an (initially failing) automated test case that defines a desired improvement or new function, then produces the minimum amount of code to pass that test.” So, before actually writing any code, you should write your tests. All tests should go into the tests/ subdirectory of the specific package. Write tests in three steps  Get/make the input data  Manually construct the result you expect Compare the actual result to the expected correct result
  • 35.  Engineering smart systems around a machine-learned core is difficult  It requires teams of exceptionally talented individuals to work together.  What makes data scientists special is their ability to work with both business leaders and technology experts.  We must acknowledge that we are a part of something much bigger and learn to play well with each other and with all parties involved. Our hope is that these systems, principles and best practices will help you take the first steps in that direction