SlideShare una empresa de Scribd logo
1 de 39
Descargar para leer sin conexión
Reproducibility (and automation) of
Machine Learning process
Dzianis Dus
dzianisdus@gmail.com
Data Scientist at InData Labs
What this speech is about?
1. Data mining / Machine learning process
2. Workflow automation
3. Basic design concepts
4. Data pipelines
5. Available instruments
6. About my own experience
Process overview
1. Data Engineering – 80%
– Data extraction
– Data cleaning
– Data transformation
– Data normalization
– Feature extraction
2. Machine Learning – 20%
– Model fitting
– Hyperparameters tuning
– Model evaluation
CRISP-DM
Why automation?
1. You want to update models on regular basis
2. Make your data workflows more trustable
3. You can perform a data freeze (possibly)
4. A step to (more) reproducible experiments
5. Write once and enjoy every day 
How: Conceptual requirements
1. Reuse code between training and evaluation
phases (as much as possible)
2. Its easier to log features then to extract them
from data in retrospective way (if you can)
3. Solid environment is more important for the
first iteration then the quality of your model
4. Better to use the same language everywhere
(integration becomes much easier)
5. Every model requires support after deployment
6. You’d better know the rules of the game…
Feel free to download from author’s personal web page:
http://martin.zinkevich.org/rules_of_ml/
A taste of
How: Technical requirements
1. Simple way to define DAGs of batch tasks
2. Tasks parameterization
3. Ability to store intermediate results
(checkpointing)
4. Tasks dependencies resolution
5. Automatic failures processing
6. Logging, notifications
7. Execution state monitoring
8. Python-based solution (we are on PyCon )
https://github.com/pinterest/pinball
Pinball (Pinterest)
1. Nice UI
2. Dynamic pipelines
generation
3. Pipelines configuration in
Python code (?)
4. Parameterization through
shipping python dicts (?)
5. In fact, not documented
6. Seems like no other big
players use this
https://github.com/apache/incubator-airflow
Airflow (AirBnB, Apache Incubator)
1. Very nice UI
2. Dynamic pipelines
generation
3. Orchestration through
message queue
4. Code shipping
5. Scheduler spawns workers
6. Pipelines configuration in
Python code
7. Parameterization through
tasks templates using Jinja
(Hmm…)
8. As for me, not so elegant as
written in documentation 
https://github.com/spotify/luigi
Luigi (Spotify, Foursquare)
1. Simple UI
2. Dynamic pipelines
generation
3. Orchestration through
central scheduling (no
external components)
4. No code shipping
5. No scheduler
6. Pipelines configuration in
Python code (very elegant!)
7. Parameterization through
Parameters ()
8. Simple, well-tested
9. Good documentation
About … Luigi!
Luigi …
… is a Python module that helps you build complex pipelines
of batch jobs. It handles dependency resolution, workflow
management, visualization etc. It also comes with Hadoop
support built in.
… helps you stitch many tasks together, where each task can
be a Hive query, a Hadoop job in Java, a Spark job in Scala or
Python, a Python snippet, dumping a table from a database,
or anything else…
Luigi facts
1. Inspired by GNU Make
2. Everything in Luigi is in Python
3. Extremely simple (has only three main
classes: Target, Task, Parameter)
4. Each task must consume some input data
and may produce some output
5. Based on assumption of atomic writes
Luigi facts
1. Has no built-in scheduler (use crontab / run
manually from CLI)
2. You can not trigger any tasks from UI (its only
for monitoring purposes)
3. Master takes only orchestration role
4. Master does not ship your code to workers
Luigi fundamentals
Target corresponds to:
• file on local FS
• file on HDFS
• entry in DB
• any other kind of a checkpoint
Task:
• this is where execution takes place
• consume Targets that where created by other Tasks
• usually also outputs Target
• could depend on one or more other Tasks
• could have Parameters
Luigi Targets
• Have to implement exists method
• Write must be atomic
• Luigi comes with a toolbox of useful Targets:
luigi.LocalTarget(‘/home/path/to/some/file/’)
luigi.contrib.hdfs.HdfsTarget(‘/reports/%Y-%m-%d’)
luigi.postgres.PostgresTarget(…)
luigi.contrib.mysqldb.MySqlTarget(…)
luigi.contib.ftp.RemoteTarget(…)
… and many others …
• Built-in formats (GzipFormat is useful)
Luigi Tasks
• Main methods: run(), output(), requires()
• Write your code in run()
• Define your Target in output()
• Define dependencies using requires()
• Task is complete() if output Target exists()
Luigi Parameters
• Task that runs a Hadoop job every night?
• Luigi provides a lot of them:
luigi.parameter.Parameter
luigi.parameter.DateParameter
luigi.parameter.IntParameter
luigi.parameter.EnumParameter
luigi.parameter.ListParameter
luigi.parameter.DictParameter
… and etc …
• And automatically parses from CLI!
Execute from CLI: $ luigi MyTask --module your.cool.module --param 999
Central scheduling
• Luigi central scheduler (luigid)
– Doesn’t do any data processing
– Doesn’t execute any tasks
– Workers synchronization
– Tasks dependencies resolution
– Prevents same task run multiple times
– Provides administrative web interface
– Retries in case of failures
– Sends notifications (emails only)
• Luigi worker (luigi)
– Starts via cron / by hand
– Connects to central scheduler
– Defines tasks for execution
– Waits for permission to execute Task.run()
– Processes data, populates Targets
Web interface
Execution model
Simplified process:
1. Some workers started
2. Each submits DAG of Tasks
3. Recursive check of Tasks completion
4. Worker receives Task to execute
5. Data processing!
6. Repeat
Client-server API:
1. add_task(task_id, worker_id, status)
2. get_work(worker_id)
3. ping(worker_id)
http://www.arashrouhani.com/luigid-basics-jun-2015/
Tasks dependencies
• Using requires() method
• yielding at runtime!
Easy parallelization recipe
1. Do not use multiprocessing inside Task
2. Split huge Task into smaller ones and yield
them inside run() method
3. Run luigi with --workers N parameter
4. Make a separate job to combine all the
Targets (if you want)
5. Also it helps to minimize your possible data
loss in case of failures (atomic writes)
Luigi notifications
• luigi.notifications
• Built-in support for email notifications:
– SMTP
– Sendgrid
– Amazon SES / Amazon SNS
• Side projects for other channels:
– Slack (https://github.com/bonzanini/luigi-slack)
– …
About … Flo!
Flo is the first period & ovulation tracker that uses neural networks*.
* OWHEALTH, INC. is the first company to publicly announce using neural networks for
menstrual cycle analysis and prediction.
• Top-level App in Apple Store and Google Play
• More than 6.5 million registered users
• More than 17.5 million tracked cycles
• Integration with wearable devices
• A lot of (partially) structured information
• Quite a lot work with data & machine learning
• And even more!
• About 450 GB of useful information:
– Cycles lengths history
– Ovulation and pregnancy tests results
– User profile data (Age, Height, Weight, …)
– Manual tracked events (Symptoms, Mood, …)
– Lifestyle statistics (Sleep, Activity, Nutrition, …)
– Biometrics data (Heart rate, Basal temperature, …)
– Textual data
– …
• Periodic model updates
Key points
• Base class for all models (sklearn-like interface)
• Shared code base for data and features extraction
during training and prediction phases
• Currently 450+ features extracted for each cycle
• Using individual-level submodels predictions (weak
predictors) as features for network input (strong
predictor)
• Semi-automatic model updates
• Model unit testing before deployment
• In practice heuristics combined with machine
learning
Model update in Flo =
• (Me) Trigger pipeline execution from CLI
• (Luigi) Executes ETL tasks (on live Postgres replica)
• (Luigi) Persists raw data on disk (data freeze)
• (Luigi) Executes features extraction tasks
• (Luigi) Persists dataset on disk
• (Luigi) Executes Neural Network fitting task
• (Tensorflow) A lot of operations with tensors
• (Me) Monitoring with TensorBoard and Luigi Web Interface
• (Me) Working on other tasks, reading Slack notifications
• (Me) Deploying model by hand (after unit testing)
• (Luigi, Me) Looking after model accuracy in production
Triggering pipeline
1. Class of model:
• Provides basic architecture of network
• Has predefined set of hyperparameters
2. Model build parameters:
• Sizes of some named layers
• Weights decay amount (L2 regularization technique)
• Dropout regularization amount
• Or what ever needed to compile Tensorflow / Theano computation graph
3. Model fit parameters:
• Number of fitting epochs
• Mini-batch size
• Learning rate
• Specific paths to store intermediate results
4. Data extraction parameters:
• Date of data freeze (raw data on disk)
• Segment of users for which we want to fit model
• Many other (used default values)
Model update in Flo: DAG
Fit network → Extract features → Fit submodels → Extract Raw → Data Train / Test split
Model update in Flo: Dashboard
Track DAG execution status in Luigi scheduler web interface:
Model update in Flo: Tensorboard
Track model fitting progress in Tensorboard:
Model update in Flo: Notifications
• Everything is OK:
• Some trouble with connection:
• Do I need to update the model?
Conclusion
Reproducibility and automation is about:
1. Process design (conceptual aspect)
– Think not only about experiments, but about further
integration too
– Known best practices
2. Process realization (technical aspect)
– Build solid data science environment
– Search for convenient instruments (Luigi seems like a
good starting point)
– Make your pipelines simple and easily extensible
– Make everything you can to make your pipelines trustful
– Monitoring is important aspect
I hope you’ve enjoyed it!
Questions, please.

Más contenido relacionado

Destacado

.Net development with Azure Machine Learning (AzureML) Nov 2014
.Net development with Azure Machine Learning (AzureML) Nov 2014.Net development with Azure Machine Learning (AzureML) Nov 2014
.Net development with Azure Machine Learning (AzureML) Nov 2014Mark Tabladillo
 
Is Machine learning for your business? - Girls in Tech Luxembourg
Is Machine learning for your business? - Girls in Tech LuxembourgIs Machine learning for your business? - Girls in Tech Luxembourg
Is Machine learning for your business? - Girls in Tech LuxembourgMarie-Adélaïde Gervis
 
Directions towards a cool consumer review platform using machine learning (ml...
Directions towards a cool consumer review platform using machine learning (ml...Directions towards a cool consumer review platform using machine learning (ml...
Directions towards a cool consumer review platform using machine learning (ml...Dhwaj Raj
 
Requirements for next generation of Cloud Computing: Case study with multiple...
Requirements for next generation of Cloud Computing: Case study with multiple...Requirements for next generation of Cloud Computing: Case study with multiple...
Requirements for next generation of Cloud Computing: Case study with multiple...David Lary
 
Technical Area: Machine Learning and Pattern Recognition
Technical Area: Machine Learning and Pattern RecognitionTechnical Area: Machine Learning and Pattern Recognition
Technical Area: Machine Learning and Pattern Recognitionbutest
 
Machine Learning part 2 - Introduction to Data Science
Machine Learning part 2 -  Introduction to Data Science Machine Learning part 2 -  Introduction to Data Science
Machine Learning part 2 - Introduction to Data Science Frank Kienle
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningLior Rokach
 

Destacado (7)

.Net development with Azure Machine Learning (AzureML) Nov 2014
.Net development with Azure Machine Learning (AzureML) Nov 2014.Net development with Azure Machine Learning (AzureML) Nov 2014
.Net development with Azure Machine Learning (AzureML) Nov 2014
 
Is Machine learning for your business? - Girls in Tech Luxembourg
Is Machine learning for your business? - Girls in Tech LuxembourgIs Machine learning for your business? - Girls in Tech Luxembourg
Is Machine learning for your business? - Girls in Tech Luxembourg
 
Directions towards a cool consumer review platform using machine learning (ml...
Directions towards a cool consumer review platform using machine learning (ml...Directions towards a cool consumer review platform using machine learning (ml...
Directions towards a cool consumer review platform using machine learning (ml...
 
Requirements for next generation of Cloud Computing: Case study with multiple...
Requirements for next generation of Cloud Computing: Case study with multiple...Requirements for next generation of Cloud Computing: Case study with multiple...
Requirements for next generation of Cloud Computing: Case study with multiple...
 
Technical Area: Machine Learning and Pattern Recognition
Technical Area: Machine Learning and Pattern RecognitionTechnical Area: Machine Learning and Pattern Recognition
Technical Area: Machine Learning and Pattern Recognition
 
Machine Learning part 2 - Introduction to Data Science
Machine Learning part 2 -  Introduction to Data Science Machine Learning part 2 -  Introduction to Data Science
Machine Learning part 2 - Introduction to Data Science
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 

Similar a Reproducibility and automation of machine learning process

Reproducible research concepts and tools
Reproducible research concepts and toolsReproducible research concepts and tools
Reproducible research concepts and toolsC. Tobin Magle
 
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)Yury Leonychev
 
Introduction to Jupyter notebook and MS Azure Machine Learning Studio
Introduction to Jupyter notebook and MS Azure Machine Learning StudioIntroduction to Jupyter notebook and MS Azure Machine Learning Studio
Introduction to Jupyter notebook and MS Azure Machine Learning StudioMuralidharan Deenathayalan
 
Introduction to Jupyter notebook and MS Azure Machine Learning Studio
Introduction to Jupyter notebook and MS Azure Machine Learning StudioIntroduction to Jupyter notebook and MS Azure Machine Learning Studio
Introduction to Jupyter notebook and MS Azure Machine Learning StudioMuralidharan Deenathayalan
 
Apresentação - Minicurso de Introdução a Python, Data Science e Machine Learning
Apresentação - Minicurso de Introdução a Python, Data Science e Machine LearningApresentação - Minicurso de Introdução a Python, Data Science e Machine Learning
Apresentação - Minicurso de Introdução a Python, Data Science e Machine LearningArthur Emanuel
 
Reproducibility in artificial intelligence
Reproducibility in artificial intelligenceReproducibility in artificial intelligence
Reproducibility in artificial intelligenceCarlos Toxtli
 
Intro to Reproducible Research
Intro to Reproducible ResearchIntro to Reproducible Research
Intro to Reproducible ResearchC. Tobin Magle
 
PyCon AU 2012 - Debugging Live Python Web Applications
PyCon AU 2012 - Debugging Live Python Web ApplicationsPyCon AU 2012 - Debugging Live Python Web Applications
PyCon AU 2012 - Debugging Live Python Web ApplicationsGraham Dumpleton
 
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.Luciano Resende
 
01 Metasploit kung fu introduction
01 Metasploit kung fu introduction01 Metasploit kung fu introduction
01 Metasploit kung fu introductionMostafa Abdel-sallam
 
Dev Ops for systems of record - Talk at Agile Australia 2015
Dev Ops for systems of record - Talk at Agile Australia 2015Dev Ops for systems of record - Talk at Agile Australia 2015
Dev Ops for systems of record - Talk at Agile Australia 2015Mirco Hering
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyftmarkgrover
 
Practical automation for beginners
Practical automation for beginnersPractical automation for beginners
Practical automation for beginnersSeoweon Yoo
 

Similar a Reproducibility and automation of machine learning process (20)

Reproducible research concepts and tools
Reproducible research concepts and toolsReproducible research concepts and tools
Reproducible research concepts and tools
 
Python ml
Python mlPython ml
Python ml
 
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)
 
Django
DjangoDjango
Django
 
Introduction to Jupyter notebook and MS Azure Machine Learning Studio
Introduction to Jupyter notebook and MS Azure Machine Learning StudioIntroduction to Jupyter notebook and MS Azure Machine Learning Studio
Introduction to Jupyter notebook and MS Azure Machine Learning Studio
 
Introduction to Jupyter notebook and MS Azure Machine Learning Studio
Introduction to Jupyter notebook and MS Azure Machine Learning StudioIntroduction to Jupyter notebook and MS Azure Machine Learning Studio
Introduction to Jupyter notebook and MS Azure Machine Learning Studio
 
Apresentação - Minicurso de Introdução a Python, Data Science e Machine Learning
Apresentação - Minicurso de Introdução a Python, Data Science e Machine LearningApresentação - Minicurso de Introdução a Python, Data Science e Machine Learning
Apresentação - Minicurso de Introdução a Python, Data Science e Machine Learning
 
Introduction to Google Colaboratory.pdf
Introduction to Google Colaboratory.pdfIntroduction to Google Colaboratory.pdf
Introduction to Google Colaboratory.pdf
 
Reproducibility in artificial intelligence
Reproducibility in artificial intelligenceReproducibility in artificial intelligence
Reproducibility in artificial intelligence
 
Intro to Reproducible Research
Intro to Reproducible ResearchIntro to Reproducible Research
Intro to Reproducible Research
 
PyCon AU 2012 - Debugging Live Python Web Applications
PyCon AU 2012 - Debugging Live Python Web ApplicationsPyCon AU 2012 - Debugging Live Python Web Applications
PyCon AU 2012 - Debugging Live Python Web Applications
 
Top 10 python ide
Top 10 python ideTop 10 python ide
Top 10 python ide
 
Node.js
Node.jsNode.js
Node.js
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
GDSC Cloud Jam.pptx
GDSC Cloud Jam.pptxGDSC Cloud Jam.pptx
GDSC Cloud Jam.pptx
 
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
 
01 Metasploit kung fu introduction
01 Metasploit kung fu introduction01 Metasploit kung fu introduction
01 Metasploit kung fu introduction
 
Dev Ops for systems of record - Talk at Agile Australia 2015
Dev Ops for systems of record - Talk at Agile Australia 2015Dev Ops for systems of record - Talk at Agile Australia 2015
Dev Ops for systems of record - Talk at Agile Australia 2015
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyft
 
Practical automation for beginners
Practical automation for beginnersPractical automation for beginners
Practical automation for beginners
 

Más de Denis Dus

Probabilistic modeling in deep learning
Probabilistic modeling in deep learningProbabilistic modeling in deep learning
Probabilistic modeling in deep learningDenis Dus
 
Generative modeling with Convolutional Neural Networks
Generative modeling with Convolutional Neural NetworksGenerative modeling with Convolutional Neural Networks
Generative modeling with Convolutional Neural NetworksDenis Dus
 
Sequence prediction with TensorFlow
Sequence prediction with TensorFlowSequence prediction with TensorFlow
Sequence prediction with TensorFlowDenis Dus
 
word2vec (часть 2)
word2vec (часть 2)word2vec (часть 2)
word2vec (часть 2)Denis Dus
 
word2vec (part 1)
word2vec (part 1)word2vec (part 1)
word2vec (part 1)Denis Dus
 
Using spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and CassandraUsing spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and CassandraDenis Dus
 

Más de Denis Dus (6)

Probabilistic modeling in deep learning
Probabilistic modeling in deep learningProbabilistic modeling in deep learning
Probabilistic modeling in deep learning
 
Generative modeling with Convolutional Neural Networks
Generative modeling with Convolutional Neural NetworksGenerative modeling with Convolutional Neural Networks
Generative modeling with Convolutional Neural Networks
 
Sequence prediction with TensorFlow
Sequence prediction with TensorFlowSequence prediction with TensorFlow
Sequence prediction with TensorFlow
 
word2vec (часть 2)
word2vec (часть 2)word2vec (часть 2)
word2vec (часть 2)
 
word2vec (part 1)
word2vec (part 1)word2vec (part 1)
word2vec (part 1)
 
Using spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and CassandraUsing spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and Cassandra
 

Último

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 

Último (20)

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 

Reproducibility and automation of machine learning process

  • 1. Reproducibility (and automation) of Machine Learning process Dzianis Dus dzianisdus@gmail.com Data Scientist at InData Labs
  • 2. What this speech is about? 1. Data mining / Machine learning process 2. Workflow automation 3. Basic design concepts 4. Data pipelines 5. Available instruments 6. About my own experience
  • 3. Process overview 1. Data Engineering – 80% – Data extraction – Data cleaning – Data transformation – Data normalization – Feature extraction 2. Machine Learning – 20% – Model fitting – Hyperparameters tuning – Model evaluation CRISP-DM
  • 4. Why automation? 1. You want to update models on regular basis 2. Make your data workflows more trustable 3. You can perform a data freeze (possibly) 4. A step to (more) reproducible experiments 5. Write once and enjoy every day 
  • 5. How: Conceptual requirements 1. Reuse code between training and evaluation phases (as much as possible) 2. Its easier to log features then to extract them from data in retrospective way (if you can) 3. Solid environment is more important for the first iteration then the quality of your model 4. Better to use the same language everywhere (integration becomes much easier) 5. Every model requires support after deployment 6. You’d better know the rules of the game…
  • 6. Feel free to download from author’s personal web page: http://martin.zinkevich.org/rules_of_ml/
  • 8. How: Technical requirements 1. Simple way to define DAGs of batch tasks 2. Tasks parameterization 3. Ability to store intermediate results (checkpointing) 4. Tasks dependencies resolution 5. Automatic failures processing 6. Logging, notifications 7. Execution state monitoring 8. Python-based solution (we are on PyCon )
  • 9. https://github.com/pinterest/pinball Pinball (Pinterest) 1. Nice UI 2. Dynamic pipelines generation 3. Pipelines configuration in Python code (?) 4. Parameterization through shipping python dicts (?) 5. In fact, not documented 6. Seems like no other big players use this
  • 10. https://github.com/apache/incubator-airflow Airflow (AirBnB, Apache Incubator) 1. Very nice UI 2. Dynamic pipelines generation 3. Orchestration through message queue 4. Code shipping 5. Scheduler spawns workers 6. Pipelines configuration in Python code 7. Parameterization through tasks templates using Jinja (Hmm…) 8. As for me, not so elegant as written in documentation 
  • 11. https://github.com/spotify/luigi Luigi (Spotify, Foursquare) 1. Simple UI 2. Dynamic pipelines generation 3. Orchestration through central scheduling (no external components) 4. No code shipping 5. No scheduler 6. Pipelines configuration in Python code (very elegant!) 7. Parameterization through Parameters () 8. Simple, well-tested 9. Good documentation
  • 13. Luigi … … is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in. … helps you stitch many tasks together, where each task can be a Hive query, a Hadoop job in Java, a Spark job in Scala or Python, a Python snippet, dumping a table from a database, or anything else…
  • 14. Luigi facts 1. Inspired by GNU Make 2. Everything in Luigi is in Python 3. Extremely simple (has only three main classes: Target, Task, Parameter) 4. Each task must consume some input data and may produce some output 5. Based on assumption of atomic writes
  • 15. Luigi facts 1. Has no built-in scheduler (use crontab / run manually from CLI) 2. You can not trigger any tasks from UI (its only for monitoring purposes) 3. Master takes only orchestration role 4. Master does not ship your code to workers
  • 16. Luigi fundamentals Target corresponds to: • file on local FS • file on HDFS • entry in DB • any other kind of a checkpoint Task: • this is where execution takes place • consume Targets that where created by other Tasks • usually also outputs Target • could depend on one or more other Tasks • could have Parameters
  • 17. Luigi Targets • Have to implement exists method • Write must be atomic • Luigi comes with a toolbox of useful Targets: luigi.LocalTarget(‘/home/path/to/some/file/’) luigi.contrib.hdfs.HdfsTarget(‘/reports/%Y-%m-%d’) luigi.postgres.PostgresTarget(…) luigi.contrib.mysqldb.MySqlTarget(…) luigi.contib.ftp.RemoteTarget(…) … and many others … • Built-in formats (GzipFormat is useful)
  • 18. Luigi Tasks • Main methods: run(), output(), requires() • Write your code in run() • Define your Target in output() • Define dependencies using requires() • Task is complete() if output Target exists()
  • 19. Luigi Parameters • Task that runs a Hadoop job every night? • Luigi provides a lot of them: luigi.parameter.Parameter luigi.parameter.DateParameter luigi.parameter.IntParameter luigi.parameter.EnumParameter luigi.parameter.ListParameter luigi.parameter.DictParameter … and etc … • And automatically parses from CLI!
  • 20. Execute from CLI: $ luigi MyTask --module your.cool.module --param 999
  • 21. Central scheduling • Luigi central scheduler (luigid) – Doesn’t do any data processing – Doesn’t execute any tasks – Workers synchronization – Tasks dependencies resolution – Prevents same task run multiple times – Provides administrative web interface – Retries in case of failures – Sends notifications (emails only) • Luigi worker (luigi) – Starts via cron / by hand – Connects to central scheduler – Defines tasks for execution – Waits for permission to execute Task.run() – Processes data, populates Targets
  • 23. Execution model Simplified process: 1. Some workers started 2. Each submits DAG of Tasks 3. Recursive check of Tasks completion 4. Worker receives Task to execute 5. Data processing! 6. Repeat Client-server API: 1. add_task(task_id, worker_id, status) 2. get_work(worker_id) 3. ping(worker_id) http://www.arashrouhani.com/luigid-basics-jun-2015/
  • 24. Tasks dependencies • Using requires() method • yielding at runtime!
  • 25. Easy parallelization recipe 1. Do not use multiprocessing inside Task 2. Split huge Task into smaller ones and yield them inside run() method 3. Run luigi with --workers N parameter 4. Make a separate job to combine all the Targets (if you want) 5. Also it helps to minimize your possible data loss in case of failures (atomic writes)
  • 26. Luigi notifications • luigi.notifications • Built-in support for email notifications: – SMTP – Sendgrid – Amazon SES / Amazon SNS • Side projects for other channels: – Slack (https://github.com/bonzanini/luigi-slack) – …
  • 28. Flo is the first period & ovulation tracker that uses neural networks*. * OWHEALTH, INC. is the first company to publicly announce using neural networks for menstrual cycle analysis and prediction.
  • 29. • Top-level App in Apple Store and Google Play • More than 6.5 million registered users • More than 17.5 million tracked cycles • Integration with wearable devices • A lot of (partially) structured information • Quite a lot work with data & machine learning • And even more!
  • 30. • About 450 GB of useful information: – Cycles lengths history – Ovulation and pregnancy tests results – User profile data (Age, Height, Weight, …) – Manual tracked events (Symptoms, Mood, …) – Lifestyle statistics (Sleep, Activity, Nutrition, …) – Biometrics data (Heart rate, Basal temperature, …) – Textual data – … • Periodic model updates
  • 31. Key points • Base class for all models (sklearn-like interface) • Shared code base for data and features extraction during training and prediction phases • Currently 450+ features extracted for each cycle • Using individual-level submodels predictions (weak predictors) as features for network input (strong predictor) • Semi-automatic model updates • Model unit testing before deployment • In practice heuristics combined with machine learning
  • 32. Model update in Flo = • (Me) Trigger pipeline execution from CLI • (Luigi) Executes ETL tasks (on live Postgres replica) • (Luigi) Persists raw data on disk (data freeze) • (Luigi) Executes features extraction tasks • (Luigi) Persists dataset on disk • (Luigi) Executes Neural Network fitting task • (Tensorflow) A lot of operations with tensors • (Me) Monitoring with TensorBoard and Luigi Web Interface • (Me) Working on other tasks, reading Slack notifications • (Me) Deploying model by hand (after unit testing) • (Luigi, Me) Looking after model accuracy in production
  • 33. Triggering pipeline 1. Class of model: • Provides basic architecture of network • Has predefined set of hyperparameters 2. Model build parameters: • Sizes of some named layers • Weights decay amount (L2 regularization technique) • Dropout regularization amount • Or what ever needed to compile Tensorflow / Theano computation graph 3. Model fit parameters: • Number of fitting epochs • Mini-batch size • Learning rate • Specific paths to store intermediate results 4. Data extraction parameters: • Date of data freeze (raw data on disk) • Segment of users for which we want to fit model • Many other (used default values)
  • 34. Model update in Flo: DAG Fit network → Extract features → Fit submodels → Extract Raw → Data Train / Test split
  • 35. Model update in Flo: Dashboard Track DAG execution status in Luigi scheduler web interface:
  • 36. Model update in Flo: Tensorboard Track model fitting progress in Tensorboard:
  • 37. Model update in Flo: Notifications • Everything is OK: • Some trouble with connection: • Do I need to update the model?
  • 38. Conclusion Reproducibility and automation is about: 1. Process design (conceptual aspect) – Think not only about experiments, but about further integration too – Known best practices 2. Process realization (technical aspect) – Build solid data science environment – Search for convenient instruments (Luigi seems like a good starting point) – Make your pipelines simple and easily extensible – Make everything you can to make your pipelines trustful – Monitoring is important aspect
  • 39. I hope you’ve enjoyed it! Questions, please.