RAPIDS 2018 - Machine Learning in Production - headstart.io

•

3 likes•281 views

Headstart takes reproducibility very seriously. Our system needs to be fully auditable: our “match score” is a crucial element for candidate selection. At any point in time we need to be able to: - Access the models that were being used in production when the match score was computed; - Examine their code (including all upstream ETL/preprocessing pipelines); - Examine the data they were trained on; - Be able to deserialize the models and run diagnostics/tests on them. To support our requirements, we developed our own internal model versioning system using Git, Docker, CircleCI, AWS S3 and Pipenv. This presentation will share the design, implementation and functionalities of our versioning system, with a detailed walkthrough using our skill recommendation engine as a streamlined running example.

Data & Analytics

A practical approach to continuous deployment of Machine Learning pipelines
Machine Learning in Production
Christos Dimitroulas
Fullstack developer & DevOps
Luca Palmieri
Data Scientist & Machine Learning
#rapids2018

ProductionOperational
requirements
=> Reproducibility!
● Robustness
● Availability
● Monitoring
● Debugging & auditability

ReproducibilityIngredients
● Code
● Environment
● Data
Software

ReproducibilityData
● Training
● Input
● Serialized dataset
● Database queries
● Output
● Serialized models
● Serialized transformers
● Inference...

Use caseSkill recommendation
● Training
● Input
● Users’ skills (MongoDB)
● Output
● Serialized SVD model
● Serialized transformers
● Automated retraining
● Automated deployment on changes

AutomationData Science Culture
● Code quality delivers value
● DataOps?
● Data scientist ownership needs to
get closer to production

Requirements
● Strict dependency management
● Model performance analysis
● Model versioning
● Infrastructure (as code)
● Data versioning
● Continuous integration/delivery
#rapids2018

Dependency Management
Lockfile! Managed virtualenv! Reproducible environments!
#rapids2018

Model Performance Analysis
1. Does the model meet the basic
performance requirements?
2. How does the performance
compare to the latest model in
production?
#rapids2018

Model Versioning
Two levels of versioning:
1. Which version of the code
produced this model?
2. Which data produced this model?
#rapids2018

Infrastructure
● Needs to scale (ideally
automatically)
● Should be easy to maintain,
change or upgrade
● Must be easily reproducible (for
other environments e.g. staging)
● Make it easier for other team
members to develop infrastructure
#rapids2018

Data Versioning
Depends largely on how the data input for the model is created.
In our use case, input comes from a database query.
● MongoDB automated backups (snapshots)
● Save relevant snapshots alongside models
#rapids2018

Continuous Integration tool
● Trigger a pipeline from code
changes
● Schedule jobs
● Run tests & scripts
#rapids2018

Continuous Delivery
serve-skill-recommender-system
Build training container
skill-recommender-system
Private PyPi server
Train new models
Assess models
Build serving API container
Upload models
#rapids2018

● S3 is “primitive” - where is git diff for data?
● Painful transition: prototyping is chaos, production is order sancta sanctorum
● Generalize our process to a proper internal tool
#rapids2018

Recently uploaded

FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion

BigBuy dropshipping via API with DroFx.pptxolyaivanovalion

Midocean dropshipping via API with DroFxolyaivanovalion

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H

Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Riyadh +966572737505 get cytotec

Invezz.com - Grow your wealth with trading signalsInvezz1

100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate

Mature dropshipping via API with DroFx.pptxolyaivanovalion

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7Call Girls in Nagpur High Profile Call Girls

Week-01-2.ppt BBB human Computer interactionfulawalesam

Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823

Ravak dropshipping via API with DroFx.pptxolyaivanovalion

Discover Why Less is More in B2B Researchmichael115558

Recently uploaded (20)

FESE Capital Markets Fact Sheet 2024 Q1.pdf

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

CebaBaby dropshipping via API with DroFX.pptx

BigBuy dropshipping via API with DroFx.pptx

Midocean dropshipping via API with DroFx

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec

Invezz.com - Grow your wealth with trading signals

100-Concepts-of-AI by Anupama Kate .pptx

Mature dropshipping via API with DroFx.pptx

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7

Week-01-2.ppt BBB human Computer interaction

Best VIP Call Girls Noida Sector 22 Call Me: 8448380779

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...

Ravak dropshipping via API with DroFx.pptx

Discover Why Less is More in B2B Research

Featured

Content Methodology: A Best Practices Report (Webinar)contently

How to Prepare For a Successful Job Search for 2024Albert Qian

Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)

Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal

5 Public speaking tips from TED - Visualized summarySpeakerHub

ChatGPT and the Future of Work - Clark Boyd Clark Boyd

Getting into the tech field. what next Tessa Mero

Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray

How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC

Introduction to Data ScienceChristy Abraham Joy

Time Management & Productivity - Best PracticesVit Horky

The six step guide to practical project managementMindGenius

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools

12 Ways to Increase Your Influence at WorkGetSmarter

ChatGPT webinar slidesAlireza Esmikhani

More than Just Lines on a Map: Best Practices for U.S Bike RoutesProject for Public Spaces & National Center for Biking and Walking

Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference

Barbie - Brand Strategy PresentationErica Santiago

Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellSaba Software

Featured (20)

Content Methodology: A Best Practices Report (Webinar)

How to Prepare For a Successful Job Search for 2024

Social Media Marketing Trends 2024 // The Global Indie Insights

Trends In Paid Search: Navigating The Digital Landscape In 2024

5 Public speaking tips from TED - Visualized summary

ChatGPT and the Future of Work - Clark Boyd

Getting into the tech field. what next

Google's Just Not That Into You: Understanding Core Updates & Search Intent

How to have difficult conversations

Introduction to Data Science

Time Management & Productivity - Best Practices

The six step guide to practical project management

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...

12 Ways to Increase Your Influence at Work

ChatGPT webinar slides

More than Just Lines on a Map: Best Practices for U.S Bike Routes

Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...

Barbie - Brand Strategy Presentation

Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well

RAPIDS 2018 - Machine Learning in Production - headstart.io

1. A practical approach to continuous deployment of Machine Learning pipelines Machine Learning in Production Christos Dimitroulas Fullstack developer & DevOps Luca Palmieri Data Scientist & Machine Learning #rapids2018

2. Prototyping Production #rapids2018

3. ProductionOperational requirements => Reproducibility! ● Robustness ● Availability ● Monitoring ● Debugging & auditability

4. ReproducibilityIngredients ● Code ● Environment ● Data Software

5. ReproducibilityData ● Training ● Input ● Serialized dataset ● Database queries ● Output ● Serialized models ● Serialized transformers ● Inference...

6. Use caseSkill recommendation ● Training ● Input ● Users’ skills (MongoDB) ● Output ● Serialized SVD model ● Serialized transformers ● Automated retraining ● Automated deployment on changes

7. AutomationData Science Culture ● Code quality delivers value ● DataOps? ● Data scientist ownership needs to get closer to production

8. #rapids2018

9. #rapids2018

10. Requirements ● Strict dependency management ● Model performance analysis ● Model versioning ● Infrastructure (as code) ● Data versioning ● Continuous integration/delivery #rapids2018

11. Dependency Management Lockfile! Managed virtualenv! Reproducible environments! #rapids2018

12. Model Performance Analysis 1. Does the model meet the basic performance requirements? 2. How does the performance compare to the latest model in production? #rapids2018

13. Model Versioning Two levels of versioning: 1. Which version of the code produced this model? 2. Which data produced this model? #rapids2018

14. Infrastructure ● Needs to scale (ideally automatically) ● Should be easy to maintain, change or upgrade ● Must be easily reproducible (for other environments e.g. staging) ● Make it easier for other team members to develop infrastructure #rapids2018

15. Data Versioning Depends largely on how the data input for the model is created. In our use case, input comes from a database query. ● MongoDB automated backups (snapshots) ● Save relevant snapshots alongside models #rapids2018

16. Continuous Integration tool ● Trigger a pipeline from code changes ● Schedule jobs ● Run tests & scripts #rapids2018

17. Continuous Delivery serve-skill-recommender-system Build training container skill-recommender-system Private PyPi server Train new models Assess models Build serving API container Upload models #rapids2018

18. ● S3 is “primitive” - where is git diff for data? ● Painful transition: prototyping is chaos, production is order sancta sanctorum ● Generalize our process to a proper internal tool #rapids2018

RAPIDS 2018 - Machine Learning in Production - headstart.io

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

RAPIDS 2018 - Machine Learning in Production - headstart.io