SlideShare una empresa de Scribd logo
1 de 38
DevOps and Machine Learning
Jasjeet Thind
VP, Data Science & Engineering, Zillow Group
@JasjeetThind
Agenda
Overview of Zillow Group (ZG)
Machine Learning (ML) at ZG
Architecture
DevOps for ML
@JasjeetThind
Zillow Group Composed of 9 Brands
Build the world's largest, most trusted and vibrant home-related marketplace.
@JasjeetThind
RealEstate.com
Machine Learning at ZG
Personalization
Ad Targeting
Zestimate (AVMs)
Premier Agent (B2B)
Mortgages
Deep Learning
Demographics & Community
Business Analytics
Forecasting home price trends
@JasjeetThind
High Level Architecture
@JasjeetThind
DATA LAKE MACHINE LEARNING
ML WEB SERVICES
DATA SCIENCE
ANALYTICS
DATA QUALITYDATA COLLECTION
DevOps & ML at ZG
Tight collaboration, fast iterations, and lots of automation.
CONTINUOUS INTEGRATION (CI) CONTINUOUS DELIVERY (CD)
@JasjeetThind
Research Dev Test Ship ML Workflow
Real-time
ML
Research
Data scientists prototype features and models to solve business problems
Models
● Random Forest
● Gradient Boosted Machines
● CNN / Deep Learning
● NLP / TF-IDF / Word2vec / Bag of Words
@JasjeetThind
● Linear Regression
● K-means clustering
● Collaborative Filtering
DevOps Challenges in Research
(1) Data scientists iterate through lots of prototype code, models, features and datasets
that never get into production
(2) How do data scientists validate experiments on production system?
@JasjeetThind
How to apply DevOps to experiments?
Check-in prototypes
● Organize around research area
● Document heavily. Experiments often revisited
● Check-in subsets of training & scoring datasets
For large training datasets, store in the data lake
● Pointers to data in README.md
@JasjeetThind
DevOps enables experiments in production
Build tools for scientists to package and deploy ML models and features into
production A/B tests
● Once submitted via the tool the model goes through the same release / validation
pipeline as ML production models go through.
Real world prototyping can be done before production quality ML code.
@JasjeetThind
DEV
ML engineers operationalize features and models for scale.
Automatically build and upload artifacts
ML
Developer
Machine
Git Jenkins
- build
- unit tests
- upload artifacts
AWS EMR
Build
Cluster
Local
Artifact
Repository
AWS Cloud
Repository
(S3)
- AWS S3- Squid proxy
Upload
Artifact
GIT
push
Trigger
Check-in unit tests,
source code
Run unit tests to test
machine learning
models
@JasjeetThind
DevOps ML Challenges in Dev
(1) Unit tests non-trivial as model outputs not deterministic
output = f (inputs)
(2) CI requires fast builds. But accurate predictions require large training data sets, and
training models is time consuming.
@JasjeetThind
Unit Tests w/ ML
Don’t test “real” model outputs.
● Leave that to integration /
regression tests
● Focus on code coverage, build
and unit tests are fast.
Train models w/ mock training data
● Test schema, type, # records,
expected values
Predictions on mock scoring data sets
// load mock small training dataset
training_data = read(“training_small.csv”)
// model trained on mock training data
model = train(training_data)
// schema test
assertHasColumn(model.trainingData, “zipcode”)
// type test
assertIsInstance(model.trainingData, Vector)
// # of records test
assertEqual(model.trainingData.size, 32)
// expected value test
assertEqual(model.trainingData[0].key, 207)
// load mock small scoring dataset
predictions = score(model, scoring_data)
@JasjeetThind
Mock training data facilitates fast builds
Model prediction accuracy tested in the Test phase
@JasjeetThind
TEST
After artifact uploaded by build process, deploy and auto-trigger its regression tests
DevOps
API
Jenkins
Deployment
ShipIt
- REST API to to
deploy packages
- Hosts config
Jenkins host that
deploys & saves history
Upload
new job
AWS EMR
Regression
Cluster
Tool that executes
deployments
Unified
Versioning
SystemPin
@JasjeetThind
Kick off regression tests after deployment
DevOps ML Challenges in Test
(1) Training and scoring models can take many hours or more
(2) Models have very large memory footprint
(3) Regression tests non-trivial as model outputs not deterministic
(4) Real time model outputs vs that of backend systems
@JasjeetThind
How to train and score in Test?
For regression testing, create “golden” datasets, smaller subsets of training and scoring
data, for training and scoring models.
Makes training and scoring time much faster
@JasjeetThind
Reducing Model Memory Footprint in Test
Model dynamically generated using golden datasets
Model itself not in GIT, treated as binary
@JasjeetThind
Regression Tests and ML
Adjust with fuzz factor
● May need to run multiple
times and average results
Model validation tests to
ensure model metrics meet
minimum bar
● More on this later
// load “real” subset training data from data lake
training_data = read(“training_golden_dataset.csv”)
// model trained on golden training data
model = train(training_data)
// generate model output
scoring_data = read(“scoring_golden_dataset.csv”)
predictions = score(model, scoring_data)
// fuzz factor
id = 5
zestimate = predictions[id]
assertAlmostEqual(zestimate, 238000, delta=7140)
@JasjeetThind
Verify Real-time Matches Backend
Real-time machine learning services thousands / millions of concurrent requests
// get model output from web service API for scoring row
id = 5
prediction1 = getFromMLApi(id)
// get model output from batch pipeline for same scoring row
prediction2 = getFromBatchSystem(id)
// get model output from near real time pipeline for same scoring row
prediction3 = getFromNRTSystem(id)
// verify the values are equal
assertAlmostEqual(prediction1, prediction2, prediction3, delta=7140)
@JasjeetThind
SHIP
Automated deployment of ML code to production
DevOps
API
Jenkins
Deployment
ShipIt
Upload
new job
AWS EMR
Production
Cluster
@JasjeetThind
DevOps ML Challenges in Ship
(1) Deploy starts during model training and scoring
(2) Backward compatibility issues with code and models for real-time ML API
@JasjeetThind
4 Ways to Handle Deploy during Training
Terminate ML workflow and start the deploy
Wait for ML workflow to finish and start the deploy
Deploy at certain times (e.g. once per day at midnight)
Deploy to another cluster, shutdown the previous cluster
@JasjeetThind
Shipping real-time ML services
Complete successful run of batch pipeline.
● Then ship code and models to web services hosts.
● More on deployment to web services hosts later
@JasjeetThind
ML Workflow
Production ML backends generate models and predictions.
@JasjeetThind
Raw Data
Sources
Data
Collection
Generate
datasets
and
features
Model
Learning
Evaluate
Model
metrics
Scoring Deployment
DevOps Challenges w/ML Workflow
(1) Data quality given high velocity petabyte scale data continually being pushed into
the system
(2) Ensuring models aren’t stale as data continues to evolve
(3) Partial failures due to intermediate model training pipelines failing
(4) Model deployments don’t cause bad user experience
@JasjeetThind
Build Analytical Systems for Data Quality
Monitor
● what is expected data
● outlier detection
● missing data
● expected # of records
● latency
Reports / alerts that drive action
@JasjeetThind
Consume
Generate
Metrics
Models Data
Quality DB
Historical
Metrics
Write
Alert
for
Issues
Raw Data
Sources
Data Quality Architecture
Automate ML to Keep Models Fresh
Run continuously or frequently
Workflow engine
● Oozie, Airflow, Azkaban, SWF, etc
Logging, monitoring, and alerts throughout the pipeline
@JasjeetThind
Persist Intermediate Train Data for Failures
Breakup ML pipelines into appropriate abstraction layers
Leverage Spark which can recover from intermediate failures by replaying the DAG
@JasjeetThind
Model validation tests to verify quality
Don’t replace existing model if one or more metrics fail. Trigger alert instead.
Replace if all tests pass.
if (mean_squared_error(y_actual, y_pred) > 3) // mean squared error
alert()
else if (median_error(y_actual, y_pred) > 6) // median error
alert()
else if (auc(y_actual, y_pred)) // area under curve
alert()
else
deploy_models() // all model metrics passed so replace old models
@JasjeetThind
Real-time ML
Models need to be deployed into production
Models need to be executed per request with concurrent users under strict SLA
@JasjeetThind
DevOps ML Challenges w/ Real-time ML
(1) Deployment of models can take many seconds or minutes
(2) Latency to execute ML model for each request and return responses must meet
strict SLA (< 100 milliseconds)
(3) Need to ensure features at serving time same as backend systems
@JasjeetThind
Automate Deployment of Models
Replicate hosts
During deploy, take host offline, update models, put back into rotation, repeat for
replicated hosts.
For small models, do in-memory swap
@JasjeetThind
CPU is Bottleneck for Real-time ML
Real-time feature generation requires data serialization
● Minimize as much as possible
● If C++ code needs to prep data for ML in Python, consider re-writing Python code
into C++
Hyperparameters - balance latency and accuracy
● 1000 trees for RF that meets SLA might be a better business decision than 2000
trees with “slight” accuracy gain
@JasjeetThind
Features at serving should equal backend
Save feature set used at serving time and pipe into a log or queue
● Use features directly in training
● Or at least verify they match training sets
@JasjeetThind
Free Zillow Data
@JasjeetThind
Zillow Home Value Index (ZHVI)
• Top / Middle / Bottom Thirds
• Single Family / Condo / Co-op
• Median Home Value Per Sq Ft
Zillow Rent Index (ZRI)
• Multi-family / SFR / Condo / Co-op
• Median ZRI Per Sq ft
• Median Rent List Price
Other Metrics
• Median List Price
• Price-to-Rent ratio
• Homes Foreclosed
• For-sale Inventory / Age Inventory
• Negative Equity
• And many more…
Time Series: national, state, metro, county, city and ZIP
code levels
ZTRAX: Zillow Transaction and Assessment Dataset
Previously inaccessible or prohibitively expensive housing
data for academic and institutional researchers FOR FREE.
∙ More than 100 gigabytes
∙ 374 million detailed public records across more than
2,750 U.S. counties
∙ 20+ years of deed transfers, mortgages, foreclosures,
auctions, property tax delinquencies and more for
residential and commercial properties.
∙ Assessor data including property characteristics,
geographic information, and prior valuations on
approximately 200 million parcels in more than 3,100
counties.
Email ZTRAX@zillow.com for more information
Zillow.com/data
We’re Hiring!!!
Roles
● Principal DevOps Engineer
● Machine Learning Engineer
● Data Scientist
● Product Manager
● Data Engineer
@JasjeetThind
Related Blogs
● Zillow.com/data-science
● Trulia.com/blog/tech/
Zillow Prize - $1 million dollars
The brightest scientific minds have a unique opportunity to work on the algorithm that
changed the world of real estate - the Zestimate® home valuation algorithm.
www.zillow.com/promo/zillow-prize
@JasjeetThind

Más contenido relacionado

La actualidad más candente

Spark and machine learning in microservices architecture
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architectureStepan Pushkarev
 
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Databricks
 
SparkML: Easy ML Productization for Real-Time Bidding
SparkML: Easy ML Productization for Real-Time BiddingSparkML: Easy ML Productization for Real-Time Bidding
SparkML: Easy ML Productization for Real-Time BiddingDatabricks
 
Databricks Overview for MLOps
Databricks Overview for MLOpsDatabricks Overview for MLOps
Databricks Overview for MLOpsDatabricks
 
Using PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of DataUsing PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of DataRobert Dempsey
 
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...Databricks
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingDatabricks
 
AI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with DatabricksAI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with DatabricksDatabricks
 
Big Data – A New Testing Challenge
Big Data – A New Testing ChallengeBig Data – A New Testing Challenge
Big Data – A New Testing ChallengeTEST Huddle
 
Bigger Faster Easier: LinkedIn Hadoop Summit 2015
Bigger Faster Easier: LinkedIn Hadoop Summit 2015Bigger Faster Easier: LinkedIn Hadoop Summit 2015
Bigger Faster Easier: LinkedIn Hadoop Summit 2015Shirshanka Das
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Databricks
 
Challenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in ProductionChallenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in Productioniguazio
 
The Power of Unified Analytics with Ali Ghodsi
The Power of Unified Analytics with Ali Ghodsi The Power of Unified Analytics with Ali Ghodsi
The Power of Unified Analytics with Ali Ghodsi Databricks
 
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold XinUnifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold XinDatabricks
 
Saving Energy in Homes with a Unified Approach to Data and AI
Saving Energy in Homes with a Unified Approach to Data and AISaving Energy in Homes with a Unified Approach to Data and AI
Saving Energy in Homes with a Unified Approach to Data and AIDatabricks
 
An Architecture for Agile Machine Learning in Real-Time Applications
An Architecture for Agile Machine Learning in Real-Time ApplicationsAn Architecture for Agile Machine Learning in Real-Time Applications
An Architecture for Agile Machine Learning in Real-Time ApplicationsJohann Schleier-Smith
 
Ai platform at scale
Ai platform at scaleAi platform at scale
Ai platform at scaleHenry Saputra
 
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformHow to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformDatabricks
 

La actualidad más candente (20)

Spark and machine learning in microservices architecture
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architecture
 
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
 
Machine Learning with Apache Spark
Machine Learning with Apache SparkMachine Learning with Apache Spark
Machine Learning with Apache Spark
 
SparkML: Easy ML Productization for Real-Time Bidding
SparkML: Easy ML Productization for Real-Time BiddingSparkML: Easy ML Productization for Real-Time Bidding
SparkML: Easy ML Productization for Real-Time Bidding
 
Databricks Overview for MLOps
Databricks Overview for MLOpsDatabricks Overview for MLOps
Databricks Overview for MLOps
 
Using PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of DataUsing PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of Data
 
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and Tracking
 
AI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with DatabricksAI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with Databricks
 
Big Data – A New Testing Challenge
Big Data – A New Testing ChallengeBig Data – A New Testing Challenge
Big Data – A New Testing Challenge
 
Bigger Faster Easier: LinkedIn Hadoop Summit 2015
Bigger Faster Easier: LinkedIn Hadoop Summit 2015Bigger Faster Easier: LinkedIn Hadoop Summit 2015
Bigger Faster Easier: LinkedIn Hadoop Summit 2015
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
 
Challenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in ProductionChallenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in Production
 
NextGenML
NextGenML NextGenML
NextGenML
 
The Power of Unified Analytics with Ali Ghodsi
The Power of Unified Analytics with Ali Ghodsi The Power of Unified Analytics with Ali Ghodsi
The Power of Unified Analytics with Ali Ghodsi
 
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold XinUnifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
 
Saving Energy in Homes with a Unified Approach to Data and AI
Saving Energy in Homes with a Unified Approach to Data and AISaving Energy in Homes with a Unified Approach to Data and AI
Saving Energy in Homes with a Unified Approach to Data and AI
 
An Architecture for Agile Machine Learning in Real-Time Applications
An Architecture for Agile Machine Learning in Real-Time ApplicationsAn Architecture for Agile Machine Learning in Real-Time Applications
An Architecture for Agile Machine Learning in Real-Time Applications
 
Ai platform at scale
Ai platform at scaleAi platform at scale
Ai platform at scale
 
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformHow to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
 

Similar a DevOps and Machine Learning Workflow at Zillow Group

Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...Provectus
 
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...James Anderson
 
Machine Learning for .NET Developers - ADC21
Machine Learning for .NET Developers - ADC21Machine Learning for .NET Developers - ADC21
Machine Learning for .NET Developers - ADC21Gülden Bilgütay
 
Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon SageMaker 內建機器學習演算法 (Level 400)Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon SageMaker 內建機器學習演算法 (Level 400)Amazon Web Services
 
Microsoft DevOps for AI with GoDataDriven
Microsoft DevOps for AI with GoDataDrivenMicrosoft DevOps for AI with GoDataDriven
Microsoft DevOps for AI with GoDataDrivenGoDataDriven
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in ProductionDataWorks Summit
 
Bring Your Own Recipes Hands-On Session
Bring Your Own Recipes Hands-On Session Bring Your Own Recipes Hands-On Session
Bring Your Own Recipes Hands-On Session Sri Ambati
 
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...All Things Open
 
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)dtz001
 
World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018Adam Gibson
 
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Efficiently Building Machine Learning Models for Predictive Maintenance in th...Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Efficiently Building Machine Learning Models for Predictive Maintenance in th...Databricks
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabszekeLabs Technologies
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaData Science Milan
 
Getting Started with Azure AutoML
Getting Started with Azure AutoMLGetting Started with Azure AutoML
Getting Started with Azure AutoMLVivek Raja P S
 
국내 건설 기계사 도입 사례를 통해 보는 AI가 적용된 수요 예측 관리 - 베스핀글로벌 조창윤 AI/ML팀 팀장
국내 건설 기계사 도입 사례를 통해 보는 AI가 적용된 수요 예측 관리 - 베스핀글로벌 조창윤 AI/ML팀 팀장국내 건설 기계사 도입 사례를 통해 보는 AI가 적용된 수요 예측 관리 - 베스핀글로벌 조창윤 AI/ML팀 팀장
국내 건설 기계사 도입 사례를 통해 보는 AI가 적용된 수요 예측 관리 - 베스핀글로벌 조창윤 AI/ML팀 팀장BESPIN GLOBAL
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
[DSC Europe 22] Smart approach in development and deployment process for vari...
[DSC Europe 22] Smart approach in development and deployment process for vari...[DSC Europe 22] Smart approach in development and deployment process for vari...
[DSC Europe 22] Smart approach in development and deployment process for vari...DataScienceConferenc1
 
DevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-usDevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-useltonrodriguez11
 
201909 Automated ML for Developers
201909 Automated ML for Developers201909 Automated ML for Developers
201909 Automated ML for DevelopersMark Tabladillo
 
201908 Overview of Automated ML
201908 Overview of Automated ML201908 Overview of Automated ML
201908 Overview of Automated MLMark Tabladillo
 

Similar a DevOps and Machine Learning Workflow at Zillow Group (20)

Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
 
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
 
Machine Learning for .NET Developers - ADC21
Machine Learning for .NET Developers - ADC21Machine Learning for .NET Developers - ADC21
Machine Learning for .NET Developers - ADC21
 
Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon SageMaker 內建機器學習演算法 (Level 400)Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon SageMaker 內建機器學習演算法 (Level 400)
 
Microsoft DevOps for AI with GoDataDriven
Microsoft DevOps for AI with GoDataDrivenMicrosoft DevOps for AI with GoDataDriven
Microsoft DevOps for AI with GoDataDriven
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
 
Bring Your Own Recipes Hands-On Session
Bring Your Own Recipes Hands-On Session Bring Your Own Recipes Hands-On Session
Bring Your Own Recipes Hands-On Session
 
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
 
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
 
World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018
 
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Efficiently Building Machine Learning Models for Predictive Maintenance in th...Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at Helixa
 
Getting Started with Azure AutoML
Getting Started with Azure AutoMLGetting Started with Azure AutoML
Getting Started with Azure AutoML
 
국내 건설 기계사 도입 사례를 통해 보는 AI가 적용된 수요 예측 관리 - 베스핀글로벌 조창윤 AI/ML팀 팀장
국내 건설 기계사 도입 사례를 통해 보는 AI가 적용된 수요 예측 관리 - 베스핀글로벌 조창윤 AI/ML팀 팀장국내 건설 기계사 도입 사례를 통해 보는 AI가 적용된 수요 예측 관리 - 베스핀글로벌 조창윤 AI/ML팀 팀장
국내 건설 기계사 도입 사례를 통해 보는 AI가 적용된 수요 예측 관리 - 베스핀글로벌 조창윤 AI/ML팀 팀장
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
[DSC Europe 22] Smart approach in development and deployment process for vari...
[DSC Europe 22] Smart approach in development and deployment process for vari...[DSC Europe 22] Smart approach in development and deployment process for vari...
[DSC Europe 22] Smart approach in development and deployment process for vari...
 
DevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-usDevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-us
 
201909 Automated ML for Developers
201909 Automated ML for Developers201909 Automated ML for Developers
201909 Automated ML for Developers
 
201908 Overview of Automated ML
201908 Overview of Automated ML201908 Overview of Automated ML
201908 Overview of Automated ML
 

Último

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 

Último (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

DevOps and Machine Learning Workflow at Zillow Group

  • 1. DevOps and Machine Learning Jasjeet Thind VP, Data Science & Engineering, Zillow Group @JasjeetThind
  • 2. Agenda Overview of Zillow Group (ZG) Machine Learning (ML) at ZG Architecture DevOps for ML @JasjeetThind
  • 3. Zillow Group Composed of 9 Brands Build the world's largest, most trusted and vibrant home-related marketplace. @JasjeetThind RealEstate.com
  • 4. Machine Learning at ZG Personalization Ad Targeting Zestimate (AVMs) Premier Agent (B2B) Mortgages Deep Learning Demographics & Community Business Analytics Forecasting home price trends @JasjeetThind
  • 5. High Level Architecture @JasjeetThind DATA LAKE MACHINE LEARNING ML WEB SERVICES DATA SCIENCE ANALYTICS DATA QUALITYDATA COLLECTION
  • 6. DevOps & ML at ZG Tight collaboration, fast iterations, and lots of automation. CONTINUOUS INTEGRATION (CI) CONTINUOUS DELIVERY (CD) @JasjeetThind Research Dev Test Ship ML Workflow Real-time ML
  • 7. Research Data scientists prototype features and models to solve business problems Models ● Random Forest ● Gradient Boosted Machines ● CNN / Deep Learning ● NLP / TF-IDF / Word2vec / Bag of Words @JasjeetThind ● Linear Regression ● K-means clustering ● Collaborative Filtering
  • 8. DevOps Challenges in Research (1) Data scientists iterate through lots of prototype code, models, features and datasets that never get into production (2) How do data scientists validate experiments on production system? @JasjeetThind
  • 9. How to apply DevOps to experiments? Check-in prototypes ● Organize around research area ● Document heavily. Experiments often revisited ● Check-in subsets of training & scoring datasets For large training datasets, store in the data lake ● Pointers to data in README.md @JasjeetThind
  • 10. DevOps enables experiments in production Build tools for scientists to package and deploy ML models and features into production A/B tests ● Once submitted via the tool the model goes through the same release / validation pipeline as ML production models go through. Real world prototyping can be done before production quality ML code. @JasjeetThind
  • 11. DEV ML engineers operationalize features and models for scale. Automatically build and upload artifacts ML Developer Machine Git Jenkins - build - unit tests - upload artifacts AWS EMR Build Cluster Local Artifact Repository AWS Cloud Repository (S3) - AWS S3- Squid proxy Upload Artifact GIT push Trigger Check-in unit tests, source code Run unit tests to test machine learning models @JasjeetThind
  • 12. DevOps ML Challenges in Dev (1) Unit tests non-trivial as model outputs not deterministic output = f (inputs) (2) CI requires fast builds. But accurate predictions require large training data sets, and training models is time consuming. @JasjeetThind
  • 13. Unit Tests w/ ML Don’t test “real” model outputs. ● Leave that to integration / regression tests ● Focus on code coverage, build and unit tests are fast. Train models w/ mock training data ● Test schema, type, # records, expected values Predictions on mock scoring data sets // load mock small training dataset training_data = read(“training_small.csv”) // model trained on mock training data model = train(training_data) // schema test assertHasColumn(model.trainingData, “zipcode”) // type test assertIsInstance(model.trainingData, Vector) // # of records test assertEqual(model.trainingData.size, 32) // expected value test assertEqual(model.trainingData[0].key, 207) // load mock small scoring dataset predictions = score(model, scoring_data) @JasjeetThind
  • 14. Mock training data facilitates fast builds Model prediction accuracy tested in the Test phase @JasjeetThind
  • 15. TEST After artifact uploaded by build process, deploy and auto-trigger its regression tests DevOps API Jenkins Deployment ShipIt - REST API to to deploy packages - Hosts config Jenkins host that deploys & saves history Upload new job AWS EMR Regression Cluster Tool that executes deployments Unified Versioning SystemPin @JasjeetThind Kick off regression tests after deployment
  • 16. DevOps ML Challenges in Test (1) Training and scoring models can take many hours or more (2) Models have very large memory footprint (3) Regression tests non-trivial as model outputs not deterministic (4) Real time model outputs vs that of backend systems @JasjeetThind
  • 17. How to train and score in Test? For regression testing, create “golden” datasets, smaller subsets of training and scoring data, for training and scoring models. Makes training and scoring time much faster @JasjeetThind
  • 18. Reducing Model Memory Footprint in Test Model dynamically generated using golden datasets Model itself not in GIT, treated as binary @JasjeetThind
  • 19. Regression Tests and ML Adjust with fuzz factor ● May need to run multiple times and average results Model validation tests to ensure model metrics meet minimum bar ● More on this later // load “real” subset training data from data lake training_data = read(“training_golden_dataset.csv”) // model trained on golden training data model = train(training_data) // generate model output scoring_data = read(“scoring_golden_dataset.csv”) predictions = score(model, scoring_data) // fuzz factor id = 5 zestimate = predictions[id] assertAlmostEqual(zestimate, 238000, delta=7140) @JasjeetThind
  • 20. Verify Real-time Matches Backend Real-time machine learning services thousands / millions of concurrent requests // get model output from web service API for scoring row id = 5 prediction1 = getFromMLApi(id) // get model output from batch pipeline for same scoring row prediction2 = getFromBatchSystem(id) // get model output from near real time pipeline for same scoring row prediction3 = getFromNRTSystem(id) // verify the values are equal assertAlmostEqual(prediction1, prediction2, prediction3, delta=7140) @JasjeetThind
  • 21. SHIP Automated deployment of ML code to production DevOps API Jenkins Deployment ShipIt Upload new job AWS EMR Production Cluster @JasjeetThind
  • 22. DevOps ML Challenges in Ship (1) Deploy starts during model training and scoring (2) Backward compatibility issues with code and models for real-time ML API @JasjeetThind
  • 23. 4 Ways to Handle Deploy during Training Terminate ML workflow and start the deploy Wait for ML workflow to finish and start the deploy Deploy at certain times (e.g. once per day at midnight) Deploy to another cluster, shutdown the previous cluster @JasjeetThind
  • 24. Shipping real-time ML services Complete successful run of batch pipeline. ● Then ship code and models to web services hosts. ● More on deployment to web services hosts later @JasjeetThind
  • 25. ML Workflow Production ML backends generate models and predictions. @JasjeetThind Raw Data Sources Data Collection Generate datasets and features Model Learning Evaluate Model metrics Scoring Deployment
  • 26. DevOps Challenges w/ML Workflow (1) Data quality given high velocity petabyte scale data continually being pushed into the system (2) Ensuring models aren’t stale as data continues to evolve (3) Partial failures due to intermediate model training pipelines failing (4) Model deployments don’t cause bad user experience @JasjeetThind
  • 27. Build Analytical Systems for Data Quality Monitor ● what is expected data ● outlier detection ● missing data ● expected # of records ● latency Reports / alerts that drive action @JasjeetThind Consume Generate Metrics Models Data Quality DB Historical Metrics Write Alert for Issues Raw Data Sources Data Quality Architecture
  • 28. Automate ML to Keep Models Fresh Run continuously or frequently Workflow engine ● Oozie, Airflow, Azkaban, SWF, etc Logging, monitoring, and alerts throughout the pipeline @JasjeetThind
  • 29. Persist Intermediate Train Data for Failures Breakup ML pipelines into appropriate abstraction layers Leverage Spark which can recover from intermediate failures by replaying the DAG @JasjeetThind
  • 30. Model validation tests to verify quality Don’t replace existing model if one or more metrics fail. Trigger alert instead. Replace if all tests pass. if (mean_squared_error(y_actual, y_pred) > 3) // mean squared error alert() else if (median_error(y_actual, y_pred) > 6) // median error alert() else if (auc(y_actual, y_pred)) // area under curve alert() else deploy_models() // all model metrics passed so replace old models @JasjeetThind
  • 31. Real-time ML Models need to be deployed into production Models need to be executed per request with concurrent users under strict SLA @JasjeetThind
  • 32. DevOps ML Challenges w/ Real-time ML (1) Deployment of models can take many seconds or minutes (2) Latency to execute ML model for each request and return responses must meet strict SLA (< 100 milliseconds) (3) Need to ensure features at serving time same as backend systems @JasjeetThind
  • 33. Automate Deployment of Models Replicate hosts During deploy, take host offline, update models, put back into rotation, repeat for replicated hosts. For small models, do in-memory swap @JasjeetThind
  • 34. CPU is Bottleneck for Real-time ML Real-time feature generation requires data serialization ● Minimize as much as possible ● If C++ code needs to prep data for ML in Python, consider re-writing Python code into C++ Hyperparameters - balance latency and accuracy ● 1000 trees for RF that meets SLA might be a better business decision than 2000 trees with “slight” accuracy gain @JasjeetThind
  • 35. Features at serving should equal backend Save feature set used at serving time and pipe into a log or queue ● Use features directly in training ● Or at least verify they match training sets @JasjeetThind
  • 36. Free Zillow Data @JasjeetThind Zillow Home Value Index (ZHVI) • Top / Middle / Bottom Thirds • Single Family / Condo / Co-op • Median Home Value Per Sq Ft Zillow Rent Index (ZRI) • Multi-family / SFR / Condo / Co-op • Median ZRI Per Sq ft • Median Rent List Price Other Metrics • Median List Price • Price-to-Rent ratio • Homes Foreclosed • For-sale Inventory / Age Inventory • Negative Equity • And many more… Time Series: national, state, metro, county, city and ZIP code levels ZTRAX: Zillow Transaction and Assessment Dataset Previously inaccessible or prohibitively expensive housing data for academic and institutional researchers FOR FREE. ∙ More than 100 gigabytes ∙ 374 million detailed public records across more than 2,750 U.S. counties ∙ 20+ years of deed transfers, mortgages, foreclosures, auctions, property tax delinquencies and more for residential and commercial properties. ∙ Assessor data including property characteristics, geographic information, and prior valuations on approximately 200 million parcels in more than 3,100 counties. Email ZTRAX@zillow.com for more information Zillow.com/data
  • 37. We’re Hiring!!! Roles ● Principal DevOps Engineer ● Machine Learning Engineer ● Data Scientist ● Product Manager ● Data Engineer @JasjeetThind Related Blogs ● Zillow.com/data-science ● Trulia.com/blog/tech/
  • 38. Zillow Prize - $1 million dollars The brightest scientific minds have a unique opportunity to work on the algorithm that changed the world of real estate - the Zestimate® home valuation algorithm. www.zillow.com/promo/zillow-prize @JasjeetThind

Notas del editor

  1. DevOps and Machine Learning How do you test and deploy real-time machine learning services given the challenge that machine learning algorithms produce nondeterministic behaviors even for the same input.
  2. Zillow Group - conglomerate of real-estate websites, mobile apps, and services. 170 million monthly unique users Hundreds of petabytes of data in our data lake Brands Trulia - 2005, unprecedented information on comps, prior sale, and listing data Naked Apartments - rentals in NYC StreetEasy - sales and rentals in NYC Zillow - 2006, Zestimate - first freely available home valuation Hotpads - 2005, rentals nationwide Mortech - loan pricing from multiple vendors Dotloop - allows paperwork in real-estate transaction to be paper-less Retsly - platform and APIs that power real-estate software
  3. Email - homes for sale / for rent Home Details - homes for sale / homes like this Personalized Search Mobile - smart SMS and push notifications Home owner predictions Pre-Seller predictions Lender selection algorithm Real-estate content personalization Similar photos / video
  4. Here is a list of models that we use, everything from RF, to deep learning, and collab filtering
  5. - Master r3.2xlarge - Slave r3.4xlarge
  6. https://stash.atl.zillow.net/projects/ZDA/repos/generic-models/browse/generic_models/tests/test_generic_models.py https://stash.atl.zillow.net/projects/ZDA/repos/tax-valuation/browse/tax_valuation/tests/test_tax_valuation_model.py#46 We can use the same artifact repository for trained models that we use for our services and libraries. It gives us the ability to use a set of coordinates (group ID, artifact ID, and version) to define an artifact, which gets stored in our artifact repository that is backed by S3. Here's how it looks for our built code (link points to a lightweight frontend to S3): http://repo-storage-proxy.in.zillow.net/release/com/zillow/datascience/prior-sale-model/0.0.27-master.71390e6/ group ID = com.zillow.datascience artifact ID = prior-sale-model version = 0.0.27-master.71390e6
  7. https://zwiki.zillowgroup.net/display/AT/Automated+deployments+for+Z6
  8. training_data[] = get_train_data() // features with dependent variable scoring_data[] = get_score_data() // features model = train(training_data, RF, 1000) // create Random Forest model, 1000 trees predictions[] = generate_preditions(model, scoring_data) // generate predictions foreach(prediction) { TEST_RANGE(prediction[i], expected[i] - expected[i]*0.5, expected[i] + expected[i]*0.5) }
  9. training_data[] = get_train_data() // features with dependent variable scoring_data[] = get_score_data() // features model = train(training_data, RF, 1000) // create Random Forest model, 1000 trees predictions[] = generate_preditions(model, scoring_data) // generate predictions foreach(prediction) { TEST_RANGE(prediction[i], expected[i] - expected[i]*0.5, expected[i] + expected[i]*0.5) }
  10. def evaluate_regression_metrics(y_true, y_pred): return dict([ ("mape", calculate_mape(y_true, y_pred)), ("mnape", round(np.mean(np.abs((y_pred - y_true) / y_true)) * 100, 3)), ("rmse", round(np.sqrt(mean_squared_error(y_true, y_pred)), 3)), ("mean_absolute_error", round(mean_absolute_error(y_true, y_pred), 3)), ("median_absolute_error", round(median_absolute_error(y_true, y_pred), 3)) ])
  11. Should I cover real-time learning (another trend, re-enforcement learning)
  12. Know the content Show enthusiasm personality