SlideShare una empresa de Scribd logo
1 de 26
1
ZESTIMATE + LAMBDA ARCHITECTURE
Steven Hoelscher, Machine Learning Engineer
How we produce low-latency, high-quality home estimates
Goals of the Zestimate
• Independent
• Transparent
• High Accuracy
• Low Bias
• Stable over time
• Respond quickly to data
updates
• High coverage (about 100M
homes)
www.zillow.com/zestimate
In early 2015, we shared the original architecture of the
Zestimate…
…but a lot has changed
Then (2015)
• Languages: R and Python
• Data Storage: on-prem RDBMSs
• Compute: on-prem hosts
• Framework: in-house
parallelization library (ZPL)
• People: Data Analysts and
Scientists
Now (2017)
• Languages: Python and R
• Data Storage: AWS Simple
Storage Service (S3), Redis
• Compute: AWS Elastic
MapReduce (EMR)
• Framework: Apache Spark
• People: Data Analysts, Scientists,
and Engineers
So, what’s changed?
Lambda Architecture
• Introduced by Nathan Marz
(Apache Storm) and highlighted in
his book, Big Data (2015)
• An architecture for scalable, fault-
tolerant, low-latency big data
systems
Low Latency,
Accuracy
High Latency,
Accuracy
Latency-Accuracy Tradeoff
www.databricks.com/blog/2016/05/19/approximate-
algorithms-in-apache-spark-hyperloglog-and-quantiles.html
>>> review_lengths.approxQuantile("lengths", quantiles, relative_error)
High-level Lambda Architecture
• We can process new data with
only a batch layer, but for
computationally expensive
queries, the results will be out-of-
date
• The speed layer compensates for
this lack of timeliness, by
computing, generally,
approximate views
Master Data Architecture
Lock down permissions
to prevent data deletes
and updates!
PropertyId Bedrooms Bathrooms SquareFootage UpdateDate
1 2.0 1.0 1450 2010-03-13
1 2.0 2.0 1500 2015-05-15
1 3.0 2.5 1800 2016-06-24
Data is immutable
Below, we see the evolution of a home over time:
• Constructed in 2010 with 2 bedrooms and 1 bath
• A full-bath added five years later, increasing the square footage
• Finally, another bedroom is added as well as a half-bath
Data is eternally true
PropertyId Bathrooms UpdateTime
1 2.0 2015-05-15
1 2.5 2016-06-24
PropertyId SaleValue SaleTime
1 450000 2015-08-19
This bathroom value would have
been overwritten in our mutable
data view
This transaction in our training data
would erroneously use a bathroom
upgrade from the future
Batch Layer Architecture
ETL
• Ingests master data
• Standardizes data across many sources
• Dedupes, cleanses and performs sanity checks on data
• Stores partitioned training and scoring sets in Parquet format
Train
• Large memory requirements (caching training sets for various models)
Score
• Scoring set partitioned in uniform chunks for parallelization
Batch Layer Highlights
• The number one source of Zestimate error is the facts that
flow into it – about bedrooms, bathrooms, and square
footage.
• To combat data issues, we give homeowners the ability to
update such facts and immediately see a change to their
Zestimate
• Beyond that, we want to recalculate Zestimates when
homes are listed on the market
Responding to data changes quickly
• Kinesis consumer is responsible
for low-latency transformations to
the data.
• Much of the data cleansing in the
batch layer relies on a
longitudinal view of the data, so
we cannot afford these
computations
• It looks up pertinent property
information in Redis and decides
whether to update the Zestimate
by calling the API
Speed Layer Architecture: Kinesis Consumer
Speed Layer Architecture: Zestimate API
• Uses latest, pre-trained models
from batch layer to avoid costly
retraining
• All property information required
for scoring is stored in Redis,
reusing a majority of the exact
calculations from the batch layer
• Relies on sharding of pre-trained
region models due to individual
model memory requirements
• The speed layer is not meant to be perfect; it’s meant to be lightning fast.
Your batch layer will correct mistakes, eventually.
• As a result, we can think of the speed layer view as ephemeral
PropertyId LotSize
0 21
1 16
2 5
Remember: Eventual Accuracy
Toy Example: Square feet or Acres?
Imagine a GIS model for validating lot
size by looking at a given property’s
parcel and its neighboring parcels. But
what happens if that model is slow
to compute?
• We still rely on our on-prem SQL
Server for serving Zestimates on
Zillow.com
• Reconciliation of views requires
knowing when the batch layer
started: if a home fact comes in
after the batch layer began, we
serve the speed layer’s
calculation
Serving Layer Architecture
The Big Picture
(3) Reduces
latency and
improves
timeliness
(2) Performs
heavy-lifting
cleaning and
training
(4)
Reconciles
views to
ensure better
estimation is
chosen
(1) Data is
immutable and
human-fault
tolerant
19
SO DID YOU FIX MY
ZESTIMATE?
Andrew Martin, Zestimate Research Manager
Accuracy Metrics for Real-Estate
Valuation
• Median Absolute Percent Error (MAPE)
• Measures the “average” amount of error in in prediction in terms of
percentage off the correct answer in either direction
• Measuring error in percentages more natural for home prices since
they are heteroscedastic
• Percent Error Within 5%, 10%, 20%
• Measure of how many predictions fell within +/-X% of the true value
𝑀𝐴𝑃𝐸 = 𝑀𝑒𝑑𝑖𝑎𝑛
𝐴𝑏𝑠 𝑆𝑎𝑙𝑒𝑝𝑟𝑖𝑐𝑒 − 𝑍𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒
𝑆𝑎𝑙𝑒𝑝𝑟𝑖𝑐𝑒
𝑊𝑖𝑡ℎ𝑖𝑛 𝑋% =
𝑆𝑎𝑙𝑒𝑠
𝐴𝑏𝑠 𝑆𝑎𝑙𝑒𝑝𝑟𝑖𝑐𝑒𝑖 − 𝑍𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑖
𝑆𝑎𝑙𝑒𝑝𝑟𝑖𝑐𝑒𝑖
< 𝑋%
Did you know we keep a public
scorecard?
www.zillow.com/zestimate/
Comparing Accuracy at 10,000FT
• Let’s focus on King County, WA since the new architecture has
been live here since January 2017
• We compute accuracy by using the Zestimate at the end of the
month prior to when a home was sold as our prediction
• i.e. if a home sold in Kent for $300,000 on April 10th we’d use the
Zestimate from March 31st
• We went back and recomputed Zestimates at month ends with
the new architecture for all homes and months 2016
• We compare architectures by looking at error on the same set of sales
Architecture MAPE Within 5% Within 10% Within 20%
2015 (Z5.4) 5.1% 49.0% 75.0% 92.5%
2017 (Z6) 4.5% 54.1% 81.0% 94.9%
Breaking Accuracy out by Price
0
1000
2000
3000
4000
5000
6000
0.0%
1.0%
2.0%
3.0%
4.0%
5.0%
6.0%
7.0%
8.0%
MAPE
Sales 2015 (Z5.4) 2017 (Z6)
Breaking Accuracy out by Home Type
Architecture Home
Type
MAPE Within5% Within10% Within20%
2015 (Z5.4) SFR
5.1% 49.2% 74.8% 92.4%
Condo 5.1% 49.5% 76.8% 93.7%
2017 (Z6) SFR 4.5% 54.6% 81.1% 94.6%
Condo 4.6% 53.4% 81.6% 96.0%
Think that you might have an idea for how to
improve the Zestimate? We’re all ears...
+
www.zillow.com/promo/zillow-prize
26
We are hiring!
• Data Scientist
• Machine Learning Engineer
• Data Scientist, Computer Vision and Deep Learning
• Software Developer Engineer, Computer Vision
• Economist
• Data Analyst
www.zillow.com/jobs

Más contenido relacionado

La actualidad más candente

Vector Search for Data Scientists.pdf
Vector Search for Data Scientists.pdfVector Search for Data Scientists.pdf
Vector Search for Data Scientists.pdfConnorShorten2
 
An Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDBAn Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDBWilliam LaForest
 
Introduction to Knowledge Graphs for Information Architects.pdf
Introduction to Knowledge Graphs for Information Architects.pdfIntroduction to Knowledge Graphs for Information Architects.pdf
Introduction to Knowledge Graphs for Information Architects.pdfHeather Hedden
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & DeltaDatabricks
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentationTao Feng
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryNeo4j
 
Neo4j Graph Use Cases, Bruno Ungermann, Neo4j
Neo4j Graph Use Cases, Bruno Ungermann, Neo4jNeo4j Graph Use Cases, Bruno Ungermann, Neo4j
Neo4j Graph Use Cases, Bruno Ungermann, Neo4jNeo4j
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationKnoldus Inc.
 
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...Neo4j
 
Neo4j y GenAI
Neo4j y GenAI Neo4j y GenAI
Neo4j y GenAI Neo4j
 
연구데이터 관리와 데이터 관리 계획서 (DMP) - part02
연구데이터 관리와 데이터 관리 계획서 (DMP) - part02연구데이터 관리와 데이터 관리 계획서 (DMP) - part02
연구데이터 관리와 데이터 관리 계획서 (DMP) - part02Suntae Kim
 
Graph Database 101- What, Why and How?.pdf
Graph Database 101- What, Why and How?.pdfGraph Database 101- What, Why and How?.pdf
Graph Database 101- What, Why and How?.pdfNeo4j
 
Graph in Apache Cassandra. The World’s Most Scalable Graph Database
Graph in Apache Cassandra. The World’s Most Scalable Graph DatabaseGraph in Apache Cassandra. The World’s Most Scalable Graph Database
Graph in Apache Cassandra. The World’s Most Scalable Graph DatabaseConnected Data World
 
How the Neanex digital twin solution delivers on both speed and scale to the ...
How the Neanex digital twin solution delivers on both speed and scale to the ...How the Neanex digital twin solution delivers on both speed and scale to the ...
How the Neanex digital twin solution delivers on both speed and scale to the ...Neo4j
 
[261] 실시간 추천엔진 머신한대에 구겨넣기
[261] 실시간 추천엔진 머신한대에 구겨넣기[261] 실시간 추천엔진 머신한대에 구겨넣기
[261] 실시간 추천엔진 머신한대에 구겨넣기NAVER D2
 
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
Apache Spark and MongoDB - Turning Analytics into Real-Time ActionApache Spark and MongoDB - Turning Analytics into Real-Time Action
Apache Spark and MongoDB - Turning Analytics into Real-Time ActionJoão Gabriel Lima
 

La actualidad más candente (20)

Vector Search for Data Scientists.pdf
Vector Search for Data Scientists.pdfVector Search for Data Scientists.pdf
Vector Search for Data Scientists.pdf
 
An Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDBAn Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDB
 
KorQuAD v2.0 소개
KorQuAD v2.0 소개KorQuAD v2.0 소개
KorQuAD v2.0 소개
 
Introduction to Knowledge Graphs for Information Architects.pdf
Introduction to Knowledge Graphs for Information Architects.pdfIntroduction to Knowledge Graphs for Information Architects.pdf
Introduction to Knowledge Graphs for Information Architects.pdf
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
Neo4j Graph Use Cases, Bruno Ungermann, Neo4j
Neo4j Graph Use Cases, Bruno Ungermann, Neo4jNeo4j Graph Use Cases, Bruno Ungermann, Neo4j
Neo4j Graph Use Cases, Bruno Ungermann, Neo4j
 
Hbase hivepig
Hbase hivepigHbase hivepig
Hbase hivepig
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its application
 
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...
 
Vector database
Vector databaseVector database
Vector database
 
Neo4j y GenAI
Neo4j y GenAI Neo4j y GenAI
Neo4j y GenAI
 
연구데이터 관리와 데이터 관리 계획서 (DMP) - part02
연구데이터 관리와 데이터 관리 계획서 (DMP) - part02연구데이터 관리와 데이터 관리 계획서 (DMP) - part02
연구데이터 관리와 데이터 관리 계획서 (DMP) - part02
 
Introduction à Hadoop
Introduction à HadoopIntroduction à Hadoop
Introduction à Hadoop
 
Graph Database 101- What, Why and How?.pdf
Graph Database 101- What, Why and How?.pdfGraph Database 101- What, Why and How?.pdf
Graph Database 101- What, Why and How?.pdf
 
Graph in Apache Cassandra. The World’s Most Scalable Graph Database
Graph in Apache Cassandra. The World’s Most Scalable Graph DatabaseGraph in Apache Cassandra. The World’s Most Scalable Graph Database
Graph in Apache Cassandra. The World’s Most Scalable Graph Database
 
How the Neanex digital twin solution delivers on both speed and scale to the ...
How the Neanex digital twin solution delivers on both speed and scale to the ...How the Neanex digital twin solution delivers on both speed and scale to the ...
How the Neanex digital twin solution delivers on both speed and scale to the ...
 
[261] 실시간 추천엔진 머신한대에 구겨넣기
[261] 실시간 추천엔진 머신한대에 구겨넣기[261] 실시간 추천엔진 머신한대에 구겨넣기
[261] 실시간 추천엔진 머신한대에 구겨넣기
 
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
Apache Spark and MongoDB - Turning Analytics into Real-Time ActionApache Spark and MongoDB - Turning Analytics into Real-Time Action
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
 

Similar a Zestimate Lambda Architecture

Rsqrd AI: Zestimates and Zillow AI Platform
Rsqrd AI: Zestimates and Zillow AI PlatformRsqrd AI: Zestimates and Zillow AI Platform
Rsqrd AI: Zestimates and Zillow AI PlatformSanjana Chowdhury
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionDmitry Anoshin
 
AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matilli...
AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matilli...AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matilli...
AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matilli...Amazon Web Services
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudAmazon Web Services
 
Managing Large Amounts of Data with Salesforce
Managing Large Amounts of Data with SalesforceManaging Large Amounts of Data with Salesforce
Managing Large Amounts of Data with SalesforceSense Corp
 
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...Dataconomy Media
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web developmentTung Nguyen
 
Migrate from Netezza to Amazon Redshift: Best Practices with Financial Engine...
Migrate from Netezza to Amazon Redshift: Best Practices with Financial Engine...Migrate from Netezza to Amazon Redshift: Best Practices with Financial Engine...
Migrate from Netezza to Amazon Redshift: Best Practices with Financial Engine...Amazon Web Services
 
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevAltinity Ltd
 
Streamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache PulsarStreamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache PulsarStreamlio
 
Amazon Redshift with Full 360 Inc.
Amazon Redshift with Full 360 Inc.Amazon Redshift with Full 360 Inc.
Amazon Redshift with Full 360 Inc.Amazon Web Services
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
 
AWS Sydney Summit 2013 - Big Data Analytics
AWS Sydney Summit 2013 - Big Data AnalyticsAWS Sydney Summit 2013 - Big Data Analytics
AWS Sydney Summit 2013 - Big Data AnalyticsAmazon Web Services
 
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Amazon Web Services
 
AWS APAC Webinar Week - Real Time Data Processing with Kinesis
AWS APAC Webinar Week - Real Time Data Processing with KinesisAWS APAC Webinar Week - Real Time Data Processing with Kinesis
AWS APAC Webinar Week - Real Time Data Processing with KinesisAmazon Web Services
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon RedshiftAmazon Web Services
 
Which Database is Right for My Workload?
Which Database is Right for My Workload?Which Database is Right for My Workload?
Which Database is Right for My Workload?Amazon Web Services
 
Which Database is Right for My Workload: Database Week SF
Which Database is Right for My Workload: Database Week SFWhich Database is Right for My Workload: Database Week SF
Which Database is Right for My Workload: Database Week SFAmazon Web Services
 
Which Database is Right for My Workload?: Database Week San Francisco
Which Database is Right for My Workload?: Database Week San FranciscoWhich Database is Right for My Workload?: Database Week San Francisco
Which Database is Right for My Workload?: Database Week San FranciscoAmazon Web Services
 

Similar a Zestimate Lambda Architecture (20)

Rsqrd AI: Zestimates and Zillow AI Platform
Rsqrd AI: Zestimates and Zillow AI PlatformRsqrd AI: Zestimates and Zillow AI Platform
Rsqrd AI: Zestimates and Zillow AI Platform
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
 
AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matilli...
AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matilli...AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matilli...
AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matilli...
 
ABD217_From Batch to Streaming
ABD217_From Batch to StreamingABD217_From Batch to Streaming
ABD217_From Batch to Streaming
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
 
Managing Large Amounts of Data with Salesforce
Managing Large Amounts of Data with SalesforceManaging Large Amounts of Data with Salesforce
Managing Large Amounts of Data with Salesforce
 
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
Migrate from Netezza to Amazon Redshift: Best Practices with Financial Engine...
Migrate from Netezza to Amazon Redshift: Best Practices with Financial Engine...Migrate from Netezza to Amazon Redshift: Best Practices with Financial Engine...
Migrate from Netezza to Amazon Redshift: Best Practices with Financial Engine...
 
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
 
Streamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache PulsarStreamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache Pulsar
 
Amazon Redshift with Full 360 Inc.
Amazon Redshift with Full 360 Inc.Amazon Redshift with Full 360 Inc.
Amazon Redshift with Full 360 Inc.
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
AWS Sydney Summit 2013 - Big Data Analytics
AWS Sydney Summit 2013 - Big Data AnalyticsAWS Sydney Summit 2013 - Big Data Analytics
AWS Sydney Summit 2013 - Big Data Analytics
 
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
 
AWS APAC Webinar Week - Real Time Data Processing with Kinesis
AWS APAC Webinar Week - Real Time Data Processing with KinesisAWS APAC Webinar Week - Real Time Data Processing with Kinesis
AWS APAC Webinar Week - Real Time Data Processing with Kinesis
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
 
Which Database is Right for My Workload?
Which Database is Right for My Workload?Which Database is Right for My Workload?
Which Database is Right for My Workload?
 
Which Database is Right for My Workload: Database Week SF
Which Database is Right for My Workload: Database Week SFWhich Database is Right for My Workload: Database Week SF
Which Database is Right for My Workload: Database Week SF
 
Which Database is Right for My Workload?: Database Week San Francisco
Which Database is Right for My Workload?: Database Week San FranciscoWhich Database is Right for My Workload?: Database Week San Francisco
Which Database is Right for My Workload?: Database Week San Francisco
 

Último

Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012rehmti665
 
Intellectual property rightsand its types.pptx
Intellectual property rightsand its types.pptxIntellectual property rightsand its types.pptx
Intellectual property rightsand its types.pptxBipin Adhikari
 
Q4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxQ4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxeditsforyah
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作ys8omjxb
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMartaLoveguard
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)Christopher H Felton
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Excelmac1
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Sonam Pathan
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxDyna Gilbert
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书rnrncn29
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa494f574xmv
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Sonam Pathan
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书rnrncn29
 
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一Fs
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一Fs
 
Elevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New OrleansElevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New Orleanscorenetworkseo
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITMgdsc13
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationLinaWolf1
 

Último (20)

Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
 
Intellectual property rightsand its types.pptx
Intellectual property rightsand its types.pptxIntellectual property rightsand its types.pptx
Intellectual property rightsand its types.pptx
 
Q4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxQ4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptx
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptx
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptx
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa
 
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
 
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
 
Elevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New OrleansElevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New Orleans
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITM
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 Documentation
 

Zestimate Lambda Architecture

  • 1. 1 ZESTIMATE + LAMBDA ARCHITECTURE Steven Hoelscher, Machine Learning Engineer How we produce low-latency, high-quality home estimates
  • 2. Goals of the Zestimate • Independent • Transparent • High Accuracy • Low Bias • Stable over time • Respond quickly to data updates • High coverage (about 100M homes) www.zillow.com/zestimate
  • 3. In early 2015, we shared the original architecture of the Zestimate… …but a lot has changed
  • 4. Then (2015) • Languages: R and Python • Data Storage: on-prem RDBMSs • Compute: on-prem hosts • Framework: in-house parallelization library (ZPL) • People: Data Analysts and Scientists Now (2017) • Languages: Python and R • Data Storage: AWS Simple Storage Service (S3), Redis • Compute: AWS Elastic MapReduce (EMR) • Framework: Apache Spark • People: Data Analysts, Scientists, and Engineers So, what’s changed?
  • 5. Lambda Architecture • Introduced by Nathan Marz (Apache Storm) and highlighted in his book, Big Data (2015) • An architecture for scalable, fault- tolerant, low-latency big data systems Low Latency, Accuracy High Latency, Accuracy
  • 7. High-level Lambda Architecture • We can process new data with only a batch layer, but for computationally expensive queries, the results will be out-of- date • The speed layer compensates for this lack of timeliness, by computing, generally, approximate views
  • 8. Master Data Architecture Lock down permissions to prevent data deletes and updates!
  • 9. PropertyId Bedrooms Bathrooms SquareFootage UpdateDate 1 2.0 1.0 1450 2010-03-13 1 2.0 2.0 1500 2015-05-15 1 3.0 2.5 1800 2016-06-24 Data is immutable Below, we see the evolution of a home over time: • Constructed in 2010 with 2 bedrooms and 1 bath • A full-bath added five years later, increasing the square footage • Finally, another bedroom is added as well as a half-bath
  • 10. Data is eternally true PropertyId Bathrooms UpdateTime 1 2.0 2015-05-15 1 2.5 2016-06-24 PropertyId SaleValue SaleTime 1 450000 2015-08-19 This bathroom value would have been overwritten in our mutable data view This transaction in our training data would erroneously use a bathroom upgrade from the future
  • 12. ETL • Ingests master data • Standardizes data across many sources • Dedupes, cleanses and performs sanity checks on data • Stores partitioned training and scoring sets in Parquet format Train • Large memory requirements (caching training sets for various models) Score • Scoring set partitioned in uniform chunks for parallelization Batch Layer Highlights
  • 13. • The number one source of Zestimate error is the facts that flow into it – about bedrooms, bathrooms, and square footage. • To combat data issues, we give homeowners the ability to update such facts and immediately see a change to their Zestimate • Beyond that, we want to recalculate Zestimates when homes are listed on the market Responding to data changes quickly
  • 14. • Kinesis consumer is responsible for low-latency transformations to the data. • Much of the data cleansing in the batch layer relies on a longitudinal view of the data, so we cannot afford these computations • It looks up pertinent property information in Redis and decides whether to update the Zestimate by calling the API Speed Layer Architecture: Kinesis Consumer
  • 15. Speed Layer Architecture: Zestimate API • Uses latest, pre-trained models from batch layer to avoid costly retraining • All property information required for scoring is stored in Redis, reusing a majority of the exact calculations from the batch layer • Relies on sharding of pre-trained region models due to individual model memory requirements
  • 16. • The speed layer is not meant to be perfect; it’s meant to be lightning fast. Your batch layer will correct mistakes, eventually. • As a result, we can think of the speed layer view as ephemeral PropertyId LotSize 0 21 1 16 2 5 Remember: Eventual Accuracy Toy Example: Square feet or Acres? Imagine a GIS model for validating lot size by looking at a given property’s parcel and its neighboring parcels. But what happens if that model is slow to compute?
  • 17. • We still rely on our on-prem SQL Server for serving Zestimates on Zillow.com • Reconciliation of views requires knowing when the batch layer started: if a home fact comes in after the batch layer began, we serve the speed layer’s calculation Serving Layer Architecture
  • 18. The Big Picture (3) Reduces latency and improves timeliness (2) Performs heavy-lifting cleaning and training (4) Reconciles views to ensure better estimation is chosen (1) Data is immutable and human-fault tolerant
  • 19. 19 SO DID YOU FIX MY ZESTIMATE? Andrew Martin, Zestimate Research Manager
  • 20. Accuracy Metrics for Real-Estate Valuation • Median Absolute Percent Error (MAPE) • Measures the “average” amount of error in in prediction in terms of percentage off the correct answer in either direction • Measuring error in percentages more natural for home prices since they are heteroscedastic • Percent Error Within 5%, 10%, 20% • Measure of how many predictions fell within +/-X% of the true value 𝑀𝐴𝑃𝐸 = 𝑀𝑒𝑑𝑖𝑎𝑛 𝐴𝑏𝑠 𝑆𝑎𝑙𝑒𝑝𝑟𝑖𝑐𝑒 − 𝑍𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒 𝑆𝑎𝑙𝑒𝑝𝑟𝑖𝑐𝑒 𝑊𝑖𝑡ℎ𝑖𝑛 𝑋% = 𝑆𝑎𝑙𝑒𝑠 𝐴𝑏𝑠 𝑆𝑎𝑙𝑒𝑝𝑟𝑖𝑐𝑒𝑖 − 𝑍𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑖 𝑆𝑎𝑙𝑒𝑝𝑟𝑖𝑐𝑒𝑖 < 𝑋%
  • 21. Did you know we keep a public scorecard? www.zillow.com/zestimate/
  • 22. Comparing Accuracy at 10,000FT • Let’s focus on King County, WA since the new architecture has been live here since January 2017 • We compute accuracy by using the Zestimate at the end of the month prior to when a home was sold as our prediction • i.e. if a home sold in Kent for $300,000 on April 10th we’d use the Zestimate from March 31st • We went back and recomputed Zestimates at month ends with the new architecture for all homes and months 2016 • We compare architectures by looking at error on the same set of sales Architecture MAPE Within 5% Within 10% Within 20% 2015 (Z5.4) 5.1% 49.0% 75.0% 92.5% 2017 (Z6) 4.5% 54.1% 81.0% 94.9%
  • 23. Breaking Accuracy out by Price 0 1000 2000 3000 4000 5000 6000 0.0% 1.0% 2.0% 3.0% 4.0% 5.0% 6.0% 7.0% 8.0% MAPE Sales 2015 (Z5.4) 2017 (Z6)
  • 24. Breaking Accuracy out by Home Type Architecture Home Type MAPE Within5% Within10% Within20% 2015 (Z5.4) SFR 5.1% 49.2% 74.8% 92.4% Condo 5.1% 49.5% 76.8% 93.7% 2017 (Z6) SFR 4.5% 54.6% 81.1% 94.6% Condo 4.6% 53.4% 81.6% 96.0%
  • 25. Think that you might have an idea for how to improve the Zestimate? We’re all ears... + www.zillow.com/promo/zillow-prize
  • 26. 26 We are hiring! • Data Scientist • Machine Learning Engineer • Data Scientist, Computer Vision and Deep Learning • Software Developer Engineer, Computer Vision • Economist • Data Analyst www.zillow.com/jobs

Notas del editor

  1. Hi everyone, thanks for joining me here at Zillow for today’s meet up. My name is Steven Hoelscher, and I’m a machine learning engineer on the data science and engineering team. I’ve been with Zillow for 2.5 years now and had the opportunity to work on the team responsible for building and rearchitecting a new Zestimate pipeline, largely inspired by Lambda Architecture. It’s my hope that you’ll walk away from this presentation with a better understanding of what lambda architecture means and will have seen a in-production example for actually realizing it.
  2. Without further ado, let’s start with the Zestimate itself and its goals. For those who aren’t familiar, the Zestimate is simply our estimated market value for individual homes nationwide. We strive to put a Zestimate on every rooftop, just as we see in this screenshot. Every day, the Zestimate team thinks about how we can improve our algorithm, and from a data science perspective, improvement is based on whether we achieve these goals. To talk about a few: obviously, we would like our Zestimates to have high accuracy; when a home sells, it’s our goal for the Zestimate to be that near sale price. The Zestimate, as an algorithm, should also be stable over time and not exhibit erratic behavior day-to-day. The Zestimate should also be able to respond quickly to data updates. Users can supply us with more accurate data to improve our estimates, and their Zestimate should immediately reflect fact updates. In a sense, these are the goals that our pipeline must support and we’re going to spend some more time talking about how to balance these goals in a big data system.
  3. In early 2015, right around the time I started at Zillow, a few of my colleagues presented on the Zestimate architecture…as it was then. But a lot has changed since that presentation, only just 2 years ago.
  4. At the core, the Zestimate in 2015 was largely written in R. Our team was comprised of R language experts and we even built an in-house R framework for parallelization a la MapReduce. We were a smaller team back then, mostly data scientists who also had a knack for engineering. We relied on collaboration with others teams, especially our database administrators to interface with on-premises relational databases. Two years later, we’ve made a hiring push across all skill sets and invited engineers to join the fray. Python has become the new language of choice, thanks mostly to its long history of support in Apache Spark. We started leveraging more and more cloud-based services, such as Amazon’s Simple Storage Service for storing our data and Elastic MapReduce for compute. No longer are we bottlenecked by the size of a single machine. With all of these changes, we had the opportunity to start afresh and design a system that would handle large amounts of data in the cloud, that would rely on horizontal scaling, and most importantly would meet the goals of the Zestimate.
  5. Enter Lambda Architecture. The idea of Lambda Architecture was introduced by Nathan Marz, the creator of Apache Storm. I highly recommend the book he published in 2015 with the title *Big Data*. This book for the uninitiated provided the foundations for Lambda Architecture, with great case studies for understanding how to achieve this architecture. Simply put, Lambda Architecture is a generic data processing architecture that is horizontally scalable, fault-tolerant (in the face of both human and hardware failures), and capable of low latency responses. Shortly, we’ll see what a high-level lambda architecture looks like. But before we dive into that, I want to talk about making a tradeoff between latency and accuracy. In some cases, we cannot expect to have low latency responses when dealing with enormous amounts of data. As such, we have to tradeoff some degree of accuracy to reduce our latency. This idea will underpins Lambda Architecture.
  6. Let’s look at example, highlighted by the Databricks team. Apache Spark implements an algorithm for calculating approximate percentiles of numerical data, with a function called approxQuantile. This algorithm requires a user to specify a target error bound and the result is guaranteed to be within this bound. This algorithm can be adjusted to trade accuracy against computation time and memory. In the example here, the Databricks team studies the length of the text in each Amazon review. On the x-axis, we have the targeted residual. As we would guess, the higher the residual, the less computationally expensive our calculation becomes, but the tradeoff is accuracy.
  7. Let’s start thinking about what this means for a big data processing system. We could start simple by building a batch system with low complexity. It reads directly from a master dataset, that contains all of the data so far. This batch layer, as it’s called, will virtually freeze the data at the time the job begins and start running computations. The problem is that once the batch layer finishes computing a query, the data is already out-of-date: new changes have come in and were not accounted for. This is the gap that the lambda architecture is trying to solve. We can rely on a speed layer that will compensate for the batch layer’s lack of timeliness. But the speed layer, generally speaking, cannot rely on the same algorithms that the batch layer did. In the example before, we would want our batch layer to calculate a correct and highly accurate quantile, but the speed layer should rely on approximation to be more nimble. In this way, at any given moment, we could have two different views: one view from the batch layer that is accurate but not so timely and one view from the speed layer that is less accurate but timely. Reconciling these two views, we can answer a query in a relatively accurate and timely fashion.
  8. At this point, we’re going to explore a few of the layers of the Lambda Architecture and see how we implement each layer for the Zestimate itself. To begin, we start with the data. As I mentioned before, most of our data in 2015 was only stored on premises in relational databases. Our first goal, then, was to move this data to the cloud and start having new data-generating processes to write directly to the cloud store. At Zillow, we use AWS S3 for our data lake / master dataset. It is optimized to handle a large, constantly growing set of data. In our case, we have a bucket specific designated for raw data. In this design, we don’t want to actually modify or update the raw data and I’ll talk about why we don’t want to do this in a second here. As such, we set permissions on the bucket itself to prevent data deletes and updates. Any generic data-generating process is responsible for only appending new records to this object store, never deleting. Most data-generating processes are writing JSON data. We do mandate a schema contract between the producers and consumers of the data, to ensure data types are conformed to.
  9. Data is immutable. Let’s understand what this means by working through this example. We have a sample home and how it has evolved over time. In 2010, it was constructed with 2 bedrooms and 1 bathroom. Five years later, the homeowner added a full-bath, therefore increasing the square footage. This was done right before selling the home in a few months later in 2015. A new owner purchased the home, and nearly a year later, decided to add another bedroom and half-bath.With mutable data, this story is lost. One way of storing these attributes in a relational database would be to update records with the new attributes.
  10. Data is eternally true. Now let’s introduce the transaction that I referred to. It occurred before the number of bathrooms changed again. In our mutable data view, this transaction would have been tied with a bathroom upgrade from the future.Once we attach a timestamp to data, we ensure it is eternally true. It is eternally true that in 2015, this home had 2 bathrooms, but in 2016, a half bath was added. This story is extremely important for data scientists. And while this example may be trivial, you can imagine tying a sale value to a larger set of home facts that weren’t actually true at that point of time. Immutability of data allows us to retain this story. We’re no longer updating data, and as a benefit, we are less prone to human mistakes, especially when it comes to what all data scientists hold dear: the raw data itself.
  11. After migrating our data to the AWS S3, we began work on the batch layer for the Zestimate pipeline. From a high-level, the Zestimate batch layer has a few components: first, we need to make available the raw, master dataset. Apache Spark allows us to read directly from S3, but some of our raw data sources suffer from the painful small-files problem in Hadoop. Simply put, big data systems expect to consume fewer large files rather than a lot of small files. Apache Spark suffers from this same problem. We rely heavily on vacuuming applications, such as Hadoop’s distcp, to aggregate data into larger files, by pulling from S3 and storing the aggregates on HDFS. From there, our jobs read directly from HDFS: we begin with an ETL layer, responsible for producing training and scoring sets for our various models. Then, training and scoring takes place for about 100 M homes in the nation. Models, training and scoring sets, and performance metrics are all stored in a different bucket in S3, one for transformed data. This ensures that we’re distinguishing between the raw data (our master dataset) and the data derived from the raw data.
  12. The ETL layer is responsible for interfacing with the master dataset and transforming it in order to arrive at cleaner, standardized datasets that are consumable by our Zestimate models. We have a wide variety of data sources that we deal with and so need to pull appropriate features from each to build a rich feature set. We invest a lot of time into ensuring our data is clean. As we know, garbage in, garbage out, and this holds true for the Zestimate algorithm. One example we always talk about is the case of fat-fingers. You can imagine that typing 500 square feet instead of 5000 square feet could drastically change how we perceive that home’s value. This cleaning process, in addition to the partitioning required, can be very expensive computationally. This is one area where a speed layer would need to be more nimble, as it won’t be able to look at historical data to make inferences about the quality of new data. After the ETL step, we can begin training models. Training, in our cases, requires large amounts of memory to support caching of training sets for various models. We train models on various geographies, making tradeoffs between data skew and volume of data available. Scoring is then done in parallel, using data partitioned in uniform chunks. At this point, we have a view created (the Zestimates for about 100M homes in the nation) as well as pre-trained models for the speed layer. But at this point, some of the facts that went into our model training and scoring could be out of date.
  13. The number one source of Zestimate error is the facts that flow into it, like bedroom count, bathroom counts, and square footage. We provide homeowners with a means for proactively making adjustments to their Zestimate. They can update a bathroom count or square footage and immediately see a change in their Zestimate. Beyond that, we want to recalculate Zestimates when homes are listed on the market, because in these cases an off the market home is updated with all of the latest facts so that it is represented accurately on the market.
  14. In lambda architecture, we want our speed layer to read from the same generic data-generating processes that our batch layer does. Amazon Kinesis (firehose and streams) makes it easy to both write to S3 as well as have consumers read directly from the stream. At this stage, you have the choice of which consumer to use. Spark Streaming can be used directly to enable code sharing (specifically, code relying on the Spark API) between the batch layer and the speed layer, but if Spark-specific code sharing is not a requirement, Amazon’s Kinesis Client Library (which Spark Streaming relies on) is a good solution. In our case, we built our Kinesis Consumer with just the Kinesis Client Library, for three reasons: (1) simplicity, (2) lack of spark processing, and (3) Elastic MapReduce would be more expensive than a small Elastic ComputeCloud (EC2) instance.
  15. Steven