At Zillow, we calculate a Zestimate® home value for about 100 million homes nationwide daily. But between batch runs, users could update their home facts or even list their home on the market. Housing markets move fast, and we want Zestimates to reflect the latest state of our housing data. In this talk, I will present the architecture of the Zestimate and the infrastructure powering it. Inspired by Lambda Architecture, the Zestimate relies on both a near real-time and a batch component. I will highlight how the design allows us to be nimble in the face of data changes, while not sacrificing algorithmic accuracy during daily batch runs.
1. 1
ZESTIMATE + LAMBDA ARCHITECTURE
Steven Hoelscher, Machine Learning Engineer
How we produce low-latency, high-quality home estimates
2. Goals of the Zestimate
• Independent
• Transparent
• High Accuracy
• Low Bias
• Stable over time
• Respond quickly to data
updates
• High coverage (about 100M
homes)
www.zillow.com/zestimate
3. In early 2015, we shared the original architecture of the
Zestimate…
…but a lot has changed
4. Then (2015)
• Languages: R and Python
• Data Storage: on-prem RDBMSs
• Compute: on-prem hosts
• Framework: in-house
parallelization library (ZPL)
• People: Data Analysts and
Scientists
Now (2017)
• Languages: Python and R
• Data Storage: AWS Simple
Storage Service (S3), Redis
• Compute: AWS Elastic
MapReduce (EMR)
• Framework: Apache Spark
• People: Data Analysts, Scientists,
and Engineers
So, what’s changed?
5. Lambda Architecture
• Introduced by Nathan Marz
(Apache Storm) and highlighted in
his book, Big Data (2015)
• An architecture for scalable, fault-
tolerant, low-latency big data
systems
Low Latency,
Accuracy
High Latency,
Accuracy
7. High-level Lambda Architecture
• We can process new data with
only a batch layer, but for
computationally expensive
queries, the results will be out-of-
date
• The speed layer compensates for
this lack of timeliness, by
computing, generally,
approximate views
9. PropertyId Bedrooms Bathrooms SquareFootage UpdateDate
1 2.0 1.0 1450 2010-03-13
1 2.0 2.0 1500 2015-05-15
1 3.0 2.5 1800 2016-06-24
Data is immutable
Below, we see the evolution of a home over time:
• Constructed in 2010 with 2 bedrooms and 1 bath
• A full-bath added five years later, increasing the square footage
• Finally, another bedroom is added as well as a half-bath
10. Data is eternally true
PropertyId Bathrooms UpdateTime
1 2.0 2015-05-15
1 2.5 2016-06-24
PropertyId SaleValue SaleTime
1 450000 2015-08-19
This bathroom value would have
been overwritten in our mutable
data view
This transaction in our training data
would erroneously use a bathroom
upgrade from the future
12. ETL
• Ingests master data
• Standardizes data across many sources
• Dedupes, cleanses and performs sanity checks on data
• Stores partitioned training and scoring sets in Parquet format
Train
• Large memory requirements (caching training sets for various models)
Score
• Scoring set partitioned in uniform chunks for parallelization
Batch Layer Highlights
13. • The number one source of Zestimate error is the facts that
flow into it – about bedrooms, bathrooms, and square
footage.
• To combat data issues, we give homeowners the ability to
update such facts and immediately see a change to their
Zestimate
• Beyond that, we want to recalculate Zestimates when
homes are listed on the market
Responding to data changes quickly
14. • Kinesis consumer is responsible
for low-latency transformations to
the data.
• Much of the data cleansing in the
batch layer relies on a
longitudinal view of the data, so
we cannot afford these
computations
• It looks up pertinent property
information in Redis and decides
whether to update the Zestimate
by calling the API
Speed Layer Architecture: Kinesis Consumer
15. Speed Layer Architecture: Zestimate API
• Uses latest, pre-trained models
from batch layer to avoid costly
retraining
• All property information required
for scoring is stored in Redis,
reusing a majority of the exact
calculations from the batch layer
• Relies on sharding of pre-trained
region models due to individual
model memory requirements
16. • The speed layer is not meant to be perfect; it’s meant to be lightning fast.
Your batch layer will correct mistakes, eventually.
• As a result, we can think of the speed layer view as ephemeral
PropertyId LotSize
0 21
1 16
2 5
Remember: Eventual Accuracy
Toy Example: Square feet or Acres?
Imagine a GIS model for validating lot
size by looking at a given property’s
parcel and its neighboring parcels. But
what happens if that model is slow
to compute?
17. • We still rely on our on-prem SQL
Server for serving Zestimates on
Zillow.com
• Reconciliation of views requires
knowing when the batch layer
started: if a home fact comes in
after the batch layer began, we
serve the speed layer’s
calculation
Serving Layer Architecture
18. The Big Picture
(3) Reduces
latency and
improves
timeliness
(2) Performs
heavy-lifting
cleaning and
training
(4)
Reconciles
views to
ensure better
estimation is
chosen
(1) Data is
immutable and
human-fault
tolerant
19. 19
SO DID YOU FIX MY
ZESTIMATE?
Andrew Martin, Zestimate Research Manager
20. Accuracy Metrics for Real-Estate
Valuation
• Median Absolute Percent Error (MAPE)
• Measures the “average” amount of error in in prediction in terms of
percentage off the correct answer in either direction
• Measuring error in percentages more natural for home prices since
they are heteroscedastic
• Percent Error Within 5%, 10%, 20%
• Measure of how many predictions fell within +/-X% of the true value
𝑀𝐴𝑃𝐸 = 𝑀𝑒𝑑𝑖𝑎𝑛
𝐴𝑏𝑠 𝑆𝑎𝑙𝑒𝑝𝑟𝑖𝑐𝑒 − 𝑍𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒
𝑆𝑎𝑙𝑒𝑝𝑟𝑖𝑐𝑒
𝑊𝑖𝑡ℎ𝑖𝑛 𝑋% =
𝑆𝑎𝑙𝑒𝑠
𝐴𝑏𝑠 𝑆𝑎𝑙𝑒𝑝𝑟𝑖𝑐𝑒𝑖 − 𝑍𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑖
𝑆𝑎𝑙𝑒𝑝𝑟𝑖𝑐𝑒𝑖
< 𝑋%
21. Did you know we keep a public
scorecard?
www.zillow.com/zestimate/
22. Comparing Accuracy at 10,000FT
• Let’s focus on King County, WA since the new architecture has
been live here since January 2017
• We compute accuracy by using the Zestimate at the end of the
month prior to when a home was sold as our prediction
• i.e. if a home sold in Kent for $300,000 on April 10th we’d use the
Zestimate from March 31st
• We went back and recomputed Zestimates at month ends with
the new architecture for all homes and months 2016
• We compare architectures by looking at error on the same set of sales
Architecture MAPE Within 5% Within 10% Within 20%
2015 (Z5.4) 5.1% 49.0% 75.0% 92.5%
2017 (Z6) 4.5% 54.1% 81.0% 94.9%
24. Breaking Accuracy out by Home Type
Architecture Home
Type
MAPE Within5% Within10% Within20%
2015 (Z5.4) SFR
5.1% 49.2% 74.8% 92.4%
Condo 5.1% 49.5% 76.8% 93.7%
2017 (Z6) SFR 4.5% 54.6% 81.1% 94.6%
Condo 4.6% 53.4% 81.6% 96.0%
25. Think that you might have an idea for how to
improve the Zestimate? We’re all ears...
+
www.zillow.com/promo/zillow-prize
26. 26
We are hiring!
• Data Scientist
• Machine Learning Engineer
• Data Scientist, Computer Vision and Deep Learning
• Software Developer Engineer, Computer Vision
• Economist
• Data Analyst
www.zillow.com/jobs
Notas del editor
Hi everyone, thanks for joining me here at Zillow for today’s meet up. My name is Steven Hoelscher, and I’m a machine learning engineer on the data science and engineering team.
I’ve been with Zillow for 2.5 years now and had the opportunity to work on the team responsible for building and rearchitecting a new Zestimate pipeline, largely inspired by Lambda Architecture.
It’s my hope that you’ll walk away from this presentation with a better understanding of what lambda architecture means and will have seen a in-production example for actually realizing it.
Without further ado, let’s start with the Zestimate itself and its goals. For those who aren’t familiar, the Zestimate is simply our estimated market value for individual homes nationwide. We strive to put a Zestimate on every rooftop, just as we see in this screenshot.
Every day, the Zestimate team thinks about how we can improve our algorithm, and from a data science perspective, improvement is based on whether we achieve these goals. To talk about a few: obviously, we would like our Zestimates to have high accuracy; when a home sells, it’s our goal for the Zestimate to be that near sale price. The Zestimate, as an algorithm, should also be stable over time and not exhibit erratic behavior day-to-day. The Zestimate should also be able to respond quickly to data updates. Users can supply us with more accurate data to improve our estimates, and their Zestimate should immediately reflect fact updates.
In a sense, these are the goals that our pipeline must support and we’re going to spend some more time talking about how to balance these goals in a big data system.
In early 2015, right around the time I started at Zillow, a few of my colleagues presented on the Zestimate architecture…as it was then. But a lot has changed since that presentation, only just 2 years ago.
At the core, the Zestimate in 2015 was largely written in R. Our team was comprised of R language experts and we even built an in-house R framework for parallelization a la MapReduce. We were a smaller team back then, mostly data scientists who also had a knack for engineering. We relied on collaboration with others teams, especially our database administrators to interface with on-premises relational databases.
Two years later, we’ve made a hiring push across all skill sets and invited engineers to join the fray. Python has become the new language of choice, thanks mostly to its long history of support in Apache Spark. We started leveraging more and more cloud-based services, such as Amazon’s Simple Storage Service for storing our data and Elastic MapReduce for compute. No longer are we bottlenecked by the size of a single machine.
With all of these changes, we had the opportunity to start afresh and design a system that would handle large amounts of data in the cloud, that would rely on horizontal scaling, and most importantly would meet the goals of the Zestimate.
Enter Lambda Architecture. The idea of Lambda Architecture was introduced by Nathan Marz, the creator of Apache Storm. I highly recommend the book he published in 2015 with the title *Big Data*. This book for the uninitiated provided the foundations for Lambda Architecture, with great case studies for understanding how to achieve this architecture.
Simply put, Lambda Architecture is a generic data processing architecture that is horizontally scalable, fault-tolerant (in the face of both human and hardware failures), and capable of low latency responses.
Shortly, we’ll see what a high-level lambda architecture looks like. But before we dive into that, I want to talk about making a tradeoff between latency and accuracy. In some cases, we cannot expect to have low latency responses when dealing with enormous amounts of data. As such, we have to tradeoff some degree of accuracy to reduce our latency. This idea will underpins Lambda Architecture.
Let’s look at example, highlighted by the Databricks team. Apache Spark implements an algorithm for calculating approximate percentiles of numerical data, with a function called approxQuantile. This algorithm requires a user to specify a target error bound and the result is guaranteed to be within this bound. This algorithm can be adjusted to trade accuracy against computation time and memory.
In the example here, the Databricks team studies the length of the text in each Amazon review. On the x-axis, we have the targeted residual. As we would guess, the higher the residual, the less computationally expensive our calculation becomes, but the tradeoff is accuracy.
Let’s start thinking about what this means for a big data processing system. We could start simple by building a batch system with low complexity. It reads directly from a master dataset, that contains all of the data so far. This batch layer, as it’s called, will virtually freeze the data at the time the job begins and start running computations. The problem is that once the batch layer finishes computing a query, the data is already out-of-date: new changes have come in and were not accounted for.
This is the gap that the lambda architecture is trying to solve. We can rely on a speed layer that will compensate for the batch layer’s lack of timeliness. But the speed layer, generally speaking, cannot rely on the same algorithms that the batch layer did. In the example before, we would want our batch layer to calculate a correct and highly accurate quantile, but the speed layer should rely on approximation to be more nimble.
In this way, at any given moment, we could have two different views: one view from the batch layer that is accurate but not so timely and one view from the speed layer that is less accurate but timely. Reconciling these two views, we can answer a query in a relatively accurate and timely fashion.
At this point, we’re going to explore a few of the layers of the Lambda Architecture and see how we implement each layer for the Zestimate itself. To begin, we start with the data. As I mentioned before, most of our data in 2015 was only stored on premises in relational databases. Our first goal, then, was to move this data to the cloud and start having new data-generating processes to write directly to the cloud store. At Zillow, we use AWS S3 for our data lake / master dataset. It is optimized to handle a large, constantly growing set of data. In our case, we have a bucket specific designated for raw data. In this design, we don’t want to actually modify or update the raw data and I’ll talk about why we don’t want to do this in a second here. As such, we set permissions on the bucket itself to prevent data deletes and updates. Any generic data-generating process is responsible for only appending new records to this object store, never deleting.
Most data-generating processes are writing JSON data. We do mandate a schema contract between the producers and consumers of the data, to ensure data types are conformed to.
Data is immutable. Let’s understand what this means by working through this example. We have a sample home and how it has evolved over time. In 2010, it was constructed with 2 bedrooms and 1 bathroom. Five years later, the homeowner added a full-bath, therefore increasing the square footage. This was done right before selling the home in a few months later in 2015. A new owner purchased the home, and nearly a year later, decided to add another bedroom and half-bath.With mutable data, this story is lost. One way of storing these attributes in a relational database would be to update records with the new attributes.
Data is eternally true. Now let’s introduce the transaction that I referred to. It occurred before the number of bathrooms changed again. In our mutable data view, this transaction would have been tied with a bathroom upgrade from the future.Once we attach a timestamp to data, we ensure it is eternally true. It is eternally true that in 2015, this home had 2 bathrooms, but in 2016, a half bath was added. This story is extremely important for data scientists. And while this example may be trivial, you can imagine tying a sale value to a larger set of home facts that weren’t actually true at that point of time.
Immutability of data allows us to retain this story. We’re no longer updating data, and as a benefit, we are less prone to human mistakes, especially when it comes to what all data scientists hold dear: the raw data itself.
After migrating our data to the AWS S3, we began work on the batch layer for the Zestimate pipeline. From a high-level, the Zestimate batch layer has a few components: first, we need to make available the raw, master dataset. Apache Spark allows us to read directly from S3, but some of our raw data sources suffer from the painful small-files problem in Hadoop. Simply put, big data systems expect to consume fewer large files rather than a lot of small files. Apache Spark suffers from this same problem. We rely heavily on vacuuming applications, such as Hadoop’s distcp, to aggregate data into larger files, by pulling from S3 and storing the aggregates on HDFS.
From there, our jobs read directly from HDFS: we begin with an ETL layer, responsible for producing training and scoring sets for our various models. Then, training and scoring takes place for about 100 M homes in the nation. Models, training and scoring sets, and performance metrics are all stored in a different bucket in S3, one for transformed data. This ensures that we’re distinguishing between the raw data (our master dataset) and the data derived from the raw data.
The ETL layer is responsible for interfacing with the master dataset and transforming it in order to arrive at cleaner, standardized datasets that are consumable by our Zestimate models. We have a wide variety of data sources that we deal with and so need to pull appropriate features from each to build a rich feature set. We invest a lot of time into ensuring our data is clean. As we know, garbage in, garbage out, and this holds true for the Zestimate algorithm. One example we always talk about is the case of fat-fingers. You can imagine that typing 500 square feet instead of 5000 square feet could drastically change how we perceive that home’s value. This cleaning process, in addition to the partitioning required, can be very expensive computationally. This is one area where a speed layer would need to be more nimble, as it won’t be able to look at historical data to make inferences about the quality of new data. After the ETL step, we can begin training models. Training, in our cases, requires large amounts of memory to support caching of training sets for various models. We train models on various geographies, making tradeoffs between data skew and volume of data available. Scoring is then done in parallel, using data partitioned in uniform chunks. At this point, we have a view created (the Zestimates for about 100M homes in the nation) as well as pre-trained models for the speed layer. But at this point, some of the facts that went into our model training and scoring could be out of date.
The number one source of Zestimate error is the facts that flow into it, like bedroom count, bathroom counts, and square footage.
We provide homeowners with a means for proactively making adjustments to their Zestimate. They can update a bathroom count or square footage and immediately see a change in their Zestimate.
Beyond that, we want to recalculate Zestimates when homes are listed on the market, because in these cases an off the market home is updated with all of the latest facts so that it is represented accurately on the market.
In lambda architecture, we want our speed layer to read from the same generic data-generating processes that our batch layer does. Amazon Kinesis (firehose and streams) makes it easy to both write to S3 as well as have consumers read directly from the stream. At this stage, you have the choice of which consumer to use. Spark Streaming can be used directly to enable code sharing (specifically, code relying on the Spark API) between the batch layer and the speed layer, but if Spark-specific code sharing is not a requirement, Amazon’s Kinesis Client Library (which Spark Streaming relies on) is a good solution.
In our case, we built our Kinesis Consumer with just the Kinesis Client Library, for three reasons: (1) simplicity, (2) lack of spark processing, and (3) Elastic MapReduce would be more expensive than a small Elastic ComputeCloud (EC2) instance.