Data Science At Zillow

11
DATA SCIENCE AT ZILLOW
The Zestimate® and Beyond

22
Machine Learning vs. Statistics :
Glossary (Rob Tibshirani)
Machine learning Statistics
network, graphs model
weights parameters
learning fitting
generalization test set performance
supervised learning regression/classiﬁcation
unsupervised learning density estimation, clustering
large grant = $1,000,000 large grant = $50,000
nice place to have a meeting:
Snowbird, Utah, French Alps
nice place to have a meeting:
Las Vegas in August

33
Decision Tree : Machine Learning vs. Statistics
Ross Quinlan (1993):
Programs for Machine
Learning (C4.5)
Breiman et al. (1984):
Classification and Regression
Trees (CART)

44
Zillow Traffic & Usage
• More than 73 million unique users visited Zillow’s mobile apps and websites.
– Source: Internal tracking via Google Analytics, December 2014
• The Yahoo!-Zillow Real Estate Network is the largest real estate network on the
Web.
– Source: comScore Media Metrix Real Estate Category Ranking by Unique Visitors, November
2014, US Data
• Zillow.com is the largest rental site on the Web.
– Source: comScore Media Metrix Real Estate category ranking by Unique Visitors, November
2014, US Data
• 4 out of 5 U.S. homes have been viewed on Zillow.
– Source: Zillow Internal, December 2014
• Zillow has data on more than 110 million U.S. homes.
• Zestimates and Rent Zestimates on more than 100 million U.S. homes.

55
How Big are Our Data?
Zestimate Scoring Data Size
Homes on Zillow 110 million
Home Attributes 103
Double precision 8 bytes
Time series 220 months
Total 20 TBs ~ 110M*103*220*8

66
How Frequent Do our Data Change ?

77
Agenda
Name Topics
Yeng Bun, Sr. Data Scientist Zestimates and Zillow Home Value Index
Mike Babb, GIS Analyst Automated Waterfront determination
and Home Street feature discovery
Nick McClure, Sr. Data Scientist Data Cleaning, Fraud Detection, and
Address Matching

88
How Do We Do It ?
Prototype
(Interactive mode)
Query
Analysis
Modeling
Visualization
Database
QueryQueryQuery
Train
and
Score
Combine
Data
Production
(Batch mode)
Models
Production
(Real-time service)
Scoring
Engine
Early Day : Java, R & C
Present Day : R & C/C++
Early Day : R
Present Day : R
Future : R & Python Future : TBD

99
Prototype vs. Production
Prototype
Turn idea into software quickly
Flexible
• Interactive mode
• Creative
• Experimental
Rigid
• Batch and Real-time modes
• Repeatability
• Maintainability
Complete software versions
• Error free
• Run on full dataset
Incomplete software versions
• Proof of concept
• Run on small sample dataset
Production
Run the software automatically

1010
Production Deployment
Prototype
Sample
dataset
Development
Full dataset
Staging
Test site
Production
Live site

1111
Software Hierarchy
App
Framework
Infrastructure
• Define a standard
structure for apps.
• Provide a generic
app.
• Build on top of a
framework.
• Deal with specific
details and
complexities of the
application.
• Basic services,
communication,
storage management,
version control, etc..
• Rterm, system(), .Call(),
.Fortran(), library(),
load(), save(), …
• Rserve, SQL,GIT
• EconBot
• ZPL
• One Pagers, CaseShiller Forecast, …
• Zhvi, Zhvi Forecast, Zri, Price/Rent
ratio, Export MarketReport Data,…
• Zestimate, ZestimateForecast,
RentZestimate, Diagnostics, …
• ZillowRserve

1212
MapReduce vs. ZPL
MapReduce
Input
Head Node
Output
Worker
Node 2
Map
Reduce
ZPL
Input
Head Node
Output
Worker
Node 2
zplOnCompute()
zplOnUpdate()
Update Node
Hadoop RDBMS

1313
Data Partitioning and Parallel Computing
AK
AL
AR
AZ
CA
CO
CT
…
Head
Node
TaskQueue
Parallel R
Worker Nodes
AL
AR
AZ
Input
Database
Update
Node
Combine
Data
Output
Database

1414
Zillow RServe
• Binary TCP/IP Servers
 Expose R functions for client apps to call
 Auto load models generated by batch jobs
 Auto cleanup
 Redundancy : multiple ports and multiple boxes
• Clients
– C/C++, Python, R, Java, C#, etc…
– SQL Server
– Web server
– Mobil devices

1515
Performance
Machine 2.5 GHz, 16 cores, 128 GB RAM
Real-Time Zestimates
Throughput/Connection
12/sec
Real-Time RentZestimates
Throughput/Connection
20/sec
Zestimates
3 times/week run in batch mode
13 hours
RentZestimates
Weekly run in batch mode
3 hours
Historical Zestimates
220 monthly data points
5 days ~ 220*13/24/24 (24 boxes)

1616
A PEEK LOOK
Rent Zestimate®

1717
Rent Zestimate
SCORING ENGINE
Yes
No
Dynamic Filter
Train County
Model
Model(B)
Train State
Model
Model(C)
Reconcile Edited Facts
Score County
Model
Model (B)
Model (C)
Good?
Score State
Model
Models
PropertyDimensionEditedFacts
PropertyDimension
PropertyUserDimension
PropertyTaxAssessment
RegionDimension
<ZestimateDate>_ForRentPosting
<ZestimateDate>_RZest
QueryQuery
pre-process
SQL Server
Query
QueryImpute
PropertyDimensionImputedFacts
Scoring DataTraining Data
<ZestimateDate>_ZestSmooth
Batch job

1818
Measuring Accuracy
Hold-out 30% of data
If rz are the Rent Zestimates for homes in the hold-out dataset, then the
percent estimated errors are
e =100*(rz – r)/r
where r are the actual rental listing prices.
Two key metrics:
• median (abs (e))
• percent of estimates within 10% of rent price:100*count (abs(e)<10)/count (e)
group by counties, metros, states and national.

1919
Rent Zestimate Accuracy: National

2020
Rent Zestimate Accuracy: National

2121
HOUSING MARKET
Zillow Rent Index (ZRI)

2222
Zillow Rent Index (ZRI): Methodology
• Calculate Raw Median Rent Zestimates (ZRI raw)
• Apply Smoothing Filter
• Apply Seasonal Adjustment
• Quality Control

2424
HOUSING MARKET
Zillow Home Value Index (ZHVI)

2525
Zillow Home Value Index (ZHVI): Methodology
• Calculate Raw Median Zestimates (ZHVI raw)
• Apply Systematic Error Correction
• Apply Smoothing Filter
• Apply Seasonal Adjustment
• Quality Control

2626
Zestimate Accuracy : National

2828
The Good, The Bad And The Ugly
Ad-hoc
Prototype
Interactive
mode
Batch
mode
Real-time
service
ZHVI, Forecast,
ZRI, Price/Rent
Diagnostics,…

29
Zillow, Python, R, and GIS
Mike Babb, GIS Analyst

3030
Overview
• GIS@Zillow
– Who we are, how we function within the larger
organization
• Technology Stack
– How we do what do
• Several examples
– Automated Waterfront Determination
– Home Street Feature Discovery

3131
GIS@Zillow
• Three person team somewhat like an in-house GIS consulting
shop
– Michalis Avraam, Ph.D.: Lead GIS Analyst
– Mike Babb, Ph.C.: GIS Analyst
– Andrew Smyth: GIS Analyst
• What we do:
– Automating the incorporation of spatially explicit data (spatial
ETL).
– Adjusting boundary geometry (cities, school districts, Zip Codes,
etc.).
– Conflating geospatial data from different vendors into a congruent
product.
– The discovery, creation, and formalization of spatial relationships
into machine-comprehensible data for input into the Zestimation
algorithm.

3232
How we do it
• Most development done in Windows.
• Highly available SQL Server DBs store current and historical
property data.
• 75% Python, 15% R, 5% SQL Server, 5% bash and shell.
• But…
• Proprietary Linux-only in-house database used for blazingly
fast in-memory and http look up.
• Crawl – walk – run.

3333
Tools and libraries
• PYTHON LIBRARIES
• Data Management, Analysis,
and Storage
– multiprocessing
– Pandas
– numpy
– sqlite3
• Spatial Analysis
– ArcPy
– gdal/ogr/osr
– Rtree
– shapely
• R LIBRARIES
• Data Management, Analysis,
and Storage
– data.Table
– doSNOW
– rsqlite
• Spatial Analysis
– gpclib
– maptools
– rgdal
– rgeos
– sp

3434
AUTOMATED WATERFRONT
DETERMINATION

3535
Automated Waterfront Determination
• Motivation
– Homes on the waterfront are valued differently than homes not on the waterfront.
– Incorporating measures of waterfront access into our Zestimation algorithm helps
increase the accuracy of our models.
• Needs
– Distinguish between proximity and access.
– Identify properties that are near the waterfront but have intervening properties and
intervening streets.
• Tools
– Most processing done using R and the following geospatial libraries: sp, maptools,
rgdal, and rgeos. Native R objects and data.table objects are used for storage and
data management.
– Multiprocessing techniques were used where possible.
• Technique
– Identify parcels within 250 meters of the shore.
– Use ray tracing to identify intervening features.
– Visualize and check results using ArcMap.

3737
Parcels within 250 meters

3838
Identify intervening parcels

3939
Identify intervening streets

4040
Final waterfront determination

4141
HOME STREET FEATURE
DISCOVERY

4242
Home Street Feature Discovery
• Motivation
– Homes are on a street network.
– What information about a home can we gather from a home’s relationship to it’s
street?
• Needs
– The orientation of the home to the street.
– The sequence of homes along a street.
– Various other needs.
• Tools
– Propriety database used to fuzzy-match a home to a street segment. Database
accepts both same-machine in-memory lookup and http requests.
– Pandas for IO, batching, analysis, and storage.
– ArcPy for prototyping.
– Rtree, shapely, and gdal/ogr/osr for production.

4343
The Laurelhurst Neighborhood

4545
Sequence of houses along a street

4646
DATA SCIENCE AT ZILLOW
PART 3:
PYTHON AND GRAPHLAB
Nick McClure, Senior Data Scientist
Zillow

4747
Why Data Modeling?
• Find outliers
• Find bad data
• Database cleaning
• Imputation of missing data

4848
The Role of Python in Data Science
• Increase in computation!
• New algorithms and
complex old ones can be
implemented now, easier
than ever.
• The demand for analytic
talent is insatiable.
• So a versatile, fast, and
easy-to-learn language is
invaluable.
• That language must
double as usable by
developers and by
analysts.

4949
Heavily Used Python Tools in Zillow Data Science
• NumPy – Speeds up computations by pre-allocating sizes of objects.
• Pandas – Creates the familiar ‘data frame’ object and relevant tools.
• Scikit Learn – Easy to use machine learning tools.
• Textmining – Allows analysis of unstructured text fields.
• Pymssql/Pyodbc – Connections to SQL Server.
• SQLite3 – Creation of local databases.
• Graphlab Create – Strikingly fast machine learning, easy and scalable
application creation by Dato (previously Graphlab).

5050
• Dato maintains ‘Graphlab Create’, an open source python package that
allows very easy and scalable applications to be built in simple code.
(Functions should have intelligent defaults!)

5151
Dato Example: Finding Quantiles of MOM Change in
ALL Zestimates
• Example:

5353
Dato: MoM Quantile Results!

5454
Dato and Zestimate MoM Analysis Take Aways
• Dato’s Graphlab Create tool is immensely powerful and easy to use.
• Integration into your AWS is as easy as setting up an environment and
writing a function.
• We now have a tool that can slice and dice the Zestimate and look at all
of our data by any number of factors.
• Fast in comparison to alternatives. (~1-3 hours total)
• Currently setting up this tool to catch problematic Zestimates every
month.

5555
Using Scikit Learn for Fraudulent Listing Detection
Make Me Move Fraud Commercial Listing

5656
Finding Fraud: Methodology
• Every property has lots of information: attributes (bds, bths, …),
address, pricing data, transactional data, account information, and
unstructured text descriptions.
• We create features based on this information.
• Train a gradient-boosted random forest with features on known
fraudulent and non-fraudulent listings.
• Output is scored (fraud = P(fraud)>0.5) as actual fraud or not.
• Scored data is refed into the fraud model weekly for training.

5757
Finding Fraud: Results
• Fraudulent Listing Model
• The current iteration of Fraud detection is 96.9%*
accurate.
• The null prediction benchmark is 96.1% accurate.
* Note that high % accuracy when predicting rare events is expected.

5858
Record Linkage: Property Matching
Zestimate
Data
Cleaning
County
Records
User
Records
MLS-1
MLS-2
Matching
Vs.
123 Main St.
Bellevue, WA 89555
123 Main St.
Seattle, WA 89555

5959
Property 1 Property 2
Property Matching: The Problem

6060
Property Matching Methodology
New Data
Zip code 1
Zip code 2
Zip code 3
.
.
.
Superset of
current
properties in
zip code 1
Matching Algorithm
-knn algorithm (Dato)
-text & numeric features
-feature weights
-distance metrics
Results
• Outputs all new data with most probable matches
• P(match) > constant

6161
Property Matching: Results
• Speed:
– Unmatched LA County Records: ~150k
– LA Database Records: ~2 Million
– Naively at 100k comparisons a second = ~33 days of computing time.
– Graphlab’s shortcuts reduce time to 20-40 minutes!!! (8 core personal desktop)
• Previous match rate is around ~65-95%.
• Test cases so far have resulted in new match rates of >97%.
• What are we missing? New lots, construction, incomplete addresses.

6262
Current Openings at Zillow!
(www.Zillow.com/jobs)
• Software Development Engineer, Machine Learning
• Senior Business Intelligence Developer
• Manager, Business Analytics
• Program Manager, Enterprise Data Warehouse
• Data Analyst, Listing and Data Quality
• Reporting Analyst
• Data Quality Control Specialist
• Senior Software Development Engineer
• Software Development Engineer
• Hardware/Datacenter Technician
• Cloud Architect
• Associate Software Test Engineer
• Software Development Manager
• Full Stack Software Dev. Engineer
• Search Infrastructure Engineer
• Senior Database Developer
• SOC Engineer
• UX Developer
Jobs within Zillow
Analytics
Other SDE/IT Jobs
within Zillow

6363
Fun Benefits of Working at Zillow
•Great benefits (gym access, matched 401k, medical, dental, vision,…)
•Free fitbit!
•Monthly zSpeakers.
–Previous speakers: Arianna Huffington, Mexican Pres. Vicente Fox, Seahawks Defensive
End Cliff Avril, Joel Spolsky, …
•Free snacks, drinks, candy, …
•Treadmill rooms, ping pong, shuffleboard, game room,…
•Free Orca card.
•Bi-annual Hackweek.
•Quarterly Group Outings (Clam-bake, kayaking, ice skating,…)
•Smart, fun, and helpful colleagues in a relaxed atmosphere!

Data Science At Zillow

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Data Science At Zillow

Similar a Data Science At Zillow (20)

Último

Último (20)

Data Science At Zillow