2. 22
Machine Learning vs. Statistics :
Glossary (Rob Tibshirani)
Machine learning Statistics
network, graphs model
weights parameters
learning fitting
generalization test set performance
supervised learning regression/classification
unsupervised learning density estimation, clustering
large grant = $1,000,000 large grant = $50,000
nice place to have a meeting:
Snowbird, Utah, French Alps
nice place to have a meeting:
Las Vegas in August
3. 33
Decision Tree : Machine Learning vs. Statistics
Ross Quinlan (1993):
Programs for Machine
Learning (C4.5)
Breiman et al. (1984):
Classification and Regression
Trees (CART)
4. 44
Zillow Traffic & Usage
• More than 73 million unique users visited Zillow’s mobile apps and websites.
– Source: Internal tracking via Google Analytics, December 2014
• The Yahoo!-Zillow Real Estate Network is the largest real estate network on the
Web.
– Source: comScore Media Metrix Real Estate Category Ranking by Unique Visitors, November
2014, US Data
• Zillow.com is the largest rental site on the Web.
– Source: comScore Media Metrix Real Estate category ranking by Unique Visitors, November
2014, US Data
• 4 out of 5 U.S. homes have been viewed on Zillow.
– Source: Zillow Internal, December 2014
• Zillow has data on more than 110 million U.S. homes.
– Source: Zillow Internal, December 2014
• Zestimates and Rent Zestimates on more than 100 million U.S. homes.
– Source: Zillow Internal, December 2014
5. 55
How Big are Our Data?
Zestimate Scoring Data Size
Homes on Zillow 110 million
Home Attributes 103
Double precision 8 bytes
Time series 220 months
Total 20 TBs ~ 110M*103*220*8
7. 77
Agenda
Name Topics
Yeng Bun, Sr. Data Scientist Zestimates and Zillow Home Value Index
Mike Babb, GIS Analyst Automated Waterfront determination
and Home Street feature discovery
Nick McClure, Sr. Data Scientist Data Cleaning, Fraud Detection, and
Address Matching
8. 88
How Do We Do It ?
Prototype
(Interactive mode)
Query
Analysis
Modeling
Visualization
Database
QueryQueryQuery
Train
and
Score
Combine
Data
Production
(Batch mode)
Models
Production
(Real-time service)
Scoring
Engine
Early Day : Java, R & C
Present Day : R & C/C++
Early Day : R
Present Day : R
Future : R & Python Future : TBD
9. 99
Prototype vs. Production
Prototype
Turn idea into software quickly
Flexible
• Interactive mode
• Creative
• Experimental
Rigid
• Batch and Real-time modes
• Repeatability
• Maintainability
Complete software versions
• Error free
• Run on full dataset
Incomplete software versions
• Proof of concept
• Run on small sample dataset
Production
Run the software automatically
11. 1111
Software Hierarchy
App
Framework
Infrastructure
• Define a standard
structure for apps.
• Provide a generic
app.
• Build on top of a
framework.
• Deal with specific
details and
complexities of the
application.
• Basic services,
communication,
storage management,
version control, etc..
• Rterm, system(), .Call(),
.Fortran(), library(),
load(), save(), …
• Rserve, SQL,GIT
• EconBot
• ZPL
• One Pagers, CaseShiller Forecast, …
• Zhvi, Zhvi Forecast, Zri, Price/Rent
ratio, Export MarketReport Data,…
• Zestimate, ZestimateForecast,
RentZestimate, Diagnostics, …
• ZillowRserve
12. 1212
MapReduce vs. ZPL
MapReduce
Input
Head Node
Output
Worker
Node 2
Map
Reduce
ZPL
Input
Head Node
Output
Worker
Node 2
zplOnCompute()
zplOnUpdate()
Update Node
Hadoop RDBMS
13. 1313
Data Partitioning and Parallel Computing
AK
AL
AR
AZ
CA
CO
CT
…
Head
Node
TaskQueue
Parallel R
Worker Nodes
AL
AR
AZ
Input
Database
Update
Node
Combine
Data
Output
Database
14. 1414
Zillow RServe
• Binary TCP/IP Servers
Expose R functions for client apps to call
Auto load models generated by batch jobs
Auto cleanup
Redundancy : multiple ports and multiple boxes
• Clients
– C/C++, Python, R, Java, C#, etc…
– SQL Server
– Web server
– Mobil devices
15. 1515
Performance
Machine 2.5 GHz, 16 cores, 128 GB RAM
Real-Time Zestimates
Throughput/Connection
12/sec
Real-Time RentZestimates
Throughput/Connection
20/sec
Zestimates
3 times/week run in batch mode
13 hours
RentZestimates
Weekly run in batch mode
3 hours
Historical Zestimates
220 monthly data points
5 days ~ 220*13/24/24 (24 boxes)
17. 1717
Rent Zestimate
SCORING ENGINE
Yes
No
Dynamic Filter
Train County
Model
Model(B)
Train State
Model
Model(C)
Reconcile Edited Facts
Score County
Model
Model (B)
Model (C)
Good?
Score State
Model
Models
PropertyDimensionEditedFacts
PropertyDimension
PropertyUserDimension
PropertyTaxAssessment
RegionDimension
<ZestimateDate>_ForRentPosting
<ZestimateDate>_RZest
QueryQuery
pre-process
SQL Server
Query
QueryImpute
PropertyDimensionImputedFacts
Scoring DataTraining Data
<ZestimateDate>_ZestSmooth
Batch job
18. 1818
Measuring Accuracy
Hold-out 30% of data
If rz are the Rent Zestimates for homes in the hold-out dataset, then the
percent estimated errors are
e =100*(rz – r)/r
where r are the actual rental listing prices.
Two key metrics:
• median (abs (e))
• percent of estimates within 10% of rent price:100*count (abs(e)<10)/count (e)
group by counties, metros, states and national.
30. 3030
Overview
• GIS@Zillow
– Who we are, how we function within the larger
organization
• Technology Stack
– How we do what do
• Several examples
– Automated Waterfront Determination
– Home Street Feature Discovery
31. 3131
GIS@Zillow
• Three person team somewhat like an in-house GIS consulting
shop
– Michalis Avraam, Ph.D.: Lead GIS Analyst
– Mike Babb, Ph.C.: GIS Analyst
– Andrew Smyth: GIS Analyst
• What we do:
– Automating the incorporation of spatially explicit data (spatial
ETL).
– Adjusting boundary geometry (cities, school districts, Zip Codes,
etc.).
– Conflating geospatial data from different vendors into a congruent
product.
– The discovery, creation, and formalization of spatial relationships
into machine-comprehensible data for input into the Zestimation
algorithm.
32. 3232
How we do it
• Most development done in Windows.
• Highly available SQL Server DBs store current and historical
property data.
• 75% Python, 15% R, 5% SQL Server, 5% bash and shell.
• But…
• Proprietary Linux-only in-house database used for blazingly
fast in-memory and http look up.
• Crawl – walk – run.
35. 3535
Automated Waterfront Determination
• Motivation
– Homes on the waterfront are valued differently than homes not on the waterfront.
– Incorporating measures of waterfront access into our Zestimation algorithm helps
increase the accuracy of our models.
• Needs
– Distinguish between proximity and access.
– Identify properties that are near the waterfront but have intervening properties and
intervening streets.
• Tools
– Most processing done using R and the following geospatial libraries: sp, maptools,
rgdal, and rgeos. Native R objects and data.table objects are used for storage and
data management.
– Multiprocessing techniques were used where possible.
• Technique
– Identify parcels within 250 meters of the shore.
– Use ray tracing to identify intervening features.
– Visualize and check results using ArcMap.
42. 4242
Home Street Feature Discovery
• Motivation
– Homes are on a street network.
– What information about a home can we gather from a home’s relationship to it’s
street?
• Needs
– The orientation of the home to the street.
– The sequence of homes along a street.
– Various other needs.
• Tools
– Propriety database used to fuzzy-match a home to a street segment. Database
accepts both same-machine in-memory lookup and http requests.
– Pandas for IO, batching, analysis, and storage.
– ArcPy for prototyping.
– Rtree, shapely, and gdal/ogr/osr for production.
46. 4646
DATA SCIENCE AT ZILLOW
PART 3:
PYTHON AND GRAPHLAB
Nick McClure, Senior Data Scientist
Zillow
47. 4747
Why Data Modeling?
• Find outliers
• Find bad data
• Database cleaning
• Imputation of missing data
48. 4848
The Role of Python in Data Science
• Increase in computation!
• New algorithms and
complex old ones can be
implemented now, easier
than ever.
• The demand for analytic
talent is insatiable.
• So a versatile, fast, and
easy-to-learn language is
invaluable.
• That language must
double as usable by
developers and by
analysts.
49. 4949
Heavily Used Python Tools in Zillow Data Science
• NumPy – Speeds up computations by pre-allocating sizes of objects.
• Pandas – Creates the familiar ‘data frame’ object and relevant tools.
• Scikit Learn – Easy to use machine learning tools.
• Textmining – Allows analysis of unstructured text fields.
• Pymssql/Pyodbc – Connections to SQL Server.
• SQLite3 – Creation of local databases.
• Graphlab Create – Strikingly fast machine learning, easy and scalable
application creation by Dato (previously Graphlab).
50. 5050
• Dato maintains ‘Graphlab Create’, an open source python package that
allows very easy and scalable applications to be built in simple code.
(Functions should have intelligent defaults!)
54. 5454
Dato and Zestimate MoM Analysis Take Aways
• Dato’s Graphlab Create tool is immensely powerful and easy to use.
• Integration into your AWS is as easy as setting up an environment and
writing a function.
• We now have a tool that can slice and dice the Zestimate and look at all
of our data by any number of factors.
• Fast in comparison to alternatives. (~1-3 hours total)
• Currently setting up this tool to catch problematic Zestimates every
month.
55. 5555
Using Scikit Learn for Fraudulent Listing Detection
Make Me Move Fraud Commercial Listing
56. 5656
Finding Fraud: Methodology
• Every property has lots of information: attributes (bds, bths, …),
address, pricing data, transactional data, account information, and
unstructured text descriptions.
• We create features based on this information.
• Train a gradient-boosted random forest with features on known
fraudulent and non-fraudulent listings.
• Output is scored (fraud = P(fraud)>0.5) as actual fraud or not.
• Scored data is refed into the fraud model weekly for training.
57. 5757
Finding Fraud: Results
• Fraudulent Listing Model
• The current iteration of Fraud detection is 96.9%*
accurate.
• The null prediction benchmark is 96.1% accurate.
* Note that high % accuracy when predicting rare events is expected.
58. 5858
Record Linkage: Property Matching
Zestimate
Data
Cleaning
County
Records
User
Records
MLS-1
MLS-2
Matching
Vs.
123 Main St.
Bellevue, WA 89555
123 Main St.
Seattle, WA 89555
60. 6060
Property Matching Methodology
New Data
Zip code 1
Zip code 2
Zip code 3
.
.
.
Superset of
current
properties in
zip code 1
Matching Algorithm
-knn algorithm (Dato)
-text & numeric features
-feature weights
-distance metrics
Results
• Outputs all new data with most probable matches
• P(match) > constant
61. 6161
Property Matching: Results
• Speed:
– Unmatched LA County Records: ~150k
– LA Database Records: ~2 Million
– Naively at 100k comparisons a second = ~33 days of computing time.
– Graphlab’s shortcuts reduce time to 20-40 minutes!!! (8 core personal desktop)
• Previous match rate is around ~65-95%.
• Test cases so far have resulted in new match rates of >97%.
• What are we missing? New lots, construction, incomplete addresses.
62. 6262
Current Openings at Zillow!
(www.Zillow.com/jobs)
• Software Development Engineer, Machine Learning
• Senior Business Intelligence Developer
• Manager, Business Analytics
• Program Manager, Enterprise Data Warehouse
• Data Analyst, Listing and Data Quality
• Reporting Analyst
• Data Quality Control Specialist
• Senior Software Development Engineer
• Software Development Engineer
• Hardware/Datacenter Technician
• Cloud Architect
• Associate Software Test Engineer
• Software Development Manager
• Full Stack Software Dev. Engineer
• Search Infrastructure Engineer
• Senior Database Developer
• SOC Engineer
• UX Developer
Jobs within Zillow
Analytics
Other SDE/IT Jobs
within Zillow
63. 6363
Fun Benefits of Working at Zillow
•Great benefits (gym access, matched 401k, medical, dental, vision,…)
•Free fitbit!
•Monthly zSpeakers.
–Previous speakers: Arianna Huffington, Mexican Pres. Vicente Fox, Seahawks Defensive
End Cliff Avril, Joel Spolsky, …
•Free snacks, drinks, candy, …
•Treadmill rooms, ping pong, shuffleboard, game room,…
•Free Orca card.
•Bi-annual Hackweek.
•Quarterly Group Outings (Clam-bake, kayaking, ice skating,…)
•Smart, fun, and helpful colleagues in a relaxed atmosphere!