SlideShare una empresa de Scribd logo
1 de 64
11
DATA SCIENCE AT ZILLOW
The Zestimate® and Beyond
22
Machine Learning vs. Statistics :
Glossary (Rob Tibshirani)
Machine learning Statistics
network, graphs model
weights parameters
learning fitting
generalization test set performance
supervised learning regression/classification
unsupervised learning density estimation, clustering
large grant = $1,000,000 large grant = $50,000
nice place to have a meeting:
Snowbird, Utah, French Alps
nice place to have a meeting:
Las Vegas in August
33
Decision Tree : Machine Learning vs. Statistics
Ross Quinlan (1993):
Programs for Machine
Learning (C4.5)
Breiman et al. (1984):
Classification and Regression
Trees (CART)
44
Zillow Traffic & Usage
• More than 73 million unique users visited Zillow’s mobile apps and websites.
– Source: Internal tracking via Google Analytics, December 2014
• The Yahoo!-Zillow Real Estate Network is the largest real estate network on the
Web.
– Source: comScore Media Metrix Real Estate Category Ranking by Unique Visitors, November
2014, US Data
• Zillow.com is the largest rental site on the Web.
– Source: comScore Media Metrix Real Estate category ranking by Unique Visitors, November
2014, US Data
• 4 out of 5 U.S. homes have been viewed on Zillow.
– Source: Zillow Internal, December 2014
• Zillow has data on more than 110 million U.S. homes.
– Source: Zillow Internal, December 2014
• Zestimates and Rent Zestimates on more than 100 million U.S. homes.
– Source: Zillow Internal, December 2014
55
How Big are Our Data?
Zestimate Scoring Data Size
Homes on Zillow 110 million
Home Attributes 103
Double precision 8 bytes
Time series 220 months
Total 20 TBs ~ 110M*103*220*8
66
How Frequent Do our Data Change ?
77
Agenda
Name Topics
Yeng Bun, Sr. Data Scientist Zestimates and Zillow Home Value Index
Mike Babb, GIS Analyst Automated Waterfront determination
and Home Street feature discovery
Nick McClure, Sr. Data Scientist Data Cleaning, Fraud Detection, and
Address Matching
88
How Do We Do It ?
Prototype
(Interactive mode)
Query
Analysis
Modeling
Visualization
Database
QueryQueryQuery
Train
and
Score
Combine
Data
Production
(Batch mode)
Models
Production
(Real-time service)
Scoring
Engine
Early Day : Java, R & C
Present Day : R & C/C++
Early Day : R
Present Day : R
Future : R & Python Future : TBD
99
Prototype vs. Production
Prototype
Turn idea into software quickly
Flexible
• Interactive mode
• Creative
• Experimental
Rigid
• Batch and Real-time modes
• Repeatability
• Maintainability
Complete software versions
• Error free
• Run on full dataset
Incomplete software versions
• Proof of concept
• Run on small sample dataset
Production
Run the software automatically
1010
Production Deployment
Prototype
Sample
dataset
Development
Full dataset
Staging
Test site
Production
Live site
1111
Software Hierarchy
App
Framework
Infrastructure
• Define a standard
structure for apps.
• Provide a generic
app.
• Build on top of a
framework.
• Deal with specific
details and
complexities of the
application.
• Basic services,
communication,
storage management,
version control, etc..
• Rterm, system(), .Call(),
.Fortran(), library(),
load(), save(), …
• Rserve, SQL,GIT
• EconBot
• ZPL
• One Pagers, CaseShiller Forecast, …
• Zhvi, Zhvi Forecast, Zri, Price/Rent
ratio, Export MarketReport Data,…
• Zestimate, ZestimateForecast,
RentZestimate, Diagnostics, …
• ZillowRserve
1212
MapReduce vs. ZPL
MapReduce
Input
Head Node
Output
Worker
Node 2
Map
Reduce
ZPL
Input
Head Node
Output
Worker
Node 2
zplOnCompute()
zplOnUpdate()
Update Node
Hadoop RDBMS
1313
Data Partitioning and Parallel Computing
AK
AL
AR
AZ
CA
CO
CT
…
Head
Node
TaskQueue
Parallel R
Worker Nodes
AL
AR
AZ
Input
Database
Update
Node
Combine
Data
Output
Database
1414
Zillow RServe
• Binary TCP/IP Servers
 Expose R functions for client apps to call
 Auto load models generated by batch jobs
 Auto cleanup
 Redundancy : multiple ports and multiple boxes
• Clients
– C/C++, Python, R, Java, C#, etc…
– SQL Server
– Web server
– Mobil devices
1515
Performance
Machine 2.5 GHz, 16 cores, 128 GB RAM
Real-Time Zestimates
Throughput/Connection
12/sec
Real-Time RentZestimates
Throughput/Connection
20/sec
Zestimates
3 times/week run in batch mode
13 hours
RentZestimates
Weekly run in batch mode
3 hours
Historical Zestimates
220 monthly data points
5 days ~ 220*13/24/24 (24 boxes)
1616
A PEEK LOOK
Rent Zestimate®
1717
Rent Zestimate
SCORING ENGINE
Yes
No
Dynamic Filter
Train County
Model
Model(B)
Train State
Model
Model(C)
Reconcile Edited Facts
Score County
Model
Model (B)
Model (C)
Good?
Score State
Model
Models
PropertyDimensionEditedFacts
PropertyDimension
PropertyUserDimension
PropertyTaxAssessment
RegionDimension
<ZestimateDate>_ForRentPosting
<ZestimateDate>_RZest
QueryQuery
pre-process
SQL Server
Query
QueryImpute
PropertyDimensionImputedFacts
Scoring DataTraining Data
<ZestimateDate>_ZestSmooth
Batch job
1818
Measuring Accuracy
Hold-out 30% of data
If rz are the Rent Zestimates for homes in the hold-out dataset, then the
percent estimated errors are
e =100*(rz – r)/r
where r are the actual rental listing prices.
Two key metrics:
• median (abs (e))
• percent of estimates within 10% of rent price:100*count (abs(e)<10)/count (e)
group by counties, metros, states and national.
1919
Rent Zestimate Accuracy: National
2020
Rent Zestimate Accuracy: National
2121
HOUSING MARKET
Zillow Rent Index (ZRI)
2222
Zillow Rent Index (ZRI): Methodology
• Calculate Raw Median Rent Zestimates (ZRI raw)
• Apply Smoothing Filter
• Apply Seasonal Adjustment
• Quality Control
2323
National
2424
HOUSING MARKET
Zillow Home Value Index (ZHVI)
2525
Zillow Home Value Index (ZHVI): Methodology
• Calculate Raw Median Zestimates (ZHVI raw)
• Apply Systematic Error Correction
• Apply Smoothing Filter
• Apply Seasonal Adjustment
• Quality Control
2626
Zestimate Accuracy : National
2727
National
2828
The Good, The Bad And The Ugly
Ad-hoc
Prototype
Interactive
mode
Batch
mode
Real-time
service
ZHVI, Forecast,
ZRI, Price/Rent
Diagnostics,…
29
Zillow, Python, R, and GIS
Mike Babb, GIS Analyst
3030
Overview
• GIS@Zillow
– Who we are, how we function within the larger
organization
• Technology Stack
– How we do what do
• Several examples
– Automated Waterfront Determination
– Home Street Feature Discovery
3131
GIS@Zillow
• Three person team somewhat like an in-house GIS consulting
shop
– Michalis Avraam, Ph.D.: Lead GIS Analyst
– Mike Babb, Ph.C.: GIS Analyst
– Andrew Smyth: GIS Analyst
• What we do:
– Automating the incorporation of spatially explicit data (spatial
ETL).
– Adjusting boundary geometry (cities, school districts, Zip Codes,
etc.).
– Conflating geospatial data from different vendors into a congruent
product.
– The discovery, creation, and formalization of spatial relationships
into machine-comprehensible data for input into the Zestimation
algorithm.
3232
How we do it
• Most development done in Windows.
• Highly available SQL Server DBs store current and historical
property data.
• 75% Python, 15% R, 5% SQL Server, 5% bash and shell.
• But…
• Proprietary Linux-only in-house database used for blazingly
fast in-memory and http look up.
• Crawl – walk – run.
3333
Tools and libraries
• PYTHON LIBRARIES
• Data Management, Analysis,
and Storage
– multiprocessing
– Pandas
– numpy
– sqlite3
• Spatial Analysis
– ArcPy
– gdal/ogr/osr
– Rtree
– shapely
• R LIBRARIES
• Data Management, Analysis,
and Storage
– data.Table
– doSNOW
– rsqlite
• Spatial Analysis
– gpclib
– maptools
– rgdal
– rgeos
– sp
3434
AUTOMATED WATERFRONT
DETERMINATION
3535
Automated Waterfront Determination
• Motivation
– Homes on the waterfront are valued differently than homes not on the waterfront.
– Incorporating measures of waterfront access into our Zestimation algorithm helps
increase the accuracy of our models.
• Needs
– Distinguish between proximity and access.
– Identify properties that are near the waterfront but have intervening properties and
intervening streets.
• Tools
– Most processing done using R and the following geospatial libraries: sp, maptools,
rgdal, and rgeos. Native R objects and data.table objects are used for storage and
data management.
– Multiprocessing techniques were used where possible.
• Technique
– Identify parcels within 250 meters of the shore.
– Use ray tracing to identify intervening features.
– Visualize and check results using ArcMap.
3636
Parcel situation
3737
Parcels within 250 meters
3838
Identify intervening parcels
3939
Identify intervening streets
4040
Final waterfront determination
4141
HOME STREET FEATURE
DISCOVERY
4242
Home Street Feature Discovery
• Motivation
– Homes are on a street network.
– What information about a home can we gather from a home’s relationship to it’s
street?
• Needs
– The orientation of the home to the street.
– The sequence of homes along a street.
– Various other needs.
• Tools
– Propriety database used to fuzzy-match a home to a street segment. Database
accepts both same-machine in-memory lookup and http requests.
– Pandas for IO, batching, analysis, and storage.
– ArcPy for prototyping.
– Rtree, shapely, and gdal/ogr/osr for production.
4343
The Laurelhurst Neighborhood
4444
House orientation
4545
Sequence of houses along a street
4646
DATA SCIENCE AT ZILLOW
PART 3:
PYTHON AND GRAPHLAB
Nick McClure, Senior Data Scientist
Zillow
4747
Why Data Modeling?
• Find outliers
• Find bad data
• Database cleaning
• Imputation of missing data
4848
The Role of Python in Data Science
• Increase in computation!
• New algorithms and
complex old ones can be
implemented now, easier
than ever.
• The demand for analytic
talent is insatiable.
• So a versatile, fast, and
easy-to-learn language is
invaluable.
• That language must
double as usable by
developers and by
analysts.
4949
Heavily Used Python Tools in Zillow Data Science
• NumPy – Speeds up computations by pre-allocating sizes of objects.
• Pandas – Creates the familiar ‘data frame’ object and relevant tools.
• Scikit Learn – Easy to use machine learning tools.
• Textmining – Allows analysis of unstructured text fields.
• Pymssql/Pyodbc – Connections to SQL Server.
• SQLite3 – Creation of local databases.
• Graphlab Create – Strikingly fast machine learning, easy and scalable
application creation by Dato (previously Graphlab).
5050
• Dato maintains ‘Graphlab Create’, an open source python package that
allows very easy and scalable applications to be built in simple code.
(Functions should have intelligent defaults!)
5151
Dato Example: Finding Quantiles of MOM Change in
ALL Zestimates
• Example:
5252
Dato in action!
5353
Dato: MoM Quantile Results!
5454
Dato and Zestimate MoM Analysis Take Aways
• Dato’s Graphlab Create tool is immensely powerful and easy to use.
• Integration into your AWS is as easy as setting up an environment and
writing a function.
• We now have a tool that can slice and dice the Zestimate and look at all
of our data by any number of factors.
• Fast in comparison to alternatives. (~1-3 hours total)
• Currently setting up this tool to catch problematic Zestimates every
month.
5555
Using Scikit Learn for Fraudulent Listing Detection
Make Me Move Fraud Commercial Listing
5656
Finding Fraud: Methodology
• Every property has lots of information: attributes (bds, bths, …),
address, pricing data, transactional data, account information, and
unstructured text descriptions.
• We create features based on this information.
• Train a gradient-boosted random forest with features on known
fraudulent and non-fraudulent listings.
• Output is scored (fraud = P(fraud)>0.5) as actual fraud or not.
• Scored data is refed into the fraud model weekly for training.
5757
Finding Fraud: Results
• Fraudulent Listing Model
• The current iteration of Fraud detection is 96.9%*
accurate.
• The null prediction benchmark is 96.1% accurate.
* Note that high % accuracy when predicting rare events is expected.
5858
Record Linkage: Property Matching
Zestimate
Data
Cleaning
County
Records
User
Records
MLS-1
MLS-2
Matching
Vs.
123 Main St.
Bellevue, WA 89555
123 Main St.
Seattle, WA 89555
5959
Property 1 Property 2
Property Matching: The Problem
6060
Property Matching Methodology
New Data
Zip code 1
Zip code 2
Zip code 3
.
.
.
Superset of
current
properties in
zip code 1
Matching Algorithm
-knn algorithm (Dato)
-text & numeric features
-feature weights
-distance metrics
Results
• Outputs all new data with most probable matches
• P(match) > constant
6161
Property Matching: Results
• Speed:
– Unmatched LA County Records: ~150k
– LA Database Records: ~2 Million
– Naively at 100k comparisons a second = ~33 days of computing time.
– Graphlab’s shortcuts reduce time to 20-40 minutes!!! (8 core personal desktop)
• Previous match rate is around ~65-95%.
• Test cases so far have resulted in new match rates of >97%.
• What are we missing? New lots, construction, incomplete addresses.
6262
Current Openings at Zillow!
(www.Zillow.com/jobs)
• Software Development Engineer, Machine Learning
• Senior Business Intelligence Developer
• Manager, Business Analytics
• Program Manager, Enterprise Data Warehouse
• Data Analyst, Listing and Data Quality
• Reporting Analyst
• Data Quality Control Specialist
• Senior Software Development Engineer
• Software Development Engineer
• Hardware/Datacenter Technician
• Cloud Architect
• Associate Software Test Engineer
• Software Development Manager
• Full Stack Software Dev. Engineer
• Search Infrastructure Engineer
• Senior Database Developer
• SOC Engineer
• UX Developer
Jobs within Zillow
Analytics
Other SDE/IT Jobs
within Zillow
6363
Fun Benefits of Working at Zillow
•Great benefits (gym access, matched 401k, medical, dental, vision,…)
•Free fitbit!
•Monthly zSpeakers.
–Previous speakers: Arianna Huffington, Mexican Pres. Vicente Fox, Seahawks Defensive
End Cliff Avril, Joel Spolsky, …
•Free snacks, drinks, candy, …
•Treadmill rooms, ping pong, shuffleboard, game room,…
•Free Orca card.
•Bi-annual Hackweek.
•Quarterly Group Outings (Clam-bake, kayaking, ice skating,…)
•Smart, fun, and helpful colleagues in a relaxed atmosphere!
6464
Thank You! Questions?

Más contenido relacionado

La actualidad más candente

Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...
Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...
Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...Neo4j
 
Applying Network Analytics in KYC
Applying Network Analytics in KYCApplying Network Analytics in KYC
Applying Network Analytics in KYCNeo4j
 
Haystack 2019 - Making the case for human judgement relevance testing - Tara ...
Haystack 2019 - Making the case for human judgement relevance testing - Tara ...Haystack 2019 - Making the case for human judgement relevance testing - Tara ...
Haystack 2019 - Making the case for human judgement relevance testing - Tara ...OpenSource Connections
 
How Graph Algorithms Answer your Business Questions in Banking and Beyond
How Graph Algorithms Answer your Business Questions in Banking and BeyondHow Graph Algorithms Answer your Business Questions in Banking and Beyond
How Graph Algorithms Answer your Business Questions in Banking and BeyondNeo4j
 
Introduction to Python for Data Science
Introduction to Python for Data ScienceIntroduction to Python for Data Science
Introduction to Python for Data ScienceArc & Codementor
 
2021 Emerging Models in Real Estate Report
2021 Emerging Models in Real Estate Report2021 Emerging Models in Real Estate Report
2021 Emerging Models in Real Estate ReportMike DelPrete
 
Introduction to Neo4j
Introduction to Neo4jIntroduction to Neo4j
Introduction to Neo4jNeo4j
 
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data ScienceGet Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data ScienceNeo4j
 
3. Relationships Matter: Using Connected Data for Better Machine Learning
3. Relationships Matter: Using Connected Data for Better Machine Learning3. Relationships Matter: Using Connected Data for Better Machine Learning
3. Relationships Matter: Using Connected Data for Better Machine LearningNeo4j
 
Data Modeling with Neo4j
Data Modeling with Neo4jData Modeling with Neo4j
Data Modeling with Neo4jNeo4j
 
Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Simplilearn
 
Graphs for Finance - AML with Neo4j Graph Data Science
Graphs for Finance - AML with Neo4j Graph Data Science Graphs for Finance - AML with Neo4j Graph Data Science
Graphs for Finance - AML with Neo4j Graph Data Science Neo4j
 
PropTech: The Future of Real Estate
PropTech: The Future of Real EstatePropTech: The Future of Real Estate
PropTech: The Future of Real EstateNFX
 
Pitch Deck of Friendfiz - An Augmented Reality based Social Network
Pitch Deck of Friendfiz - An Augmented Reality based Social NetworkPitch Deck of Friendfiz - An Augmented Reality based Social Network
Pitch Deck of Friendfiz - An Augmented Reality based Social NetworkRupesh Patil
 
Neo4j y GenAI
Neo4j y GenAI Neo4j y GenAI
Neo4j y GenAI Neo4j
 
Build Intelligent Fraud Prevention with Machine Learning and Graphs
Build Intelligent Fraud Prevention with Machine Learning and GraphsBuild Intelligent Fraud Prevention with Machine Learning and Graphs
Build Intelligent Fraud Prevention with Machine Learning and GraphsNeo4j
 
Knowledge Graphs and Generative AI
Knowledge Graphs and Generative AIKnowledge Graphs and Generative AI
Knowledge Graphs and Generative AINeo4j
 
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...Edureka!
 
Workshop Introduction to Neo4j
Workshop Introduction to Neo4jWorkshop Introduction to Neo4j
Workshop Introduction to Neo4jNeo4j
 
How to Build a Fraud Detection Solution with Neo4j
How to Build a Fraud Detection Solution with Neo4jHow to Build a Fraud Detection Solution with Neo4j
How to Build a Fraud Detection Solution with Neo4jNeo4j
 

La actualidad más candente (20)

Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...
Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...
Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...
 
Applying Network Analytics in KYC
Applying Network Analytics in KYCApplying Network Analytics in KYC
Applying Network Analytics in KYC
 
Haystack 2019 - Making the case for human judgement relevance testing - Tara ...
Haystack 2019 - Making the case for human judgement relevance testing - Tara ...Haystack 2019 - Making the case for human judgement relevance testing - Tara ...
Haystack 2019 - Making the case for human judgement relevance testing - Tara ...
 
How Graph Algorithms Answer your Business Questions in Banking and Beyond
How Graph Algorithms Answer your Business Questions in Banking and BeyondHow Graph Algorithms Answer your Business Questions in Banking and Beyond
How Graph Algorithms Answer your Business Questions in Banking and Beyond
 
Introduction to Python for Data Science
Introduction to Python for Data ScienceIntroduction to Python for Data Science
Introduction to Python for Data Science
 
2021 Emerging Models in Real Estate Report
2021 Emerging Models in Real Estate Report2021 Emerging Models in Real Estate Report
2021 Emerging Models in Real Estate Report
 
Introduction to Neo4j
Introduction to Neo4jIntroduction to Neo4j
Introduction to Neo4j
 
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data ScienceGet Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
 
3. Relationships Matter: Using Connected Data for Better Machine Learning
3. Relationships Matter: Using Connected Data for Better Machine Learning3. Relationships Matter: Using Connected Data for Better Machine Learning
3. Relationships Matter: Using Connected Data for Better Machine Learning
 
Data Modeling with Neo4j
Data Modeling with Neo4jData Modeling with Neo4j
Data Modeling with Neo4j
 
Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...
 
Graphs for Finance - AML with Neo4j Graph Data Science
Graphs for Finance - AML with Neo4j Graph Data Science Graphs for Finance - AML with Neo4j Graph Data Science
Graphs for Finance - AML with Neo4j Graph Data Science
 
PropTech: The Future of Real Estate
PropTech: The Future of Real EstatePropTech: The Future of Real Estate
PropTech: The Future of Real Estate
 
Pitch Deck of Friendfiz - An Augmented Reality based Social Network
Pitch Deck of Friendfiz - An Augmented Reality based Social NetworkPitch Deck of Friendfiz - An Augmented Reality based Social Network
Pitch Deck of Friendfiz - An Augmented Reality based Social Network
 
Neo4j y GenAI
Neo4j y GenAI Neo4j y GenAI
Neo4j y GenAI
 
Build Intelligent Fraud Prevention with Machine Learning and Graphs
Build Intelligent Fraud Prevention with Machine Learning and GraphsBuild Intelligent Fraud Prevention with Machine Learning and Graphs
Build Intelligent Fraud Prevention with Machine Learning and Graphs
 
Knowledge Graphs and Generative AI
Knowledge Graphs and Generative AIKnowledge Graphs and Generative AI
Knowledge Graphs and Generative AI
 
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
 
Workshop Introduction to Neo4j
Workshop Introduction to Neo4jWorkshop Introduction to Neo4j
Workshop Introduction to Neo4j
 
How to Build a Fraud Detection Solution with Neo4j
How to Build a Fraud Detection Solution with Neo4jHow to Build a Fraud Detection Solution with Neo4j
How to Build a Fraud Detection Solution with Neo4j
 

Similar a Data Science At Zillow

Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Perficient, Inc.
 
Neo4j GraphTalk Oslo - Introduction to Graphs
Neo4j GraphTalk Oslo - Introduction to GraphsNeo4j GraphTalk Oslo - Introduction to Graphs
Neo4j GraphTalk Oslo - Introduction to GraphsNeo4j
 
Follow the money with graphs
Follow the money with graphsFollow the money with graphs
Follow the money with graphsStanka Dalekova
 
Knowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceKnowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceCambridge Semantics
 
Big data analytics and machine intelligence v5.0
Big data analytics and machine intelligence   v5.0Big data analytics and machine intelligence   v5.0
Big data analytics and machine intelligence v5.0Amr Kamel Deklel
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RDatabricks
 
Cloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastCloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastDatabricks
 
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...Neo4j
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni
 
Information Exploitation at BBN
Information Exploitation at BBNInformation Exploitation at BBN
Information Exploitation at BBNPlamen Petrov
 
Processing Large Graphs
Processing Large GraphsProcessing Large Graphs
Processing Large GraphsNishant Gandhi
 
giasan.vn real-estate analytics: a Vietnam case study
giasan.vn real-estate analytics: a Vietnam case studygiasan.vn real-estate analytics: a Vietnam case study
giasan.vn real-estate analytics: a Vietnam case studyViet-Trung TRAN
 
Continuous Intelligence - Intersecting Event-Based Business Logic and ML
Continuous Intelligence - Intersecting Event-Based Business Logic and MLContinuous Intelligence - Intersecting Event-Based Business Logic and ML
Continuous Intelligence - Intersecting Event-Based Business Logic and MLParis Carbone
 
Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?Crate.io
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryStanka Dalekova
 

Similar a Data Science At Zillow (20)

Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
Neo4j GraphTalk Oslo - Introduction to Graphs
Neo4j GraphTalk Oslo - Introduction to GraphsNeo4j GraphTalk Oslo - Introduction to Graphs
Neo4j GraphTalk Oslo - Introduction to Graphs
 
Follow the money with graphs
Follow the money with graphsFollow the money with graphs
Follow the money with graphs
 
Knowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceKnowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data Science
 
Neo4j in Depth
Neo4j in DepthNeo4j in Depth
Neo4j in Depth
 
Big data analytics and machine intelligence v5.0
Big data analytics and machine intelligence   v5.0Big data analytics and machine intelligence   v5.0
Big data analytics and machine intelligence v5.0
 
Analytics&IoT
Analytics&IoTAnalytics&IoT
Analytics&IoT
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
 
Cloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastCloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and Fast
 
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
Bertenthal
BertenthalBertenthal
Bertenthal
 
Information Exploitation at BBN
Information Exploitation at BBNInformation Exploitation at BBN
Information Exploitation at BBN
 
Making Sense of Remote Sensing
Making Sense of Remote SensingMaking Sense of Remote Sensing
Making Sense of Remote Sensing
 
M7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal HausenblasM7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal Hausenblas
 
Processing Large Graphs
Processing Large GraphsProcessing Large Graphs
Processing Large Graphs
 
giasan.vn real-estate analytics: a Vietnam case study
giasan.vn real-estate analytics: a Vietnam case studygiasan.vn real-estate analytics: a Vietnam case study
giasan.vn real-estate analytics: a Vietnam case study
 
Continuous Intelligence - Intersecting Event-Based Business Logic and ML
Continuous Intelligence - Intersecting Event-Based Business Logic and MLContinuous Intelligence - Intersecting Event-Based Business Logic and ML
Continuous Intelligence - Intersecting Event-Based Business Logic and ML
 
Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech Industry
 

Último

{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 

Último (20)

{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 

Data Science At Zillow

  • 1. 11 DATA SCIENCE AT ZILLOW The Zestimate® and Beyond
  • 2. 22 Machine Learning vs. Statistics : Glossary (Rob Tibshirani) Machine learning Statistics network, graphs model weights parameters learning fitting generalization test set performance supervised learning regression/classification unsupervised learning density estimation, clustering large grant = $1,000,000 large grant = $50,000 nice place to have a meeting: Snowbird, Utah, French Alps nice place to have a meeting: Las Vegas in August
  • 3. 33 Decision Tree : Machine Learning vs. Statistics Ross Quinlan (1993): Programs for Machine Learning (C4.5) Breiman et al. (1984): Classification and Regression Trees (CART)
  • 4. 44 Zillow Traffic & Usage • More than 73 million unique users visited Zillow’s mobile apps and websites. – Source: Internal tracking via Google Analytics, December 2014 • The Yahoo!-Zillow Real Estate Network is the largest real estate network on the Web. – Source: comScore Media Metrix Real Estate Category Ranking by Unique Visitors, November 2014, US Data • Zillow.com is the largest rental site on the Web. – Source: comScore Media Metrix Real Estate category ranking by Unique Visitors, November 2014, US Data • 4 out of 5 U.S. homes have been viewed on Zillow. – Source: Zillow Internal, December 2014 • Zillow has data on more than 110 million U.S. homes. – Source: Zillow Internal, December 2014 • Zestimates and Rent Zestimates on more than 100 million U.S. homes. – Source: Zillow Internal, December 2014
  • 5. 55 How Big are Our Data? Zestimate Scoring Data Size Homes on Zillow 110 million Home Attributes 103 Double precision 8 bytes Time series 220 months Total 20 TBs ~ 110M*103*220*8
  • 6. 66 How Frequent Do our Data Change ?
  • 7. 77 Agenda Name Topics Yeng Bun, Sr. Data Scientist Zestimates and Zillow Home Value Index Mike Babb, GIS Analyst Automated Waterfront determination and Home Street feature discovery Nick McClure, Sr. Data Scientist Data Cleaning, Fraud Detection, and Address Matching
  • 8. 88 How Do We Do It ? Prototype (Interactive mode) Query Analysis Modeling Visualization Database QueryQueryQuery Train and Score Combine Data Production (Batch mode) Models Production (Real-time service) Scoring Engine Early Day : Java, R & C Present Day : R & C/C++ Early Day : R Present Day : R Future : R & Python Future : TBD
  • 9. 99 Prototype vs. Production Prototype Turn idea into software quickly Flexible • Interactive mode • Creative • Experimental Rigid • Batch and Real-time modes • Repeatability • Maintainability Complete software versions • Error free • Run on full dataset Incomplete software versions • Proof of concept • Run on small sample dataset Production Run the software automatically
  • 11. 1111 Software Hierarchy App Framework Infrastructure • Define a standard structure for apps. • Provide a generic app. • Build on top of a framework. • Deal with specific details and complexities of the application. • Basic services, communication, storage management, version control, etc.. • Rterm, system(), .Call(), .Fortran(), library(), load(), save(), … • Rserve, SQL,GIT • EconBot • ZPL • One Pagers, CaseShiller Forecast, … • Zhvi, Zhvi Forecast, Zri, Price/Rent ratio, Export MarketReport Data,… • Zestimate, ZestimateForecast, RentZestimate, Diagnostics, … • ZillowRserve
  • 12. 1212 MapReduce vs. ZPL MapReduce Input Head Node Output Worker Node 2 Map Reduce ZPL Input Head Node Output Worker Node 2 zplOnCompute() zplOnUpdate() Update Node Hadoop RDBMS
  • 13. 1313 Data Partitioning and Parallel Computing AK AL AR AZ CA CO CT … Head Node TaskQueue Parallel R Worker Nodes AL AR AZ Input Database Update Node Combine Data Output Database
  • 14. 1414 Zillow RServe • Binary TCP/IP Servers  Expose R functions for client apps to call  Auto load models generated by batch jobs  Auto cleanup  Redundancy : multiple ports and multiple boxes • Clients – C/C++, Python, R, Java, C#, etc… – SQL Server – Web server – Mobil devices
  • 15. 1515 Performance Machine 2.5 GHz, 16 cores, 128 GB RAM Real-Time Zestimates Throughput/Connection 12/sec Real-Time RentZestimates Throughput/Connection 20/sec Zestimates 3 times/week run in batch mode 13 hours RentZestimates Weekly run in batch mode 3 hours Historical Zestimates 220 monthly data points 5 days ~ 220*13/24/24 (24 boxes)
  • 16. 1616 A PEEK LOOK Rent Zestimate®
  • 17. 1717 Rent Zestimate SCORING ENGINE Yes No Dynamic Filter Train County Model Model(B) Train State Model Model(C) Reconcile Edited Facts Score County Model Model (B) Model (C) Good? Score State Model Models PropertyDimensionEditedFacts PropertyDimension PropertyUserDimension PropertyTaxAssessment RegionDimension <ZestimateDate>_ForRentPosting <ZestimateDate>_RZest QueryQuery pre-process SQL Server Query QueryImpute PropertyDimensionImputedFacts Scoring DataTraining Data <ZestimateDate>_ZestSmooth Batch job
  • 18. 1818 Measuring Accuracy Hold-out 30% of data If rz are the Rent Zestimates for homes in the hold-out dataset, then the percent estimated errors are e =100*(rz – r)/r where r are the actual rental listing prices. Two key metrics: • median (abs (e)) • percent of estimates within 10% of rent price:100*count (abs(e)<10)/count (e) group by counties, metros, states and national.
  • 22. 2222 Zillow Rent Index (ZRI): Methodology • Calculate Raw Median Rent Zestimates (ZRI raw) • Apply Smoothing Filter • Apply Seasonal Adjustment • Quality Control
  • 24. 2424 HOUSING MARKET Zillow Home Value Index (ZHVI)
  • 25. 2525 Zillow Home Value Index (ZHVI): Methodology • Calculate Raw Median Zestimates (ZHVI raw) • Apply Systematic Error Correction • Apply Smoothing Filter • Apply Seasonal Adjustment • Quality Control
  • 28. 2828 The Good, The Bad And The Ugly Ad-hoc Prototype Interactive mode Batch mode Real-time service ZHVI, Forecast, ZRI, Price/Rent Diagnostics,…
  • 29. 29 Zillow, Python, R, and GIS Mike Babb, GIS Analyst
  • 30. 3030 Overview • GIS@Zillow – Who we are, how we function within the larger organization • Technology Stack – How we do what do • Several examples – Automated Waterfront Determination – Home Street Feature Discovery
  • 31. 3131 GIS@Zillow • Three person team somewhat like an in-house GIS consulting shop – Michalis Avraam, Ph.D.: Lead GIS Analyst – Mike Babb, Ph.C.: GIS Analyst – Andrew Smyth: GIS Analyst • What we do: – Automating the incorporation of spatially explicit data (spatial ETL). – Adjusting boundary geometry (cities, school districts, Zip Codes, etc.). – Conflating geospatial data from different vendors into a congruent product. – The discovery, creation, and formalization of spatial relationships into machine-comprehensible data for input into the Zestimation algorithm.
  • 32. 3232 How we do it • Most development done in Windows. • Highly available SQL Server DBs store current and historical property data. • 75% Python, 15% R, 5% SQL Server, 5% bash and shell. • But… • Proprietary Linux-only in-house database used for blazingly fast in-memory and http look up. • Crawl – walk – run.
  • 33. 3333 Tools and libraries • PYTHON LIBRARIES • Data Management, Analysis, and Storage – multiprocessing – Pandas – numpy – sqlite3 • Spatial Analysis – ArcPy – gdal/ogr/osr – Rtree – shapely • R LIBRARIES • Data Management, Analysis, and Storage – data.Table – doSNOW – rsqlite • Spatial Analysis – gpclib – maptools – rgdal – rgeos – sp
  • 35. 3535 Automated Waterfront Determination • Motivation – Homes on the waterfront are valued differently than homes not on the waterfront. – Incorporating measures of waterfront access into our Zestimation algorithm helps increase the accuracy of our models. • Needs – Distinguish between proximity and access. – Identify properties that are near the waterfront but have intervening properties and intervening streets. • Tools – Most processing done using R and the following geospatial libraries: sp, maptools, rgdal, and rgeos. Native R objects and data.table objects are used for storage and data management. – Multiprocessing techniques were used where possible. • Technique – Identify parcels within 250 meters of the shore. – Use ray tracing to identify intervening features. – Visualize and check results using ArcMap.
  • 42. 4242 Home Street Feature Discovery • Motivation – Homes are on a street network. – What information about a home can we gather from a home’s relationship to it’s street? • Needs – The orientation of the home to the street. – The sequence of homes along a street. – Various other needs. • Tools – Propriety database used to fuzzy-match a home to a street segment. Database accepts both same-machine in-memory lookup and http requests. – Pandas for IO, batching, analysis, and storage. – ArcPy for prototyping. – Rtree, shapely, and gdal/ogr/osr for production.
  • 45. 4545 Sequence of houses along a street
  • 46. 4646 DATA SCIENCE AT ZILLOW PART 3: PYTHON AND GRAPHLAB Nick McClure, Senior Data Scientist Zillow
  • 47. 4747 Why Data Modeling? • Find outliers • Find bad data • Database cleaning • Imputation of missing data
  • 48. 4848 The Role of Python in Data Science • Increase in computation! • New algorithms and complex old ones can be implemented now, easier than ever. • The demand for analytic talent is insatiable. • So a versatile, fast, and easy-to-learn language is invaluable. • That language must double as usable by developers and by analysts.
  • 49. 4949 Heavily Used Python Tools in Zillow Data Science • NumPy – Speeds up computations by pre-allocating sizes of objects. • Pandas – Creates the familiar ‘data frame’ object and relevant tools. • Scikit Learn – Easy to use machine learning tools. • Textmining – Allows analysis of unstructured text fields. • Pymssql/Pyodbc – Connections to SQL Server. • SQLite3 – Creation of local databases. • Graphlab Create – Strikingly fast machine learning, easy and scalable application creation by Dato (previously Graphlab).
  • 50. 5050 • Dato maintains ‘Graphlab Create’, an open source python package that allows very easy and scalable applications to be built in simple code. (Functions should have intelligent defaults!)
  • 51. 5151 Dato Example: Finding Quantiles of MOM Change in ALL Zestimates • Example:
  • 54. 5454 Dato and Zestimate MoM Analysis Take Aways • Dato’s Graphlab Create tool is immensely powerful and easy to use. • Integration into your AWS is as easy as setting up an environment and writing a function. • We now have a tool that can slice and dice the Zestimate and look at all of our data by any number of factors. • Fast in comparison to alternatives. (~1-3 hours total) • Currently setting up this tool to catch problematic Zestimates every month.
  • 55. 5555 Using Scikit Learn for Fraudulent Listing Detection Make Me Move Fraud Commercial Listing
  • 56. 5656 Finding Fraud: Methodology • Every property has lots of information: attributes (bds, bths, …), address, pricing data, transactional data, account information, and unstructured text descriptions. • We create features based on this information. • Train a gradient-boosted random forest with features on known fraudulent and non-fraudulent listings. • Output is scored (fraud = P(fraud)>0.5) as actual fraud or not. • Scored data is refed into the fraud model weekly for training.
  • 57. 5757 Finding Fraud: Results • Fraudulent Listing Model • The current iteration of Fraud detection is 96.9%* accurate. • The null prediction benchmark is 96.1% accurate. * Note that high % accuracy when predicting rare events is expected.
  • 58. 5858 Record Linkage: Property Matching Zestimate Data Cleaning County Records User Records MLS-1 MLS-2 Matching Vs. 123 Main St. Bellevue, WA 89555 123 Main St. Seattle, WA 89555
  • 59. 5959 Property 1 Property 2 Property Matching: The Problem
  • 60. 6060 Property Matching Methodology New Data Zip code 1 Zip code 2 Zip code 3 . . . Superset of current properties in zip code 1 Matching Algorithm -knn algorithm (Dato) -text & numeric features -feature weights -distance metrics Results • Outputs all new data with most probable matches • P(match) > constant
  • 61. 6161 Property Matching: Results • Speed: – Unmatched LA County Records: ~150k – LA Database Records: ~2 Million – Naively at 100k comparisons a second = ~33 days of computing time. – Graphlab’s shortcuts reduce time to 20-40 minutes!!! (8 core personal desktop) • Previous match rate is around ~65-95%. • Test cases so far have resulted in new match rates of >97%. • What are we missing? New lots, construction, incomplete addresses.
  • 62. 6262 Current Openings at Zillow! (www.Zillow.com/jobs) • Software Development Engineer, Machine Learning • Senior Business Intelligence Developer • Manager, Business Analytics • Program Manager, Enterprise Data Warehouse • Data Analyst, Listing and Data Quality • Reporting Analyst • Data Quality Control Specialist • Senior Software Development Engineer • Software Development Engineer • Hardware/Datacenter Technician • Cloud Architect • Associate Software Test Engineer • Software Development Manager • Full Stack Software Dev. Engineer • Search Infrastructure Engineer • Senior Database Developer • SOC Engineer • UX Developer Jobs within Zillow Analytics Other SDE/IT Jobs within Zillow
  • 63. 6363 Fun Benefits of Working at Zillow •Great benefits (gym access, matched 401k, medical, dental, vision,…) •Free fitbit! •Monthly zSpeakers. –Previous speakers: Arianna Huffington, Mexican Pres. Vicente Fox, Seahawks Defensive End Cliff Avril, Joel Spolsky, … •Free snacks, drinks, candy, … •Treadmill rooms, ping pong, shuffleboard, game room,… •Free Orca card. •Bi-annual Hackweek. •Quarterly Group Outings (Clam-bake, kayaking, ice skating,…) •Smart, fun, and helpful colleagues in a relaxed atmosphere!