14. Getting the content
The
Six
Degrees
Hypothesis
Experienced
It
Is
When
You
Travel
15. Building a Word Matrix
The
Six
Degrees
Six
Hypothesis Six 3
Degrees
Experienced Degrees 3
Hypothesis
Hypothesis 1
It
Experienced
Experienced 5
Is
Travel
Travel 6
When
You
Travel
30. K-Means Clustering
Divides data into distinct clusters
User determines how many
Algorithm
Start with arbitrary centroids
Assign points to centroids
Move the centroids
Repeat
36. K-Means Results
1 2
The Viral Garden Wonkette
Copyblogger Gawker
Creating Passionate Users Gothamist
Oilman Huffington Post
ProBlogger Blog Tips
Seth's Blog
37. 2D Visualizations
Instead of Clusters, a 2D Map
Goals
Preserve distances as much as
possible
Draw in two dimensions
Dimension Reduction
Principal Components Analysis
Multidimensional Scaling
47. The Zillow API
Allows querying by address
Returns information about the
property
Bedrooms
Bathrooms
Zip Code
Price Estimate
Last Sale Price
48. A home price dataset
House Zip Bathrooms Bedrooms Built Type Price
Single 505296
A 02138 1.5 2 1847
B 02139 3.5 9 Triplex 776378
1916
C 02140 3.5 4 Duplex 595027
1894
D 02139 2.5 4 Duplex 552213
1854
E 02138 3.5 5 Duplex 947528
1909
F 02138 3.5 4 Single 2107871
1930
etc..
49. What can we learn?
A made-up houses price
How important is Zip Code?
What are the important attributes?
Can we do better than averages?
50. Introducing Regression
Trees
A B Value
10 Circle 20
11 Square 22
22 Square 8
18 Circle 6
51. Introducing Regression
Trees
A B Value
10 Circle 20
11 Square 22
22 Square 8
18 Circle 6
52. Minimizing deviation
Standard deviation is the “spread” of results
Try all possible divisions
Choose the division that decreases deviation the
most
Initially
A B Value
Average = 14
10 Circle 20
Standard Deviation = 8.2
11 Square 22
22 Square 8
18 Circle 6
53. Minimizing deviation
Standard deviation is the “spread” of results
Try all possible divisions
Choose the division that decreases deviation the
most
B = Circle
A B Value
Average = 13
10 Circle 20
Standard Deviation = 9.9
11 Square 22
22 Square 8
B = Square
18 Circle 6
Average = 15
Standard Deviation = 9.9
54. Minimizing deviation
Standard deviation is the “spread” of results
Try all possible divisions
Choose the division that decreases deviation the
most
A > 18
A B Value
Average = 8
10 Circle 20
Standard Deviation = 0
11 Square 22
22 Square 8
A <= 20
18 Circle 6
Average = 16
Standard Deviation = 8.7
55. Minimizing deviation
Standard deviation is the “spread” of results
Try all possible divisions
Choose the division that decreases deviation the
most
A > 11
A B Value
Average = 7
10 Circle 20
Standard Deviation = 1.4
11 Square 22
22 Square 8
A <= 11
18 Circle 6
Average = 21
Standard Deviation = 1.4
56. CART Algoritm
A B Value
10 Circle 20
11 Square 22
22 Square 8
18 Circle 6
57. CART Algoritm
A B Value
10 Circle 20
11 Square 22
22 Square 8
18 Circle 6
58. CART Algoritm
10 Circle 20 22 Square 8
11 Square 22 18 Circle 6
64. Supervised and
Unsupervised
Clustering methods are unsupervised
There are no answers
Methods just characterize the data
Show interesting patterns
Regression Trees are supervised
“answers” are in the dataset
Tree models predict answers
67. Bayesian filter
If you listen to NPR, watch Hardball,
and love the Red Sox, you may be the Sox 0.4
guy for me. Red 0.35
Boston
Grad 0.2
Please email me back.
Professional 0.1
I'm a professional with a grad school Humor 0.1
degree who has a sense of humor and
loves the Sox.
68. Bayesian filter
P( C | W ) = P (C & W) / P (W)
How often do the word and the city appear together?
How often does the word appear overall…
Rank these, and you have a list of the words most particular to a given city
69. Results
New York Boston Chicago
Mets Pink Cubs
Lounges Sox Burbs
Offense Poetry Bears
Desires Intellectually Girlie
Musical Punk Insecure
Submissive Appreciation Cheat
Create Exercise Importance
Song Winter Blunt
Oral Education Mouth
70. Results
Los Angeles San Francisco
Excellent Tee
Vegas Employment
Meaningful Picnic
Star STD
Lame Tasting
Industry Hikes
Heat French
Fitness .com
Entertainment Kayaking
Latino Cycling
83. Other ideas
Finance
Analysts already drowning in info
Stories sometimes broken on blogs
Message boards show sentiment
Extremely low signal-to-noise ratio
84. Other ideas
Product problems/ideas
Use support message boards
Extract themes
Understand recurring issues
Learn what features people want
85. Other ideas
Entertainment
How much buzz is a movie
generating?
What psychographic profiles like this
type of movie?
Of interest to studios and media
investors