Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
1. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
advancedspark.com
2. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Who Am I?
2
Streaming Data Engineer
Netflix OSS Committer
Data Solutions Engineer
Apache Contributor
Principal Data Solutions Engineer
IBM Technology Center
Meetup Organizer
Advanced Apache Meetup
Book Author
Advanced .
Due 2016
3. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Advanced Apache Spark Meetup
http://advancedspark.com
Meetup Metrics
Top 10 Most-active Spark Meetup!
3200+ Members in just 9 mos!!
3700+ Docker downloads (demos)
Meetup Mission
Code deep-dive into Spark and related open source projects
Surface key patterns and idioms
Focus on distributed systems, scale, and performance
3
4. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Live, Interactive Demo!
Audience Participation Required!!
Cell Phone Compatible!!!
demo.advancedspark.com4
5. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
http://demo.advancedspark.com
End User ->
ElasticSearch ->
Spark ML ->
Data Scientist ->
5
<- Kafka
<- Spark
Streaming
<- Cassandra,
Redis
<- Zeppelin,
iPython
6. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Presentation Outline
① Scaling
② Similarities
③ Recommendations
④ Approximations
⑤ Netflix Recommendations
6
7. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Scaling with Parallelism
7
Peter
O(log n)
O(log n)
Worker
Nodes
8. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Parallelism with Composability
Worker 1 Worker 2
Max (a max b max c max d) == (a max b) max (c max d)
Set Union (a U b U c U d) == (a U b) U (c U d)
Addition (a + b + c + d) == (a + b) + (c + d)
Multiply (a * b * c * d) == (a * b) * (c * d)
8
What about Division and Average?
Collect at Driver
9. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
What about Division?
Division (a / b / c / d) != (a / b) / (c / d)
(3 / 4 / 7 / 8) != (3 / 4) / (7 / 8)
(((3 / 4) / 7) / 8) != ((3 * 8) / (4 * 7))
0.134 != 0.857
9
What were the Egyptians thinking?!
Not Composable
“Divide like
an Egyptian”
10. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
What about Average?
Overall AVG
(3, 1) (3 + 5 + 5 + 7) 20
+ (5, 1) == -------------------- == --- == 5
+ (5, 1) (1 + 1 + 1 + 1) 4
+ (7, 1)
10
values
counts
Pairwise AVG
(3 + 5) (5 + 7) 8 12 20
------- + ------- == --- + --- == --- == 10 != 5
2 2 2 2 2
Divide, Add, Divide?
Not Composable
Single-Node Divide at the End?
Doesn’t need to be Composable!
AVG (3, 5, 5, 7) == 5
Add, Add, Add?
Composable!
11. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Presentation Outline
① Scaling
② Similarities
③ Recommendations
④ Approximations
⑤ Netflix Recommendations
11
12. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Similarities
12
13. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Euclidean Similarity
Exists in Euclidean, flat space
Based on Euclidean distance
Linear measure
Bias towards magnitude
13
14. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Cosine Similarity
Angular measure
Adjusts for Euclidean magnitude bias
Normalize to unit vectors in all dimensions
Used with real-valued vectors (versus binary)
14
org.jblas.
DoubleMatrix
15. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Jaccard Similarity
Set similarity measurement
Set intersection / set union
Bias towards popularity
Works with binary vectors
15
16. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Log Likelihood Similarity
Adjusts for popularity bias
Netflix “Shawshank” problem
16
17. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Word Similarity
Edit Distance
Misspellings and autocorrect
Word2Vec
Similar words are defined by similar contexts in vector space
17
English Spanish
18. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Demo!
Find Synonyms with Word2Vec
18
19. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Find Synonyms using Word2Vec
19
20. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Document Similarity
TF/IDF
Term Freq / Inverse Document Freq
Used by most search engines
Doc2Vec
Similar documents are determined by similar contexts
20
21. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Bonus! Text Rank Document Summary
Text Rank (aka Sentence Rank)
Surface summary sentences
TF/IDF + Similarity Graph + PageRank
Most similar sentence to all other sentences
TF/IDF + Similarity Graph
Most influential sentences
PageRank
21
22. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Similarity Pathways (Recommendations)
Best recommendations for 2 (or more) people
“You like Max Max. I like Message in a Bottle.
We might like a movie similar to both.”
Item-to-Item Similarity Graph + Dijkstra Heaviest Path
22
23. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Demo!
Similarity Pathway for Movie Recommendations
23
24. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Load Movies with Tags into DataFrame
24
My
Choice
Their
Choice
25. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Item-to-Item Tag Jaccard Similarity
Based on Tags
25
Calculate Jaccard Similarity
(Tag Set Similarity)
Must be Above the Given
Jaccard Similarity Threshold
26. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Item-to-Item Tag Similarity Graph
26
Edge Weights
==
Jaccard Similarity
(Based on Tag Sets)
27. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
TODO: Use Dijkstra to Find Heaviest
Pathway
27
28. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Calculating Exact Similarity
Brute-Force Similarity
Cartesian Product
O(n^2) shuffle and compute
aka. All-pairs, Pair-wise,
Similarity Join
28
29. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Calculating Approximate Similarity
Goal: Reduce Shuffle
Approximate Similarity
Sampling
Bucketing or Clustering
Ignore low-similarity probability
Locality Sensitive Hashing
Twitter Algebird MinHash
29
Bucket
By Genre
30. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Presentation Outline
① Scaling
② Similarities
③ Recommendations
④ Approximations
① Netflix Recommendations
30
31. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Recommendations
31
32. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Basic Terminology
User: User seeking recommendations
Item: Item being recommended
Explicit User Feedback: user knows they are rating or liking, can choose to dislike
Implicit User Feedback: user not explicitly aware, cannot dislike (click, hover, etc)
Instances: Rows of user feedback/input data
Overfitting: Training a model too closely to the training data & hyperparameters
Hold Out Split: Holding out some of the instances to avoid overfitting
Features: Columns of instance rows (of feedback/input data)
Cold Start Problem: Not enough data to personalize (new)
Hyperparameter: Model-specific config knobs for tuning (tree depth, iterations)
Model Evaluation: Compare predictions to actual values of hold out split
Feature Engineering: Modify, reduce, combine features
32
33. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Features
Binary: True or False
Numeric Discrete: Integers
Numeric: Real Values
Binning: Convert Continuous into Discrete (Time of Day->Morning, Afternoon)
Categorical Ordinal: Size (Small->Medium->Large), Ratings (1->5)
Categorical Nominal: Independent, Favorite Sports Teams, Dating Spots
Temporal: Time-based, Time of Day, Binge Viewing
Text: Movie Titles, Genres, Tags, Reviews (Tokenize, Stop Words, Stemming)
Media: Images, Audio, Video
Geographic: (Longitude, Latitude), Geohash
Latent: Hidden Features within Data (Collaborative Filtering)
Derived: Age of Movie, Duration of User Subscription
33
34. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Feature Engineering
Dimension Reduction
Reduce number of features in feature space
Principle Component Analysis (PCA)
Find principle features that best describe data variance
Peel dimensional layers back
One-Hot Encoding
Convert nominal categorical feature values into 0’s and 1’s
Remove any numerical relationship between categories
Bears -> 1 Bears -> [1.0, 0.0, 0.0]
49’ers -> 2 --> 49’ers -> [0.0, 1.0, 0.0]
Steelers-> 3 Steelers-> [0.0, 0.0, 1.0]
34
Convert Each Item
to Binary Vector
with Single 1.0 Column
35. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Feature Normalization & Standardization
Goal
Scale features to standard size
Required by many ML algos
Normalize Features
Calculate L1 (or L2, etc) norm, then divide into each element
org.apache.spark.ml.feature.Normalizer
Standardize Features
Apply standard normal transformation (mean->0, stddev->1)
org.apache.spark.ml.feature.StandardScaler
35
http://www.mathsisfun.com/data/standard-normal-distribution.html
36. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Non-Personalized Recommendations
36
37. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Cold Start Problem
“Cold Start” problem
New user, don’t know their preferences, must show something!
Movies with highest-rated actors
Top K aggregations
Facebook social graph
Friend-based recommendations
Most desirable singles
PageRank of likes and dislikes
37
38. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Demo!
GraphFrame PageRank
38
39. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Example: Dating Site “Like” Graph
39
40. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
PageRank of Top Influencers
40
41. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Personalized Recommendations
41
42. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Demo!
Personalized PageRank
42
43. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Personalized PageRank: Outbound Links
43
0.15 = (1 - 0.85 “Damping Factor”)
85% Probability: Choose Among Outbound Network
15% Probability: Choose Self or Random
85% Among
Outbound
Network
44. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Personalized PageRank: No Outbound
44
0.15 = (1 - 0.85 “Damping Factor”)
85% Probability: Choose Among Outbound Network
15% Probability: Choose Self or Random
85% Among
No
Outbound
Network!!
45. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
User-to-User Clustering
User Similarity
Time-based
Pattern of viewing (binge or casual)
Time of viewing (am or pm)
Ratings-based
Content ratings or number of views
Average rating relative to others (critical or lenient)
Search-based
Search terms
45
46. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Item-to-Item Clustering
Item Similarity
Profile text (TF/IDF, Word2Vec, NLP)
Categories, tags, interests (Jaccard Similarity, LSH)
Images, facial structures (Neural Nets, Eigenfaces)
Dating Site Example…
46
Cluster Similar Eigen-facesCluster Similar Profiles Cluster Similar Categories
47. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Bonus: NLP Conversation Starter Bot
47
“If your responses to my generic opening
lines are positive, I may read your profile.”
Spark ML, Stanford CoreNLP,
TF/IDF, DecisionTrees, Sentiment
http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
48. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Bonus: Demo!
Spark + Stanford CoreNLP Sentiment Analysis
48
49. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Bonus: Top 100 Country Song Sentiment
49
50. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Bonus: Surprising Results…?!
50
51. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Item-to-Item Based Recommendations
Based on Metadata: Genre, Description, Cast, City
51
52. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Demo!
Item-to-Item-based Recommendations
One-Hot Encoding + K-Means Clustering
52
53. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
One-Hot Encode Tag Feature Vectors
53
54. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Cluster Movie Tag Feature Vectors
54
Hyperparameter
Tuning
(K Clusters?)
55. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Analyze Movie Tag Clusters
55
56. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
User-to-Item Collaborative Filtering
Matrix Factorization
① Factor the large matrix (left) into 2 smaller matrices (right)
② Lower-rank matrices approximate original when multiplied
③ Fill in the missing values of the large matrix
④ Surface k (rank) latent features from user-item interactions
56
57. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Item-to-Item Collaborative Filtering
Famous Amazon Paper circa 2003
Problem
As users grew, user-to-item collaborative filtering didn’t scale
Solution
Item-to-item similarity, nearest neighbors
Offline (Batch)
Generate itemId->List[userId] vectors
Online (Real-time)
From cart, recommend nearest-neighbors in vector space
57
58. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Demo!
Collaborative Filtering-based Recommendations
58
59. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Fitting the Matrix Factorization Model
59
60. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Show ItemFactors Matrix from ALS
60
61. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Show UserFactors Matrix from ALS
61
62. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Generating Individual Recommendations
62
63. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Generating Batch Recommendations
63
64. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Clustering + Collaborative Filtering Recs
Cluster matrix output from Matrix Factorization
Latent features derived from user-item interaction
Item-to-Item Similarity
Cluster item-factor matrix->
User-to-User Similarity
<-Cluster user-factor matrix
64
65. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Demo!
Clustering + Collaborative Filtering-based Recommendations
65
66. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Show ItemFactors Matrix from ALS
66
67. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Convert to Item Factors -> mllib.Vector
Required by K-Means Clustering Algorithm
67
68. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Fit and Evaluate K-Means Cluster Model
68
Measures Closeness
Of Points Within Clusters
K = 5 Clusters
69. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Netflix Genres and Clusters
Typical Genres
Documentary, Romance, Comedy, Horror, Action, Adventure
Latent (Hidden) Clusters
Emotionally-Independent Dramas for Hopeless Romantics
Witty Dysfunctional-Family TV Animated Comedies
Romantic Crime Movies based on Classic Literature
Latin American Forbidden-Love Movies
Critically-acclaimed Emotional Drug Movie
Cerebral Military Movie based on Real Life
Sentimental Movies about Horses for Ages 11-12
Gory Canadian Revenge Movies
Raunchy Mad Scientist Comedy
69
70. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Presentation Outline
① Scaling
② Similarities
③ Recommendations
④ Approximations
⑤ Netflix Recommendations
70
71. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
When to Approximate?
Memory or time constrained queries
Relative vs. exact counts are OK (approx # errors after a release)
Using machine learning or graph algos
Inherently probabilistic and approximate
Streaming aggregations
Inherently sloppy collection (exactly once?)
71
Approximate as much as you can get away with!
Ask for forgiveness later !!
72. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
When NOT to Approximate?
If you’ve ever heard the term…
“Sarbanes-Oxley”
…at the office.
72
73. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
A Few Good Algorithms
73
You can’t handle
the approximate!
74. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Common to These Algos & Data Structs
Low, fixed size in memory
Store large amount of data
Known error bounds
Tunable tradeoff between size and error
Less memory than Java/Scala collections
Rely on multiple hash functions or operations
Size of hash range defines error
74
75. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Bloom Filter
Set.contains(key): Boolean
“Hash Multiple Times and Flip the Bits Wherever You Land”
75
76. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Bloom Filter
Approximate Set.contains(key)
No means No, Yes means Maybe
Elements can only be added
Never updated or removed
76
77. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Bloom Filter in Action
77
set(key) contains(key): Boolean
Images by @avibryant
Set.contains(key): TRUE -> maybe contains (other key hashes may overlap)
Set.contains(key): FALSE -> definitely does not contain (no key flipped all bits)
78. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
CountMin Sketch
Frequency Count and TopK
“Hash Multiple Times and Add 1 Wherever You Land”
78
79. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
CountMin Sketch (CMS)
Approximate frequency count and TopK for key
ie. “Heavy Hitters” on Twitter
79
Matei Zaharia Martin Odersky Donald Trump
80. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
CountMin Sketch In Action (TopK Count)
80
Images derived from @avibryant
Find minimum of all rows
…
…
Can overestimate,
but never underestimate
Multiple hash functions
(1 hash function per row)
Binary hash output
(1 element per column)
x 2 occurrences of
“Top Gun” for slightly
additional complexity
Top Gun
Top Gun
Top Gun
(x 2)
A Few
Good Men
Taps
Top Gun
(x 2)
add(Top Gun, 2)
getCount(Top Gun): Long
Use Case: TopK movies using total views
add(A Few Good Men, 1)
add(Taps, 1)
A Few
Good Men
Taps
…
…
Overlap Top Gun
Overlap A Few Good Men
81. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
HyperLogLog
Count Distinct
“Hash Multiple Times and Uniformly Distribute Where You Land”
81
82. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
HyperLogLog (HLL)
Approximate count distinct
Slight twist
Special hash function creates uniform distribution
Hash subsets of data with single, special hash func
Error estimate
14 bits for size of range
m = 2^14 = 16,384 hash slots
error = 1.04/(sqrt(16,384)) = .81%
82
Not many of these
83. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
HyperLogLog In Action (Count Distinct)
Use Case: Number of distinct users who view a movie
83
0 32
Top Gun: Hour 2
user
2001
user
4009
user
3002
user
7002
user
1005
user
6001
User
8001
User
8002
user
1001
user
2009
user
3005
user
3003
Top Gun: Hour 1
user
3001
user
7009
0 16
UniformDistribution:
Estimate distinct # of users by
inspecting just the beginning
0 32
Top Gun: Hour 1 + 2
user
2001
user
4009
user
3002
user
7002
user
1005
user
6001
User
8001
User
8002
Combine across
different scales
user
7009
user
1001
user
2009
user
3005
user
3003
user
3001
84. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Locality Sensitive Hashing
Set Similarity
“Pre-process Items into Buckets, Compare Within Buckets”
84
85. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Locality Sensitive Hashing (LSH)
Approximate set similarity
Pre-process m rows into b buckets
b << m; b = buckets, m = rows
Hash items multiple times
** Similar items hash to overlapping buckets
** Hash designed to cluster similar items
Compare just contents of buckets
Much smaller cartesian compare
** Compare in parallel !!
Avoids huge cartesian all-pairs compare
85
Chapter 3: LSH
86. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
DIMSUM
Set Similarity
“Pre-process and ignore data that is unlikely to be similar.”
86
87. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
DIMSUM
“Dimension Independent Matrix Square Using MR”
Remove vectors with low probability of similarity
RowMatrix.columnSimiliarites(threshold)
Twitter DIMSUM Case Study
40% efficiency gain over bruce-force Cosine Sim
87
88. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Common Tools to Approximate
Twitter Algebird
Redis
Apache Spark
88
Composable Library
Distributed Cache
Big Data Processing
89. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Twitter Algebird
Algebraic Fundamentals
Parallel
Associative
Composable
Examples
Min, Max, Avg
BloomFilter (Set.contains(key))
HyperLogLog (Count Distinct)
CountMin Sketch (TopK Count)
89
90. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Redis
Implementation of HyperLogLog (Count Distinct)
12KB per item count
2^64 max # of items
0.81% error
Add user views for given movie
PFADD TopGun_Hour1_HLL user1001 user2009 user3005
PFADD TopGun_Hour1_HLL user3003 user1001
Get distinct count (cardinality) of set
PFCOUNT TopGun_Hour1_HLL
Returns: 4 (distinct users viewed this movie)
Union 2 HyperLogLog Data Structures
PFMERGE TopGun_Hour1_HLL TopGun_Hour2_HLL
90
ignore duplicates
Tunable
91. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Approximations in Spark Libraries
Spark Core
countByKeyApprox(timeout: Long, confidence: Double)
PartialResult
Spark SQL
approxCountDistinct(column: Column)
HyperLogLogPlus
Spark ML
Stratified sampling
sampleByKey(fractions: Map[K, Double])
DIMSUM sampling
Probabilistic sampling reduces amount of shuffle
RowMatrix.columnSimilarities(threshold: Double)
91
92. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Demo!
Exact Count vs. Approximate HLL and CMS Count
92
93. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
HashSet vs. HyperLogLog (Memory)
93
94. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
HashSet vs. CountMin Sketch (Memory)
94
95. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Demo!
Exact Similarity vs. Approximate LSH Similarity
95
96. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Brute Force Cartesian All Pair Similarity
96
47 seconds
97. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Locality Sensitive Hash All Pair Similarity
97
6 seconds
98. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Many More Demos!
or
Download Docker Clone on Github
98
http://advancedspark.com
99. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Presentation Outline
① Scaling
② Similarities
③ Recommendations
④ Approximations
⑤ Netflix Recommendations
99
100. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Netflix Recommendations
From Ratings to Real-time
100
101. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Netflix Has a Lot of Data
Netflix has a lot of data about a lot of users and a lot of movies.
Netflix can use this data to buy new movies.
Netflix is global.
Netflix can use this data to choose original programming.
Netflix knows that a lot of people like politics and Kevin Spacey.
101
The UK doesn’t have White Castle.
Renamed my favourite movie to:
“Harold and Kumar
Get the Munchies”
My favorite movie:
“Harold and Kumar
Go to White Castle”
Summary: Buy NFLX Stock!
This broke my unit tests!
102. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Netflix Data Pipeline - Then
102
v1.0
v2.0
103. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Netflix Data Pipeline – Now (Keystone)
103
v3.0
9 million events per second
22 GB per second!!
EC2 D2XL
Disk: 6 TB, 475 MB/s
RAM: 30 G
Network: 700 Mbps
Auto-scaling,
Fault tolerance
A/B Tests,
Trending Now
SAMZA
Splits high and
normal priority
104. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Netflix Recommendation Data Pipeline
104
Throw away
batch user
factors (U)
Keep
batch video
factors (V)
105. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Netflix Trending Now (Time-based Recs)
Uses Spark Streaming
Personalized to user (viewing history, past ratings)
Learns and adapts to events (Valentine’s Day)
105
“VHS”
Number of
Plays
Number of
Impressions
Calculate
Take Rate
106. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Bonus: Pandora Time-based Recs
Work Days
Play familiar music
User is less likely accept new music
Evenings and Weekends
Play new music
More like to accept new music
106
107. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
$1 Million Netflix Prize (2006-2009)
Goal
Improve movie predictions by 10% (Root Mean Sq Error)
Test data withheld to calculate RMSE upon submission
5-star Ratings Dataset
(userId, movieId, rating, timestamp)
Winning algorithm(s)
10.06% improvement (RMSE)
Ensemble of 500+ ML combined with GBDT’s
Computationally impractical
107
108. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Secrets to the Winning Algorithms
Adjust for the following human bias…
① Alice effect: user rates lower than avg
② Inception effect: movie rated higher than avg
③ Overall mean rating of a movie
④ Number of people who have rated a movie
⑤ Number of days since user’s first rating
⑥ Number of days since movie’s first rating
⑦ Mood, time of day, day of week, season, weather
108
109. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Netflix Common ML Algorithms
Logistic Regression
Linear Regression
Gradient Boosted Decision Trees
Random Forest
Matrix Factorization
SVD
Restricted Boltzmann Machines
Deep Neural Nets
Markov Models
LDA
Clustering
109
Ensembles!
110. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Netflix Genres and Clusters
Typical Genres
Documentaries, Romance Comedies, Horror, Action, Adventure
Latent (Hidden) Clusters
Emotionally-Independent Dramas for Hopeless Romantics
Witty Dysfunctional-Family TV Animated Comedies
Romantic Crime Movies based on Classic Literature
Latin American Forbidden-Love Movies
Critically-acclaimed Emotional Drug Movie
Cerebral Military Movie based on Real Life
Sentimental Movies about Horses for Ages 11-12
Gory Canadian Revenge Movies
Raunchy Mad Scientist Comedy
110
111. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Netflix Social Integration
Post to Facebook after movie start (5 mins)
Recommend to new users based on friends
Helps with Cold Start problem
111
112. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Netflix Search
No results? No problem… Show similar results!
Utilize extensive DVD Catalog
Metadata search (ElasticSearch)
Named entity recognition (NLP)
Empty searches are opportunity!
Explicit feedback for future recommendations
Content to buy and produce!
112
113. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Netflix A/B Tests
Users tend to click on images featuring…
Faces with strong emotional expressions
Villains over heroes
Small number of cast members
113
114. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Netflix Recommendation Serving Layer
Use Case: Recommendation service depends on EVCache
Problem: EVCache cluster does down or becomes latent!?
Answer: github.com/Netflix/Hystrix Circuit Breaker!
Circuit States
Closed: Service OK
Open: Service DOWN
Fallback to Static
114
115. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Why Higher Average Ratings 2004+?
2004, Netflix noticed higher ratings on average
Some possible reasons why…
115
① Significant UI improvements deployed
② New recommendation engine deployed
③
116. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Thank You, Everyone!!
Chris Fregly @cfregly
IBM Spark Technology Center
San Francisco, California, USA
http://advancedspark.com
Sign up for the Meetup and Book
Contribute to Github Repo
Run all Demos using Docker
Find me LinkedIn, Twitter, Github, Email, Fax
116
Image derived from http://www.duchess-france.org/
117. Flux Capacitor AI Bringing AI Back to the Future!
Bringing AI Back to the Future!
Flux Capacitor AI
http://advancedspark.com
@cfregly