1. The document discusses various techniques for generating high-quality recommendations using Apache Spark including parallelism, performance optimizations, real-time streaming, and machine learning algorithms.
2. It demonstrates Spark's high-level libraries like Spark Streaming, Spark SQL, GraphX, and MLlib for tasks such as generating recommendations, computing page rank, and training word embedding models.
3. The goals of the talk are to show how to build a recommendation engine in Spark that can perform personalized recommendations using techniques like collaborative filtering, content-based filtering, and similarity joins.
6. What is ?
6
Founded by the creators of
as a Service
Amazon AWS based
Powerful Visualizations
Collaborative Notebooks
Scala/Java, Python, SQL, R
Flexible Cluster Management
Job Scheduling and Monitoring
7. 7
①Generate high-quality recommendations
②Demonstrate Spark high-level libraries:
③ Spark Streaming -> Kafka, Approximates
④ Spark SQL -> DataFrames, Cassandra
① GraphX -> PageRank, Shortest Path
① MLlib -> Matrix Factor, Word2Vec
Goals of After Dark?
Images courtesy of tinder.com. Not affiliated with Tinder in any way!
16. Daytona Gray Sort Contest
16
On-disk only
250,000 partitions
No in-memory caching
(2014)(2013) (2014)
17. Improved Shuffle and Network Layer
17
①“Sort-based shuffle”
②Minimize OS resources
③Switched to async Netty
④Keep CPUs hot
⑤Reuse byte buffers to minimize GC
⑥Use epoll for I/O to stay in kernel space
18. Project Tungsten: CPU and Memory
18
①More JVM bytecode generation, JIT optimize
②CPU-cache-aware data structs and algos
->
③Custom memory management
Serializers HashMap
21. Parquet File Format
21
①Based on Google Dremel Paper
②Implemented by Twitter and Cloudera
③Columnar storage format
④Optimized for fast columnar aggregations
⑤Tight compression
⑥Supports pushdowns
⑦Nested, self-describing, evolving schema
22. Types of Compression
22
①Run Length Encoding
Repeated data
②Dictionary Encoding
Fixed set of values
③Delta, Prefix Encoding
Sorted dataset
25. Direct Kafka Streaming (KafkaRDD)
① No single Receiver, no Write Ahead Log (WAL)
② Workers pull from Kafka in parallel
③ Each KafkaRDD partition stores relevant offsets
④ Upon Worker Node failure, rebuild from offsets
⑤ Optimizes happy path by avoiding the WAL
25
At least once
delivery guarantee
<--
27. Count Min Sketch
27
① Approximate counters
② Better than HashMap
③ Low, fixed memory
④ Known error bounds
⑤ Large num of counters
⑥ Available in Twitter’s Algebird
⑦ Streaming example in Spark codebase
28. HyperLogLog
28
① Measures set cardinality
Approx count distinct
② Low memory
1.5KB @ 2% error
10^9 elements!
③ From Twitter’s Algebird
④ Streaming example in Spark codebase
⑤ RDD: countApproxDistinctByKey()
30. Types of Recommendations
30
①Non-personalized (2 out of 10)
Cold Start
No preference or behavior data for user, yet
②Personalized (8 out of 10)
User-Item Similarity
Items that others with similar prefs have
liked
Item-Item Similarity
34. Summary Statistics and Aggregations
34
①Top Users by Like Count
“I might like users with the highest sum aggregation
of likes overall.”
SparkSQL + DataFrame: Aggregations
35. Like Graph Analysis
35
②Top Influencers by Like Graph
“I might like users who have the highest probability of
me liking them randomly while walking the like graph.”
GraphX: PageRank
38. Types of Similarity
38
①Euclidean: linear measure
Magnitude bias
②Cosine: angle measure
Adjust for magnitude bias
③Jaccard: Set intersection divided by union
Popularity bias
④Log Likelihood
Adjust for pop. bias
Ali Matei Reynold Patrick Andy
Kimberly 1 1 1 1
Leslie 1 1
Meredith 1 1 1
Lisa 1 1 1
Holden 1 1 1 1 1
z
39. All-pairs Similarity Measure
39
①Compare everything to everything
②aka. “pair-wise similarity” or “similarity join”
③Naïve shuffle: O(m*n^2); m=rows, n=cols
④Minimize shuffle: reduce data size & approx
Reduce m (rows)
Sampling and bucketing
Reduce n (cols)
Remove most frequent value (0?)
40. Sampling Algo: DIMSUM
40
①"Dimension Independent Matrix Square
Using MR”
②Remove rows with low similarity probability
③MLlib: RowMatrix.columnSimilarities(…)
④Twitter: 40% efficiency gain over Cosine
41. Bucket Algo: Locality Sensitive Hashing
41
① Split into b buckets using similarity hash algo
Requires pre-processing of data
② Compare bucket contents in parallel
③ Converts O(m*n^2) -> O(m*n/b*b^2);
m=rows, n=cols, b=buckets
④ Example: 500k x 500k matrix
O(1.25E17) -> O(1.25E13); b=50
⑤ github.com/mrsqueeze/spark-hash
42. MLlib: SparseVector vs. DenseVector
42
① Remove columns using sparse vectors
② Converts O(m*n^2) -> O(m*nnz^2);
nnz=num nonzeros, nnz << n
Tip: Choose most frequent value … may not be 0
44. Personalized Recommendation Terms
44
①User
User seeking likeable recommendations
②Item
User who has been liked
*Also a user seeking likeable recommendations!
③Types of Feedback
Explicit: rating, like
Implicit: search, click, hover, view, scroll
45. Collaborative Filtering Personalized Recs
45
③Like behavior of similar users
“I like the same people that you like.
What other people did you like that I haven’t seen?”
MLlib: Matrix Factorization, User-Item Similarity
46. Text-based Personalized Recs
46
④Similar profiles to each other
“Our profiles have similar, unique k-skip n-grams.
We might like each other.”
MLlib: Word2Vec, TF/IDF, Doc Similarity
47. More Text-based Personalized Recs
47
⑤Similar profiles from my past likes
“Your profile shares a similar feature vector space to
others that I’ve liked. I might like you.”
MLlib: Word2Vec, TF/IDF, Doc Similarity
48. More Text-based Personalized Recs
48
⑥Relevant, High-Value Emails
“Your initial email has similar named entities to my profile.
I might like you just for making the effort.”
MLlib: Word2Vec, TF/IDF, Entity Recognition
^
Her Email< My Profile
50. Facial Recognition
50
⑦Eigenfaces
“Your face looks similar to others that I’ve liked.
I might like you.”
MLlib: RowMatrix, PCA, Item-Item Similarity
Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
51. Conversation Starter Bot
51
⑧NLP and DecisionTrees
“If your responses to my trite opening lines are positive,
I might actually read your profile.”
MLlib: TF/IDF, DecisionTree,
Sentiment Analysis
Positive
response ->
Negative
<- response
Image courtesty of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
53. Compromise Recommendations (Couples)
53
⑨Pathway of Similarity
“I want Mad Max. You want Message In a Bottle.
Let’s find something in between to watch tonight.”
MLlib: RowMatrix, Item-Item Similarity
GraphX: Nearest Neighbors, Shortest Path
similar similar
plots -> <- actors
… …
55. ⑩ Get Off The Computer and Meet People!
linkedin.com/in/cfregly
github.com/cfregly
chris@fregly.com
@cfregly
55
Thank you!
Image courtesy of http://www.duchess-france.org/
Free trial at databricks.com
Try !!