Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark

After Dark
Generating High-Quality Recommendations using
Real-time Advanced Analytics and Machine Learning with
Chris Fregly
chris@fregly.com

Who am I?
Streaming Platform Engineer
Streaming Data Engineer
Netflix Open Source Committer
Data Solutions Engineer
Apache Spark Contributor
Spark Author
Consultant, Trainer
2
advancedspark.com

Why After Dark?
Playboy After Dark
Late 1960’s TV Show
Progressive Show For Its Time
And it rhymes!!
3

What is ?
4
Spark Core
Spark
Streaming
real-timeSpark SQL
structured data
MLlib
machine
learning
GraphX
graph
analytics
…
BlinkDB
approx queries

What is ?
6
Founded by the creators of
as a Service
Amazon AWS based
Powerful Visualizations
Collaborative Notebooks
Scala/Java, Python, SQL, R
Flexible Cluster Management
Job Scheduling and Monitoring

7
①Generate high-quality recommendations
②Demonstrate Spark high-level libraries:
③ Spark Streaming -> Kafka, Approximates
④ Spark SQL -> DataFrames, Cassandra
① GraphX -> PageRank, Shortest Path
① MLlib -> Matrix Factor, Word2Vec
Goals of After Dark?
Images courtesy of tinder.com. Not affiliated with Tinder in any way!

Focus of This Talk
9
①Parallelism
②Performance
③Real-time Streaming
④Approximations
⑤Similarity Measures
Spark and…

Brady Bunch circa 1980
11
Season 5, Episode 18: “Two Petes in a Pod”

Parallel Algorithm : O(log n)
12

Non-parallel Algorithm : O(n)
13

Daytona Gray Sort Contest
16
On-disk only
250,000 partitions
No in-memory caching
(2014)(2013) (2014)

Improved Shuffle and Network Layer
17
①“Sort-based shuffle”
②Minimize OS resources
③Switched to async Netty
④Keep CPUs hot
⑤Reuse byte buffers to minimize GC
⑥Use epoll for I/O to stay in kernel space

Project Tungsten: CPU and Memory
18
①More JVM bytecode generation, JIT optimize
②CPU-cache-aware data structs and algos
->
③Custom memory management
Serializers HashMap

DataFrames and Catalyst
19
19
https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/
Please
Use DataFrames!!
-->
JVM bytecode
generation

Columnar Storage Format
20
*Skip whole chunks with min-max heuristics
stored in each chunk (sorted data only)

Parquet File Format
21
①Based on Google Dremel Paper
②Implemented by Twitter and Cloudera
③Columnar storage format
④Optimized for fast columnar aggregations
⑤Tight compression
⑥Supports pushdowns
⑦Nested, self-describing, evolving schema

Types of Compression
22
①Run Length Encoding
Repeated data
②Dictionary Encoding
Fixed set of values
③Delta, Prefix Encoding
Sorted dataset

Types of Pushdowns
23
①Column, Partition Pruning
②Row, Predicate Filtering

Direct Kafka Streaming (KafkaRDD)
① No single Receiver, no Write Ahead Log (WAL)
② Workers pull from Kafka in parallel
③ Each KafkaRDD partition stores relevant offsets
④ Upon Worker Node failure, rebuild from offsets
⑤ Optimizes happy path by avoiding the WAL
25
At least once
delivery guarantee
<--

Count Min Sketch
27
① Approximate counters
② Better than HashMap
③ Low, fixed memory
④ Known error bounds
⑤ Large num of counters
⑥ Available in Twitter’s Algebird
⑦ Streaming example in Spark codebase

HyperLogLog
28
① Measures set cardinality
Approx count distinct
② Low memory
1.5KB @ 2% error
10^9 elements!
③ From Twitter’s Algebird
④ Streaming example in Spark codebase
⑤ RDD: countApproxDistinctByKey()

Types of Recommendations
30
①Non-personalized (2 out of 10)
Cold Start
No preference or behavior data for user, yet
②Personalized (8 out of 10)
User-Item Similarity
Items that others with similar prefs have
liked
Item-Item Similarity

Audience Participation Needed!
32
①Navigate to sparkafterdark.com
②Click 3 actors and 3 actresses
->
You are here
->

Non-personalized
Recommendations
33

Summary Statistics and Aggregations
34
①Top Users by Like Count
“I might like users with the highest sum aggregation
of likes overall.”
SparkSQL + DataFrame: Aggregations

Like Graph Analysis
35
②Top Influencers by Like Graph
“I might like users who have the highest probability of
me liking them randomly while walking the like graph.”
GraphX: PageRank

Demo!
Spark SQL + DataFrames + GraphX
36

Types of Similarity
38
①Euclidean: linear measure
Magnitude bias
②Cosine: angle measure
Adjust for magnitude bias
③Jaccard: Set intersection divided by union
Popularity bias
④Log Likelihood
Adjust for pop. bias
Ali Matei Reynold Patrick Andy
Kimberly 1 1 1 1
Leslie 1 1
Meredith 1 1 1
Lisa 1 1 1
Holden 1 1 1 1 1
z

All-pairs Similarity Measure
39
①Compare everything to everything
②aka. “pair-wise similarity” or “similarity join”
③Naïve shuffle: O(m*n^2); m=rows, n=cols
④Minimize shuffle: reduce data size & approx
Reduce m (rows)
Sampling and bucketing
Reduce n (cols)
Remove most frequent value (0?)

Sampling Algo: DIMSUM
40
①"Dimension Independent Matrix Square
Using MR”
②Remove rows with low similarity probability
③MLlib: RowMatrix.columnSimilarities(…)
④Twitter: 40% efficiency gain over Cosine

Bucket Algo: Locality Sensitive Hashing
41
① Split into b buckets using similarity hash algo
Requires pre-processing of data
② Compare bucket contents in parallel
③ Converts O(m*n^2) -> O(m*n/b*b^2);
m=rows, n=cols, b=buckets
④ Example: 500k x 500k matrix
O(1.25E17) -> O(1.25E13); b=50
⑤ github.com/mrsqueeze/spark-hash

MLlib: SparseVector vs. DenseVector
42
① Remove columns using sparse vectors
② Converts O(m*n^2) -> O(m*nnz^2);
nnz=num nonzeros, nnz << n
Tip: Choose most frequent value … may not be 0

Personalized
Recommendations
43

Personalized Recommendation Terms
44
①User
User seeking likeable recommendations
②Item
User who has been liked
*Also a user seeking likeable recommendations!
③Types of Feedback
Explicit: rating, like
Implicit: search, click, hover, view, scroll

Collaborative Filtering Personalized Recs
45
③Like behavior of similar users
“I like the same people that you like.
What other people did you like that I haven’t seen?”
MLlib: Matrix Factorization, User-Item Similarity

Text-based Personalized Recs
46
④Similar profiles to each other
“Our profiles have similar, unique k-skip n-grams.
We might like each other.”
MLlib: Word2Vec, TF/IDF, Doc Similarity

More Text-based Personalized Recs
47
⑤Similar profiles from my past likes
“Your profile shares a similar feature vector space to
others that I’ve liked. I might like you.”
MLlib: Word2Vec, TF/IDF, Doc Similarity

More Text-based Personalized Recs
48
⑥Relevant, High-Value Emails
“Your initial email has similar named entities to my profile.
I might like you just for making the effort.”
MLlib: Word2Vec, TF/IDF, Entity Recognition
^
Her Email< My Profile

Personalized Recommendations:
The Future
49

Facial Recognition
50
⑦Eigenfaces
“Your face looks similar to others that I’ve liked.
I might like you.”
MLlib: RowMatrix, PCA, Item-Item Similarity
Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html

Conversation Starter Bot
51
⑧NLP and DecisionTrees
“If your responses to my trite opening lines are positive,
I might actually read your profile.”
MLlib: TF/IDF, DecisionTree,
Sentiment Analysis
Positive
response ->
Negative
<- response
Image courtesty of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html

Compromise Recommendations (Couples)
53
⑨Pathway of Similarity
“I want Mad Max. You want Message In a Bottle.
Let’s find something in between to watch tonight.”
MLlib: RowMatrix, Item-Item Similarity
GraphX: Nearest Neighbors, Shortest Path
similar similar
plots -> <- actors
… …

54
⑩ The Final Recommendation

⑩ Get Off The Computer and Meet People!
linkedin.com/in/cfregly
github.com/cfregly
chris@fregly.com
@cfregly
55
Thank you!
Image courtesy of http://www.duchess-france.org/
Free trial at databricks.com
Try !!

Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark

Similar to Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark (20)

More from Chris Fregly

More from Chris Fregly (20)

Recently uploaded

Recently uploaded (20)

Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark