OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
Dublin Ireland Spark Meetup October 15, 2015
1. After Dark
Generating High-Quality Recommendations using
Real-time Advanced Analytics and Machine Learning with
Chris Fregly
chris@fregly.com, IBM Spark Technology Center (spark.tc)
2. Who am I?
2
Streaming Data Engineer
Netflix Open Source Committer
Data Solutions Engineer
Apache Contributor
Principal Data Solutions Engineer
IBM Technology Center
Meetup Organizer
Advanced Apache Meetup
Book Author
Advanced (2016)
3. Advanced Apache Spark Meetup
Total Spark Experts: ~1300 in 3 mos!
Top 5 most active Spark Meetup globally!
Main Goals
Dig deep into the Spark & extended-Spark codebase
Study integrations such as Cassandra, ElasticSearch,
Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R, e
tc
Surface and share the patterns and idioms of these
well-designed, distributed, big data components
4. Why “ After Dark”?
“Playboy After Dark”
Late 1960’s TV Show
Progressive Show For Its Time
4
And it rhymes!!
7. Tools of this Talk
7
① Redis
② Docker
③ Cassandra
④ MLlib, GraphX
⑤ Parquet, JSON
⑥ Apache Zeppelin
⑦ Spark Streaming, Kafka
⑧ Spark SQL, DataFrames
⑨ Spark JDBC/ODBC Hive ThriftServer
⑩ ElasticSearch, Logstash, Kibana (ELK)
and…
8. SMACK Stack!
8
① S park (Data Processing)
② M esos (Cluster Manager)
③ A kka (Actors)
④ C assandra (NoSQL)
⑤ K afka (Streaming)
9. Themes of This Talk
9
①Parallelism
②Performance
③Streaming
④Approximations
⑤Similarity Measures
⑥Recommendations
and…
10. 10
①Generate high-quality recommendations
②Demonstrate high-level libraries:
③ Spark Streaming -> Kafka, Approximates
④ Spark SQL -> DataFrames, Cassandra
① GraphX -> PageRank, Shortest Path
① MLlib -> Matrix Factor, Word2Vec
Goals of After Dark?
Images courtesy of tinder.com, however not affiliated with Tinder in any way.
18. Daytona Gray Sort Contest
18
① On-disk only
② 28,000 partitions
③ No in-memory caching
(2014)(2013) (2014)
19. Improved Shuffle and Network Layer
19
①“Sort-based shuffle”
②Minimize OS resources
③Switched to async Netty
④Keep CPUs hot
⑤Reuse byte buffers to minimize GC
⑥Use epoll for I/O to stay in kernel space
20. Project Tungsten: CPU and Memory
20
①More JVM bytecode generation, JIT optimize
②CPU-cache-aware data structs and algos
-->
③Custom memory management
Serializers Performance HashMap
21. DataFrames and Catalyst Optimizer
21
21
https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/
Please Use
DataFrames!
-->
-->
JVM bytecode
generation
23. Parquet File Format
23
①Based on Google Dremel Paper
②Implemented by Twitter and Cloudera
③Columnar storage format
④Optimized for fast columnar aggregations
⑤Tight compression
⑥Supports pushdowns
⑦Nested, self-describing, evolving schema
24. Types of Compression
24
①Run Length Encoding
Repeated data
②Dictionary Encoding
Fixed set of values
③Delta, Prefix Encoding
Sorted dataset
25. Types of Query Optimizations
25
①Column, Partition Pruning
②Row, Predicate Pushdown
SELECT b FROM table WHERE a in [a2,a3]
27. Direct Kafka Streaming - KafkaRDD
① No single Receiver, no Write Ahead Log (WAL)
② Workers pull from Kafka in parallel
③ Each KafkaRDD partition stores relevant offsets
④ Upon Worker Node failure, rebuild from offsets
⑤ Optimizes happy path by avoiding the WAL
27
At least once
delivery guarantee
<--
29. Count Min Sketch
29
①Approximate counters
②Better than HashMap
③Low, fixed memory
④Known error bounds
⑤Large num of counters
⑥From Twitter’s Algebird
⑦Streaming example in codebase
31. Monte Carlo Simulations
31
From Manhattan Project (A-bomb)
Simulate movement of neutrons
Law of Large Numbers (LLN)
Average of results of many trials
Converge on expected value
SparkPi example in codebase
Pi ~ # red dots /
# total dots * 4
35. Types of Recommendations
35
Non-personalized
Cold Start
No preference or behavior data for user, yet
Personalized
User-Item Similarity
Items that others with similar prefs have liked
Item-Item Similarity
Items similar to your previously-liked items
37. Summary Statistics and Aggregations
37
①Top Users by Like Count
“I might like users with the highest sum aggregation of
likes overall.”
SparkSQL + DataFrame: Aggregations
38. Like Graph Analysis
38
②Top Influencers by Like Graph
“I might like users who have the highest probability of
me liking them randomly while walking the like graph.”
GraphX: PageRank
41. Types of Similarity
41
Euclidean: linear measure
Magnitude bias
Cosine: angle measure
Adjust for magnitude bias
Jaccard: (intersection / union)
Popularity bias
Log Likelihood
Adjust for popularity bias
Ali Matei Reynold Patrick Andy
Kimberly 1 1 1 1
Leslie 1 1
Meredith 1 1 1
Lisa 1 1 1
Holden 1 1 1 1 1
z
42. All-Pairs Similarity Comparison
42
Compare everything to everything
aka. “pair-wise similarity” or “similarity join”
Naïve shuffle: O(m*n^2); m=rows, n=cols
Must Minimize shuffle through approximations
Reduce m (rows)
Sampling and bucketing
Reduce n (cols): Remove most frequent value (ie.0)
43. Reduce m: DIMSUM Sampling
43
Dimension Independent Matrix Square Using MR
Remove rows with low similarity probability
MLlib: RowMatrix.columnSimilarities(…)
Twitter: 40% efficiency gain over Cosine
44. Reduce m: LSH Bucketing
44
Locality Sensitive Hashing
Split m into b buckets
Use similarity hash algo
Requires pre-processing of data
Compare bucket contents in parallel
Converts O(m*n^2) -> O(m*n/b*b^2);
m=rows, n=cols, b=buckets
500k x 500k matrix
O(1.25E17) -> O(1.25E13); b=50
github.com/mrsqueeze/spark-hash
45. Reduce n: Remove Most Frequent Value
45
Eliminate most-frequent value
Represent other values with (index,value) pairs
Converts O(m*n^2) -> O(m*nnz^2);
nnz=num nonzeros, nnz << n
Choose most frequent value – may not be zero!
(index,value)
(index,value)
47. Terminology of Recommendations
47
User
User seeking recommendations
Item
Item that has been liked or rated
Feedback
Explicit: like, rating
Implicit: search, click, hover, view, scroll
48. Collaborative Filtering Personalized Recs
48
③Like behavior of similar users
“I like the same people that you like.
What other people did you like that I haven’t seen?”
MLlib: Matrix Factorization, User-Item Similarity
50. Text-based Personalized Recs
50
④Similar profiles to me
“Our profiles have similar, unique k-skip n-grams.
We might like each other.”
MLlib: Word2Vec, TF/IDF, Doc Similarity
51. More Text-based Personalized Recs
51
⑤Similar profiles from my past likes
“Your profile shares a similar feature vector space to
others that I’ve liked. I might like you.”
MLlib: Word2Vec, TF/IDF, Doc Similarity
52. More Text-based Personalized Recs
52
⑥Relevant, High-Value Emails
“Your initial email has similar named entities to my profile.
I might like you just for making the effort.”
MLlib: Word2Vec, TF/IDF, Entity Recognition
^
Her Email< My Profile
54. Facial Recognition
54
⑦Eigenfaces
“Your face looks similar to others that I’ve liked.
I might like you.”
MLlib: RowMatrix, PCA, Item-Item Similarity
Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
55. Conversation Bot
55
⑧NLP and DecisionTrees
“If your responses to my trite opening
lines are positive, I may read your profile.”
MLlib: TF/IDF, DecisionTree,
Sentiment Analysis
Positive Negative
Image courtesty of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
57. Couples’ Recommendations
57
⑨Pathways of Similarity
“I want Mad Max. You want Message In a Bottle.
Let’s find something in between to watch tonight.”
MLlib: RowMatrix, Item-Item Similarity
GraphX: Nearest Neighbors, Shortest Path
similar similar
plots -> <- actors
59. ⑩ Get Off The Computer and Meet People!
chris@fregly.com
@cfregly
IBM Spark Technology Center (spark.tc)
advancedspark.com
github.com/fluxcapacitor/pipeline
hub.docker.com/r/fluxcapacitor/pipeline/
59
Thank you!!
Image courtesy of http://www.duchess-france.org/