This document is a summary of a talk on using linear algebra and Apache Spark to build a text similarity engine. The talk discusses using cosine similarity to calculate pairwise similarity scores between text vectors in Spark. Several attempts are made to perform the pairwise matrix multiplication in Spark, but issues with memory usage and serialization force moving to lower-level Breeze linear algebra APIs. Key lessons include limitations of operations on different Spark distributed matrix APIs and the inability to serialize Spark vector types directly.
2. Anna Holschuh, Target
Lessons In Linear Algebra
At Scale With Apache Spark
Let’s make the sparse details a bit more dense
#UnifiedAnalytics #SparkAISummit
3. What This Talk is About
• A journey in building a text-based similarity engine
• Brief Basic Linear Algebra Refresher
• Linear Algebra at Scale in Spark
• Focus on Scala with code examples
3#UnifiedAnalytics #SparkAISummit
4. 4#UnifiedAnalytics #SparkAISummit
Who am I
• Lead Data Engineer/Scientist at Target since 2016
• Deep love of all things Target
• Other Spark Summit talks:
o 2018: Extending Apache Spark APIs Without Going Near Spark Source Or A
Compiler
o 2019: Parallelizing With Apache Spark In Unexpected Ways
7. 7#UnifiedAnalytics #SparkAISummit
Motivation
• For a core object with rich text-based attributes, we
wanted to create a system that would return the N
most similar objects for a given input object at scale.
• The goal was to first produce the raw data and then
build a configurable component that could be pulled
into other computation engines.
• The question of similarity is foundational and is
addressed in a variety of ways across many
disciplines: Text Mining, Information Retrieval, Entity
Resolution, Recommendation Engines, etc.
• scikit-learn and pyspark were first used to implement
KNN in an MVP. This first pass struggled to scale
and this is where Spark and Scala were introduced.
The
quick
brown
fox
The
slow
brown
fox
The
quick
orange
cat
A slow
brown
dog
Input
Neighbors
Score: 0.75
Score: 0.50
Score: 0.25
9. 9#UnifiedAnalytics #SparkAISummit
A Similarity Engine
GOALS
• For an input object, the system should return the
N most similar objects in the system
• It should work off of a corpus of 40k total objects
• It should support a vocabulary that has 15k
tokens
• It should be able to compute pairwise scores
across the entire corpus
• It should perform in a reasonable amount of time
(on the order of minutes)
The
quick
brown
fox
The
slow
brown
fox
The
quick
orange
cat
A slow
brown
dog
Input
Neighbors
Score: 0.75
Score: 0.50
Score: 0.25
10. 10#UnifiedAnalytics #SparkAISummit
A Similarity Engine
K-Nearest Neighbors (KNN)
Brute Force
• The most naïve approach
• Unsupervised learning method
• Data is represented in a Vector/Matrix
format
• Computes pairwise scores between all
pairs of points in the Dataset. For N
samples and D dimensions, the scale of
this method is O[DN2]
Cosine Similarity
• A pairwise scoring approach to
compute one number between 0 and 1
representing the similarity of two
vectors
11. 11#UnifiedAnalytics #SparkAISummit
A Similarity Engine
Bag of Words Representation
How do we represent our features in such a system?
The quick brown fox
The quick orange cat
The slow brown fox
A slow brown dog
the,quick,brown,fox,slow,orange,cat,a,dog
1 1 1 1 0 0 0 0 0
1 0 1 1 1 0 0 0 0
1 1 0 0 0 1 1 0 0
0 0 1 0 1 0 0 1 1
1 1 1 1 0 0 0 0 0
1 0 1 1 1 0 0 0 0
1 1 0 0 0 1 1 0 0
0 0 1 0 1 0 0 1 1
Collect Raw Features Build a Feature Vocab
Represent Features in a
Matrix
12. 12#UnifiedAnalytics #SparkAISummit
A Similarity Engine
Cosine Similarity
• A measure of similarity between two vectors that
measures the cosine of the angle between them
• The cosine function is periodic and ranges from
-1 to 1
• Vectors that are relatively close to one another
will have a score that approaches 1. Vectors that
are orthogonal will have a score of 0. Vectors
that are diametrically opposed will have a score
of -1
• Cosine similarity is often used to generate scores
in the positive space from 0 to 1.
• This measurement handles sparse data well as
only non-zero dimensions are considered.
cos(x) x2
x1
References: Wolfram Alpha, Wikipedia
13. 13#UnifiedAnalytics #SparkAISummit
A Similarity Engine
Cosine Similarity – In Practice
The quick brown fox
The slow brown fox
1 1 1 1 0 0 0 0 0
1 0 1 1 1 0 0 0 0
A * B
= 1x1 + 1x0 + 1x1 + 1x1 + 0x1 + 0x0 + 0x0
+ 0x0 + 0x0
= 3
|| A ||
= sqrt(4)
|| B ||
= sqrt(4)
cos(x) = cosine similarity =
= ¾ = 0.75
14. 14#UnifiedAnalytics #SparkAISummit
A Similarity Engine
Cosine Similarity – On Matrices
1 1 1 1 0 0 0 0 0
1 0 1 1 1 0 0 0 0
1 1 0 0 0 1 1 0 0
0 0 1 0 1 0 0 1 1
1/2
1/2
1/2
1/2 0 0 0 0 0
1/2 0 1/2
1/2
1/2 0 0 0 0
1/2
1/2 0 0 0 1/2
1/2 0 0
0 0 1/2 0 1/2 0 0 1/2
1/2
Step 1: Normalize Matrix
• Divide each element in a
row vector by the
magnitude of the row
• This takes care of the
denominator in the cosine
similarity calculation
F Fnorm
15. 1/2
1/2
1/2 0
1/2 0 1/2 0
1/2
1/2 0 1/2
1/2
1/2 0 0
0 1/2 0 1/2
0 0 1/2 0
0 0 1/2 0
0 0 0 1/2
0 0 0 1/2
15#UnifiedAnalytics #SparkAISummit
A Similarity Engine
Cosine Similarity – On Matrices
Step 2: Multiply the
normalized matrix by its
transpose
• This takes care of the dot
product part (numerator) of
the cosine similarity
calculation
Fnorm
Fnorm
T
1/2
1/2
1/2
1/2 0 0 0 0 0
1/2 0 1/2
1/2
1/2 0 0 0 0
1/2
1/2 0 0 0 1/2
1/2 0 0
0 0 1/2 0 1/2 0 0 1/2
1/2
X
16. 16#UnifiedAnalytics #SparkAISummit
A Similarity Engine
Pairwise Scoring
Pairwise Scoring
• Produces a square,
symmetric matrix
• The diagonal is always
ones, representing that a
feature is perfectly similar
with itself
• Rows can be read to find
the indices of objects that
are the best match
Fnorm Fnorm
T
X = Scores
4x9 9x4 4x4
1 3/4
1/2
1/4
3/4 1 1/4
1/2
1/2
1/4 1 0
1/4
1/2 0 1
The quick brown fox
The quick orange cat
The slow brown fox
A slow brown dog
17. 17#UnifiedAnalytics #SparkAISummit
A Similarity Engine
Putting it all together
Moving to Spark
• We want to leverage this same exact
approach at scale
• Instead of dealing with 4 features with 9
vocabulary words, we want to deal with
upwards of 40k features with upwards
of 15k vocabulary words
• We want to be able to distribute this
large scale computation across a cluster
• We want this to be performant and
reliable
19. 19#UnifiedAnalytics #SparkAISummit
Linear Algebra in Spark
Getting Started
• Completed this work using Spark 2.2
• There is no Spark MLlib KNN
implementation available
• That’s ok, because we know how to
carry out this computation at a low
level with Linear Algebra concepts
• The next step is to dig into Spark’s
APIs for Linear Algebra
20. 20#UnifiedAnalytics #SparkAISummit
Linear Algebra in Spark
Local Vector
• Int, 0-based indices, double-typed
values
• Sparse and Dense
References: https://spark.apache.org/docs/2.2.0/mllib-data-types.html
• Able to be stored on a single
machine
• Building block for local and
distributed matrices in Spark
21. 21#UnifiedAnalytics #SparkAISummit
Linear Algebra in Spark
Distributed Matrix APIs
References: https://spark.apache.org/docs/2.2.0/mllib-data-types.html
RowMatrix
• Row-oriented matrix represented by an
RDD[Vector]
• No meaningful indices
IndexedRowMatrix
• Row-oriented matrix represented by an
RDD[IndexedRow]
CoordinateMatrix
• Each entry is a tuple of (i: Long, j: Long,
value: Double)
• Should only be used when both
dimensions of the matrix are huge and the
matrix is sparse
BlockMatrix
• A distributed matrix backed by an RDD of
MatrixBlocks
• A MatrixBlock is a tuple of [(Int,Int),Matrix]
22. 22#UnifiedAnalytics #SparkAISummit
Linear Algebra in Spark
Feature Generation
• We need to convert our Dataset of Articles into
useful features to carry out similarity calculations
on
• We first need to tokenize the text contained in the
article and can use Spark’s RegexTokenizer
• We then need to turn a collection of tokens per
article into vector bag of words representations
across the entire vocabulary corpus
• We use the CountVectorizer, although there are
other options available
• This works great! No problems here.
23. 23#UnifiedAnalytics #SparkAISummit
Linear Algebra in Spark
Feature Generation, continued
• We also need to normalize our features before we
carry out matrix multiplication to generate scores
• We can use Spark’s Normalizer to carry this out
• Again, this works great! No problems here.
24. 24#UnifiedAnalytics #SparkAISummit
Linear Algebra in Spark
Pairwise Scoring – Attempt 1
• THIS CODE DOES NOT WORK
• We need to generate a feature set of bag
of words vectors and multiply this matrix
by itself to generate cosine similarity
scores
• Working in the IndexedRowMatrix API
seems most intuitive for what we’re trying
to accomplish
25. 25#UnifiedAnalytics #SparkAISummit
Linear Algebra in Spark
Lessons Learned
• Transpose is only available on the
BlockMatrix and CoordinateMatrix APIs
• Multiply is only available when both
matrices are distributed on the BlockMatrix
API
• (Multiplying by a local Matrix is available
on the RowMatrix and IndexedRowMatrix
APIs)
• BlockMatrix API it is…
26. 26#UnifiedAnalytics #SparkAISummit
Linear Algebra in Spark
Pairwise Scoring – Attempt 2
• THIS CODE DOES NOT WORK
• We attempt to work in the BlockMatrix API
instead to make use of transpose and
multiply.
• Converting back and forth between
different Distributed matrix APIs can be
expensive, so if this works, we’d go back
and start out in that API.
• This code compiles
• This code blows up on a relatively hefty
cluster with OOM errors
27. 27#UnifiedAnalytics #SparkAISummit
Linear Algebra in Spark
Lessons Learned
• BlockMatrix is the only Distributed Matrix
API that supports multiplying two
distributed matrices
• It accomplishes its multiply on
SparseMatrices by converting them to
DenseMatrices
• One can configure the number of rows and
columns contained in a block, so the
tradeoff can be made between alleviating
memory pressure during the toDense
operation and increasing the number of
operations involved in the multiply with
more blocks
28. 28#UnifiedAnalytics #SparkAISummit
Linear Algebra in Spark
Back to the drawing board
• It would be ideal to keep things in a Sparse
representation throughout the multiply operation
• Idea 1: Use the CoordinateMatrix API to manually
manage the transpose and multiplication based on
coordinates
o This seems like it would generate a lot of shuffle
activity
• Idea 2: Go back to IndexedRowMatrix and broadcast
smaller chunks of the matrix to be used in local
multiplication on executors.
o Digging through the source code also shows
these matrices are converted to dense
• Idea 3: Wrap SparseVectors in a Dataset/RDD to be
distributed and broadcast smaller chunks of Vectors
to be locally assembled into matrices for
multiplication.
29. 29#UnifiedAnalytics #SparkAISummit
Linear Algebra in Spark
Pairwise Scoring – Attempt 3
• Going with Idea 3
• THIS CODE DOES NOT WORK
• We would like to wrap a SparseVector
to pass around in a Dataset for manually
managing the multiplication
30. 30#UnifiedAnalytics #SparkAISummit
Linear Algebra in Spark
Lessons Learned
• Spark cannot serialize/deserialize
SparseVectors due to the implementation
of the VectorUDT
• We will have to wrap a Vector instead and
know in our hearts that it is indeed a
SparseVector as the VectorUDT abstracts
away all of these details
31. 31#UnifiedAnalytics #SparkAISummit
Linear Algebra in Spark
Quick note about VectorUDT
• A custom UserDefinedType
• Allows easier interaction with SQL
• Abstracts away the serialization and
deserialization of subtypes but still
preserves typing
Reference:
https://github.com/apache/spark/blob/v2.2.0/mllib/src/main/scala/org/apache/spark/mlli
b/linalg/Vectors.scala
32. 32#UnifiedAnalytics #SparkAISummit
Linear Algebra in Spark
Pairwise Scoring – Attempt 4
Finally, success! What worked?
• Broadcasting smaller chunks of feature
vectors to executors for matrix multiplications
• Dropping into lower level Breeze APIs to do
the actual linear algebra calculations locally in
tasks, instead of attempting to do math at the
Distributed Matrix level
• Working with whole partitions of
SparseVectors with .mapPartitions() to
convert collections of vectors into matrices
• Efforts in tuning the right number of rows to
include in matrix scoring calculations
35. 35#UnifiedAnalytics #SparkAISummit
Lessons Learned
• Some core matrix operations (transpose, multiply) are not available
on all of the Distributed Matrix APIs
• Under the hood, matrix multiply on the Distributed Matrix APIs
always converts underlying matrices to dense form regardless of
whether they contain dense or sparse data
• Spark uses Breeze APIs extensively to back various Linear Algebra
operations
• Spark Sparse and Dense Vector representations cannot be
serialized as anything other than the higher level Vector due to the
VectorUDT
• Matrix and vector multiplication in Breeze is a heavy operation in
regards to both memory usage and processing
37. 37#UnifiedAnalytics #SparkAISummit
Conclusion
• A brute force algorithm was leveraged to carry out 80/20 Data
Science to get a prototype going to address the problem of
generating text-based similarity metrics
• There are potentially better algorithms and scaling mechanisms to
use to address this problem
• This was an opportunity to explore what was available through
Spark to carry out Linear Algebra at scale
• This exercise exposed limitations and constraints in Spark’s
distributed Linear Algebra APIs
• Learnings from this journey can be viewed as a Call To Action to
developers who care about this space to get involved in shaping
the APIs and contributing
38. 38#UnifiedAnalytics #SparkAISummit
Come Work At Target
• We are hiring in Data Science and Data Engineering
• Solve real-world problems in domains ranging from
supply chain logistics to smart stores to
personalization and more
• Offices in…
o Sunnyvale, CA
o Minneapolis, MN
o Pittsburgh, PA
o Bangalore, India
work somewhere you
jobs.target.com
39. 39#UnifiedAnalytics #SparkAISummit
Target @ Spark+AI Summit
Check out our other talks…
2018
• Extending Apache Spark APIs Without Going Near Spark Source Or
A Compiler (Anna Holschuh)
2019
• Apache Spark Data Validation (Doug Balog and Patrick Pisciuneri)
• Parallelizing With Apache Spark In Unexpected Ways (Anna
Holschuh)