SlideShare una empresa de Scribd logo
1 de 42
Descargar para leer sin conexión
WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
Anna Holschuh, Target
Lessons In Linear Algebra
At Scale With Apache Spark
Let’s make the sparse details a bit more dense
#UnifiedAnalytics #SparkAISummit
What This Talk is About
• A journey in building a text-based similarity engine
• Brief Basic Linear Algebra Refresher
• Linear Algebra at Scale in Spark
• Focus on Scala with code examples
3#UnifiedAnalytics #SparkAISummit
4#UnifiedAnalytics #SparkAISummit
Who am I
• Lead Data Engineer/Scientist at Target since 2016
• Deep love of all things Target
• Other Spark Summit talks:
o 2018: Extending Apache Spark APIs Without Going Near Spark Source Or A
Compiler
o 2019: Parallelizing With Apache Spark In Unexpected Ways
5#UnifiedAnalytics #SparkAISummit
Agenda
• Motivation
• A Similarity Engine
• Linear Algebra in Spark
• Lessons Learned
6#UnifiedAnalytics #SparkAISummit
Agenda
• Motivation
• A Similarity Engine
• Linear Algebra in Spark
• Lessons Learned
7#UnifiedAnalytics #SparkAISummit
Motivation
• For a core object with rich text-based attributes, we
wanted to create a system that would return the N
most similar objects for a given input object at scale.
• The goal was to first produce the raw data and then
build a configurable component that could be pulled
into other computation engines.
• The question of similarity is foundational and is
addressed in a variety of ways across many
disciplines: Text Mining, Information Retrieval, Entity
Resolution, Recommendation Engines, etc.
• scikit-learn and pyspark were first used to implement
KNN in an MVP. This first pass struggled to scale
and this is where Spark and Scala were introduced.
The
quick
brown
fox
The
slow
brown
fox
The
quick
orange
cat
A slow
brown
dog
Input
Neighbors
Score: 0.75
Score: 0.50
Score: 0.25
8#UnifiedAnalytics #SparkAISummit
Agenda
• Motivation
• A Similarity Engine
• Linear Algebra in Spark
• Lessons Learned
9#UnifiedAnalytics #SparkAISummit
A Similarity Engine
GOALS
• For an input object, the system should return the
N most similar objects in the system
• It should work off of a corpus of 40k total objects
• It should support a vocabulary that has 15k
tokens
• It should be able to compute pairwise scores
across the entire corpus
• It should perform in a reasonable amount of time
(on the order of minutes)
The
quick
brown
fox
The
slow
brown
fox
The
quick
orange
cat
A slow
brown
dog
Input
Neighbors
Score: 0.75
Score: 0.50
Score: 0.25
10#UnifiedAnalytics #SparkAISummit
A Similarity Engine
K-Nearest Neighbors (KNN)
Brute Force
• The most naïve approach
• Unsupervised learning method
• Data is represented in a Vector/Matrix
format
• Computes pairwise scores between all
pairs of points in the Dataset. For N
samples and D dimensions, the scale of
this method is O[DN2]
Cosine Similarity
• A pairwise scoring approach to
compute one number between 0 and 1
representing the similarity of two
vectors
11#UnifiedAnalytics #SparkAISummit
A Similarity Engine
Bag of Words Representation
How do we represent our features in such a system?
The quick brown fox
The quick orange cat
The slow brown fox
A slow brown dog
the,quick,brown,fox,slow,orange,cat,a,dog
1 1 1 1 0 0 0 0 0
1 0 1 1 1 0 0 0 0
1 1 0 0 0 1 1 0 0
0 0 1 0 1 0 0 1 1
1 1 1 1 0 0 0 0 0
1 0 1 1 1 0 0 0 0
1 1 0 0 0 1 1 0 0
0 0 1 0 1 0 0 1 1
Collect Raw Features Build a Feature Vocab
Represent Features in a
Matrix
12#UnifiedAnalytics #SparkAISummit
A Similarity Engine
Cosine Similarity
• A measure of similarity between two vectors that
measures the cosine of the angle between them
• The cosine function is periodic and ranges from
-1 to 1
• Vectors that are relatively close to one another
will have a score that approaches 1. Vectors that
are orthogonal will have a score of 0. Vectors
that are diametrically opposed will have a score
of -1
• Cosine similarity is often used to generate scores
in the positive space from 0 to 1.
• This measurement handles sparse data well as
only non-zero dimensions are considered.
cos(x) x2
x1
References: Wolfram Alpha, Wikipedia
13#UnifiedAnalytics #SparkAISummit
A Similarity Engine
Cosine Similarity – In Practice
The quick brown fox
The slow brown fox
1 1 1 1 0 0 0 0 0
1 0 1 1 1 0 0 0 0
A * B
= 1x1 + 1x0 + 1x1 + 1x1 + 0x1 + 0x0 + 0x0
+ 0x0 + 0x0
= 3
|| A ||
= sqrt(4)
|| B ||
= sqrt(4)
cos(x) = cosine similarity =
= ¾ = 0.75
14#UnifiedAnalytics #SparkAISummit
A Similarity Engine
Cosine Similarity – On Matrices
1 1 1 1 0 0 0 0 0
1 0 1 1 1 0 0 0 0
1 1 0 0 0 1 1 0 0
0 0 1 0 1 0 0 1 1
1/2
1/2
1/2
1/2 0 0 0 0 0
1/2 0 1/2
1/2
1/2 0 0 0 0
1/2
1/2 0 0 0 1/2
1/2 0 0
0 0 1/2 0 1/2 0 0 1/2
1/2
Step 1: Normalize Matrix
• Divide each element in a
row vector by the
magnitude of the row
• This takes care of the
denominator in the cosine
similarity calculation
F Fnorm
1/2
1/2
1/2 0
1/2 0 1/2 0
1/2
1/2 0 1/2
1/2
1/2 0 0
0 1/2 0 1/2
0 0 1/2 0
0 0 1/2 0
0 0 0 1/2
0 0 0 1/2
15#UnifiedAnalytics #SparkAISummit
A Similarity Engine
Cosine Similarity – On Matrices
Step 2: Multiply the
normalized matrix by its
transpose
• This takes care of the dot
product part (numerator) of
the cosine similarity
calculation
Fnorm
Fnorm
T
1/2
1/2
1/2
1/2 0 0 0 0 0
1/2 0 1/2
1/2
1/2 0 0 0 0
1/2
1/2 0 0 0 1/2
1/2 0 0
0 0 1/2 0 1/2 0 0 1/2
1/2
X
16#UnifiedAnalytics #SparkAISummit
A Similarity Engine
Pairwise Scoring
Pairwise Scoring
• Produces a square,
symmetric matrix
• The diagonal is always
ones, representing that a
feature is perfectly similar
with itself
• Rows can be read to find
the indices of objects that
are the best match
Fnorm Fnorm
T
X = Scores
4x9 9x4 4x4
1 3/4
1/2
1/4
3/4 1 1/4
1/2
1/2
1/4 1 0
1/4
1/2 0 1
The quick brown fox
The quick orange cat
The slow brown fox
A slow brown dog
17#UnifiedAnalytics #SparkAISummit
A Similarity Engine
Putting it all together
Moving to Spark
• We want to leverage this same exact
approach at scale
• Instead of dealing with 4 features with 9
vocabulary words, we want to deal with
upwards of 40k features with upwards
of 15k vocabulary words
• We want to be able to distribute this
large scale computation across a cluster
• We want this to be performant and
reliable
18#UnifiedAnalytics #SparkAISummit
Agenda
• Motivation
• A Similarity Engine
• Linear Algebra in Spark
• Lessons Learned
19#UnifiedAnalytics #SparkAISummit
Linear Algebra in Spark
Getting Started
• Completed this work using Spark 2.2
• There is no Spark MLlib KNN
implementation available
• That’s ok, because we know how to
carry out this computation at a low
level with Linear Algebra concepts
• The next step is to dig into Spark’s
APIs for Linear Algebra
20#UnifiedAnalytics #SparkAISummit
Linear Algebra in Spark
Local Vector
• Int, 0-based indices, double-typed
values
• Sparse and Dense
References: https://spark.apache.org/docs/2.2.0/mllib-data-types.html
• Able to be stored on a single
machine
• Building block for local and
distributed matrices in Spark
21#UnifiedAnalytics #SparkAISummit
Linear Algebra in Spark
Distributed Matrix APIs
References: https://spark.apache.org/docs/2.2.0/mllib-data-types.html
RowMatrix
• Row-oriented matrix represented by an
RDD[Vector]
• No meaningful indices
IndexedRowMatrix
• Row-oriented matrix represented by an
RDD[IndexedRow]
CoordinateMatrix
• Each entry is a tuple of (i: Long, j: Long,
value: Double)
• Should only be used when both
dimensions of the matrix are huge and the
matrix is sparse
BlockMatrix
• A distributed matrix backed by an RDD of
MatrixBlocks
• A MatrixBlock is a tuple of [(Int,Int),Matrix]
22#UnifiedAnalytics #SparkAISummit
Linear Algebra in Spark
Feature Generation
• We need to convert our Dataset of Articles into
useful features to carry out similarity calculations
on
• We first need to tokenize the text contained in the
article and can use Spark’s RegexTokenizer
• We then need to turn a collection of tokens per
article into vector bag of words representations
across the entire vocabulary corpus
• We use the CountVectorizer, although there are
other options available
• This works great! No problems here.
23#UnifiedAnalytics #SparkAISummit
Linear Algebra in Spark
Feature Generation, continued
• We also need to normalize our features before we
carry out matrix multiplication to generate scores
• We can use Spark’s Normalizer to carry this out
• Again, this works great! No problems here.
24#UnifiedAnalytics #SparkAISummit
Linear Algebra in Spark
Pairwise Scoring – Attempt 1
• THIS CODE DOES NOT WORK
• We need to generate a feature set of bag
of words vectors and multiply this matrix
by itself to generate cosine similarity
scores
• Working in the IndexedRowMatrix API
seems most intuitive for what we’re trying
to accomplish
25#UnifiedAnalytics #SparkAISummit
Linear Algebra in Spark
Lessons Learned
• Transpose is only available on the
BlockMatrix and CoordinateMatrix APIs
• Multiply is only available when both
matrices are distributed on the BlockMatrix
API
• (Multiplying by a local Matrix is available
on the RowMatrix and IndexedRowMatrix
APIs)
• BlockMatrix API it is…
26#UnifiedAnalytics #SparkAISummit
Linear Algebra in Spark
Pairwise Scoring – Attempt 2
• THIS CODE DOES NOT WORK
• We attempt to work in the BlockMatrix API
instead to make use of transpose and
multiply.
• Converting back and forth between
different Distributed matrix APIs can be
expensive, so if this works, we’d go back
and start out in that API.
• This code compiles
• This code blows up on a relatively hefty
cluster with OOM errors
27#UnifiedAnalytics #SparkAISummit
Linear Algebra in Spark
Lessons Learned
• BlockMatrix is the only Distributed Matrix
API that supports multiplying two
distributed matrices
• It accomplishes its multiply on
SparseMatrices by converting them to
DenseMatrices
• One can configure the number of rows and
columns contained in a block, so the
tradeoff can be made between alleviating
memory pressure during the toDense
operation and increasing the number of
operations involved in the multiply with
more blocks
28#UnifiedAnalytics #SparkAISummit
Linear Algebra in Spark
Back to the drawing board
• It would be ideal to keep things in a Sparse
representation throughout the multiply operation
• Idea 1: Use the CoordinateMatrix API to manually
manage the transpose and multiplication based on
coordinates
o This seems like it would generate a lot of shuffle
activity
• Idea 2: Go back to IndexedRowMatrix and broadcast
smaller chunks of the matrix to be used in local
multiplication on executors.
o Digging through the source code also shows
these matrices are converted to dense
• Idea 3: Wrap SparseVectors in a Dataset/RDD to be
distributed and broadcast smaller chunks of Vectors
to be locally assembled into matrices for
multiplication.
29#UnifiedAnalytics #SparkAISummit
Linear Algebra in Spark
Pairwise Scoring – Attempt 3
• Going with Idea 3
• THIS CODE DOES NOT WORK
• We would like to wrap a SparseVector
to pass around in a Dataset for manually
managing the multiplication
30#UnifiedAnalytics #SparkAISummit
Linear Algebra in Spark
Lessons Learned
• Spark cannot serialize/deserialize
SparseVectors due to the implementation
of the VectorUDT
• We will have to wrap a Vector instead and
know in our hearts that it is indeed a
SparseVector as the VectorUDT abstracts
away all of these details
31#UnifiedAnalytics #SparkAISummit
Linear Algebra in Spark
Quick note about VectorUDT
• A custom UserDefinedType
• Allows easier interaction with SQL
• Abstracts away the serialization and
deserialization of subtypes but still
preserves typing
Reference:
https://github.com/apache/spark/blob/v2.2.0/mllib/src/main/scala/org/apache/spark/mlli
b/linalg/Vectors.scala
32#UnifiedAnalytics #SparkAISummit
Linear Algebra in Spark
Pairwise Scoring – Attempt 4
Finally, success! What worked?
• Broadcasting smaller chunks of feature
vectors to executors for matrix multiplications
• Dropping into lower level Breeze APIs to do
the actual linear algebra calculations locally in
tasks, instead of attempting to do math at the
Distributed Matrix level
• Working with whole partitions of
SparseVectors with .mapPartitions() to
convert collections of vectors into matrices
• Efforts in tuning the right number of rows to
include in matrix scoring calculations
33#UnifiedAnalytics #SparkAISummit
Linear Algebra in Spark
Pairwise Scoring – Attempt 4
Lessons Learned
• Matrix multiplication is heavy and causes
spikes in memory
34#UnifiedAnalytics #SparkAISummit
Agenda
• Motivation
• A Similarity Engine
• Linear Algebra in Spark
• Lessons Learned
35#UnifiedAnalytics #SparkAISummit
Lessons Learned
• Some core matrix operations (transpose, multiply) are not available
on all of the Distributed Matrix APIs
• Under the hood, matrix multiply on the Distributed Matrix APIs
always converts underlying matrices to dense form regardless of
whether they contain dense or sparse data
• Spark uses Breeze APIs extensively to back various Linear Algebra
operations
• Spark Sparse and Dense Vector representations cannot be
serialized as anything other than the higher level Vector due to the
VectorUDT
• Matrix and vector multiplication in Breeze is a heavy operation in
regards to both memory usage and processing
36#UnifiedAnalytics #SparkAISummit
Agenda
• Motivation
• A Similarity Engine
• Linear Algebra in Spark
• Lessons Learned
37#UnifiedAnalytics #SparkAISummit
Conclusion
• A brute force algorithm was leveraged to carry out 80/20 Data
Science to get a prototype going to address the problem of
generating text-based similarity metrics
• There are potentially better algorithms and scaling mechanisms to
use to address this problem
• This was an opportunity to explore what was available through
Spark to carry out Linear Algebra at scale
• This exercise exposed limitations and constraints in Spark’s
distributed Linear Algebra APIs
• Learnings from this journey can be viewed as a Call To Action to
developers who care about this space to get involved in shaping
the APIs and contributing
38#UnifiedAnalytics #SparkAISummit
Come Work At Target
• We are hiring in Data Science and Data Engineering
• Solve real-world problems in domains ranging from
supply chain logistics to smart stores to
personalization and more
• Offices in…
o Sunnyvale, CA
o Minneapolis, MN
o Pittsburgh, PA
o Bangalore, India
work somewhere you
jobs.target.com
39#UnifiedAnalytics #SparkAISummit
Target @ Spark+AI Summit
Check out our other talks…
2018
• Extending Apache Spark APIs Without Going Near Spark Source Or
A Compiler (Anna Holschuh)
2019
• Apache Spark Data Validation (Doug Balog and Patrick Pisciuneri)
• Parallelizing With Apache Spark In Unexpected Ways (Anna
Holschuh)
40#UnifiedAnalytics #SparkAISummit
Acknowledgements
• Thank you Spark Summit
• Thank you Target
• Thank you wonderful team members at Target
• Thank you vibrant Spark and Scala communities
41#UnifiedAnalytics #SparkAISummit
QUESTIONS
anna.holschuh@target.com
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Más contenido relacionado

La actualidad más candente

Java MySQL Connector & Connection Pool Features & Optimization
Java MySQL Connector & Connection Pool Features & OptimizationJava MySQL Connector & Connection Pool Features & Optimization
Java MySQL Connector & Connection Pool Features & OptimizationKenny Gryp
 
Scaling Apache Pulsar to 10 Petabytes/Day - Pulsar Summit NA 2021 Keynote
Scaling Apache Pulsar to 10 Petabytes/Day - Pulsar Summit NA 2021 KeynoteScaling Apache Pulsar to 10 Petabytes/Day - Pulsar Summit NA 2021 Keynote
Scaling Apache Pulsar to 10 Petabytes/Day - Pulsar Summit NA 2021 KeynoteStreamNative
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
MongoDB Internals
MongoDB InternalsMongoDB Internals
MongoDB InternalsSiraj Memon
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideIBM
 
Common MongoDB Use Cases
Common MongoDB Use CasesCommon MongoDB Use Cases
Common MongoDB Use CasesDATAVERSITY
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...Spark Summit
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudDatabricks
 
Rootless Kubernetes
Rootless KubernetesRootless Kubernetes
Rootless KubernetesAkihiro Suda
 
Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...
Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...
Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...Lucas Jellema
 
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at UberDisaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uberconfluent
 
Infinispan, a distributed in-memory key/value data grid and cache
 Infinispan, a distributed in-memory key/value data grid and cache Infinispan, a distributed in-memory key/value data grid and cache
Infinispan, a distributed in-memory key/value data grid and cacheSebastian Andrasoni
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
Event-sourced systems with Kafka, Clojure, and Jackdaw
Event-sourced systems with Kafka, Clojure, and JackdawEvent-sourced systems with Kafka, Clojure, and Jackdaw
Event-sourced systems with Kafka, Clojure, and JackdawBryce Covert
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsDatabricks
 
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
 A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ... A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...HostedbyConfluent
 
PostgreSQL 11 New Features With Examples (English)
PostgreSQL 11 New Features With Examples (English)PostgreSQL 11 New Features With Examples (English)
PostgreSQL 11 New Features With Examples (English)Noriyoshi Shinoda
 

La actualidad más candente (20)

Java MySQL Connector & Connection Pool Features & Optimization
Java MySQL Connector & Connection Pool Features & OptimizationJava MySQL Connector & Connection Pool Features & Optimization
Java MySQL Connector & Connection Pool Features & Optimization
 
Scaling Apache Pulsar to 10 Petabytes/Day - Pulsar Summit NA 2021 Keynote
Scaling Apache Pulsar to 10 Petabytes/Day - Pulsar Summit NA 2021 KeynoteScaling Apache Pulsar to 10 Petabytes/Day - Pulsar Summit NA 2021 Keynote
Scaling Apache Pulsar to 10 Petabytes/Day - Pulsar Summit NA 2021 Keynote
 
Google's Dremel
Google's DremelGoogle's Dremel
Google's Dremel
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
MongoDB Internals
MongoDB InternalsMongoDB Internals
MongoDB Internals
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
 
Common MongoDB Use Cases
Common MongoDB Use CasesCommon MongoDB Use Cases
Common MongoDB Use Cases
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Rootless Kubernetes
Rootless KubernetesRootless Kubernetes
Rootless Kubernetes
 
Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...
Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...
Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...
 
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at UberDisaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
 
Infinispan, a distributed in-memory key/value data grid and cache
 Infinispan, a distributed in-memory key/value data grid and cache Infinispan, a distributed in-memory key/value data grid and cache
Infinispan, a distributed in-memory key/value data grid and cache
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Event-sourced systems with Kafka, Clojure, and Jackdaw
Event-sourced systems with Kafka, Clojure, and JackdawEvent-sourced systems with Kafka, Clojure, and Jackdaw
Event-sourced systems with Kafka, Clojure, and Jackdaw
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized Optimizations
 
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
 A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ... A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
 
PostgreSQL 11 New Features With Examples (English)
PostgreSQL 11 New Features With Examples (English)PostgreSQL 11 New Features With Examples (English)
PostgreSQL 11 New Features With Examples (English)
 

Similar a Lessons in Linear Algebra at Scale with Apache Spark : Let's Make the Sparse Details a Bit More Dense

Neural Networks for Machine Learning and Deep Learning
Neural Networks for Machine Learning and Deep LearningNeural Networks for Machine Learning and Deep Learning
Neural Networks for Machine Learning and Deep Learningcomifa7406
 
Exploiting the query structure for efficient join ordering in SPARQL queries
Exploiting the query structure for efficient join ordering in SPARQL queriesExploiting the query structure for efficient join ordering in SPARQL queries
Exploiting the query structure for efficient join ordering in SPARQL queriesLuiz Henrique Zambom Santana
 
Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIMikko Mäkipää
 
Scaling Analytics with Apache Spark
Scaling Analytics with Apache SparkScaling Analytics with Apache Spark
Scaling Analytics with Apache SparkQuantUniversity
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine LearningPraveen Devarao
 
.NET UY Meetup 7 - CLR Memory by Fabian Alves
.NET UY Meetup 7 - CLR Memory by Fabian Alves.NET UY Meetup 7 - CLR Memory by Fabian Alves
.NET UY Meetup 7 - CLR Memory by Fabian Alves.NET UY Meetup
 
Predicting SPARQL query execution time and suggesting SPARQL queries based on...
Predicting SPARQL query execution time and suggesting SPARQL queries based on...Predicting SPARQL query execution time and suggesting SPARQL queries based on...
Predicting SPARQL query execution time and suggesting SPARQL queries based on...Rakebul Hasan
 
Alchemist: An Apache Spark <=> MPI Interface with Michael Mahoney and Kai Rot...
Alchemist: An Apache Spark <=> MPI Interface with Michael Mahoney and Kai Rot...Alchemist: An Apache Spark <=> MPI Interface with Michael Mahoney and Kai Rot...
Alchemist: An Apache Spark <=> MPI Interface with Michael Mahoney and Kai Rot...Databricks
 
Data Parallel and Object Oriented Model
Data Parallel and Object Oriented ModelData Parallel and Object Oriented Model
Data Parallel and Object Oriented ModelNikhil Sharma
 
Advanced Hyperparameter Optimization for Deep Learning with MLflow
Advanced Hyperparameter Optimization for Deep Learning with MLflowAdvanced Hyperparameter Optimization for Deep Learning with MLflow
Advanced Hyperparameter Optimization for Deep Learning with MLflowDatabricks
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReducesscdotopen
 
Assembler design options
Assembler design optionsAssembler design options
Assembler design optionsMohd Arif
 
Real-Time Voice Actuation
Real-Time Voice ActuationReal-Time Voice Actuation
Real-Time Voice ActuationPragya Agrawal
 
Machine Learning in q/kdb+ - Teaching KDB to Read Japanese
Machine Learning in q/kdb+ - Teaching KDB to Read JapaneseMachine Learning in q/kdb+ - Teaching KDB to Read Japanese
Machine Learning in q/kdb+ - Teaching KDB to Read JapaneseMark Lefevre, CQF
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDatabricks
 
Graph processing at scale using spark &amp; graph frames
Graph processing at scale using spark &amp; graph framesGraph processing at scale using spark &amp; graph frames
Graph processing at scale using spark &amp; graph framesRon Barabash
 
Coffee beans variants recommendation using clustering
Coffee beans variants recommendation using clusteringCoffee beans variants recommendation using clustering
Coffee beans variants recommendation using clusteringMihirKadam3
 
Drilling the Async Library
Drilling the Async LibraryDrilling the Async Library
Drilling the Async LibraryKnoldus Inc.
 

Similar a Lessons in Linear Algebra at Scale with Apache Spark : Let's Make the Sparse Details a Bit More Dense (20)

Neural Networks for Machine Learning and Deep Learning
Neural Networks for Machine Learning and Deep LearningNeural Networks for Machine Learning and Deep Learning
Neural Networks for Machine Learning and Deep Learning
 
Exploiting the query structure for efficient join ordering in SPARQL queries
Exploiting the query structure for efficient join ordering in SPARQL queriesExploiting the query structure for efficient join ordering in SPARQL queries
Exploiting the query structure for efficient join ordering in SPARQL queries
 
Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part III
 
Scaling Analytics with Apache Spark
Scaling Analytics with Apache SparkScaling Analytics with Apache Spark
Scaling Analytics with Apache Spark
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine Learning
 
.NET UY Meetup 7 - CLR Memory by Fabian Alves
.NET UY Meetup 7 - CLR Memory by Fabian Alves.NET UY Meetup 7 - CLR Memory by Fabian Alves
.NET UY Meetup 7 - CLR Memory by Fabian Alves
 
Predicting SPARQL query execution time and suggesting SPARQL queries based on...
Predicting SPARQL query execution time and suggesting SPARQL queries based on...Predicting SPARQL query execution time and suggesting SPARQL queries based on...
Predicting SPARQL query execution time and suggesting SPARQL queries based on...
 
Alchemist: An Apache Spark <=> MPI Interface with Michael Mahoney and Kai Rot...
Alchemist: An Apache Spark <=> MPI Interface with Michael Mahoney and Kai Rot...Alchemist: An Apache Spark <=> MPI Interface with Michael Mahoney and Kai Rot...
Alchemist: An Apache Spark <=> MPI Interface with Michael Mahoney and Kai Rot...
 
Data Parallel and Object Oriented Model
Data Parallel and Object Oriented ModelData Parallel and Object Oriented Model
Data Parallel and Object Oriented Model
 
Advanced Hyperparameter Optimization for Deep Learning with MLflow
Advanced Hyperparameter Optimization for Deep Learning with MLflowAdvanced Hyperparameter Optimization for Deep Learning with MLflow
Advanced Hyperparameter Optimization for Deep Learning with MLflow
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduce
 
Assembler design options
Assembler design optionsAssembler design options
Assembler design options
 
Real-Time Voice Actuation
Real-Time Voice ActuationReal-Time Voice Actuation
Real-Time Voice Actuation
 
Machine Learning in q/kdb+ - Teaching KDB to Read Japanese
Machine Learning in q/kdb+ - Teaching KDB to Read JapaneseMachine Learning in q/kdb+ - Teaching KDB to Read Japanese
Machine Learning in q/kdb+ - Teaching KDB to Read Japanese
 
Nbvtalkataitamimageprocessingconf
NbvtalkataitamimageprocessingconfNbvtalkataitamimageprocessingconf
Nbvtalkataitamimageprocessingconf
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
 
Graph processing at scale using spark &amp; graph frames
Graph processing at scale using spark &amp; graph framesGraph processing at scale using spark &amp; graph frames
Graph processing at scale using spark &amp; graph frames
 
Coffee beans variants recommendation using clustering
Coffee beans variants recommendation using clusteringCoffee beans variants recommendation using clustering
Coffee beans variants recommendation using clustering
 
Drilling the Async Library
Drilling the Async LibraryDrilling the Async Library
Drilling the Async Library
 

Más de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 

Último (20)

Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 

Lessons in Linear Algebra at Scale with Apache Spark : Let's Make the Sparse Details a Bit More Dense

  • 1. WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
  • 2. Anna Holschuh, Target Lessons In Linear Algebra At Scale With Apache Spark Let’s make the sparse details a bit more dense #UnifiedAnalytics #SparkAISummit
  • 3. What This Talk is About • A journey in building a text-based similarity engine • Brief Basic Linear Algebra Refresher • Linear Algebra at Scale in Spark • Focus on Scala with code examples 3#UnifiedAnalytics #SparkAISummit
  • 4. 4#UnifiedAnalytics #SparkAISummit Who am I • Lead Data Engineer/Scientist at Target since 2016 • Deep love of all things Target • Other Spark Summit talks: o 2018: Extending Apache Spark APIs Without Going Near Spark Source Or A Compiler o 2019: Parallelizing With Apache Spark In Unexpected Ways
  • 5. 5#UnifiedAnalytics #SparkAISummit Agenda • Motivation • A Similarity Engine • Linear Algebra in Spark • Lessons Learned
  • 6. 6#UnifiedAnalytics #SparkAISummit Agenda • Motivation • A Similarity Engine • Linear Algebra in Spark • Lessons Learned
  • 7. 7#UnifiedAnalytics #SparkAISummit Motivation • For a core object with rich text-based attributes, we wanted to create a system that would return the N most similar objects for a given input object at scale. • The goal was to first produce the raw data and then build a configurable component that could be pulled into other computation engines. • The question of similarity is foundational and is addressed in a variety of ways across many disciplines: Text Mining, Information Retrieval, Entity Resolution, Recommendation Engines, etc. • scikit-learn and pyspark were first used to implement KNN in an MVP. This first pass struggled to scale and this is where Spark and Scala were introduced. The quick brown fox The slow brown fox The quick orange cat A slow brown dog Input Neighbors Score: 0.75 Score: 0.50 Score: 0.25
  • 8. 8#UnifiedAnalytics #SparkAISummit Agenda • Motivation • A Similarity Engine • Linear Algebra in Spark • Lessons Learned
  • 9. 9#UnifiedAnalytics #SparkAISummit A Similarity Engine GOALS • For an input object, the system should return the N most similar objects in the system • It should work off of a corpus of 40k total objects • It should support a vocabulary that has 15k tokens • It should be able to compute pairwise scores across the entire corpus • It should perform in a reasonable amount of time (on the order of minutes) The quick brown fox The slow brown fox The quick orange cat A slow brown dog Input Neighbors Score: 0.75 Score: 0.50 Score: 0.25
  • 10. 10#UnifiedAnalytics #SparkAISummit A Similarity Engine K-Nearest Neighbors (KNN) Brute Force • The most naïve approach • Unsupervised learning method • Data is represented in a Vector/Matrix format • Computes pairwise scores between all pairs of points in the Dataset. For N samples and D dimensions, the scale of this method is O[DN2] Cosine Similarity • A pairwise scoring approach to compute one number between 0 and 1 representing the similarity of two vectors
  • 11. 11#UnifiedAnalytics #SparkAISummit A Similarity Engine Bag of Words Representation How do we represent our features in such a system? The quick brown fox The quick orange cat The slow brown fox A slow brown dog the,quick,brown,fox,slow,orange,cat,a,dog 1 1 1 1 0 0 0 0 0 1 0 1 1 1 0 0 0 0 1 1 0 0 0 1 1 0 0 0 0 1 0 1 0 0 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 1 0 0 0 0 1 1 0 0 0 1 1 0 0 0 0 1 0 1 0 0 1 1 Collect Raw Features Build a Feature Vocab Represent Features in a Matrix
  • 12. 12#UnifiedAnalytics #SparkAISummit A Similarity Engine Cosine Similarity • A measure of similarity between two vectors that measures the cosine of the angle between them • The cosine function is periodic and ranges from -1 to 1 • Vectors that are relatively close to one another will have a score that approaches 1. Vectors that are orthogonal will have a score of 0. Vectors that are diametrically opposed will have a score of -1 • Cosine similarity is often used to generate scores in the positive space from 0 to 1. • This measurement handles sparse data well as only non-zero dimensions are considered. cos(x) x2 x1 References: Wolfram Alpha, Wikipedia
  • 13. 13#UnifiedAnalytics #SparkAISummit A Similarity Engine Cosine Similarity – In Practice The quick brown fox The slow brown fox 1 1 1 1 0 0 0 0 0 1 0 1 1 1 0 0 0 0 A * B = 1x1 + 1x0 + 1x1 + 1x1 + 0x1 + 0x0 + 0x0 + 0x0 + 0x0 = 3 || A || = sqrt(4) || B || = sqrt(4) cos(x) = cosine similarity = = ¾ = 0.75
  • 14. 14#UnifiedAnalytics #SparkAISummit A Similarity Engine Cosine Similarity – On Matrices 1 1 1 1 0 0 0 0 0 1 0 1 1 1 0 0 0 0 1 1 0 0 0 1 1 0 0 0 0 1 0 1 0 0 1 1 1/2 1/2 1/2 1/2 0 0 0 0 0 1/2 0 1/2 1/2 1/2 0 0 0 0 1/2 1/2 0 0 0 1/2 1/2 0 0 0 0 1/2 0 1/2 0 0 1/2 1/2 Step 1: Normalize Matrix • Divide each element in a row vector by the magnitude of the row • This takes care of the denominator in the cosine similarity calculation F Fnorm
  • 15. 1/2 1/2 1/2 0 1/2 0 1/2 0 1/2 1/2 0 1/2 1/2 1/2 0 0 0 1/2 0 1/2 0 0 1/2 0 0 0 1/2 0 0 0 0 1/2 0 0 0 1/2 15#UnifiedAnalytics #SparkAISummit A Similarity Engine Cosine Similarity – On Matrices Step 2: Multiply the normalized matrix by its transpose • This takes care of the dot product part (numerator) of the cosine similarity calculation Fnorm Fnorm T 1/2 1/2 1/2 1/2 0 0 0 0 0 1/2 0 1/2 1/2 1/2 0 0 0 0 1/2 1/2 0 0 0 1/2 1/2 0 0 0 0 1/2 0 1/2 0 0 1/2 1/2 X
  • 16. 16#UnifiedAnalytics #SparkAISummit A Similarity Engine Pairwise Scoring Pairwise Scoring • Produces a square, symmetric matrix • The diagonal is always ones, representing that a feature is perfectly similar with itself • Rows can be read to find the indices of objects that are the best match Fnorm Fnorm T X = Scores 4x9 9x4 4x4 1 3/4 1/2 1/4 3/4 1 1/4 1/2 1/2 1/4 1 0 1/4 1/2 0 1 The quick brown fox The quick orange cat The slow brown fox A slow brown dog
  • 17. 17#UnifiedAnalytics #SparkAISummit A Similarity Engine Putting it all together Moving to Spark • We want to leverage this same exact approach at scale • Instead of dealing with 4 features with 9 vocabulary words, we want to deal with upwards of 40k features with upwards of 15k vocabulary words • We want to be able to distribute this large scale computation across a cluster • We want this to be performant and reliable
  • 18. 18#UnifiedAnalytics #SparkAISummit Agenda • Motivation • A Similarity Engine • Linear Algebra in Spark • Lessons Learned
  • 19. 19#UnifiedAnalytics #SparkAISummit Linear Algebra in Spark Getting Started • Completed this work using Spark 2.2 • There is no Spark MLlib KNN implementation available • That’s ok, because we know how to carry out this computation at a low level with Linear Algebra concepts • The next step is to dig into Spark’s APIs for Linear Algebra
  • 20. 20#UnifiedAnalytics #SparkAISummit Linear Algebra in Spark Local Vector • Int, 0-based indices, double-typed values • Sparse and Dense References: https://spark.apache.org/docs/2.2.0/mllib-data-types.html • Able to be stored on a single machine • Building block for local and distributed matrices in Spark
  • 21. 21#UnifiedAnalytics #SparkAISummit Linear Algebra in Spark Distributed Matrix APIs References: https://spark.apache.org/docs/2.2.0/mllib-data-types.html RowMatrix • Row-oriented matrix represented by an RDD[Vector] • No meaningful indices IndexedRowMatrix • Row-oriented matrix represented by an RDD[IndexedRow] CoordinateMatrix • Each entry is a tuple of (i: Long, j: Long, value: Double) • Should only be used when both dimensions of the matrix are huge and the matrix is sparse BlockMatrix • A distributed matrix backed by an RDD of MatrixBlocks • A MatrixBlock is a tuple of [(Int,Int),Matrix]
  • 22. 22#UnifiedAnalytics #SparkAISummit Linear Algebra in Spark Feature Generation • We need to convert our Dataset of Articles into useful features to carry out similarity calculations on • We first need to tokenize the text contained in the article and can use Spark’s RegexTokenizer • We then need to turn a collection of tokens per article into vector bag of words representations across the entire vocabulary corpus • We use the CountVectorizer, although there are other options available • This works great! No problems here.
  • 23. 23#UnifiedAnalytics #SparkAISummit Linear Algebra in Spark Feature Generation, continued • We also need to normalize our features before we carry out matrix multiplication to generate scores • We can use Spark’s Normalizer to carry this out • Again, this works great! No problems here.
  • 24. 24#UnifiedAnalytics #SparkAISummit Linear Algebra in Spark Pairwise Scoring – Attempt 1 • THIS CODE DOES NOT WORK • We need to generate a feature set of bag of words vectors and multiply this matrix by itself to generate cosine similarity scores • Working in the IndexedRowMatrix API seems most intuitive for what we’re trying to accomplish
  • 25. 25#UnifiedAnalytics #SparkAISummit Linear Algebra in Spark Lessons Learned • Transpose is only available on the BlockMatrix and CoordinateMatrix APIs • Multiply is only available when both matrices are distributed on the BlockMatrix API • (Multiplying by a local Matrix is available on the RowMatrix and IndexedRowMatrix APIs) • BlockMatrix API it is…
  • 26. 26#UnifiedAnalytics #SparkAISummit Linear Algebra in Spark Pairwise Scoring – Attempt 2 • THIS CODE DOES NOT WORK • We attempt to work in the BlockMatrix API instead to make use of transpose and multiply. • Converting back and forth between different Distributed matrix APIs can be expensive, so if this works, we’d go back and start out in that API. • This code compiles • This code blows up on a relatively hefty cluster with OOM errors
  • 27. 27#UnifiedAnalytics #SparkAISummit Linear Algebra in Spark Lessons Learned • BlockMatrix is the only Distributed Matrix API that supports multiplying two distributed matrices • It accomplishes its multiply on SparseMatrices by converting them to DenseMatrices • One can configure the number of rows and columns contained in a block, so the tradeoff can be made between alleviating memory pressure during the toDense operation and increasing the number of operations involved in the multiply with more blocks
  • 28. 28#UnifiedAnalytics #SparkAISummit Linear Algebra in Spark Back to the drawing board • It would be ideal to keep things in a Sparse representation throughout the multiply operation • Idea 1: Use the CoordinateMatrix API to manually manage the transpose and multiplication based on coordinates o This seems like it would generate a lot of shuffle activity • Idea 2: Go back to IndexedRowMatrix and broadcast smaller chunks of the matrix to be used in local multiplication on executors. o Digging through the source code also shows these matrices are converted to dense • Idea 3: Wrap SparseVectors in a Dataset/RDD to be distributed and broadcast smaller chunks of Vectors to be locally assembled into matrices for multiplication.
  • 29. 29#UnifiedAnalytics #SparkAISummit Linear Algebra in Spark Pairwise Scoring – Attempt 3 • Going with Idea 3 • THIS CODE DOES NOT WORK • We would like to wrap a SparseVector to pass around in a Dataset for manually managing the multiplication
  • 30. 30#UnifiedAnalytics #SparkAISummit Linear Algebra in Spark Lessons Learned • Spark cannot serialize/deserialize SparseVectors due to the implementation of the VectorUDT • We will have to wrap a Vector instead and know in our hearts that it is indeed a SparseVector as the VectorUDT abstracts away all of these details
  • 31. 31#UnifiedAnalytics #SparkAISummit Linear Algebra in Spark Quick note about VectorUDT • A custom UserDefinedType • Allows easier interaction with SQL • Abstracts away the serialization and deserialization of subtypes but still preserves typing Reference: https://github.com/apache/spark/blob/v2.2.0/mllib/src/main/scala/org/apache/spark/mlli b/linalg/Vectors.scala
  • 32. 32#UnifiedAnalytics #SparkAISummit Linear Algebra in Spark Pairwise Scoring – Attempt 4 Finally, success! What worked? • Broadcasting smaller chunks of feature vectors to executors for matrix multiplications • Dropping into lower level Breeze APIs to do the actual linear algebra calculations locally in tasks, instead of attempting to do math at the Distributed Matrix level • Working with whole partitions of SparseVectors with .mapPartitions() to convert collections of vectors into matrices • Efforts in tuning the right number of rows to include in matrix scoring calculations
  • 33. 33#UnifiedAnalytics #SparkAISummit Linear Algebra in Spark Pairwise Scoring – Attempt 4 Lessons Learned • Matrix multiplication is heavy and causes spikes in memory
  • 34. 34#UnifiedAnalytics #SparkAISummit Agenda • Motivation • A Similarity Engine • Linear Algebra in Spark • Lessons Learned
  • 35. 35#UnifiedAnalytics #SparkAISummit Lessons Learned • Some core matrix operations (transpose, multiply) are not available on all of the Distributed Matrix APIs • Under the hood, matrix multiply on the Distributed Matrix APIs always converts underlying matrices to dense form regardless of whether they contain dense or sparse data • Spark uses Breeze APIs extensively to back various Linear Algebra operations • Spark Sparse and Dense Vector representations cannot be serialized as anything other than the higher level Vector due to the VectorUDT • Matrix and vector multiplication in Breeze is a heavy operation in regards to both memory usage and processing
  • 36. 36#UnifiedAnalytics #SparkAISummit Agenda • Motivation • A Similarity Engine • Linear Algebra in Spark • Lessons Learned
  • 37. 37#UnifiedAnalytics #SparkAISummit Conclusion • A brute force algorithm was leveraged to carry out 80/20 Data Science to get a prototype going to address the problem of generating text-based similarity metrics • There are potentially better algorithms and scaling mechanisms to use to address this problem • This was an opportunity to explore what was available through Spark to carry out Linear Algebra at scale • This exercise exposed limitations and constraints in Spark’s distributed Linear Algebra APIs • Learnings from this journey can be viewed as a Call To Action to developers who care about this space to get involved in shaping the APIs and contributing
  • 38. 38#UnifiedAnalytics #SparkAISummit Come Work At Target • We are hiring in Data Science and Data Engineering • Solve real-world problems in domains ranging from supply chain logistics to smart stores to personalization and more • Offices in… o Sunnyvale, CA o Minneapolis, MN o Pittsburgh, PA o Bangalore, India work somewhere you jobs.target.com
  • 39. 39#UnifiedAnalytics #SparkAISummit Target @ Spark+AI Summit Check out our other talks… 2018 • Extending Apache Spark APIs Without Going Near Spark Source Or A Compiler (Anna Holschuh) 2019 • Apache Spark Data Validation (Doug Balog and Patrick Pisciuneri) • Parallelizing With Apache Spark In Unexpected Ways (Anna Holschuh)
  • 40. 40#UnifiedAnalytics #SparkAISummit Acknowledgements • Thank you Spark Summit • Thank you Target • Thank you wonderful team members at Target • Thank you vibrant Spark and Scala communities
  • 42. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT