SlideShare a Scribd company logo
1 of 50
Download to read offline
Recommendation and graph
algorithms in Hadoop and SQL
Code 
github.com/dgleich/matrix-hadoop-tutorial

@dgleich
dgleich@purdue.edu

DAVID F. GLEICH
ASSISTANT PROFESSOR"
COMPUTER SCIENCE"
PURDUE UNIVERSITY

David Gleich · Purdue

Ancestry.com

1
Matrix computations
A1,1

6
6 A2,1
A=6 .
6
4 .
.
Am,1

Ax

Ax = b

Operations

Linear "
systems

A1,2
A2,2
..
.
···

···
···
..
.
Am,n

min kAx

1

3

A1,n
. 7
. 7
. 7
7
Am 1,n 5
Am,n

bk

Least squares
David Gleich · Purdue

Ax = x

Eigenvalues
Ancestry.com

2

2
Outcomes
Recognize relationships between matrix methods and
things you’ve already been doing"
Example SQL queries as matrix computations
See how to work with big graphs as large edge lists in
Hadoop and SQL"
Example Connected components

David Gleich · Purdue

Ancestry.com

3

Understand how to use Hadoop to compute these
matrix methods at scale for BigData"
Example Recommenders with social network info
David Gleich · Purdue

Ancestry.com

4

matrix computations "
≠"
linear algebra
World’s simplest
recommendation system.

David Gleich · Purdue

Ancestry.com

5

Suggest the average rating.
A SQL statement as a "
matrix computation

http://stackoverflow.com/questions/4217449/returning-average-rating-from-a-database-sql

David Gleich · Purdue

Ancestry.com

6

How do I find the
average rating for
each product?
A SQL statement as a "
matrix computation

David Gleich · Purdue

Ancestry.com

7

SELECT!
p.product_id,!
p.name,!
AVG(pr.rating) AS rating_average!
http://stackoverflow.com/questions/4217449/returning-average-rating-from-a-database-sql
FROM products p!
INNER JOIN product_ratings pr!
How do I find the
ON pr.product_id = p.product_id!
average rating for
GROUP BY p.product_id!
each product?
ORDER BY rating_average DESC!
Image from rockysprings, deviantart, CC share-alike
David Gleich · Purdue

Ancestry.com

8

This SQL statement is a "
matrix computation!
SELECT!
...!
AVG(pr.rating)!
...!
GROUP BY p.product_id!
product_ratings
pid1
pid2
pid3
pid4
pid5
pid6
Is a matrix!
 pid7
pid8
pid9

David Gleich · Purdue

Ancestry.com

9

pid8 uid2 4
pid9 uid9 1
pid2 uid9 5
pid9 uid5 5
pid6 uid8 4
pid1 uid2 4
pid3 uid4 4
pid5 uid9 2
pid9 uid8 4
pid9 uid9 1
But it’s a weird matrix"

product_ratings
pid1
pid2
pid3
pid4
pid5
pid6
Is a matrix!
 pid7
pid8
pid9

Missing entries!

David Gleich · Purdue

Ancestry.com

10

pid8 uid2 4
pid9 uid9 1
pid2 uid9 5
pid9 uid5 5
pid6 uid8 4
pid1 uid2 4
pid3 uid4 4
pid5 uid9 2
pid9 uid8 4
pid9 uid9 1
But it’s a weird matrix"

Average"
of ratings

product_ratings
pid1
pid2
pid3
pid4
pid5
pid6
Is a matrix!
 pid7
pid8
pid9

4
4

4
5

Matrix

David Gleich · Purdue

4 SELECT
AVG(r)
...
4 GROUP BY
pid
Vector

Ancestry.com

11

pid8 uid2 4
pid9 uid9 1
pid2 uid9 5
pid9 uid5 5
pid6 uid8 4
pid1 uid2 4
pid3 uid4 4
pid5 uid9 2
pid9 uid8 4
But it’s a weird matrix"
and not a linear operator
A1,2

6 A2,1
A=6 .
6
4 .
.
Am,1

A2,2
..
.
···

!
6

3

···
···
..
.
Am,n

1

A1,n
. 7
. 7
. 7
7
Am 1,n 5
Am,n

P
2 P
j A1,j / Pj “A1,j 6= 0”
P
6
j A2,j /
j “A2,j 6= 0”
6
avg(A) = 6
.
.
4
.
P
P
j Am,j /
j “Am,j 6= 0”
David Gleich · Purdue

Ancestry.com

3
7
7
7
5

12

A1,1

I
product_ratings
 s a matrix

pid8 uid2 4
pid9 uid9 1
pid2 uid9 5
pid9 uid5 5
pid6 uid8 4
pid1 uid2 4
pid3 uid4 4
pid5 uid9 2
pid9 uid8 4
pid9 uid9 1

2
David Gleich · Purdue

Ancestry.com

13

matrix computations "
≠"
linear algebra
David Gleich · Purdue

Ancestry.com

14

Hadoop, MapReduce,
and Matrix Methods
MapReduce

data

data

data

Map

key
value
value

key
value

Map

Map

()

key
value
key
value
key
value

Map

key
value

Shuffle

key
value
value
value

key
value

Reduce

data

Reduce

data

Reduce

data

David Gleich · Purdue

Ancestry.com

15

data

key
value
The MapReduce Framework
Originated at Google for indexing web
pages and computing PageRank.

Data scalable
Maps
M
Reduce
M
R
M
R
M
M Shuffle

M

M

1

2

M

M

3

4



1

Express algorithms in "
“data-local operations”.

3

Implement one type of
communication: shuffle.

Fault-tolerance by design

4
5

M
5

Input stored in triplicate
Reduce input/"
M
output on disk
M
R
M
R
M
Map output"
persisted to disk"
before shuffle

David Gleich · Purdue

Ancestry.com

16

Shuffle moves all data with
the same key to the same
reducer.

2
wordcount "
is a matrix computation too
map(document) :
for word in document
D

1

2

D

D

3

4

emit (word, 1)

D
5

matrix,1
matrix,1
matrix,1
matrix,1

hadoop,1
bigdata,1
hadoop,1
bigdata,1
hadoop,1
bigdata,1
hadoop,1
bigdata,1
hadoop,1
bigdata,1
hadoop,1
bigdata,1
hadoop,1
bigdata,1
bigdata,1

reduce(word, counts) :
emit (word, sum(counts))
David Gleich · Purdue

Ancestry.com

17

D
wordcount "
is a matrix computation too
doc1

A1,1

6
6
doc2
 A2,1
A=6 .
6
4 .
.
docm
 Am,1
word count

A1,2
A2,2
..
.
···
=

3

···
···
..
.
Am,n

1

A1,n
. 7
. 7
. 7
7 = A
Am 1,n 5
Am,n

colsum(A)

=

AT e

e is the vector of all ones

David Gleich · Purdue

Ancestry.com

18

2
inverted index"
is a matrix computation too
doc1

A1,1

6
6
doc2
 A2,1
A=6 .
6
4 .
.
docm
Am,1

A1,2
A2,2
..
.
···

3

···
···
..
.
Am,n

1

A1,n
. 7
. 7
. 7
7 = A
Am 1,n 5
Am,n

David Gleich · Purdue

Ancestry.com

19

2
inverted index"
is a matrix computation too
term1

A1,1

6
6A1,2
term2
6
6 .
4 .
.
termm
 A1,n

A2,1
A2,2
..
.
···

···
···
..
.
Am

1,n

3

Am,1
. 7
. 7
. 7
= AT
7
Am,n 1 5
Am,n

David Gleich · Purdue

Ancestry.com

20

2
A recommender system "
with social info
friends_links

pid8 uid2 4
pid9 uid9 1
pid2 uid9 5
pid9 uid5 5
pid6 uid8 4
pid1 uid2 4
pid3 uid4 4
pid5 uid9 2
pid9 uid8 4
pid9 uid9 1

uid6 uid1
uid8 uid9
uid7 uid7
uid7 uid4
uid6 uid2
uid7 uid1
uid3 uid1
uid1 uid8
uid7 uid3
uid9 uid1

David Gleich · Purdue

Ancestry.com

21

product_ratings
A recommender system "
with social info
friends_links

pid8 uid2 4
pid9 uid9 1
pid2 uid9 5
pid9 uid5 5
pid6 uid8 4
pid1
1,1
pid1 uid2 4
pid3 uid4 4
1,2
pid2
pid5 uid9 2
pid9 uid8 4
pid9 uid9 1

uid6 uid1
uid8 uid9
uid7 uid7
uid7 uid4
uid6 uid2
uid1
uid7 uid1
uid3 uid1
uid2
uid1 uid8
uid7 uid3
uid9 uid1

2

A
6A
4

.
.
.

A2,1
A2,2
..
.

3

···
· · ·7
5
..
.

2

A1,1
6A1,2
4
.
.
.

David Gleich · Purdue

A2,1
A2,2
..
.

3

···
· · ·7
5
..
.

Ancestry.com

22

product_ratings
A recommender system "
with social info
friends_links

pid8 uid2 4
pid9 uid9 1
pid2 uid9 5
pid9 uid5 5
pid6 uid8 4
pid1 uid2 4
pid3 uid4 4
pid5 uid9 2
pid9 uid8 4
pid9 uid9 1

uid6 uid1
uid8 uid9
uid7 uid7
uid7 uid4
uid6 uid2
uid7 uid1
uid3 uid1
uid1 uid8
uid7 uid3
uid9 uid1

R

S

David Gleich · Purdue

Ancestry.com

23

product_ratings
A recommender system "
with social info

2

A1,1
6
pid2
 A1,2
4
.
.
.
pid1

Xuid,pid =

A2,1
A2,2
..
.

R
X

uid2

“X = S RT”

3

···
· · ·7
5
..
.

Suid,uid2 Ruid2,pid

2

A1,1
6
uid2
 A1,2
4
.
.
.
uid1

!

with something that is"
almost a matrix-matrix"
product

·

X

uid2

A2,1
A2,2
..
.

S

3

···
· · ·7
5
..
.
!

“Suid,uid2 and Ruid2,pid 6= 0”

David Gleich · Purdue

Ancestry.com

1

24

Recommend each item based
on the average rating of all
trusted users
Tools I like

hadoop streaming

David Gleich · Purdue

Ancestry.com

25

dumbo
mrjob
hadoopy
C++
Tools I don’t use but other
people seem to like …
pig
java
hbase
mahout
Eclipse

Mahout is the closest thing to a library
for matrix computations in Hadoop. If
you like Java, you should probably
start there.

I’m a low-level guy

Cassandra

David Gleich · Purdue

Ancestry.com

26
hadoop streaming
the map function is a program"
(key,value) pairs are sent via stdin"
output (key,value) pairs goes to stdout


David Gleich · Purdue

Ancestry.com

27

the reduce function is a program"
(key,value) pairs are sent via stdin"
keys are grouped"
output (key,value) pairs goes to stdout
mrjob from 
a wrapper around hadoop streaming for
map and reduce functions in python
class MRWordFreqCount(MRJob):
def mapper(self, _, line):
for word in line.split():
yield (word.lower(), 1)
def reducer(self, word, counts):
yield (word, sum(counts))

David Gleich · Purdue

Ancestry.com

28

if __name__ == '__main__':
MRWordFreqCount.run()
David Gleich · Purdue

Ancestry.com

29

Connected components in
SQL and Hadoop
Connected components

3 “components” in this graph

How can we find them
algorithmically …

David Gleich · Purdue

Ancestry.com

30

… on a huge network?
Connected components
Algorithm!
Assign each node a random
component id.

David Gleich · Purdue

Ancestry.com

31

For each node, take the
minimum component id of
itself and all neighbors.
David Gleich · Purdue

Ancestry.com

32

DEMO
Computing Connected
Components in SQL
!
CREATE TABLE v2 AS (!
SELECT !
e.tail AS id,!
MIN(v.comp) as COMP!
FROM edges e!
INNER JOIN vector v!
ON e.head = v.id!
GROUP BY e.tail!
);!

Graph!
Edges : id | head | tail !

!
“Vector”!
!

v : id | comp!
initialized to random !
component!

DROP TABLE v;!
ALTER TABLE v2 !
RENAME TO v;!
!
!
David Gleich · Purdue

Ancestry.com

33

... Repeat ...!
Matrix-vector product and
connected components in Hadoop
See example! 

matrix-hadoop/codes/smatvec.py!

k

Google’s
PageRank

Word count,
average rating!

“AT x = y”
yi = min(xi , min Aki xk )
k

Connected components

David Gleich · Purdue

Ancestry.com

34

A

x

Ax = y
X
yi =
Aik xk
Ax = y
X
yi =
Aik xk

Matrix-vector product
Follow along! 

k

matrix-hadoop/codes/smatvec.py!

A



$
0
1
2
3
4



head samples/smat_5_5.txt !
0 0.125 3 1.024 4 0.121!
0 0.597!
2 1.247!
v initially random
4 -1.45!
!
2 0.061!
$ head samples/vec_5.txt!
0
1
2
3
4

0.241!
-0.98!
0.237!
-0.32!
0.080!
David Gleich · Purdue

Ancestry.com

35

x

A is stored by “node”
Matrix-vector product
(in pictures)

Ax = y
X
yi =
Aik xk
x

x

k

x

Input

A

A

Map 1!
Align on columns


Reduce 1!
Output Aik xk
keyed on row i

David Gleich · Purdue

y

Reduce 2!
Output 
sum(Aik xk)

Ancestry.com

36

A
Matrix-vector product
(in pictures)

Ax = y
X
yi =
Aik xk

x

k

A
Input

A

Map 1!
Align on columns

David Gleich · Purdue

Ancestry.com

37

x

def joinmap(self, key, line):!
vals = line.split()!
if len(vals) == 2:!
# the vector!
yield (vals[0],
# row!
(float(vals[1]),)) # xi!
else:!
# the matrix!
row = vals[0]!
for i in xrange(1,len(vals),2):!
yield (vals[i],
# column!
(row,
# i,Aij!
float(vals[i+1])))!
x

T
“Matrix-vector” for
“A x = y”
connected components 
yi = min(xi , min Aki xk )
k

A
Input

A

Map 1!
Align on columns

David Gleich · Purdue

Ancestry.com

38

x

def joinmap(self, key, line):!
vals = line.split()!
if len(vals) == 2:!
# the vector!
yield (vals[0],
# row!
(float(vals[1]),)) # vi!
else:!
# the matrix!
row = vals[0]!
for i in xrange(1,len(vals),2):!
yield (row,
# head!
(vals[i],
# tail))!
Matrix-vector product
(in pictures)

Ax = y
X
yi =
Aik xk
x

x

k

def joinred(self, key, vals):!
vecval = 0. !
matvals = []!
for val in vals:!
if len(val) == 1:!
vecval += val[0]!
else:!
matvals.append(val)
for val in matvals:!
yield (val[0], val[1]*vecval)!

x

Note that you should use a
Input
secondary sort to avoid
reading both in memory	


A

!

Map 1!
Align on columns


A
Reduce 1!
Output Aik xk
keyed on row i

David Gleich · Purdue

Ancestry.com

39

A
x

x

T
“Matrix-vector” for
“A x = y”
connected components 
yi = min(xi , min Aki xk )
k

def joinred(self, key, vals):!
vecval = 0. !
matvals = []!
for val in vals:!
if len(val) == 1:!
vecval += val[0]!
else:!
matvals.append(val)
for val in matvals:!
yield (val[0], vecval)!

x

Note that you should use a
Input
secondary sort to avoid
reading both in memory	


A
!

Map 1!
Align on columns


A
Reduce 1!
Output Aik xk
keyed on row i

David Gleich · Purdue

Ancestry.com

40

A
Matrix-vector product
(in pictures)

Ax = y
X
yi =
Aik xk

A
Input

def sumred(self, key, vals):!
yield (key, sum(vals))!

A

A

Map 1!
Align on columns


Reduce 1!
Output Aik xk
keyed on row i

David Gleich · Purdue

y

Reduce 2!
Output 
sum(Aik xk)

Ancestry.com

41

x

x

x

k
Our social recommender
Follow along! 

matrix-hadoop/recsys/recsys.py!

!

$ gunzip –c data/rating.txt.gz!
139431556 591156
5!
139431556 1312460676
5!
139431556 204358
4
Object ID! 368725
User ID!
Rating!
139431556
5!

S is stored entry-wise
!

$ gunzip –c data/rating.txt.gz!
3287060356
232085
-1!
3288305540
709420
1!
3290337156
204418
-1!
My ID!
Other ID!
Trust!
3294138244
269243
-1!
David Gleich · Purdue

Ancestry.com

42

S

T
R

R is stored entry-wise
Matrix-matrix product

k

matrix-hadoop/codes/matmat.py!

A

B


Conceptually, the first step
is the same as the matrixvector product with a block
of vectors.


David Gleich · Purdue

Ancestry.com

43

Follow along! 

AB = C
X
Cij =
Aik Bkj
k

A
 A
 C
Reduce 1!
Map 1!
Align on columns Output Aik Bkj
keyed on (i,j)

David Gleich · Purdue

Reduce 2!
Output 
sum(Aik Bkj)

Ancestry.com

44

A

B

AB = C
X
Cij =
Aik Bkj

B

B

Matrix-matrix product 
(in pictures)
B

Social recommender 
(in code)

A

Map 1!
Align on columns

David Gleich · Purdue

Ancestry.com

45

A

B

def joinmap(self, key, line):!
parts = line.split('t')!
if len(parts) == 8: # ratings!
objid = parts[0].strip()!
uid = parts[1].strip()!
rat = int(parts[2])!
yield (uid, (objid, rat))!
else len(parts) == 4: # trust!
myid = parts[0].strip()!
otherid = parts[1].strip()!
value = int(parts[2])!
if value  0:!
yield (otherid, (myid,))!
!

def joinred(self, key, vals):!
tusers = [] # uids that trust key!
ratobjs = [] # objs rated by uid=key!
for val in vals:!
if len(val) == 1:!
tusers.append(val[0])!
else:!
ratobjs.append(val)!

A

B

A
 A

for (objid, rat) in ratobjs:!
for uid in tusers:!
yield ((uid, objid), rat)!

Conceptually,
the second step
is the same as
the matrixmatrix product
too, we “map”
the ratings from
each trusted
user back to the
source.

Reduce 1!
Map 1!
Align on columns Output Aik Bkj
keyed on (i,j)

David Gleich · Purdue

Ancestry.com

46

B

B

Matrix-matrix product 
(in pictures)
def avgred(self, key, vals):!
s = 0.!
n = 0!
for val in vals:!
s += val!
n += 1!
# the smoothed average of ratings!
yield key, !
(s+self.options.avg)/float(n+1) !
!

A

k

A
 A
 C

Reduce 1!
Map 1!
Align on columns Output Aik Bkj
keyed on (i,j)

David Gleich · Purdue

Reduce 2!
Output 
sum(Aik Bkj)

Ancestry.com

47

B

AB = C
X
Cij =
Aik Bkj

B

B

Matrix-matrix product 
(in pictures)
No need for “integer” keys that
fall between 1 and n!

A

B

Block matrices minimize the
number of intermediate keys
and values used. I’d form them
based on the first reduce 

A

B

David Gleich · Purdue

Ancestry.com

48

Better ways to store 
matrices in Hadoop
Tall-and-Skinny
matrices

(m ≫ n) 
Many rows (like a billion)
A few columns (under 10,000)
regression and
general linear models
with many samples


Used in
 block iterative methods

From tinyimages
collection

panel factorizations


simulation data analysis !


big-data SVD/PCA!
David Gleich · Purdue

Ancestry.com

49

A
Image from rockysprings, deviantart, CC share-alike
David Gleich · Purdue

Ancestry.com

50

Questions?

More Related Content

Viewers also liked

Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Renato Bonomini
 
The power and Arnoldi methods in an algebra of circulants
The power and Arnoldi methods in an algebra of circulantsThe power and Arnoldi methods in an algebra of circulants
The power and Arnoldi methods in an algebra of circulantsDavid Gleich
 
What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...
What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...
What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...David Gleich
 
Direct tall-and-skinny QR factorizations in MapReduce architectures
Direct tall-and-skinny QR factorizations in MapReduce architecturesDirect tall-and-skinny QR factorizations in MapReduce architectures
Direct tall-and-skinny QR factorizations in MapReduce architecturesDavid Gleich
 
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...David Gleich
 
Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...David Gleich
 
A multithreaded method for network alignment
A multithreaded method for network alignmentA multithreaded method for network alignment
A multithreaded method for network alignmentDavid Gleich
 
Iterative methods for network alignment
Iterative methods for network alignmentIterative methods for network alignment
Iterative methods for network alignmentDavid Gleich
 
A history of PageRank from the numerical computing perspective
A history of PageRank from the numerical computing perspectiveA history of PageRank from the numerical computing perspective
A history of PageRank from the numerical computing perspectiveDavid Gleich
 
MapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsMapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsDavid Gleich
 
Tall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduceTall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduceDavid Gleich
 
Relaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networksRelaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networksDavid Gleich
 
Tall-and-skinny QR factorizations in MapReduce architectures
Tall-and-skinny QR factorizations in MapReduce architecturesTall-and-skinny QR factorizations in MapReduce architectures
Tall-and-skinny QR factorizations in MapReduce architecturesDavid Gleich
 
Spacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisSpacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisDavid Gleich
 
How does Google Google: A journey into the wondrous mathematics behind your f...
How does Google Google: A journey into the wondrous mathematics behind your f...How does Google Google: A journey into the wondrous mathematics behind your f...
How does Google Google: A journey into the wondrous mathematics behind your f...David Gleich
 
Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential David Gleich
 
A dynamical system for PageRank with time-dependent teleportation
A dynamical system for PageRank with time-dependent teleportationA dynamical system for PageRank with time-dependent teleportation
A dynamical system for PageRank with time-dependent teleportationDavid Gleich
 
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...Vertex neighborhoods, low conductance cuts, and good seeds for local communit...
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...David Gleich
 
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysisMapReduce for scientific simulation analysis
MapReduce for scientific simulation analysisDavid Gleich
 

Viewers also liked (20)

Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
 
Apache Hadoop Crash Course
Apache Hadoop Crash CourseApache Hadoop Crash Course
Apache Hadoop Crash Course
 
The power and Arnoldi methods in an algebra of circulants
The power and Arnoldi methods in an algebra of circulantsThe power and Arnoldi methods in an algebra of circulants
The power and Arnoldi methods in an algebra of circulants
 
What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...
What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...
What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...
 
Direct tall-and-skinny QR factorizations in MapReduce architectures
Direct tall-and-skinny QR factorizations in MapReduce architecturesDirect tall-and-skinny QR factorizations in MapReduce architectures
Direct tall-and-skinny QR factorizations in MapReduce architectures
 
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
 
Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...
 
A multithreaded method for network alignment
A multithreaded method for network alignmentA multithreaded method for network alignment
A multithreaded method for network alignment
 
Iterative methods for network alignment
Iterative methods for network alignmentIterative methods for network alignment
Iterative methods for network alignment
 
A history of PageRank from the numerical computing perspective
A history of PageRank from the numerical computing perspectiveA history of PageRank from the numerical computing perspective
A history of PageRank from the numerical computing perspective
 
MapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsMapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applications
 
Tall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduceTall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduce
 
Relaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networksRelaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networks
 
Tall-and-skinny QR factorizations in MapReduce architectures
Tall-and-skinny QR factorizations in MapReduce architecturesTall-and-skinny QR factorizations in MapReduce architectures
Tall-and-skinny QR factorizations in MapReduce architectures
 
Spacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisSpacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysis
 
How does Google Google: A journey into the wondrous mathematics behind your f...
How does Google Google: A journey into the wondrous mathematics behind your f...How does Google Google: A journey into the wondrous mathematics behind your f...
How does Google Google: A journey into the wondrous mathematics behind your f...
 
Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential
 
A dynamical system for PageRank with time-dependent teleportation
A dynamical system for PageRank with time-dependent teleportationA dynamical system for PageRank with time-dependent teleportation
A dynamical system for PageRank with time-dependent teleportation
 
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...Vertex neighborhoods, low conductance cuts, and good seeds for local communit...
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...
 
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysisMapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
 

Similar to Recommendation and graph algorithms in Hadoop and SQL

Matrix methods for Hadoop
Matrix methods for HadoopMatrix methods for Hadoop
Matrix methods for HadoopDavid Gleich
 
Fast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and moreFast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and moreDavid Gleich
 
Introduction to R Short course Fall 2016
Introduction to R Short course Fall 2016Introduction to R Short course Fall 2016
Introduction to R Short course Fall 2016Spencer Fox
 
Massive MapReduce Matrix Computations & Multicore Graph Algorithms
Massive MapReduce Matrix Computations & Multicore Graph AlgorithmsMassive MapReduce Matrix Computations & Multicore Graph Algorithms
Massive MapReduce Matrix Computations & Multicore Graph AlgorithmsDavid Gleich
 
Tutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial NetworksTutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial NetworksMLReview
 
PageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structuresPageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structuresDavid Gleich
 
Localized methods in graph mining
Localized methods in graph miningLocalized methods in graph mining
Localized methods in graph miningDavid Gleich
 
Statistical analytical programming for social media analysis .
Statistical analytical programming for social media analysis .Statistical analytical programming for social media analysis .
Statistical analytical programming for social media analysis .Felicita Florence
 
Intelligent Ruby + Machine Learning
Intelligent Ruby + Machine LearningIntelligent Ruby + Machine Learning
Intelligent Ruby + Machine LearningIlya Grigorik
 
Open Problems in the Universal Graph Theory
Open Problems in the Universal Graph TheoryOpen Problems in the Universal Graph Theory
Open Problems in the Universal Graph TheoryMarko Rodriguez
 
Visual Api Training
Visual Api TrainingVisual Api Training
Visual Api TrainingSpark Summit
 
Neo4j MeetUp - Graph Exploration with MetaExp
Neo4j MeetUp - Graph Exploration with MetaExpNeo4j MeetUp - Graph Exploration with MetaExp
Neo4j MeetUp - Graph Exploration with MetaExpAdrian Ziegler
 
How to easily find the optimal solution without exhaustive search using Genet...
How to easily find the optimal solution without exhaustive search using Genet...How to easily find the optimal solution without exhaustive search using Genet...
How to easily find the optimal solution without exhaustive search using Genet...Viach Kakovskyi
 
The SynergyScreen Package
The SynergyScreen PackageThe SynergyScreen Package
The SynergyScreen PackageYury V. Bukhman
 
Large Scale Recommendation: a view from the Trenches
Large Scale Recommendation: a view from the TrenchesLarge Scale Recommendation: a view from the Trenches
Large Scale Recommendation: a view from the TrenchesAnne-Marie Tousch
 

Similar to Recommendation and graph algorithms in Hadoop and SQL (20)

Matrix methods for Hadoop
Matrix methods for HadoopMatrix methods for Hadoop
Matrix methods for Hadoop
 
Fast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and moreFast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and more
 
Introduction to R Short course Fall 2016
Introduction to R Short course Fall 2016Introduction to R Short course Fall 2016
Introduction to R Short course Fall 2016
 
Massive MapReduce Matrix Computations & Multicore Graph Algorithms
Massive MapReduce Matrix Computations & Multicore Graph AlgorithmsMassive MapReduce Matrix Computations & Multicore Graph Algorithms
Massive MapReduce Matrix Computations & Multicore Graph Algorithms
 
Tutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial NetworksTutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial Networks
 
MongoDB 3.2 - Analytics
MongoDB 3.2  - AnalyticsMongoDB 3.2  - Analytics
MongoDB 3.2 - Analytics
 
R studio
R studio R studio
R studio
 
PageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structuresPageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structures
 
Localized methods in graph mining
Localized methods in graph miningLocalized methods in graph mining
Localized methods in graph mining
 
Statistical analytical programming for social media analysis .
Statistical analytical programming for social media analysis .Statistical analytical programming for social media analysis .
Statistical analytical programming for social media analysis .
 
Intelligent Ruby + Machine Learning
Intelligent Ruby + Machine LearningIntelligent Ruby + Machine Learning
Intelligent Ruby + Machine Learning
 
15 unionfind
15 unionfind15 unionfind
15 unionfind
 
Algorithms, Union Find
Algorithms, Union FindAlgorithms, Union Find
Algorithms, Union Find
 
Open Problems in the Universal Graph Theory
Open Problems in the Universal Graph TheoryOpen Problems in the Universal Graph Theory
Open Problems in the Universal Graph Theory
 
Visual Api Training
Visual Api TrainingVisual Api Training
Visual Api Training
 
Neo4j MeetUp - Graph Exploration with MetaExp
Neo4j MeetUp - Graph Exploration with MetaExpNeo4j MeetUp - Graph Exploration with MetaExp
Neo4j MeetUp - Graph Exploration with MetaExp
 
How to easily find the optimal solution without exhaustive search using Genet...
How to easily find the optimal solution without exhaustive search using Genet...How to easily find the optimal solution without exhaustive search using Genet...
How to easily find the optimal solution without exhaustive search using Genet...
 
The SynergyScreen Package
The SynergyScreen PackageThe SynergyScreen Package
The SynergyScreen Package
 
Large Scale Recommendation: a view from the Trenches
Large Scale Recommendation: a view from the TrenchesLarge Scale Recommendation: a view from the Trenches
Large Scale Recommendation: a view from the Trenches
 
Basics of R
Basics of RBasics of R
Basics of R
 

More from David Gleich

Engineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisEngineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisDavid Gleich
 
Correlation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networksCorrelation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networksDavid Gleich
 
Spectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structuresSpectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structuresDavid Gleich
 
Higher-order organization of complex networks
Higher-order organization of complex networksHigher-order organization of complex networks
Higher-order organization of complex networksDavid Gleich
 
Non-exhaustive, Overlapping K-means
Non-exhaustive, Overlapping K-meansNon-exhaustive, Overlapping K-means
Non-exhaustive, Overlapping K-meansDavid Gleich
 
Spacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chainsSpacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chainsDavid Gleich
 
Iterative methods with special structures
Iterative methods with special structuresIterative methods with special structures
Iterative methods with special structuresDavid Gleich
 
Big data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsBig data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsDavid Gleich
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduceDavid Gleich
 

More from David Gleich (9)

Engineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisEngineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network Analysis
 
Correlation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networksCorrelation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networks
 
Spectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structuresSpectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structures
 
Higher-order organization of complex networks
Higher-order organization of complex networksHigher-order organization of complex networks
Higher-order organization of complex networks
 
Non-exhaustive, Overlapping K-means
Non-exhaustive, Overlapping K-meansNon-exhaustive, Overlapping K-means
Non-exhaustive, Overlapping K-means
 
Spacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chainsSpacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chains
 
Iterative methods with special structures
Iterative methods with special structuresIterative methods with special structures
Iterative methods with special structures
 
Big data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsBig data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphs
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduce
 

Recently uploaded

Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 

Recently uploaded (20)

Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 

Recommendation and graph algorithms in Hadoop and SQL

  • 1. Recommendation and graph algorithms in Hadoop and SQL Code github.com/dgleich/matrix-hadoop-tutorial @dgleich dgleich@purdue.edu DAVID F. GLEICH ASSISTANT PROFESSOR" COMPUTER SCIENCE" PURDUE UNIVERSITY David Gleich · Purdue Ancestry.com 1
  • 2. Matrix computations A1,1 6 6 A2,1 A=6 . 6 4 . . Am,1 Ax Ax = b Operations Linear " systems A1,2 A2,2 .. . ··· ··· ··· .. . Am,n min kAx 1 3 A1,n . 7 . 7 . 7 7 Am 1,n 5 Am,n bk Least squares David Gleich · Purdue Ax = x Eigenvalues Ancestry.com 2 2
  • 3. Outcomes Recognize relationships between matrix methods and things you’ve already been doing" Example SQL queries as matrix computations See how to work with big graphs as large edge lists in Hadoop and SQL" Example Connected components David Gleich · Purdue Ancestry.com 3 Understand how to use Hadoop to compute these matrix methods at scale for BigData" Example Recommenders with social network info
  • 4. David Gleich · Purdue Ancestry.com 4 matrix computations " ≠" linear algebra
  • 5. World’s simplest recommendation system. David Gleich · Purdue Ancestry.com 5 Suggest the average rating.
  • 6. A SQL statement as a " matrix computation http://stackoverflow.com/questions/4217449/returning-average-rating-from-a-database-sql David Gleich · Purdue Ancestry.com 6 How do I find the average rating for each product?
  • 7. A SQL statement as a " matrix computation David Gleich · Purdue Ancestry.com 7 SELECT! p.product_id,! p.name,! AVG(pr.rating) AS rating_average! http://stackoverflow.com/questions/4217449/returning-average-rating-from-a-database-sql FROM products p! INNER JOIN product_ratings pr! How do I find the ON pr.product_id = p.product_id! average rating for GROUP BY p.product_id! each product? ORDER BY rating_average DESC!
  • 8. Image from rockysprings, deviantart, CC share-alike David Gleich · Purdue Ancestry.com 8 This SQL statement is a " matrix computation!
  • 9. SELECT! ...! AVG(pr.rating)! ...! GROUP BY p.product_id! product_ratings pid1 pid2 pid3 pid4 pid5 pid6 Is a matrix! pid7 pid8 pid9 David Gleich · Purdue Ancestry.com 9 pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4 pid9 uid9 1
  • 10. But it’s a weird matrix" product_ratings pid1 pid2 pid3 pid4 pid5 pid6 Is a matrix! pid7 pid8 pid9 Missing entries! David Gleich · Purdue Ancestry.com 10 pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4 pid9 uid9 1
  • 11. But it’s a weird matrix" Average" of ratings product_ratings pid1 pid2 pid3 pid4 pid5 pid6 Is a matrix! pid7 pid8 pid9 4 4 4 5 Matrix David Gleich · Purdue 4 SELECT AVG(r) ... 4 GROUP BY pid Vector Ancestry.com 11 pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4
  • 12. But it’s a weird matrix" and not a linear operator A1,2 6 A2,1 A=6 . 6 4 . . Am,1 A2,2 .. . ··· ! 6 3 ··· ··· .. . Am,n 1 A1,n . 7 . 7 . 7 7 Am 1,n 5 Am,n P 2 P j A1,j / Pj “A1,j 6= 0” P 6 j A2,j / j “A2,j 6= 0” 6 avg(A) = 6 . . 4 . P P j Am,j / j “Am,j 6= 0” David Gleich · Purdue Ancestry.com 3 7 7 7 5 12 A1,1 I product_ratings s a matrix pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4 pid9 uid9 1 2
  • 13. David Gleich · Purdue Ancestry.com 13 matrix computations " ≠" linear algebra
  • 14. David Gleich · Purdue Ancestry.com 14 Hadoop, MapReduce, and Matrix Methods
  • 16. The MapReduce Framework Originated at Google for indexing web pages and computing PageRank. Data scalable Maps M Reduce M R M R M M Shuffle M M 1 2 M M 3 4 1 Express algorithms in " “data-local operations”. 3 Implement one type of communication: shuffle. Fault-tolerance by design 4 5 M 5 Input stored in triplicate Reduce input/" M output on disk M R M R M Map output" persisted to disk" before shuffle David Gleich · Purdue Ancestry.com 16 Shuffle moves all data with the same key to the same reducer. 2
  • 17. wordcount " is a matrix computation too map(document) : for word in document D 1 2 D D 3 4 emit (word, 1) D 5 matrix,1 matrix,1 matrix,1 matrix,1 hadoop,1 bigdata,1 hadoop,1 bigdata,1 hadoop,1 bigdata,1 hadoop,1 bigdata,1 hadoop,1 bigdata,1 hadoop,1 bigdata,1 hadoop,1 bigdata,1 bigdata,1 reduce(word, counts) : emit (word, sum(counts)) David Gleich · Purdue Ancestry.com 17 D
  • 18. wordcount " is a matrix computation too doc1 A1,1 6 6 doc2 A2,1 A=6 . 6 4 . . docm Am,1 word count A1,2 A2,2 .. . ··· = 3 ··· ··· .. . Am,n 1 A1,n . 7 . 7 . 7 7 = A Am 1,n 5 Am,n colsum(A) = AT e e is the vector of all ones David Gleich · Purdue Ancestry.com 18 2
  • 19. inverted index" is a matrix computation too doc1 A1,1 6 6 doc2 A2,1 A=6 . 6 4 . . docm Am,1 A1,2 A2,2 .. . ··· 3 ··· ··· .. . Am,n 1 A1,n . 7 . 7 . 7 7 = A Am 1,n 5 Am,n David Gleich · Purdue Ancestry.com 19 2
  • 20. inverted index" is a matrix computation too term1 A1,1 6 6A1,2 term2 6 6 . 4 . . termm A1,n A2,1 A2,2 .. . ··· ··· ··· .. . Am 1,n 3 Am,1 . 7 . 7 . 7 = AT 7 Am,n 1 5 Am,n David Gleich · Purdue Ancestry.com 20 2
  • 21. A recommender system " with social info friends_links pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4 pid9 uid9 1 uid6 uid1 uid8 uid9 uid7 uid7 uid7 uid4 uid6 uid2 uid7 uid1 uid3 uid1 uid1 uid8 uid7 uid3 uid9 uid1 David Gleich · Purdue Ancestry.com 21 product_ratings
  • 22. A recommender system " with social info friends_links pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 1,1 pid1 uid2 4 pid3 uid4 4 1,2 pid2 pid5 uid9 2 pid9 uid8 4 pid9 uid9 1 uid6 uid1 uid8 uid9 uid7 uid7 uid7 uid4 uid6 uid2 uid1 uid7 uid1 uid3 uid1 uid2 uid1 uid8 uid7 uid3 uid9 uid1 2 A 6A 4 . . . A2,1 A2,2 .. . 3 ··· · · ·7 5 .. . 2 A1,1 6A1,2 4 . . . David Gleich · Purdue A2,1 A2,2 .. . 3 ··· · · ·7 5 .. . Ancestry.com 22 product_ratings
  • 23. A recommender system " with social info friends_links pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4 pid9 uid9 1 uid6 uid1 uid8 uid9 uid7 uid7 uid7 uid4 uid6 uid2 uid7 uid1 uid3 uid1 uid1 uid8 uid7 uid3 uid9 uid1 R S David Gleich · Purdue Ancestry.com 23 product_ratings
  • 24. A recommender system " with social info 2 A1,1 6 pid2 A1,2 4 . . . pid1 Xuid,pid = A2,1 A2,2 .. . R X uid2 “X = S RT” 3 ··· · · ·7 5 .. . Suid,uid2 Ruid2,pid 2 A1,1 6 uid2 A1,2 4 . . . uid1 ! with something that is" almost a matrix-matrix" product · X uid2 A2,1 A2,2 .. . S 3 ··· · · ·7 5 .. . ! “Suid,uid2 and Ruid2,pid 6= 0” David Gleich · Purdue Ancestry.com 1 24 Recommend each item based on the average rating of all trusted users
  • 25. Tools I like hadoop streaming David Gleich · Purdue Ancestry.com 25 dumbo mrjob hadoopy C++
  • 26. Tools I don’t use but other people seem to like … pig java hbase mahout Eclipse Mahout is the closest thing to a library for matrix computations in Hadoop. If you like Java, you should probably start there. I’m a low-level guy Cassandra David Gleich · Purdue Ancestry.com 26
  • 27. hadoop streaming the map function is a program" (key,value) pairs are sent via stdin" output (key,value) pairs goes to stdout David Gleich · Purdue Ancestry.com 27 the reduce function is a program" (key,value) pairs are sent via stdin" keys are grouped" output (key,value) pairs goes to stdout
  • 28. mrjob from a wrapper around hadoop streaming for map and reduce functions in python class MRWordFreqCount(MRJob): def mapper(self, _, line): for word in line.split(): yield (word.lower(), 1) def reducer(self, word, counts): yield (word, sum(counts)) David Gleich · Purdue Ancestry.com 28 if __name__ == '__main__': MRWordFreqCount.run()
  • 29. David Gleich · Purdue Ancestry.com 29 Connected components in SQL and Hadoop
  • 30. Connected components 3 “components” in this graph How can we find them algorithmically … David Gleich · Purdue Ancestry.com 30 … on a huge network?
  • 31. Connected components Algorithm! Assign each node a random component id. David Gleich · Purdue Ancestry.com 31 For each node, take the minimum component id of itself and all neighbors.
  • 32. David Gleich · Purdue Ancestry.com 32 DEMO
  • 33. Computing Connected Components in SQL ! CREATE TABLE v2 AS (! SELECT ! e.tail AS id,! MIN(v.comp) as COMP! FROM edges e! INNER JOIN vector v! ON e.head = v.id! GROUP BY e.tail! );! Graph! Edges : id | head | tail ! ! “Vector”! ! v : id | comp! initialized to random ! component! DROP TABLE v;! ALTER TABLE v2 ! RENAME TO v;! ! ! David Gleich · Purdue Ancestry.com 33 ... Repeat ...!
  • 34. Matrix-vector product and connected components in Hadoop See example! matrix-hadoop/codes/smatvec.py! k Google’s PageRank Word count, average rating! “AT x = y” yi = min(xi , min Aki xk ) k Connected components David Gleich · Purdue Ancestry.com 34 A x Ax = y X yi = Aik xk
  • 35. Ax = y X yi = Aik xk Matrix-vector product Follow along! k matrix-hadoop/codes/smatvec.py! A $ 0 1 2 3 4 head samples/smat_5_5.txt ! 0 0.125 3 1.024 4 0.121! 0 0.597! 2 1.247! v initially random 4 -1.45! ! 2 0.061! $ head samples/vec_5.txt! 0 1 2 3 4 0.241! -0.98! 0.237! -0.32! 0.080! David Gleich · Purdue Ancestry.com 35 x A is stored by “node”
  • 36. Matrix-vector product (in pictures) Ax = y X yi = Aik xk x x k x Input A A Map 1! Align on columns Reduce 1! Output Aik xk keyed on row i David Gleich · Purdue y Reduce 2! Output sum(Aik xk) Ancestry.com 36 A
  • 37. Matrix-vector product (in pictures) Ax = y X yi = Aik xk x k A Input A Map 1! Align on columns David Gleich · Purdue Ancestry.com 37 x def joinmap(self, key, line):! vals = line.split()! if len(vals) == 2:! # the vector! yield (vals[0], # row! (float(vals[1]),)) # xi! else:! # the matrix! row = vals[0]! for i in xrange(1,len(vals),2):! yield (vals[i], # column! (row, # i,Aij! float(vals[i+1])))!
  • 38. x T “Matrix-vector” for “A x = y” connected components yi = min(xi , min Aki xk ) k A Input A Map 1! Align on columns David Gleich · Purdue Ancestry.com 38 x def joinmap(self, key, line):! vals = line.split()! if len(vals) == 2:! # the vector! yield (vals[0], # row! (float(vals[1]),)) # vi! else:! # the matrix! row = vals[0]! for i in xrange(1,len(vals),2):! yield (row, # head! (vals[i], # tail))!
  • 39. Matrix-vector product (in pictures) Ax = y X yi = Aik xk x x k def joinred(self, key, vals):! vecval = 0. ! matvals = []! for val in vals:! if len(val) == 1:! vecval += val[0]! else:! matvals.append(val) for val in matvals:! yield (val[0], val[1]*vecval)! x Note that you should use a Input secondary sort to avoid reading both in memory A ! Map 1! Align on columns A Reduce 1! Output Aik xk keyed on row i David Gleich · Purdue Ancestry.com 39 A
  • 40. x x T “Matrix-vector” for “A x = y” connected components yi = min(xi , min Aki xk ) k def joinred(self, key, vals):! vecval = 0. ! matvals = []! for val in vals:! if len(val) == 1:! vecval += val[0]! else:! matvals.append(val) for val in matvals:! yield (val[0], vecval)! x Note that you should use a Input secondary sort to avoid reading both in memory A ! Map 1! Align on columns A Reduce 1! Output Aik xk keyed on row i David Gleich · Purdue Ancestry.com 40 A
  • 41. Matrix-vector product (in pictures) Ax = y X yi = Aik xk A Input def sumred(self, key, vals):! yield (key, sum(vals))! A A Map 1! Align on columns Reduce 1! Output Aik xk keyed on row i David Gleich · Purdue y Reduce 2! Output sum(Aik xk) Ancestry.com 41 x x x k
  • 42. Our social recommender Follow along! matrix-hadoop/recsys/recsys.py! ! $ gunzip –c data/rating.txt.gz! 139431556 591156 5! 139431556 1312460676 5! 139431556 204358 4 Object ID! 368725 User ID! Rating! 139431556 5! S is stored entry-wise ! $ gunzip –c data/rating.txt.gz! 3287060356 232085 -1! 3288305540 709420 1! 3290337156 204418 -1! My ID! Other ID! Trust! 3294138244 269243 -1! David Gleich · Purdue Ancestry.com 42 S T R R is stored entry-wise
  • 43. Matrix-matrix product k matrix-hadoop/codes/matmat.py! A B Conceptually, the first step is the same as the matrixvector product with a block of vectors. David Gleich · Purdue Ancestry.com 43 Follow along! AB = C X Cij = Aik Bkj
  • 44. k A A C Reduce 1! Map 1! Align on columns Output Aik Bkj keyed on (i,j) David Gleich · Purdue Reduce 2! Output sum(Aik Bkj) Ancestry.com 44 A B AB = C X Cij = Aik Bkj B B Matrix-matrix product (in pictures)
  • 45. B Social recommender (in code) A Map 1! Align on columns David Gleich · Purdue Ancestry.com 45 A B def joinmap(self, key, line):! parts = line.split('t')! if len(parts) == 8: # ratings! objid = parts[0].strip()! uid = parts[1].strip()! rat = int(parts[2])! yield (uid, (objid, rat))! else len(parts) == 4: # trust! myid = parts[0].strip()! otherid = parts[1].strip()! value = int(parts[2])! if value 0:! yield (otherid, (myid,))!
  • 46. ! def joinred(self, key, vals):! tusers = [] # uids that trust key! ratobjs = [] # objs rated by uid=key! for val in vals:! if len(val) == 1:! tusers.append(val[0])! else:! ratobjs.append(val)! A B A A for (objid, rat) in ratobjs:! for uid in tusers:! yield ((uid, objid), rat)! Conceptually, the second step is the same as the matrixmatrix product too, we “map” the ratings from each trusted user back to the source. Reduce 1! Map 1! Align on columns Output Aik Bkj keyed on (i,j) David Gleich · Purdue Ancestry.com 46 B B Matrix-matrix product (in pictures)
  • 47. def avgred(self, key, vals):! s = 0.! n = 0! for val in vals:! s += val! n += 1! # the smoothed average of ratings! yield key, ! (s+self.options.avg)/float(n+1) ! ! A k A A C Reduce 1! Map 1! Align on columns Output Aik Bkj keyed on (i,j) David Gleich · Purdue Reduce 2! Output sum(Aik Bkj) Ancestry.com 47 B AB = C X Cij = Aik Bkj B B Matrix-matrix product (in pictures)
  • 48. No need for “integer” keys that fall between 1 and n! A B Block matrices minimize the number of intermediate keys and values used. I’d form them based on the first reduce A B David Gleich · Purdue Ancestry.com 48 Better ways to store matrices in Hadoop
  • 49. Tall-and-Skinny matrices (m ≫ n) Many rows (like a billion) A few columns (under 10,000) regression and general linear models with many samples Used in block iterative methods From tinyimages collection panel factorizations simulation data analysis ! big-data SVD/PCA! David Gleich · Purdue Ancestry.com 49 A
  • 50. Image from rockysprings, deviantart, CC share-alike David Gleich · Purdue Ancestry.com 50 Questions?