A talk I gave at ancestry.com on Hadoop, SQL, recommendation and graph algorithms. It's a tutorial overview, there are better algorithms than those I describe, but these are a simple starting point.
Recommendation and graph algorithms in Hadoop and SQL
1. Recommendation and graph
algorithms in Hadoop and SQL
Code
github.com/dgleich/matrix-hadoop-tutorial
@dgleich
dgleich@purdue.edu
DAVID F. GLEICH
ASSISTANT PROFESSOR"
COMPUTER SCIENCE"
PURDUE UNIVERSITY
David Gleich · Purdue
Ancestry.com
1
2. Matrix computations
A1,1
6
6 A2,1
A=6 .
6
4 .
.
Am,1
Ax
Ax = b
Operations
Linear "
systems
A1,2
A2,2
..
.
···
···
···
..
.
Am,n
min kAx
1
3
A1,n
. 7
. 7
. 7
7
Am 1,n 5
Am,n
bk
Least squares
David Gleich · Purdue
Ax = x
Eigenvalues
Ancestry.com
2
2
3. Outcomes
Recognize relationships between matrix methods and
things you’ve already been doing"
Example SQL queries as matrix computations
See how to work with big graphs as large edge lists in
Hadoop and SQL"
Example Connected components
David Gleich · Purdue
Ancestry.com
3
Understand how to use Hadoop to compute these
matrix methods at scale for BigData"
Example Recommenders with social network info
4. David Gleich · Purdue
Ancestry.com
4
matrix computations "
≠"
linear algebra
6. A SQL statement as a "
matrix computation
http://stackoverflow.com/questions/4217449/returning-average-rating-from-a-database-sql
David Gleich · Purdue
Ancestry.com
6
How do I find the
average rating for
each product?
7. A SQL statement as a "
matrix computation
David Gleich · Purdue
Ancestry.com
7
SELECT!
p.product_id,!
p.name,!
AVG(pr.rating) AS rating_average!
http://stackoverflow.com/questions/4217449/returning-average-rating-from-a-database-sql
FROM products p!
INNER JOIN product_ratings pr!
How do I find the
ON pr.product_id = p.product_id!
average rating for
GROUP BY p.product_id!
each product?
ORDER BY rating_average DESC!
8. Image from rockysprings, deviantart, CC share-alike
David Gleich · Purdue
Ancestry.com
8
This SQL statement is a "
matrix computation!
16. The MapReduce Framework
Originated at Google for indexing web
pages and computing PageRank.
Data scalable
Maps
M
Reduce
M
R
M
R
M
M Shuffle
M
M
1
2
M
M
3
4
1
Express algorithms in "
“data-local operations”.
3
Implement one type of
communication: shuffle.
Fault-tolerance by design
4
5
M
5
Input stored in triplicate
Reduce input/"
M
output on disk
M
R
M
R
M
Map output"
persisted to disk"
before shuffle
David Gleich · Purdue
Ancestry.com
16
Shuffle moves all data with
the same key to the same
reducer.
2
17. wordcount "
is a matrix computation too
map(document) :
for word in document
D
1
2
D
D
3
4
emit (word, 1)
D
5
matrix,1
matrix,1
matrix,1
matrix,1
hadoop,1
bigdata,1
hadoop,1
bigdata,1
hadoop,1
bigdata,1
hadoop,1
bigdata,1
hadoop,1
bigdata,1
hadoop,1
bigdata,1
hadoop,1
bigdata,1
bigdata,1
reduce(word, counts) :
emit (word, sum(counts))
David Gleich · Purdue
Ancestry.com
17
D
18. wordcount "
is a matrix computation too
doc1
A1,1
6
6
doc2
A2,1
A=6 .
6
4 .
.
docm
Am,1
word count
A1,2
A2,2
..
.
···
=
3
···
···
..
.
Am,n
1
A1,n
. 7
. 7
. 7
7 = A
Am 1,n 5
Am,n
colsum(A)
=
AT e
e is the vector of all ones
David Gleich · Purdue
Ancestry.com
18
2
19. inverted index"
is a matrix computation too
doc1
A1,1
6
6
doc2
A2,1
A=6 .
6
4 .
.
docm
Am,1
A1,2
A2,2
..
.
···
3
···
···
..
.
Am,n
1
A1,n
. 7
. 7
. 7
7 = A
Am 1,n 5
Am,n
David Gleich · Purdue
Ancestry.com
19
2
20. inverted index"
is a matrix computation too
term1
A1,1
6
6A1,2
term2
6
6 .
4 .
.
termm
A1,n
A2,1
A2,2
..
.
···
···
···
..
.
Am
1,n
3
Am,1
. 7
. 7
. 7
= AT
7
Am,n 1 5
Am,n
David Gleich · Purdue
Ancestry.com
20
2
23. A recommender system "
with social info
friends_links
pid8 uid2 4
pid9 uid9 1
pid2 uid9 5
pid9 uid5 5
pid6 uid8 4
pid1 uid2 4
pid3 uid4 4
pid5 uid9 2
pid9 uid8 4
pid9 uid9 1
uid6 uid1
uid8 uid9
uid7 uid7
uid7 uid4
uid6 uid2
uid7 uid1
uid3 uid1
uid1 uid8
uid7 uid3
uid9 uid1
R
S
David Gleich · Purdue
Ancestry.com
23
product_ratings
24. A recommender system "
with social info
2
A1,1
6
pid2
A1,2
4
.
.
.
pid1
Xuid,pid =
A2,1
A2,2
..
.
R
X
uid2
“X = S RT”
3
···
· · ·7
5
..
.
Suid,uid2 Ruid2,pid
2
A1,1
6
uid2
A1,2
4
.
.
.
uid1
!
with something that is"
almost a matrix-matrix"
product
·
X
uid2
A2,1
A2,2
..
.
S
3
···
· · ·7
5
..
.
!
“Suid,uid2 and Ruid2,pid 6= 0”
David Gleich · Purdue
Ancestry.com
1
24
Recommend each item based
on the average rating of all
trusted users
25. Tools I like
hadoop streaming
David Gleich · Purdue
Ancestry.com
25
dumbo
mrjob
hadoopy
C++
26. Tools I don’t use but other
people seem to like …
pig
java
hbase
mahout
Eclipse
Mahout is the closest thing to a library
for matrix computations in Hadoop. If
you like Java, you should probably
start there.
I’m a low-level guy
Cassandra
David Gleich · Purdue
Ancestry.com
26
27. hadoop streaming
the map function is a program"
(key,value) pairs are sent via stdin"
output (key,value) pairs goes to stdout
David Gleich · Purdue
Ancestry.com
27
the reduce function is a program"
(key,value) pairs are sent via stdin"
keys are grouped"
output (key,value) pairs goes to stdout
28. mrjob from
a wrapper around hadoop streaming for
map and reduce functions in python
class MRWordFreqCount(MRJob):
def mapper(self, _, line):
for word in line.split():
yield (word.lower(), 1)
def reducer(self, word, counts):
yield (word, sum(counts))
David Gleich · Purdue
Ancestry.com
28
if __name__ == '__main__':
MRWordFreqCount.run()
29. David Gleich · Purdue
Ancestry.com
29
Connected components in
SQL and Hadoop
30. Connected components
3 “components” in this graph
How can we find them
algorithmically …
David Gleich · Purdue
Ancestry.com
30
… on a huge network?
31. Connected components
Algorithm!
Assign each node a random
component id.
David Gleich · Purdue
Ancestry.com
31
For each node, take the
minimum component id of
itself and all neighbors.
33. Computing Connected
Components in SQL
!
CREATE TABLE v2 AS (!
SELECT !
e.tail AS id,!
MIN(v.comp) as COMP!
FROM edges e!
INNER JOIN vector v!
ON e.head = v.id!
GROUP BY e.tail!
);!
Graph!
Edges : id | head | tail !
!
“Vector”!
!
v : id | comp!
initialized to random !
component!
DROP TABLE v;!
ALTER TABLE v2 !
RENAME TO v;!
!
!
David Gleich · Purdue
Ancestry.com
33
... Repeat ...!
34. Matrix-vector product and
connected components in Hadoop
See example!
matrix-hadoop/codes/smatvec.py!
k
Google’s
PageRank
Word count,
average rating!
“AT x = y”
yi = min(xi , min Aki xk )
k
Connected components
David Gleich · Purdue
Ancestry.com
34
A
x
Ax = y
X
yi =
Aik xk
35. Ax = y
X
yi =
Aik xk
Matrix-vector product
Follow along!
k
matrix-hadoop/codes/smatvec.py!
A
$
0
1
2
3
4
head samples/smat_5_5.txt !
0 0.125 3 1.024 4 0.121!
0 0.597!
2 1.247!
v initially random
4 -1.45!
!
2 0.061!
$ head samples/vec_5.txt!
0
1
2
3
4
0.241!
-0.98!
0.237!
-0.32!
0.080!
David Gleich · Purdue
Ancestry.com
35
x
A is stored by “node”
36. Matrix-vector product
(in pictures)
Ax = y
X
yi =
Aik xk
x
x
k
x
Input
A
A
Map 1!
Align on columns
Reduce 1!
Output Aik xk
keyed on row i
David Gleich · Purdue
y
Reduce 2!
Output
sum(Aik xk)
Ancestry.com
36
A
37. Matrix-vector product
(in pictures)
Ax = y
X
yi =
Aik xk
x
k
A
Input
A
Map 1!
Align on columns
David Gleich · Purdue
Ancestry.com
37
x
def joinmap(self, key, line):!
vals = line.split()!
if len(vals) == 2:!
# the vector!
yield (vals[0],
# row!
(float(vals[1]),)) # xi!
else:!
# the matrix!
row = vals[0]!
for i in xrange(1,len(vals),2):!
yield (vals[i],
# column!
(row,
# i,Aij!
float(vals[i+1])))!
38. x
T
“Matrix-vector” for
“A x = y”
connected components
yi = min(xi , min Aki xk )
k
A
Input
A
Map 1!
Align on columns
David Gleich · Purdue
Ancestry.com
38
x
def joinmap(self, key, line):!
vals = line.split()!
if len(vals) == 2:!
# the vector!
yield (vals[0],
# row!
(float(vals[1]),)) # vi!
else:!
# the matrix!
row = vals[0]!
for i in xrange(1,len(vals),2):!
yield (row,
# head!
(vals[i],
# tail))!
39. Matrix-vector product
(in pictures)
Ax = y
X
yi =
Aik xk
x
x
k
def joinred(self, key, vals):!
vecval = 0. !
matvals = []!
for val in vals:!
if len(val) == 1:!
vecval += val[0]!
else:!
matvals.append(val)
for val in matvals:!
yield (val[0], val[1]*vecval)!
x
Note that you should use a
Input
secondary sort to avoid
reading both in memory
A
!
Map 1!
Align on columns
A
Reduce 1!
Output Aik xk
keyed on row i
David Gleich · Purdue
Ancestry.com
39
A
40. x
x
T
“Matrix-vector” for
“A x = y”
connected components
yi = min(xi , min Aki xk )
k
def joinred(self, key, vals):!
vecval = 0. !
matvals = []!
for val in vals:!
if len(val) == 1:!
vecval += val[0]!
else:!
matvals.append(val)
for val in matvals:!
yield (val[0], vecval)!
x
Note that you should use a
Input
secondary sort to avoid
reading both in memory
A
!
Map 1!
Align on columns
A
Reduce 1!
Output Aik xk
keyed on row i
David Gleich · Purdue
Ancestry.com
40
A
41. Matrix-vector product
(in pictures)
Ax = y
X
yi =
Aik xk
A
Input
def sumred(self, key, vals):!
yield (key, sum(vals))!
A
A
Map 1!
Align on columns
Reduce 1!
Output Aik xk
keyed on row i
David Gleich · Purdue
y
Reduce 2!
Output
sum(Aik xk)
Ancestry.com
41
x
x
x
k
42. Our social recommender
Follow along!
matrix-hadoop/recsys/recsys.py!
!
$ gunzip –c data/rating.txt.gz!
139431556 591156
5!
139431556 1312460676
5!
139431556 204358
4
Object ID! 368725
User ID!
Rating!
139431556
5!
S is stored entry-wise
!
$ gunzip –c data/rating.txt.gz!
3287060356
232085
-1!
3288305540
709420
1!
3290337156
204418
-1!
My ID!
Other ID!
Trust!
3294138244
269243
-1!
David Gleich · Purdue
Ancestry.com
42
S
T
R
R is stored entry-wise
44. k
A
A
C
Reduce 1!
Map 1!
Align on columns Output Aik Bkj
keyed on (i,j)
David Gleich · Purdue
Reduce 2!
Output
sum(Aik Bkj)
Ancestry.com
44
A
B
AB = C
X
Cij =
Aik Bkj
B
B
Matrix-matrix product
(in pictures)
45. B
Social recommender
(in code)
A
Map 1!
Align on columns
David Gleich · Purdue
Ancestry.com
45
A
B
def joinmap(self, key, line):!
parts = line.split('t')!
if len(parts) == 8: # ratings!
objid = parts[0].strip()!
uid = parts[1].strip()!
rat = int(parts[2])!
yield (uid, (objid, rat))!
else len(parts) == 4: # trust!
myid = parts[0].strip()!
otherid = parts[1].strip()!
value = int(parts[2])!
if value 0:!
yield (otherid, (myid,))!
46. !
def joinred(self, key, vals):!
tusers = [] # uids that trust key!
ratobjs = [] # objs rated by uid=key!
for val in vals:!
if len(val) == 1:!
tusers.append(val[0])!
else:!
ratobjs.append(val)!
A
B
A
A
for (objid, rat) in ratobjs:!
for uid in tusers:!
yield ((uid, objid), rat)!
Conceptually,
the second step
is the same as
the matrixmatrix product
too, we “map”
the ratings from
each trusted
user back to the
source.
Reduce 1!
Map 1!
Align on columns Output Aik Bkj
keyed on (i,j)
David Gleich · Purdue
Ancestry.com
46
B
B
Matrix-matrix product
(in pictures)
47. def avgred(self, key, vals):!
s = 0.!
n = 0!
for val in vals:!
s += val!
n += 1!
# the smoothed average of ratings!
yield key, !
(s+self.options.avg)/float(n+1) !
!
A
k
A
A
C
Reduce 1!
Map 1!
Align on columns Output Aik Bkj
keyed on (i,j)
David Gleich · Purdue
Reduce 2!
Output
sum(Aik Bkj)
Ancestry.com
47
B
AB = C
X
Cij =
Aik Bkj
B
B
Matrix-matrix product
(in pictures)
48. No need for “integer” keys that
fall between 1 and n!
A
B
Block matrices minimize the
number of intermediate keys
and values used. I’d form them
based on the first reduce
A
B
David Gleich · Purdue
Ancestry.com
48
Better ways to store
matrices in Hadoop
49. Tall-and-Skinny
matrices
(m ≫ n)
Many rows (like a billion)
A few columns (under 10,000)
regression and
general linear models
with many samples
Used in
block iterative methods
From tinyimages
collection
panel factorizations
simulation data analysis !
big-data SVD/PCA!
David Gleich · Purdue
Ancestry.com
49
A
50. Image from rockysprings, deviantart, CC share-alike
David Gleich · Purdue
Ancestry.com
50
Questions?