Mining of massive datasets using locality sensitive hashing (LSH)

Mining of Massive Datasets
using
Locality Sensitive Hashing (LSH)

J Singh
January 9, 2014

The problems
• Large scale image search:

• Large scale source repo
search:

– We have a candidate
image
– Search the internet to find
similar images

source repo
– Search github to find
similar source repos

• Large scale document search: • Large scale X search:
document
– Search for similar documents
to find possible plagiarism

– We have a candidate X
– Search for similar X’s

© DataThinks 2013-14
2

A Motivating Example
• People Like You
– Characterize your
Facebook Friends
– Find Facebook friends
and friends-of-friends
who like the same
things you do.

• Disclosure
– This is a pedagogical example, loosely patterned after
ShoutFlow
– I have no knowledge of how Shoutflow actually worked
– I have no connection with the people involved
3

A Likeness Score is…
• A number from 1 to 100%
– Likeness between Harry and Sally is 100% if they like exactly the
same things
– Technically, the Jaccard distance
= ( LikesHarry LikesSally ) / ( LikesHarry LikesSally)

• But mind the n2 problem: 1 Billion users

4

5

1017 pairs!

4

Basic Algorithm
1. Walk the graph
–

–

Build a data set of all
users and their friends
If access denied, skip

2. Cluster all Billion users
into “hash buckets” with
similar likes
3. When a new user logs in,
hash their likes and
compare their similarity
with other users in that
bucket.

• The magic is in the hashing!

5

The LSH Idea
• Treat n-valued items as
vectors in n-dimensional
space.
• Draw k random hyperplanes in that space.
• For each hyper-plane:
– Is each vector above it
(1) or below it (0)?
• Hash(Item1) = 011
• Hash(Item2) = 001

• The magic is in choosing
h1, h2, etc.
6

6

The LSH Hash Code was a Lie…
• …But the idea of boiling down a complex object into
something that is quickly and easily compared with other
complex objects is what matters.
• Each purple block
represents a person

Buckets

– Each Bucket represents a
group of people who are
alike
• Members within each
bucket still need to be
compared to see which
ones are the “closest”

7

Choosing hash functions
• Introducing minhash
1.
2.
3.
4.

Gather the LikeIDs for a person
Calculate the hash value for every LikeID.
Store the minimum hash value found in step 2.
Repeat steps 2 and 3 with different hash algorithms 199
more times to get a total of 200 minhash values.

• The resulting minhashes are 200 integer values
representing a random selection of Likes.
– Property of minhashes: If the minhashes for two people
are the same, their Likes are likely to be the same

8

8

All 200 minhashes must match?
• There is a lot of sampling going on in the algorithm.
• Make sure we catch most cases
– Don’t compare all minhashes at once, compare them in
bands. Candidate pairs are those that hash to the same
bucket for ≥ 1 band.
– Sometimes one band will reject a pair and another band
will consider it a candidate.

9

9

But 200 was just a guess, no?
• Actually, the parameters of the algorithm need to be
tuned
– Tune b (number of bands) and r (number of hash

functions per band) to catch most similar pairs, but few
non-similar pairs.

10

10

LSH Involves a Tradeoff
• Pick the number of minhashes, the number of bands, and
the number of rows per band to balance false
positives/negatives.
– False positives
need to examine more pairs that are not
really similar. More processing resources, more time.
– False negatives
failed to examine pairs that were similar,
didn’t find all similar results. But got done faster!

11

11

LSH Tradeoff Example
• If we had fewer than 20 bands, (and more rows / band)
–
–
–
–

fewer pairs would be selected for comparison,
the number of false positives would go down,
but the number of false negatives would go up,

Performance would go up but so would the error rate!

12

12

Running LSH on a cluster of machines
• Can be implemented on a Map Reduce Architecture

Buckets

Map Step

Reduce Step

13

Summary
• Mine the data and place members into hash buckets
• When you need to find a match, hash it and possible
nearest neighbors will be in one of b buckets.
• Algorithm performance O(n)

14

14

Thank you
• J Singh
– Principal, DataThinks
• j.singh@datathinks.org

– Adj. Prof, WPI

• References:
– Mining of Massive Datasets, Chapter 3 by Anand Rajaraman and
Jeff Ullman. http://infolab.stanford.edu/~ullman/mmds/ch3.pdf
– Matt’s Blog, Minhash for Dummies
http://matthewcasperson.blogspot.com/2013/11/minhash-fordummies.html

15

15

Mining of massive datasets using locality sensitive hashing (LSH)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Mining of massive datasets using locality sensitive hashing (LSH)

Similar to Mining of massive datasets using locality sensitive hashing (LSH) (20)

More from J Singh

More from J Singh (19)

Recently uploaded

Recently uploaded (20)

Mining of massive datasets using locality sensitive hashing (LSH)