More Related Content
Similar to Mining of massive datasets using locality sensitive hashing (LSH)
Similar to Mining of massive datasets using locality sensitive hashing (LSH) (20)
Mining of massive datasets using locality sensitive hashing (LSH)
- 1. Mining of Massive Datasets
using
Locality Sensitive Hashing (LSH)
J Singh
January 9, 2014
- 2. The problems
• Large scale image search:
• Large scale source repo
search:
– We have a candidate
image
– Search the internet to find
similar images
– We have a candidate
source repo
– Search github to find
similar source repos
• Large scale document search: • Large scale X search:
– We have a candidate
document
– Search for similar documents
to find possible plagiarism
– We have a candidate X
– Search for similar X’s
© DataThinks 2013-14
2
- 3. A Motivating Example
• People Like You
– Characterize your
Facebook Friends
– Find Facebook friends
and friends-of-friends
who like the same
things you do.
• Disclosure
– This is a pedagogical example, loosely patterned after
ShoutFlow
– I have no knowledge of how Shoutflow actually worked
– I have no connection with the people involved
© DataThinks 2013-14
3
- 4. A Likeness Score is…
• A number from 1 to 100%
– Likeness between Harry and Sally is 100% if they like exactly the
same things
– Technically, the Jaccard distance
= ( LikesHarry LikesSally ) / ( LikesHarry LikesSally)
• But mind the n2 problem: 1 Billion users
© DataThinks 2013-14
4
5
1017 pairs!
4
- 5. Basic Algorithm
1. Walk the graph
–
–
Build a data set of all
users and their friends
If access denied, skip
2. Cluster all Billion users
into “hash buckets” with
similar likes
3. When a new user logs in,
hash their likes and
compare their similarity
with other users in that
bucket.
• The magic is in the hashing!
© DataThinks 2013-14
5
- 6. The LSH Idea
• Treat n-valued items as
vectors in n-dimensional
space.
• Draw k random hyperplanes in that space.
• For each hyper-plane:
– Is each vector above it
(1) or below it (0)?
• Hash(Item1) = 011
• Hash(Item2) = 001
• The magic is in choosing
h1, h2, etc.
© DataThinks 2013-14
6
6
- 7. The LSH Hash Code was a Lie…
• …But the idea of boiling down a complex object into
something that is quickly and easily compared with other
complex objects is what matters.
• Each purple block
represents a person
Buckets
– Each Bucket represents a
group of people who are
alike
• Members within each
bucket still need to be
compared to see which
ones are the “closest”
© DataThinks 2013-14
7
- 8. Choosing hash functions
• Introducing minhash
1.
2.
3.
4.
Gather the LikeIDs for a person
Calculate the hash value for every LikeID.
Store the minimum hash value found in step 2.
Repeat steps 2 and 3 with different hash algorithms 199
more times to get a total of 200 minhash values.
• The resulting minhashes are 200 integer values
representing a random selection of Likes.
– Property of minhashes: If the minhashes for two people
are the same, their Likes are likely to be the same
© DataThinks 2013-14
8
8
- 9. All 200 minhashes must match?
• There is a lot of sampling going on in the algorithm.
• Make sure we catch most cases
– Don’t compare all minhashes at once, compare them in
bands. Candidate pairs are those that hash to the same
bucket for ≥ 1 band.
– Sometimes one band will reject a pair and another band
will consider it a candidate.
© DataThinks 2013-14
9
9
- 10. But 200 was just a guess, no?
• Actually, the parameters of the algorithm need to be
tuned
– Tune b (number of bands) and r (number of hash
functions per band) to catch most similar pairs, but few
non-similar pairs.
© DataThinks 2013-14
10
10
- 11. LSH Involves a Tradeoff
• Pick the number of minhashes, the number of bands, and
the number of rows per band to balance false
positives/negatives.
– False positives
need to examine more pairs that are not
really similar. More processing resources, more time.
– False negatives
failed to examine pairs that were similar,
didn’t find all similar results. But got done faster!
© DataThinks 2013-14
11
11
- 12. LSH Tradeoff Example
• If we had fewer than 20 bands, (and more rows / band)
–
–
–
–
fewer pairs would be selected for comparison,
the number of false positives would go down,
but the number of false negatives would go up,
Performance would go up but so would the error rate!
© DataThinks 2013-14
12
12
- 13. Running LSH on a cluster of machines
• Can be implemented on a Map Reduce Architecture
Buckets
Map Step
Reduce Step
© DataThinks 2013-14
13
- 14. Summary
• Mine the data and place members into hash buckets
• When you need to find a match, hash it and possible
nearest neighbors will be in one of b buckets.
• Algorithm performance O(n)
© DataThinks 2013-14
14
14
- 15. Thank you
• J Singh
– Principal, DataThinks
• j.singh@datathinks.org
– Adj. Prof, WPI
• References:
– Mining of Massive Datasets, Chapter 3 by Anand Rajaraman and
Jeff Ullman. http://infolab.stanford.edu/~ullman/mmds/ch3.pdf
– Matt’s Blog, Minhash for Dummies
http://matthewcasperson.blogspot.com/2013/11/minhash-fordummies.html
© DataThinks 2013-14
15
15