SlideShare una empresa de Scribd logo
1 de 20
Descargar para leer sin conexión
Finding Similar Items in high-
dimensional spaces: Locality
Sensitive Hashing
Dmitriy Selivanov
04 Sep 2015
Data Science
Statistics
Domain knowledge
Computer science
·
·
·
/
Today's problem: near duplicates detection
Find all pairs of data points within distance threshold
Given: High dimensional data points
And some distance function
· ( , , . . . )x1 x2
Image
User-Item (rating, whatever) matrix
-
-
· d( , )x1 x2
Euclidean distance
Cosine distance
Jaccard distance
-
-
-
J(A, B) =
|A ∩ B|
|A ∪ B|
,xi xj d( , ) < sxi xj
/
Near-neighbor search
Applications
High Dimensions
High-dimensional spaces are lonely places
Duplicate detection (plagiarism, entity resolution)
Recommender systems
Clustering
·
·
·
Bag-of-words model
Large-Scale Recommender Systems (many users vs many items)
·
·
Curse of dimensionality·
almost all pairs of points are equally far away from one another-
/
Finding similar text documents
Examples
Challenges
Mirror websites
Similar news articles (google news, yandex news?)
·
·
Pieces of one document can appear out of order (headers, footers, etc), diffrent lengths.
Large collection of documents can not fit in RAM
Too many documents to compare all pairs - complexity
·
·
· O( )n
2
/
Pipeline
Pieces of one document can appear out of order (headers, footers, etc) =>
Shingling
Documents as sets
Large collection of documents can not fit in RAM =>
Minhashing
Convert large sets to short signatures, while preserving similarity
Too many documents to compare all pairs - complexity =>
Locality Sensitive Hashing
Focus on pairs of signatures likely to be from similar documents.
·
·
·
·
· O( )n
2
·
/
Document representation
Example phrase:
"To be, or not to be, that is the question"
Documents as sets
Bag-of-words => set of words:
k-shingles or k-gram => unordered set of k-grams:
·
{"to", "be", "or", "not", "that", "is", "the", "question"}
don't work well - need ordering
-
-
·
word level (for k = 2)
caharacter level (for k = 3):
-
{"to be", "be or", "or not", "not to", "be that", "that is", "is the", "the
question"}
-
-
{"to ", "o b", " be", "be ", "e o", " or", "or ", "r n", " no", "not", "ot
", "t t", " to", "e t", " th", "tha", "hat", "at ", "t i", " is", "is ", "s
t", "the", "he ", "e q", " qu", "que", "ues", "est", "sti", "tio", "ion"}
-
/
Practical notes
Optionally can compress (hash!) long shingles into 4 byte integers!
Picking
should be picked large enough that the probability of any given shingle appearing in any given
document is low
k
is OK for short documents
is OK for long documents
· k = 5
· k = 10
k
/
Binary term-document-matrix
shingle s1 s2 intersecton union
……. . . . .
осенняя 0 0 - -
светило 1 0 - +
летнее 1 1 + +
солце 1 1 + +
яркое 0 1 - +
погода 0 0 - -
……. . . . .
D1 = "светило летнее солце" => s1 = {"светило", "летнее", "солце"}
D2 = "яркое летнее солнце" => s2 = {"яркое", "летнее", "солце"}
·
·
/
Type of rows
type s1 s2
a 1 1
b 1 0
c 0 1
d 0 0
A, B, C, D - # rows types a, b, c, d
J( , ) =s1 s2
A
(A + B + C)
/
Minhashing
Convert large sets to short signatures
Minhashing example
1. Random permutation of rows of the input-matrix .
2. Minhash function = # of first row in which column .
3. Use independent permutations we will end with minhash functions. => can construct signature-
matrix from input-matrix using these minhash functions.
M
h(c) c == 1
N N
/
Property
(h( ) = h( )) =???Pperm s1 s2
=
Why? Both
remember this result
·
J( , )s1 s2
· A
(A+B+C)
·
/
Implementation of Minhashing
One-pass implementation
Universal hashing
random permutation
random lookup
·
·
1. Instead of permutations pick independent hash-functions ( )
2. For column and hash-function keep slot . Init with +Inf.
3. Scan rows looking for 1
N N N = O(1/ )ϵ2
c hi Sig(i, c)
Suppose row has 1 in column
Then for each : If =>
· r c
· ki (r) < Sig(i, c)hi Sig(i, c) := (r))hi
(x) = ((ax + b) mod p)hi
- integers
- large prime:
· a, b
· p p > N
/
Algorithm
for each row r do begin
for each hash function hi do
compute hi (r);
for each column c
if c has 1 in row r
for each hash function hi do
if hi(r) is smaller than M(i, c) then
M(i, c) := hi(r);
end;
/
Candidates
Still comlexity
Bruteforce
O( )n
2
library('microbenchmark')
library('magrittr')
jaccard <- function(x, y) {
set_intersection <- intersect(x, y) %>% length
set_union <- length(x) + length(y) - set_intersection
return(set_intersection / set_union)
}
elements <- sapply(seq_len(1e5), function(x) paste(sample(letters, 4), collapse = '')) %>% unique
set_1 <- sample(elements, 100, replace = F)
set_2 <- sample(elements, 100, replace = F)
microbenchmark(jaccard(set_1, set_2))
## Unit: microseconds
## expr min lq mean median uq max neval
## jaccard(set_1, set_2) 56.812 58.693 65.64373 61.8075 69.184 161.986 100
/
Job for LSH:
1. Divide M into bands, rows each
2. Hash each column in band into table with large number of buckets
3. Column become candidate if fall into same bucket for any band
b r
bi
/
Bands number tuning
1. The probability that the signatures agree in all rows of one particular band is
2. The probability that the signatures disagree in at least one row of a particular band is
3. The probability that the signatures disagree in at least one row of each of the bands is
4. The probability that the signatures agree in all the rows of at least one band, and therefore become a
candidate pair, is
s
r
1 − s
r
(1 − s
r
)
b
1 − (1 − s
r
)
b
/
Example
Goal : tune and to catch most similar pairs, but few non-similar pairs
100k documents =>
100 hash-functions => signatures of 100 integers = 40mb
find all documents, similar at least
·
·
· b = 20, r = 5
· s = 0.8
1. Probability C1, C2 identical in one particular band:
2. Probability C1, C2 are not similar in all of the 20 bands:
(0.8 = 0.328)
5
(1 − 0.328 = 0.00035)
20
b r
/
LSH function families
Minhash - jaccard similiarity
Random projections (random hypeplanes) - cosine similiarity
p-stable-distributions - Euclidean distance (and norm)
Edit distance
Hamming distance
·
·
· Lp
·
·
/
References
https://www.coursera.org/course/mmds
R package LSHR - https://github.com/dselivanov/LSHR
Jure Leskovec, Anand Rajaraman, Jeff Ullman: Mining of Massive Datasets. http://mmds.org/
P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of
dimensionality.
Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji: Hashing for Similarity Search: A Survey
Alexandr Andoni and Piotr Indyk : Near-Optimal Hashing Algorithms for Approximate Nearest
Neighbor in High Dimensions
Mayur Datar, Nicole Immorlica, Piotr Indyk, Vahab S. Mirrokni: Locality-Sensitive Hashing Scheme
Based on p-Stable Distributions
·
·
·
·
·
/

Más contenido relacionado

La actualidad más candente

CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306Yasuo Tabei
 
DCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant SpaceDCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant SpaceYasuo Tabei
 
SPIRE2013-tabei20131009
SPIRE2013-tabei20131009SPIRE2013-tabei20131009
SPIRE2013-tabei20131009Yasuo Tabei
 
Hashing Technique In Data Structures
Hashing Technique In Data StructuresHashing Technique In Data Structures
Hashing Technique In Data StructuresSHAKOOR AB
 
Data Structure and Algorithms Hashing
Data Structure and Algorithms HashingData Structure and Algorithms Hashing
Data Structure and Algorithms HashingManishPrajapati78
 
Analysis Of Algorithms - Hashing
Analysis Of Algorithms - HashingAnalysis Of Algorithms - Hashing
Analysis Of Algorithms - HashingSam Light
 
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...Menlo Systems GmbH
 
Hashing Algorithm
Hashing AlgorithmHashing Algorithm
Hashing AlgorithmHayi Nukman
 
Introduction to Ultra-succinct representation of ordered trees with applications
Introduction to Ultra-succinct representation of ordered trees with applicationsIntroduction to Ultra-succinct representation of ordered trees with applications
Introduction to Ultra-succinct representation of ordered trees with applicationsYu Liu
 

La actualidad más candente (20)

CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306
 
DCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant SpaceDCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant Space
 
Hashing
HashingHashing
Hashing
 
Rehashing
RehashingRehashing
Rehashing
 
Hashing PPT
Hashing PPTHashing PPT
Hashing PPT
 
SPIRE2013-tabei20131009
SPIRE2013-tabei20131009SPIRE2013-tabei20131009
SPIRE2013-tabei20131009
 
Hashing 1
Hashing 1Hashing 1
Hashing 1
 
Hashing Technique In Data Structures
Hashing Technique In Data StructuresHashing Technique In Data Structures
Hashing Technique In Data Structures
 
Data Structure and Algorithms Hashing
Data Structure and Algorithms HashingData Structure and Algorithms Hashing
Data Structure and Algorithms Hashing
 
Analysis Of Algorithms - Hashing
Analysis Of Algorithms - HashingAnalysis Of Algorithms - Hashing
Analysis Of Algorithms - Hashing
 
Hashing
HashingHashing
Hashing
 
Hashing
HashingHashing
Hashing
 
Hash table
Hash tableHash table
Hash table
 
Hashing
HashingHashing
Hashing
 
Hashing
HashingHashing
Hashing
 
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...
 
Hashing Algorithm
Hashing AlgorithmHashing Algorithm
Hashing Algorithm
 
Hashing
HashingHashing
Hashing
 
Introduction to Ultra-succinct representation of ordered trees with applications
Introduction to Ultra-succinct representation of ordered trees with applicationsIntroduction to Ultra-succinct representation of ordered trees with applications
Introduction to Ultra-succinct representation of ordered trees with applications
 
Hash tables
Hash tablesHash tables
Hash tables
 

Similar a LSH for Finding Similar Items in High-Dimensional Spaces

Deduplication on large amounts of code
Deduplication on large amounts of codeDeduplication on large amounts of code
Deduplication on large amounts of codesource{d}
 
similarity1 (6).ppt
similarity1 (6).pptsimilarity1 (6).ppt
similarity1 (6).pptssuserf11a32
 
Similarity Search in High Dimensions via Hashing
Similarity Search in High Dimensions via HashingSimilarity Search in High Dimensions via Hashing
Similarity Search in High Dimensions via HashingMaruf Aytekin
 
Strinng Classes in c++
Strinng Classes in c++Strinng Classes in c++
Strinng Classes in c++Vikash Dhal
 
Hashing in datastructure
Hashing in datastructureHashing in datastructure
Hashing in datastructurerajshreemuthiah
 
presentation on important DAG,TRIE,Hashing.pptx
presentation on important DAG,TRIE,Hashing.pptxpresentation on important DAG,TRIE,Hashing.pptx
presentation on important DAG,TRIE,Hashing.pptxjainaaru59
 
snarks <3 hash functions
snarks <3 hash functionssnarks <3 hash functions
snarks <3 hash functionsRebekah Mercer
 
1.Buffer Overflows
1.Buffer Overflows1.Buffer Overflows
1.Buffer Overflowsphanleson
 
Bayesian phylogenetic inference_big4_ws_2016-10-10
Bayesian phylogenetic inference_big4_ws_2016-10-10Bayesian phylogenetic inference_big4_ws_2016-10-10
Bayesian phylogenetic inference_big4_ws_2016-10-10FredrikRonquist
 
Monads and Monoids by Oleksiy Dyagilev
Monads and Monoids by Oleksiy DyagilevMonads and Monoids by Oleksiy Dyagilev
Monads and Monoids by Oleksiy DyagilevJavaDayUA
 

Similar a LSH for Finding Similar Items in High-Dimensional Spaces (20)

Nn
NnNn
Nn
 
Ultra-efficient algorithms for testing well-parenthesised expressions by Tati...
Ultra-efficient algorithms for testing well-parenthesised expressions by Tati...Ultra-efficient algorithms for testing well-parenthesised expressions by Tati...
Ultra-efficient algorithms for testing well-parenthesised expressions by Tati...
 
Deduplication on large amounts of code
Deduplication on large amounts of codeDeduplication on large amounts of code
Deduplication on large amounts of code
 
similarity1 (6).ppt
similarity1 (6).pptsimilarity1 (6).ppt
similarity1 (6).ppt
 
Similarity Search in High Dimensions via Hashing
Similarity Search in High Dimensions via HashingSimilarity Search in High Dimensions via Hashing
Similarity Search in High Dimensions via Hashing
 
Python Cheat Sheet
Python Cheat SheetPython Cheat Sheet
Python Cheat Sheet
 
I1
I1I1
I1
 
Strinng Classes in c++
Strinng Classes in c++Strinng Classes in c++
Strinng Classes in c++
 
lecture5.ppt
lecture5.pptlecture5.ppt
lecture5.ppt
 
The string class
The string classThe string class
The string class
 
Hashing in datastructure
Hashing in datastructureHashing in datastructure
Hashing in datastructure
 
Lec5
Lec5Lec5
Lec5
 
presentation on important DAG,TRIE,Hashing.pptx
presentation on important DAG,TRIE,Hashing.pptxpresentation on important DAG,TRIE,Hashing.pptx
presentation on important DAG,TRIE,Hashing.pptx
 
snarks <3 hash functions
snarks <3 hash functionssnarks <3 hash functions
snarks <3 hash functions
 
1.Buffer Overflows
1.Buffer Overflows1.Buffer Overflows
1.Buffer Overflows
 
Algorithms Exam Help
Algorithms Exam HelpAlgorithms Exam Help
Algorithms Exam Help
 
R language introduction
R language introductionR language introduction
R language introduction
 
Bayesian phylogenetic inference_big4_ws_2016-10-10
Bayesian phylogenetic inference_big4_ws_2016-10-10Bayesian phylogenetic inference_big4_ws_2016-10-10
Bayesian phylogenetic inference_big4_ws_2016-10-10
 
Monads and Monoids by Oleksiy Dyagilev
Monads and Monoids by Oleksiy DyagilevMonads and Monoids by Oleksiy Dyagilev
Monads and Monoids by Oleksiy Dyagilev
 
R Basics
R BasicsR Basics
R Basics
 

Último

The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuinethapagita
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxJorenAcuavera1
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptJoemSTuliba
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju
 
Forensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxForensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxkumarsanjai28051
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXDole Philippines School
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayupadhyaymani499
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
preservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxpreservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxnoordubaliya2003
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 

Último (20)

The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptx
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.ppt
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
 
Forensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxForensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptx
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdf
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyay
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
preservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxpreservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptx
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 

LSH for Finding Similar Items in High-Dimensional Spaces

  • 1. Finding Similar Items in high- dimensional spaces: Locality Sensitive Hashing Dmitriy Selivanov 04 Sep 2015
  • 3. Today's problem: near duplicates detection Find all pairs of data points within distance threshold Given: High dimensional data points And some distance function · ( , , . . . )x1 x2 Image User-Item (rating, whatever) matrix - - · d( , )x1 x2 Euclidean distance Cosine distance Jaccard distance - - - J(A, B) = |A ∩ B| |A ∪ B| ,xi xj d( , ) < sxi xj /
  • 4. Near-neighbor search Applications High Dimensions High-dimensional spaces are lonely places Duplicate detection (plagiarism, entity resolution) Recommender systems Clustering · · · Bag-of-words model Large-Scale Recommender Systems (many users vs many items) · · Curse of dimensionality· almost all pairs of points are equally far away from one another- /
  • 5. Finding similar text documents Examples Challenges Mirror websites Similar news articles (google news, yandex news?) · · Pieces of one document can appear out of order (headers, footers, etc), diffrent lengths. Large collection of documents can not fit in RAM Too many documents to compare all pairs - complexity · · · O( )n 2 /
  • 6. Pipeline Pieces of one document can appear out of order (headers, footers, etc) => Shingling Documents as sets Large collection of documents can not fit in RAM => Minhashing Convert large sets to short signatures, while preserving similarity Too many documents to compare all pairs - complexity => Locality Sensitive Hashing Focus on pairs of signatures likely to be from similar documents. · · · · · O( )n 2 · /
  • 7. Document representation Example phrase: "To be, or not to be, that is the question" Documents as sets Bag-of-words => set of words: k-shingles or k-gram => unordered set of k-grams: · {"to", "be", "or", "not", "that", "is", "the", "question"} don't work well - need ordering - - · word level (for k = 2) caharacter level (for k = 3): - {"to be", "be or", "or not", "not to", "be that", "that is", "is the", "the question"} - - {"to ", "o b", " be", "be ", "e o", " or", "or ", "r n", " no", "not", "ot ", "t t", " to", "e t", " th", "tha", "hat", "at ", "t i", " is", "is ", "s t", "the", "he ", "e q", " qu", "que", "ues", "est", "sti", "tio", "ion"} - /
  • 8. Practical notes Optionally can compress (hash!) long shingles into 4 byte integers! Picking should be picked large enough that the probability of any given shingle appearing in any given document is low k is OK for short documents is OK for long documents · k = 5 · k = 10 k /
  • 9. Binary term-document-matrix shingle s1 s2 intersecton union ……. . . . . осенняя 0 0 - - светило 1 0 - + летнее 1 1 + + солце 1 1 + + яркое 0 1 - + погода 0 0 - - ……. . . . . D1 = "светило летнее солце" => s1 = {"светило", "летнее", "солце"} D2 = "яркое летнее солнце" => s2 = {"яркое", "летнее", "солце"} · · /
  • 10. Type of rows type s1 s2 a 1 1 b 1 0 c 0 1 d 0 0 A, B, C, D - # rows types a, b, c, d J( , ) =s1 s2 A (A + B + C) /
  • 11. Minhashing Convert large sets to short signatures Minhashing example 1. Random permutation of rows of the input-matrix . 2. Minhash function = # of first row in which column . 3. Use independent permutations we will end with minhash functions. => can construct signature- matrix from input-matrix using these minhash functions. M h(c) c == 1 N N /
  • 12. Property (h( ) = h( )) =???Pperm s1 s2 = Why? Both remember this result · J( , )s1 s2 · A (A+B+C) · /
  • 13. Implementation of Minhashing One-pass implementation Universal hashing random permutation random lookup · · 1. Instead of permutations pick independent hash-functions ( ) 2. For column and hash-function keep slot . Init with +Inf. 3. Scan rows looking for 1 N N N = O(1/ )ϵ2 c hi Sig(i, c) Suppose row has 1 in column Then for each : If => · r c · ki (r) < Sig(i, c)hi Sig(i, c) := (r))hi (x) = ((ax + b) mod p)hi - integers - large prime: · a, b · p p > N /
  • 14. Algorithm for each row r do begin for each hash function hi do compute hi (r); for each column c if c has 1 in row r for each hash function hi do if hi(r) is smaller than M(i, c) then M(i, c) := hi(r); end; /
  • 15. Candidates Still comlexity Bruteforce O( )n 2 library('microbenchmark') library('magrittr') jaccard <- function(x, y) { set_intersection <- intersect(x, y) %>% length set_union <- length(x) + length(y) - set_intersection return(set_intersection / set_union) } elements <- sapply(seq_len(1e5), function(x) paste(sample(letters, 4), collapse = '')) %>% unique set_1 <- sample(elements, 100, replace = F) set_2 <- sample(elements, 100, replace = F) microbenchmark(jaccard(set_1, set_2)) ## Unit: microseconds ## expr min lq mean median uq max neval ## jaccard(set_1, set_2) 56.812 58.693 65.64373 61.8075 69.184 161.986 100 /
  • 16. Job for LSH: 1. Divide M into bands, rows each 2. Hash each column in band into table with large number of buckets 3. Column become candidate if fall into same bucket for any band b r bi /
  • 17. Bands number tuning 1. The probability that the signatures agree in all rows of one particular band is 2. The probability that the signatures disagree in at least one row of a particular band is 3. The probability that the signatures disagree in at least one row of each of the bands is 4. The probability that the signatures agree in all the rows of at least one band, and therefore become a candidate pair, is s r 1 − s r (1 − s r ) b 1 − (1 − s r ) b /
  • 18. Example Goal : tune and to catch most similar pairs, but few non-similar pairs 100k documents => 100 hash-functions => signatures of 100 integers = 40mb find all documents, similar at least · · · b = 20, r = 5 · s = 0.8 1. Probability C1, C2 identical in one particular band: 2. Probability C1, C2 are not similar in all of the 20 bands: (0.8 = 0.328) 5 (1 − 0.328 = 0.00035) 20 b r /
  • 19. LSH function families Minhash - jaccard similiarity Random projections (random hypeplanes) - cosine similiarity p-stable-distributions - Euclidean distance (and norm) Edit distance Hamming distance · · · Lp · · /
  • 20. References https://www.coursera.org/course/mmds R package LSHR - https://github.com/dselivanov/LSHR Jure Leskovec, Anand Rajaraman, Jeff Ullman: Mining of Massive Datasets. http://mmds.org/ P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji: Hashing for Similarity Search: A Survey Alexandr Andoni and Piotr Indyk : Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions Mayur Datar, Nicole Immorlica, Piotr Indyk, Vahab S. Mirrokni: Locality-Sensitive Hashing Scheme Based on p-Stable Distributions · · · · · /