SlideShare a Scribd company logo
1 of 31
Download to read offline
Near-Optimal Hashing
        Algorithms for Approximate
        Near(est) Neighbor Problem
                        Piotr Indyk
                           MIT

Joint work with: Alex Andoni, Mayur Datar, Nicole Immorlica,
                 Vahab Mirrokni
Definition
• Given: a set P of points in Rd
• Nearest Neighbor: for any
  query q, returns a point p∈P      r

  minimizing ||p-q||
                                        q
• r-Near Neighbor: for any
  query q, returns a point p∈P
  s.t. ||p-q|| ≤ r (if it exists)
Nearest Neighbor: Motivation
• Learning: nearest
  neighbor rule           ?
• Database retrieval
• Vector quantization,
  a.k.a. compression
Brief History of NN
The case of d=2
• Compute Voronoi diagram
• Given q, perform point
  location
• Performance:
  – Space: O(n)
  – Query time: O(log n)
The case of d>2
• Voronoi diagram has size nO(d)
• We can also perform a linear scan: O(dn) time
• That is pretty much all what known for exact
  algorithms with theoretical guarantees
• In practice:
  – kd-trees work “well” in “low-medium” dimensions
  – Near-linear query time for high dimensions
Approximate Near Neighbor
• c-Approximate r-Near Neighbor: build data
  structure which, for any query q:
   – If there is a point p∈P, ||p-q|| ≤ r
   – it returns p’∈P, ||p-q|| ≤ cr
                                                     r
• Reductions:
                                                     q
   – c-Approx Nearest Neighbor reduces to c-Approx       cr
      Near Neighbor
     (log overhead)
   – One can enumerate all approx near neighbors
    → can solve exact near neighbor problem
   – Other apps: c-approximate Minimum Spanning
      Tree, clustering, etc.
Approximate algorithms
• Space/time exponential in d [Arya-Mount-et al],
  [Kleinberg’97], [Har-Peled’02], [Arya-Mount-…]
• Space/time polynomial in d [Kushilevitz-Ostrovsky-
  Rabani’98], [Indyk-Motwani’98], [Indyk’98], [Gionis-Indyk-Motwani’99],
  [Charikar’02], [Datar-Immorlica-Indyk-Mirrokni’04], [Chakrabarti-
  Regev’04], [Panigrahy’06], [Ailon-Chazelle’06]…
 Space              Time                Comment            Norm       Ref
               2
 dn+n4/ε            d * logn /ε2 or 1   c=1+ ε             Hamm, l2   [KOR’98, IM’98]
          2)
 nΩ(1/ε                         O(1)                                  [AIP’0?]
 dn+n1+ρ(c)         dnρ(c)              ρ(c)=1/c           Hamm, l2   [IM’98], [Cha’02]
                                        ρ(c)<1/c           l2         [DIIM’04]
 dn * logs          dnσ(c)              σ(c)=O(log c/c)    Hamm, l2   [Ind’01]
                                        σ(c)=O(1/c)        l2         [Pan’06]

 dn+n1+ρ(c)         dnρ(c)              ρ(c)=1/c2 + o(1)   l2         [AI’06]
 dn * logs          dnσ(c)              σ(c)=O(1/c2)       l2         [AI’06]
Locality-Sensitive Hashing
                                         q
• Idea: construct hash
  functions g: Rd → U such that          p

  for any points p,q:
  – If ||p-q|| ≤ r, then Pr[g(p)=g(q)]
    is “high” “not-so-small”
  – If ||p-q|| >cr, then Pr[g(p)=g(q)]
    is “small”
• Then we can solve the
  problem by hashing
LSH [Indyk-Motwani’98]
• A family H of functions h: Rd → U is called
  (P1,P2,r,cr)-sensitive, if for any p,q:
  – if ||p-q|| <r then Pr[ h(p)=h(q) ] > P1
  – if ||p-q|| >cr then Pr[ h(p)=h(q) ] < P2

• Example: Hamming distance
  – LSH functions: h(p)=pi, i.e., the i-th bit of p
  – Probabilities: Pr[ h(p)=h(q) ] = 1-D(p,q)/d
                            p=10010010
                            q=11010110
LSH Algorithm
• We use functions of the form
           g(p)=<h1(p),h2(p),…,hk(p)>
• Preprocessing:
  – Select g1…gL
  – For all p∈P, hash p to buckets g1(p)…gL(p)

• Query:
  – Retrieve the points from buckets g1(q), g2(q), … , until
     • Either the points from all L buckets have been retrieved, or
     • Total number of points retrieved exceeds 2L
  – Answer the query based on the retrieved points
  – Total time: O(dL)
Analysis
• LSH solves c-approximate NN with:
  – Number of hash fun: L=nρ, ρ=log(1/P1)/log(1/P2)
  – E.g., for the Hamming distance we have ρ=1/c
  – Constant success probability per query q
• Questions:
  – Can we extend this beyond Hamming distance ?
     • Yes:
        – embed l2 into l1 (random projections)
        – l1 into Hamming (discretization)
  – Can we reduce the exponent ρ ?
Projection-based LSH
                [Datar-Immorlica-Indyk-Mirrokni’04]


• Define hX,b(p)=⎣(p*X+b)/w⎦:
    – w≈r
                                                           p                 X
    – X=(X1…Xd) , where Xi is chosen
                                                                         w
      from:
        • Gaussian distribution (for l2 norm)
        • “s-stable” distribution* (for ls norm)
                                                                     w
    – b is a scalar
• Similar to the l2 → l1 →Hamming
  route



 * I.e., p*X has same distribution as ||p|| Z, where Z is s-stable
                                           s
Analysis
• Need to:
  – Compute Pr[h(p)=h(q)] as a function of ||p-q||
    and w; this defines P1 and P2
  – For each c choose w that minimizes               w

                    ρ=log1/P2(1/P1)
                                               w
• Method:
  – For l2: computational
  – For general ls: analytic
ρ(w) for various c’s: l1
       1
                                      c=1.1
                                      c=1.5
                                      c=2.5
      0.9                               c=5
                                       c=10

      0.8


      0.7                                              w

      0.6
pxe




      0.5                                          w

      0.4


      0.3


      0.2


      0.1
            0       5     10     15           20
                          wr
ρ(w) for various c’s: l2
       1
                                     c=1.1
                                     c=1.5
      0.9                            c=2.5
                                       c=5
                                      c=10
      0.8


      0.7


      0.6                                             w
pxe




      0.5


      0.4
                                                  w
      0.3


      0.2


      0.1


       0
            0      5      10    15           20
                          w
                          r
ρ(c) for l2
 1
                                                            rho
                                                            1/c
0.9


0.8


0.7


0.6


0.5


0.4


0.3


0.2


0.1


 0
      1   2   3    4       5          6         7   8   9         10
                       Approximation factor c
New LSH scheme
                      [Andoni-Indyk’06]

• Instead of projecting onto R1,              p           X
   project onto Rt , for constant t                   w
• Intervals → lattice of balls
   – Can hit empty space, so hash until           w
     a ball is hit
• Analysis:
   – ρ=1/c2 + O( log t / t1/2 )           p
   – Time to hash is tO(t)
   – Total query time: dn1/c2+o(1)
• [Motwani-Naor-Panigrahy’06]:
  LSH in l2 must have ρ ≥ 0.45/c2
Connections to

•   [Charikar-Chekuri-Goel-Guha-Plotkin’98]
    – Consider partitioning of Rd using balls of radius R
    – Show that Pr[ Ball(p) ≠ Ball(q) ] ≤ ||p-q||/R * d1/2
        • Linear dependence on the distance – same as Hamming
        • Need to analyze R≈||p-q|| to achieve non-linear behavior!
          (as for the projection on the line)
•   [Karger-Motwani-Sudan’94]
                                                                                p
    – Consider partitioning of the sphere via random vectors u                      q
      from Nd(0,1) :
                       p is in Cap(u) if u*p ≥ T
    – Showed Pr[ Cap(p) = Cap(q) ] ≤ exp[ - (2T/||p+q||)2/2 ]               o
        • Large relative changes to ||p-q|| can yield only small relative
          changes to ||p+q||
p

                       Proof idea
• Claim: ρ=log(P1)/log(P2)→1/c2
   – P1=Pr(1), P2=Pr(c)
   – Pr(z)=prob. of collision when distance z
• Proof idea:
   – Assumption: ignore effects of mapping into Rt            x
                                                          q       p
   – Pr(z) is proportional to the volume of the cap
   – Fraction of mass in a cap is proportional to
     the probability that the x-coordinate of a
     random point u from a ball exceeds x
   – Approximation: the x-coordinate of u has
     approximately normal distribution
     → Pr(x) ≈ exp(-A x2)
   – ρ=log[ exp(-A12) ] / log [ exp(-Ac2) ] = 1/c2
New LSH scheme, ctd.
• How does it work in practice ?
• The time tO(t)dn1/c2+f(t) is not very
  practical
   – Need t≈30 to see some improvement
• Idea: a different decomposition of Rt
   – Replace random balls by Voronoi
     diagram of a lattice
   – For specific lattices, finding a cell
     containing a point can be very fast
     →fast hashing
Leech Lattice LSH
• Use Leech lattice in R24 , t=24
  – Largest kissing number in 24D: 196560
  – Conjectured largest packing density in 24D
  – 24 is 42 in reverse…
• Very fast (bounded) decoder: about 519
  operations [Amrani-Beery’94]
• Performance of that decoder for c=2:
  –   1/c2                                  0.25
  –   1/c                                   0.50
  –   Leech LSH, any dimension:         ρ ≈ 0.36
  –   Leech LSH, 24D (no projection):   ρ ≈ 0.26
Conclusions
• We have seen:
                                     2+o(1)
   – Algorithm for c-NN with dn1/c            query time
    (and reasonable space)
       • Exponent tight up to a constant
   – (More) practical algorithms based on Leech lattice
• We haven’t seen:
                                       2
   – Algorithm for c-NN with dnO(1/c ) query time and dn log n space
• Immediate questions:
   – Get rid of the o(1)
   – …or came up with a really neat lattice…
   – Tight lower bound
• Non-immediate questions:
   – Other ways of solving proximity problems
Advertisement
• See LSH web page (linked from my web
  page for):
  – Experimental results (for the ’04 version)
  – Pointers to code
Experiments
Experiments (with ’04 version)
• E2LSH: Exact Euclidean LSH (with Alex Andoni)
   – Near Neighbor
   – User sets r and P = probability of NOT reporting a point within
     distance r (=10%)
   – Program finds parameters k,L,w so that:
        • Probability of failure is at most P
        • Expected query time is minimized
• Nearest neighbor: set radius (radiae) to accommodate
  90% queries (results for 98% are similar)
   –   1 radius: 90%
   –   2 radiae: 40%, 90%
   –   3 radiae: 40%, 65%, 90%
   –   4 radiae: 25%, 50%, 75%, 90%
Data sets
• MNIST OCR data, normalized (LeCun et al)
   – d=784
   – n=60,000
• Corel_hist
   – d=64
   – n=20,000
• Corel_uci
   – d=64
   – n=68,040
• Aerial data (Manjunath)
   – d=60
   – n=275,476
Other NN packages
• ANN (by Arya & Mount):
  – Based on kd-tree
  – Supports exact and approximate NN
• Metric trees (by Moore et al):
  – Splits along arbitrary directions (not just x,y,..)
  – Further optimizations
Running times

          MNIST    Speedup Corel_hist Speedup Corel_uci Speedup Aerial    Speedup
E2LSH-1    0.00960
E2LSH-2    0.00851           0.00024            0.00070           0.07400
E2LSH-3                      0.00018            0.00055           0.00833
E2LSH-4                                                           0.00668
ANN        0.25300 29.72274 0.00018 1.011236 0.00274 4.954792 0.00741 1.109281
MT         0.20900 24.55357 0.00130 7.303371 0.00650 11.75407 0.01700 2.54491
LSH vs kd-tree (MNIST)

 0.2
0.18
0.16
0.14
0.12
 0.1
0.08
0.06
0.04
0.02
   0
       0   10   20   30   40   50   60   70
Caveats
• For ANN (MNIST), setting ε=1000% results in:
  – Query time comparable to LSH
  – Correct NN in about 65% cases, small error otherwise
• However, no guarantees
• LSH eats much more space (for optimal
  performance):
  – LSH: 1.2 GB
  – Kd-tree: 360 MB

More Related Content

What's hot

Response Surface in Tensor Train format for Uncertainty Quantification
Response Surface in Tensor Train format for Uncertainty QuantificationResponse Surface in Tensor Train format for Uncertainty Quantification
Response Surface in Tensor Train format for Uncertainty QuantificationAlexander Litvinenko
 
Hiroyuki Sato
Hiroyuki SatoHiroyuki Sato
Hiroyuki SatoSuurist
 
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...Matt Moores
 
Phase diagram at finite T & Mu in strong coupling limit of lattice QCD
Phase diagram at finite T & Mu in strong coupling limit of lattice QCDPhase diagram at finite T & Mu in strong coupling limit of lattice QCD
Phase diagram at finite T & Mu in strong coupling limit of lattice QCDBenjamin Jaedon Choi
 
Multilayer Neural Networks
Multilayer Neural NetworksMultilayer Neural Networks
Multilayer Neural NetworksESCOM
 
SIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithmsSIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithmsJagadeeswaran Rathinavel
 
Hiroaki Shiokawa
Hiroaki ShiokawaHiroaki Shiokawa
Hiroaki ShiokawaSuurist
 
BEADS : filtrage asymétrique de ligne de base (tendance) et débruitage pour d...
BEADS : filtrage asymétrique de ligne de base (tendance) et débruitage pour d...BEADS : filtrage asymétrique de ligne de base (tendance) et débruitage pour d...
BEADS : filtrage asymétrique de ligne de base (tendance) et débruitage pour d...Laurent Duval
 
[MIRU2018] Global Average Poolingの特性を用いたAttention Branch Network
[MIRU2018] Global Average Poolingの特性を用いたAttention Branch Network[MIRU2018] Global Average Poolingの特性を用いたAttention Branch Network
[MIRU2018] Global Average Poolingの特性を用いたAttention Branch NetworkHiroshi Fukui
 
snarks <3 hash functions
snarks <3 hash functionssnarks <3 hash functions
snarks <3 hash functionsRebekah Mercer
 
Tetsunao Matsuta
Tetsunao MatsutaTetsunao Matsuta
Tetsunao MatsutaSuurist
 
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...Matt Moores
 
Goldberg-Coxeter construction for 3- or 4-valent plane maps
Goldberg-Coxeter construction for 3- or 4-valent plane mapsGoldberg-Coxeter construction for 3- or 4-valent plane maps
Goldberg-Coxeter construction for 3- or 4-valent plane mapsMathieu Dutour Sikiric
 
asymptotic notations i
asymptotic notations iasymptotic notations i
asymptotic notations iAli mahmood
 
Hideitsu Hino
Hideitsu HinoHideitsu Hino
Hideitsu HinoSuurist
 
Non-local Neural Network
Non-local Neural NetworkNon-local Neural Network
Non-local Neural NetworkHiroshi Fukui
 

What's hot (20)

Response Surface in Tensor Train format for Uncertainty Quantification
Response Surface in Tensor Train format for Uncertainty QuantificationResponse Surface in Tensor Train format for Uncertainty Quantification
Response Surface in Tensor Train format for Uncertainty Quantification
 
Os 2 cycle
Os 2 cycleOs 2 cycle
Os 2 cycle
 
2018 MUMS Fall Course - Bayesian inference for model calibration in UQ - Ralp...
2018 MUMS Fall Course - Bayesian inference for model calibration in UQ - Ralp...2018 MUMS Fall Course - Bayesian inference for model calibration in UQ - Ralp...
2018 MUMS Fall Course - Bayesian inference for model calibration in UQ - Ralp...
 
Hiroyuki Sato
Hiroyuki SatoHiroyuki Sato
Hiroyuki Sato
 
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
 
Phase diagram at finite T & Mu in strong coupling limit of lattice QCD
Phase diagram at finite T & Mu in strong coupling limit of lattice QCDPhase diagram at finite T & Mu in strong coupling limit of lattice QCD
Phase diagram at finite T & Mu in strong coupling limit of lattice QCD
 
Multilayer Neural Networks
Multilayer Neural NetworksMultilayer Neural Networks
Multilayer Neural Networks
 
SIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithmsSIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithms
 
Hiroaki Shiokawa
Hiroaki ShiokawaHiroaki Shiokawa
Hiroaki Shiokawa
 
BEADS : filtrage asymétrique de ligne de base (tendance) et débruitage pour d...
BEADS : filtrage asymétrique de ligne de base (tendance) et débruitage pour d...BEADS : filtrage asymétrique de ligne de base (tendance) et débruitage pour d...
BEADS : filtrage asymétrique de ligne de base (tendance) et débruitage pour d...
 
[MIRU2018] Global Average Poolingの特性を用いたAttention Branch Network
[MIRU2018] Global Average Poolingの特性を用いたAttention Branch Network[MIRU2018] Global Average Poolingの特性を用いたAttention Branch Network
[MIRU2018] Global Average Poolingの特性を用いたAttention Branch Network
 
snarks <3 hash functions
snarks <3 hash functionssnarks <3 hash functions
snarks <3 hash functions
 
Ec gate 13
Ec gate 13Ec gate 13
Ec gate 13
 
Tetsunao Matsuta
Tetsunao MatsutaTetsunao Matsuta
Tetsunao Matsuta
 
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...
 
ABC in Roma
ABC in RomaABC in Roma
ABC in Roma
 
Goldberg-Coxeter construction for 3- or 4-valent plane maps
Goldberg-Coxeter construction for 3- or 4-valent plane mapsGoldberg-Coxeter construction for 3- or 4-valent plane maps
Goldberg-Coxeter construction for 3- or 4-valent plane maps
 
asymptotic notations i
asymptotic notations iasymptotic notations i
asymptotic notations i
 
Hideitsu Hino
Hideitsu HinoHideitsu Hino
Hideitsu Hino
 
Non-local Neural Network
Non-local Neural NetworkNon-local Neural Network
Non-local Neural Network
 

Similar to mmds

Trilinear embedding for divergence-form operators
Trilinear embedding for divergence-form operatorsTrilinear embedding for divergence-form operators
Trilinear embedding for divergence-form operatorsVjekoslavKovac1
 
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdf
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdfreservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdf
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdfRTEFGDFGJU
 
PushdownAutomata.ppt
PushdownAutomata.pptPushdownAutomata.ppt
PushdownAutomata.pptRSRS39
 
Pushdown automata
Pushdown automataPushdown automata
Pushdown automataeugenesri
 
Pushdown automata
Pushdown automataPushdown automata
Pushdown automataparmeet834
 
Lecture6
Lecture6Lecture6
Lecture6voracle
 
Computational complexity and simulation of rare events of Ising spin glasses
Computational complexity and simulation of rare events of Ising spin glasses Computational complexity and simulation of rare events of Ising spin glasses
Computational complexity and simulation of rare events of Ising spin glasses Martin Pelikan
 
parameterized complexity for graph Motif
parameterized complexity for graph Motifparameterized complexity for graph Motif
parameterized complexity for graph MotifAMR koura
 
Unit-1 DAA_Notes.pdf
Unit-1 DAA_Notes.pdfUnit-1 DAA_Notes.pdf
Unit-1 DAA_Notes.pdfAmayJaiswal4
 
Volume and edge skeleton computation in high dimensions
Volume and edge skeleton computation in high dimensionsVolume and edge skeleton computation in high dimensions
Volume and edge skeleton computation in high dimensionsVissarion Fisikopoulos
 
Chap10 slides
Chap10 slidesChap10 slides
Chap10 slidesHJ DS
 
Oracle-based algorithms for high-dimensional polytopes.
Oracle-based algorithms for high-dimensional polytopes.Oracle-based algorithms for high-dimensional polytopes.
Oracle-based algorithms for high-dimensional polytopes.Vissarion Fisikopoulos
 
Sep logic slide
Sep logic slideSep logic slide
Sep logic sliderainoftime
 
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)Austin Benson
 

Similar to mmds (20)

Trilinear embedding for divergence-form operators
Trilinear embedding for divergence-form operatorsTrilinear embedding for divergence-form operators
Trilinear embedding for divergence-form operators
 
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdf
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdfreservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdf
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdf
 
Asymptotic notation
Asymptotic notationAsymptotic notation
Asymptotic notation
 
Presentation
PresentationPresentation
Presentation
 
PushdownAutomata.ppt
PushdownAutomata.pptPushdownAutomata.ppt
PushdownAutomata.ppt
 
Pushdown automata
Pushdown automataPushdown automata
Pushdown automata
 
Pushdown automata
Pushdown automataPushdown automata
Pushdown automata
 
Lecture6
Lecture6Lecture6
Lecture6
 
QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...
QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...
QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...
 
Computational complexity and simulation of rare events of Ising spin glasses
Computational complexity and simulation of rare events of Ising spin glasses Computational complexity and simulation of rare events of Ising spin glasses
Computational complexity and simulation of rare events of Ising spin glasses
 
parameterized complexity for graph Motif
parameterized complexity for graph Motifparameterized complexity for graph Motif
parameterized complexity for graph Motif
 
Unit-1 DAA_Notes.pdf
Unit-1 DAA_Notes.pdfUnit-1 DAA_Notes.pdf
Unit-1 DAA_Notes.pdf
 
Volume and edge skeleton computation in high dimensions
Volume and edge skeleton computation in high dimensionsVolume and edge skeleton computation in high dimensions
Volume and edge skeleton computation in high dimensions
 
QMC: Transition Workshop - Probabilistic Integrators for Deterministic Differ...
QMC: Transition Workshop - Probabilistic Integrators for Deterministic Differ...QMC: Transition Workshop - Probabilistic Integrators for Deterministic Differ...
QMC: Transition Workshop - Probabilistic Integrators for Deterministic Differ...
 
Asymptotic Notation
Asymptotic NotationAsymptotic Notation
Asymptotic Notation
 
Chap10 slides
Chap10 slidesChap10 slides
Chap10 slides
 
Shell theory
Shell theoryShell theory
Shell theory
 
Oracle-based algorithms for high-dimensional polytopes.
Oracle-based algorithms for high-dimensional polytopes.Oracle-based algorithms for high-dimensional polytopes.
Oracle-based algorithms for high-dimensional polytopes.
 
Sep logic slide
Sep logic slideSep logic slide
Sep logic slide
 
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
 

More from Hiroshi Ono

Voltdb - wikipedia
Voltdb - wikipediaVoltdb - wikipedia
Voltdb - wikipediaHiroshi Ono
 
Gamecenter概説
Gamecenter概説Gamecenter概説
Gamecenter概説Hiroshi Ono
 
EventDrivenArchitecture
EventDrivenArchitectureEventDrivenArchitecture
EventDrivenArchitectureHiroshi Ono
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdfHiroshi Ono
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdfHiroshi Ono
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfHiroshi Ono
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfHiroshi Ono
 
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdfpragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdfHiroshi Ono
 
downey08semaphores.pdf
downey08semaphores.pdfdowney08semaphores.pdf
downey08semaphores.pdfHiroshi Ono
 
BOF1-Scala02.pdf
BOF1-Scala02.pdfBOF1-Scala02.pdf
BOF1-Scala02.pdfHiroshi Ono
 
TwitterOct2008.pdf
TwitterOct2008.pdfTwitterOct2008.pdf
TwitterOct2008.pdfHiroshi Ono
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfHiroshi Ono
 
SACSIS2009_TCP.pdf
SACSIS2009_TCP.pdfSACSIS2009_TCP.pdf
SACSIS2009_TCP.pdfHiroshi Ono
 
scalaliftoff2009.pdf
scalaliftoff2009.pdfscalaliftoff2009.pdf
scalaliftoff2009.pdfHiroshi Ono
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfHiroshi Ono
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdfHiroshi Ono
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdfHiroshi Ono
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfHiroshi Ono
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfHiroshi Ono
 

More from Hiroshi Ono (20)

Voltdb - wikipedia
Voltdb - wikipediaVoltdb - wikipedia
Voltdb - wikipedia
 
Gamecenter概説
Gamecenter概説Gamecenter概説
Gamecenter概説
 
EventDrivenArchitecture
EventDrivenArchitectureEventDrivenArchitecture
EventDrivenArchitecture
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdf
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdf
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdf
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdf
 
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdfpragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
 
downey08semaphores.pdf
downey08semaphores.pdfdowney08semaphores.pdf
downey08semaphores.pdf
 
BOF1-Scala02.pdf
BOF1-Scala02.pdfBOF1-Scala02.pdf
BOF1-Scala02.pdf
 
TwitterOct2008.pdf
TwitterOct2008.pdfTwitterOct2008.pdf
TwitterOct2008.pdf
 
camel-scala.pdf
camel-scala.pdfcamel-scala.pdf
camel-scala.pdf
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
 
SACSIS2009_TCP.pdf
SACSIS2009_TCP.pdfSACSIS2009_TCP.pdf
SACSIS2009_TCP.pdf
 
scalaliftoff2009.pdf
scalaliftoff2009.pdfscalaliftoff2009.pdf
scalaliftoff2009.pdf
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdf
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdf
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdf
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdf
 

Recently uploaded

Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 

mmds

  • 1. Near-Optimal Hashing Algorithms for Approximate Near(est) Neighbor Problem Piotr Indyk MIT Joint work with: Alex Andoni, Mayur Datar, Nicole Immorlica, Vahab Mirrokni
  • 2. Definition • Given: a set P of points in Rd • Nearest Neighbor: for any query q, returns a point p∈P r minimizing ||p-q|| q • r-Near Neighbor: for any query q, returns a point p∈P s.t. ||p-q|| ≤ r (if it exists)
  • 3. Nearest Neighbor: Motivation • Learning: nearest neighbor rule ? • Database retrieval • Vector quantization, a.k.a. compression
  • 5. The case of d=2 • Compute Voronoi diagram • Given q, perform point location • Performance: – Space: O(n) – Query time: O(log n)
  • 6. The case of d>2 • Voronoi diagram has size nO(d) • We can also perform a linear scan: O(dn) time • That is pretty much all what known for exact algorithms with theoretical guarantees • In practice: – kd-trees work “well” in “low-medium” dimensions – Near-linear query time for high dimensions
  • 7. Approximate Near Neighbor • c-Approximate r-Near Neighbor: build data structure which, for any query q: – If there is a point p∈P, ||p-q|| ≤ r – it returns p’∈P, ||p-q|| ≤ cr r • Reductions: q – c-Approx Nearest Neighbor reduces to c-Approx cr Near Neighbor (log overhead) – One can enumerate all approx near neighbors → can solve exact near neighbor problem – Other apps: c-approximate Minimum Spanning Tree, clustering, etc.
  • 8. Approximate algorithms • Space/time exponential in d [Arya-Mount-et al], [Kleinberg’97], [Har-Peled’02], [Arya-Mount-…] • Space/time polynomial in d [Kushilevitz-Ostrovsky- Rabani’98], [Indyk-Motwani’98], [Indyk’98], [Gionis-Indyk-Motwani’99], [Charikar’02], [Datar-Immorlica-Indyk-Mirrokni’04], [Chakrabarti- Regev’04], [Panigrahy’06], [Ailon-Chazelle’06]… Space Time Comment Norm Ref 2 dn+n4/ε d * logn /ε2 or 1 c=1+ ε Hamm, l2 [KOR’98, IM’98] 2) nΩ(1/ε O(1) [AIP’0?] dn+n1+ρ(c) dnρ(c) ρ(c)=1/c Hamm, l2 [IM’98], [Cha’02] ρ(c)<1/c l2 [DIIM’04] dn * logs dnσ(c) σ(c)=O(log c/c) Hamm, l2 [Ind’01] σ(c)=O(1/c) l2 [Pan’06] dn+n1+ρ(c) dnρ(c) ρ(c)=1/c2 + o(1) l2 [AI’06] dn * logs dnσ(c) σ(c)=O(1/c2) l2 [AI’06]
  • 9. Locality-Sensitive Hashing q • Idea: construct hash functions g: Rd → U such that p for any points p,q: – If ||p-q|| ≤ r, then Pr[g(p)=g(q)] is “high” “not-so-small” – If ||p-q|| >cr, then Pr[g(p)=g(q)] is “small” • Then we can solve the problem by hashing
  • 10. LSH [Indyk-Motwani’98] • A family H of functions h: Rd → U is called (P1,P2,r,cr)-sensitive, if for any p,q: – if ||p-q|| <r then Pr[ h(p)=h(q) ] > P1 – if ||p-q|| >cr then Pr[ h(p)=h(q) ] < P2 • Example: Hamming distance – LSH functions: h(p)=pi, i.e., the i-th bit of p – Probabilities: Pr[ h(p)=h(q) ] = 1-D(p,q)/d p=10010010 q=11010110
  • 11. LSH Algorithm • We use functions of the form g(p)=<h1(p),h2(p),…,hk(p)> • Preprocessing: – Select g1…gL – For all p∈P, hash p to buckets g1(p)…gL(p) • Query: – Retrieve the points from buckets g1(q), g2(q), … , until • Either the points from all L buckets have been retrieved, or • Total number of points retrieved exceeds 2L – Answer the query based on the retrieved points – Total time: O(dL)
  • 12. Analysis • LSH solves c-approximate NN with: – Number of hash fun: L=nρ, ρ=log(1/P1)/log(1/P2) – E.g., for the Hamming distance we have ρ=1/c – Constant success probability per query q • Questions: – Can we extend this beyond Hamming distance ? • Yes: – embed l2 into l1 (random projections) – l1 into Hamming (discretization) – Can we reduce the exponent ρ ?
  • 13. Projection-based LSH [Datar-Immorlica-Indyk-Mirrokni’04] • Define hX,b(p)=⎣(p*X+b)/w⎦: – w≈r p X – X=(X1…Xd) , where Xi is chosen w from: • Gaussian distribution (for l2 norm) • “s-stable” distribution* (for ls norm) w – b is a scalar • Similar to the l2 → l1 →Hamming route * I.e., p*X has same distribution as ||p|| Z, where Z is s-stable s
  • 14. Analysis • Need to: – Compute Pr[h(p)=h(q)] as a function of ||p-q|| and w; this defines P1 and P2 – For each c choose w that minimizes w ρ=log1/P2(1/P1) w • Method: – For l2: computational – For general ls: analytic
  • 15. ρ(w) for various c’s: l1 1 c=1.1 c=1.5 c=2.5 0.9 c=5 c=10 0.8 0.7 w 0.6 pxe 0.5 w 0.4 0.3 0.2 0.1 0 5 10 15 20 wr
  • 16. ρ(w) for various c’s: l2 1 c=1.1 c=1.5 0.9 c=2.5 c=5 c=10 0.8 0.7 0.6 w pxe 0.5 0.4 w 0.3 0.2 0.1 0 0 5 10 15 20 w r
  • 17. ρ(c) for l2 1 rho 1/c 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 9 10 Approximation factor c
  • 18. New LSH scheme [Andoni-Indyk’06] • Instead of projecting onto R1, p X project onto Rt , for constant t w • Intervals → lattice of balls – Can hit empty space, so hash until w a ball is hit • Analysis: – ρ=1/c2 + O( log t / t1/2 ) p – Time to hash is tO(t) – Total query time: dn1/c2+o(1) • [Motwani-Naor-Panigrahy’06]: LSH in l2 must have ρ ≥ 0.45/c2
  • 19. Connections to • [Charikar-Chekuri-Goel-Guha-Plotkin’98] – Consider partitioning of Rd using balls of radius R – Show that Pr[ Ball(p) ≠ Ball(q) ] ≤ ||p-q||/R * d1/2 • Linear dependence on the distance – same as Hamming • Need to analyze R≈||p-q|| to achieve non-linear behavior! (as for the projection on the line) • [Karger-Motwani-Sudan’94] p – Consider partitioning of the sphere via random vectors u q from Nd(0,1) : p is in Cap(u) if u*p ≥ T – Showed Pr[ Cap(p) = Cap(q) ] ≤ exp[ - (2T/||p+q||)2/2 ] o • Large relative changes to ||p-q|| can yield only small relative changes to ||p+q||
  • 20. p Proof idea • Claim: ρ=log(P1)/log(P2)→1/c2 – P1=Pr(1), P2=Pr(c) – Pr(z)=prob. of collision when distance z • Proof idea: – Assumption: ignore effects of mapping into Rt x q p – Pr(z) is proportional to the volume of the cap – Fraction of mass in a cap is proportional to the probability that the x-coordinate of a random point u from a ball exceeds x – Approximation: the x-coordinate of u has approximately normal distribution → Pr(x) ≈ exp(-A x2) – ρ=log[ exp(-A12) ] / log [ exp(-Ac2) ] = 1/c2
  • 21. New LSH scheme, ctd. • How does it work in practice ? • The time tO(t)dn1/c2+f(t) is not very practical – Need t≈30 to see some improvement • Idea: a different decomposition of Rt – Replace random balls by Voronoi diagram of a lattice – For specific lattices, finding a cell containing a point can be very fast →fast hashing
  • 22. Leech Lattice LSH • Use Leech lattice in R24 , t=24 – Largest kissing number in 24D: 196560 – Conjectured largest packing density in 24D – 24 is 42 in reverse… • Very fast (bounded) decoder: about 519 operations [Amrani-Beery’94] • Performance of that decoder for c=2: – 1/c2 0.25 – 1/c 0.50 – Leech LSH, any dimension: ρ ≈ 0.36 – Leech LSH, 24D (no projection): ρ ≈ 0.26
  • 23. Conclusions • We have seen: 2+o(1) – Algorithm for c-NN with dn1/c query time (and reasonable space) • Exponent tight up to a constant – (More) practical algorithms based on Leech lattice • We haven’t seen: 2 – Algorithm for c-NN with dnO(1/c ) query time and dn log n space • Immediate questions: – Get rid of the o(1) – …or came up with a really neat lattice… – Tight lower bound • Non-immediate questions: – Other ways of solving proximity problems
  • 24. Advertisement • See LSH web page (linked from my web page for): – Experimental results (for the ’04 version) – Pointers to code
  • 26. Experiments (with ’04 version) • E2LSH: Exact Euclidean LSH (with Alex Andoni) – Near Neighbor – User sets r and P = probability of NOT reporting a point within distance r (=10%) – Program finds parameters k,L,w so that: • Probability of failure is at most P • Expected query time is minimized • Nearest neighbor: set radius (radiae) to accommodate 90% queries (results for 98% are similar) – 1 radius: 90% – 2 radiae: 40%, 90% – 3 radiae: 40%, 65%, 90% – 4 radiae: 25%, 50%, 75%, 90%
  • 27. Data sets • MNIST OCR data, normalized (LeCun et al) – d=784 – n=60,000 • Corel_hist – d=64 – n=20,000 • Corel_uci – d=64 – n=68,040 • Aerial data (Manjunath) – d=60 – n=275,476
  • 28. Other NN packages • ANN (by Arya & Mount): – Based on kd-tree – Supports exact and approximate NN • Metric trees (by Moore et al): – Splits along arbitrary directions (not just x,y,..) – Further optimizations
  • 29. Running times MNIST Speedup Corel_hist Speedup Corel_uci Speedup Aerial Speedup E2LSH-1 0.00960 E2LSH-2 0.00851 0.00024 0.00070 0.07400 E2LSH-3 0.00018 0.00055 0.00833 E2LSH-4 0.00668 ANN 0.25300 29.72274 0.00018 1.011236 0.00274 4.954792 0.00741 1.109281 MT 0.20900 24.55357 0.00130 7.303371 0.00650 11.75407 0.01700 2.54491
  • 30. LSH vs kd-tree (MNIST) 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0 10 20 30 40 50 60 70
  • 31. Caveats • For ANN (MNIST), setting ε=1000% results in: – Query time comparable to LSH – Correct NN in about 65% cases, small error otherwise • However, no guarantees • LSH eats much more space (for optimal performance): – LSH: 1.2 GB – Kd-tree: 360 MB