SlideShare una empresa de Scribd logo
1 de 55
The Post Office Problem

k-d trees, k-nn search, and the Johnson-
Lindenstrauss lemma
Who am I?
   Jeremy Holland
   Senior lead developer at
    Centresource
   Math and algorithms nerd
   @awebneck,
    github.com/awebneck, freenode:
    awebneck, (you get the idea)
If you like the talk...
   I like scotch. Just putting it out there.
What is the Post Office Problem?
   Don Knuth, professional CS badass.
   TAOCP, vol. 3
   Otherwise known as ”Nearest Neighbor search”
   Let's say you've...
Just moved to Denmark!
But you need to mail a letter!
      Which post office do you go to?
Finding free images of post
offices is hard, so...
        We'll just reduce it to this:




              q
Naive implementation
   Calculate distance to all points, find smallest
min = INFINITY
P = <points to be searched>
K = <dimensionality of points, e.g. 2>
q = <query point>
best = nil
for p in P do
  dimDistSum = 0
  for k in K do
    dimDistSum += (q[k]-p[k])**2
  dist = dimDistSum.sqrt
  if dist < min
    min = dist
    best = p
return best
With a little preprocessing...
   But that takes      time! - can we do better?
   You bet!
   k-d tree
   Binary tree (each node has at most two
    children)
   Each node represents a single point in the set
    to be searched
Each node looks like...
   Domain: the vector describing the point (i.e.
    [p[0], p[1], … p[k-1]])
   Range: Some identifying characteristic (e.g. PK
    in a database)
   Split: A chosen dimension from 0 ≤ split < k
   Left: The left child (left.domain[split] <
    self.domain[split])
   Right: The right child (right.domain[split] ≥
    self.domain[split])
Let's build a k-d tree!
             Point 1: [20,10]
Let's build a k-d tree!
          Let's split on the x axis
Let's build a k-d tree!
          Add a new point: [10,5]
Let's build a k-d tree!
 The new point is the Left Child of the first point
Let's build a k-d tree!
        Let's split him on the y axis
Let's build a k-d tree!
         And add a 3rd point: [25,3]
Let's build a k-d tree!
 The new point is the Right Child of the first point
Let's build a k-d tree!
           So on and so forth...
Let's build a k-d tree!
            Giving you a tree:
How do we search it?
 Step 1: Find the best bin (where the query point
           would otherwise be inserted)


              q

                               root
How do we search it?
 NOTE: There is no node for this bin – just the
     space a node would be if existed!


             q

                              root
How do we search it?
 Step 2: Make the current leaf node the current
                 ”best guess”




                     Best guess
How do we search it?
  … and set the ”best guess radius” to be the
  distance between the query and that point




                      Best guess radius
How do we search it?
      Step 3: Back up the tree 1 node




                    Current node
How do we search it?
 If the distance between the query and the new
     node is less than the best guess radius...
How do we search it?
   Then set the best guess radius to the new
 distance, and make the current node the best
How do we search it?
Step 4: If the hypersphere described by the best
    guess radius crosses the current split...



                        Oh nooooh!
How do we search it?
 And the current node has a child on the other
                    side...




                      Oh snap!
How do we search it?
 … then make that node the current node, and
                   repeat:
How do we search it?
Here, the distance is not less than the best guess
                     radius...
How do we search it?
  … and the hypersphere neither crosses the
                  split ...
                             Whew, missed it!
How do we search it?
… nor does the current node have any children ...

                               Whew, missed it!
How do we search it?
So we can eliminate it and back up the tree again!
How do we search it?
We've already compared this node, so let's keep
            going back up the tree
How do we search it?
 Again, the radius is bigger than the best guess,
   and there is no crossing – back up again!
How do we search it?
           ...and again...
How do we search it?
       All the way back to the root!
How do we search it?
And you have your nearest neighbor, with a good
         case of        running time!


            I'm the answer!
But that was a pretty good case...
   We barely had to backtrack at all – best case is

   Worst case (lots of backtracking – examining
    almost every node) can get up to
   Amount of backtracking is directly proportional
    to k!
   If k is small (say 2, as in this example) and n is
    large, we see a huge improvement over linear
    search
   As k becomes large, the benefits of this over a
    naive implementation virtually disappear!
The Curse of Dimensionality
   Curse you, dimensionality!
   High-dimensional vector spaces are darned
    hard to search!
   Why? Too many dimensions! Why are there so
    many dimensions!?!
   What can we do about it?
   Get rid of the extra weight!
   Enter Mssrs. Johnson and Lindenstrauss
It turns out...
   Your vectors have a high dimension
   Absolute distance and precise location versus
    relative distance between points
   Relative distance can be largely preserved by a
    lower dimensional space
   Reduce k dimensions to kproj dimensions,
     kproj << k
Example: 2d to 1d

                    11.180
   5.0
       00




                               8.246
 7.28




                                       17.
                                           464
     0




                                                  7.
                                  6                  0
                               .29                       71
                             10
            3.16




                                         13.34
                                              2
                2




                                       14.000
Example: 2d to 1d, 1st attempt

                    11.180                        Projection Plane
   5.0
       00




                               8.246
 7.28




                                       17.
                                           464
     0




                                                      7.
                                  6                      0
                               .29                           71
                             10
            3.16




                                         13.34
                                              2
                2




                                       14.000
Example: 2d to 1d, 1st attempt

                    11.180
   5.0
       00




                               8.246
 7.28




                                        17.
                                            464
     0




                                                       7.
                                  6                       0
                               .29                            71
                             10
            3.16




                                           13.34
                                                2
                2




                                         14.000




                             Finished 1-d Projection
Example: 2d to 1d, 2nd attempt

                    11.180                        Projection Plane
   5.0
       00




                               8.246
 7.28




                                       17.
                                           464
     0




                                                      7.
                                  6                      0
                               .29                           71
                             10
            3.16




                                         13.34
                                              2
                2




                                       14.000
Example: 2d to 1d, 2nd attempt

                    11.180
   5.0
       00




                               8.246
 7.28




                                        17.
                                            464
     0




                                                       7.
                                  6                       0
                               .29                            71
                             10
            3.16




                                           13.34
                                                2
                2




                                         14.000




                             Finished 1-d Projection
Example: 2d to 1d, 3rd attempt

                    11.180                        Projection Plane
   5.0
       00




                               8.246
 7.28




                                       17.
                                           464
     0




                                                      7.
                                  6                      0
                               .29                           71
                             10
            3.16




                                         13.34
                                              2
                2




                                       14.000
Example: 2d to 1d, 3rd attempt

                    11.180                             Projection Plane
   5.0
       00




                               8.246
 7.28




                                        17.
                                            464
     0




                                                           7.
                                  6                           0
                               .29                                71
                             10
            3.16




                                           13.34
                                                2
                2




                                        14.000




                             Finished 1-d Projection
It turns out...




   Relative distance can be largely but not
    completely preserved by a lower dimensional
    space
   Every projection will have errors
   How do you choose one with the fewest?
   Trick question: Let fate decide!
Multiple random projection
   Choose the projections radomly
   Multiple projections
   Exchange cost in resources for cost in accuracy
       More projections = greater resource cost = greater
        accuracy
       Fewer projections = lesser resource cost = lesser
        accuracy
   Trivially parallelizable
   Learn to be happy with ”good enough”
Multiple random projections
 Get the nearest from each projection, then run a
       naive nearest on the results thereof.
Nns = []
P = <projections>
q = <query point>
for p in P do
  pq = <project q to the same plane as p>
  nns << <nearest neighbor to pq from projection>
<execute naive nearest on nns to find nearest of result>
return nn



                         Et voilá!
Multiple random projection
   Experiments yield > 98% accuracy when
    multiple nearest neighbors are selected from
    each projection and d is reduced from 256 to
    15, with approximately 30% of the calculation.
    (see credits)
   Additional experiments yielded similar results,
    as did my own
   That's pretty darn-tootin' good
Stuff to watch out for
   Balancing is vitally important (assuming uniform
    distribution of points): careful attention must be
    paid to selection of nodes (node with median
    coordinate for split axis)
   Cycle through axes for each level of the tree –
    root should split on 0, lvl 1 on 1, lvl 2 on 2, etc.
Stuff to watch out for
   Building the trees still takes some time
       Building the projections is effectively matrix
        multiplication, time in            (Strassen's
        algorithm)
       Building the (balanced) trees from the projections
        takes time in approximately
   Solution: build the trees ahead of time and
    store them for later querying (i.e. index those
    bad boys!)
Thanks!
   Credits:
       Based in large part on research conducted by
        Yousuf Ahmed, NYU: http://bit.ly/NZ7ZHo
       K-d trees: J. L. Bentley, Stanford U.:
        http://bit.ly/Mpy05p
       Dimensionality reduction: W. B. Johnson and J.
        Lindenstrauss: http://bit.ly/m9SGPN
       Research Fuel: Ardbeg Uigeadail:
        http://bit.ly/fcag0E

Más contenido relacionado

La actualidad más candente

Permen pu pr no.28 tahun 2016 - bidang umum
Permen pu pr no.28 tahun 2016   - bidang umumPermen pu pr no.28 tahun 2016   - bidang umum
Permen pu pr no.28 tahun 2016 - bidang umumgirindra_dam
 
Bab 4 metode penjadwalan proyek
Bab 4 metode penjadwalan proyekBab 4 metode penjadwalan proyek
Bab 4 metode penjadwalan proyekRif'at Hm
 
07. pengujian abrasi agregat halus dan kasar menggunakan mesin los angeles (m...
07. pengujian abrasi agregat halus dan kasar menggunakan mesin los angeles (m...07. pengujian abrasi agregat halus dan kasar menggunakan mesin los angeles (m...
07. pengujian abrasi agregat halus dan kasar menggunakan mesin los angeles (m...AldiRamdani3
 
Renstra Dinas Pemuda dan Olah Raga Kota Palangka Raya
Renstra Dinas Pemuda dan Olah Raga Kota Palangka RayaRenstra Dinas Pemuda dan Olah Raga Kota Palangka Raya
Renstra Dinas Pemuda dan Olah Raga Kota Palangka RayaMellianae Merkusi
 
MODUL TKP M5KB2 - GAMBAR BANGUNAN JALAN _ JEMBATAN
MODUL TKP M5KB2 - GAMBAR BANGUNAN JALAN _ JEMBATANMODUL TKP M5KB2 - GAMBAR BANGUNAN JALAN _ JEMBATAN
MODUL TKP M5KB2 - GAMBAR BANGUNAN JALAN _ JEMBATANPPGHybrid1
 
Tingkat Pelayanan Jalan (Level of Service)
Tingkat Pelayanan Jalan (Level of Service)Tingkat Pelayanan Jalan (Level of Service)
Tingkat Pelayanan Jalan (Level of Service)Dokter Kota
 
Kp 01 2010 perencanaan jaringan irigasi
Kp 01 2010 perencanaan jaringan irigasiKp 01 2010 perencanaan jaringan irigasi
Kp 01 2010 perencanaan jaringan irigasiArizki_Hidayat
 
PPT TKP M2KB2 - Struktur Statis Tertentu
PPT TKP M2KB2 - Struktur Statis TertentuPPT TKP M2KB2 - Struktur Statis Tertentu
PPT TKP M2KB2 - Struktur Statis TertentuPPGHybrid1
 
Kedudukan konstitusi bagi suatu negara
Kedudukan konstitusi bagi suatu negaraKedudukan konstitusi bagi suatu negara
Kedudukan konstitusi bagi suatu negaraFitri Amalia
 
Perencanaan bendung
Perencanaan bendungPerencanaan bendung
Perencanaan bendungironsand2009
 
Karya tulis ilmiah tema Politik dan Demokrasi
Karya tulis ilmiah tema Politik dan DemokrasiKarya tulis ilmiah tema Politik dan Demokrasi
Karya tulis ilmiah tema Politik dan DemokrasiMuhammad Yasir Abdad
 
Pendidikan Anti Korupsi - Buku Ajar Pendidikan dan Budaya Anti Korupsi (PBAK)
Pendidikan Anti Korupsi - Buku Ajar Pendidikan dan Budaya Anti Korupsi (PBAK)Pendidikan Anti Korupsi - Buku Ajar Pendidikan dan Budaya Anti Korupsi (PBAK)
Pendidikan Anti Korupsi - Buku Ajar Pendidikan dan Budaya Anti Korupsi (PBAK)Haristian Sahroni Putra
 
Cara sangat-mudah-mengerjakan-soal-tes-psikotes
Cara sangat-mudah-mengerjakan-soal-tes-psikotesCara sangat-mudah-mengerjakan-soal-tes-psikotes
Cara sangat-mudah-mengerjakan-soal-tes-psikotesRita Silaban
 
Tugas karya ilmiaih jaringan dan komunikasi data
Tugas karya ilmiaih jaringan dan komunikasi dataTugas karya ilmiaih jaringan dan komunikasi data
Tugas karya ilmiaih jaringan dan komunikasi dataridwan purwa
 
Menggambar sambungan kayu
Menggambar sambungan kayuMenggambar sambungan kayu
Menggambar sambungan kayuRd Rosyadi
 
Presentasi Perkerasan Jalan Raya UNS 2015
Presentasi Perkerasan Jalan Raya UNS 2015Presentasi Perkerasan Jalan Raya UNS 2015
Presentasi Perkerasan Jalan Raya UNS 2015Herizki Trisatria
 
Daya dukung pondasi dengan analisis terzaghi
Daya dukung pondasi dengan analisis terzaghiDaya dukung pondasi dengan analisis terzaghi
Daya dukung pondasi dengan analisis terzaghiAyu Fatimah Zahra
 
Modul 2- balok terjepit sebelah
Modul 2- balok terjepit sebelahModul 2- balok terjepit sebelah
Modul 2- balok terjepit sebelahMOSES HADUN
 

La actualidad más candente (20)

Permen pu pr no.28 tahun 2016 - bidang umum
Permen pu pr no.28 tahun 2016   - bidang umumPermen pu pr no.28 tahun 2016   - bidang umum
Permen pu pr no.28 tahun 2016 - bidang umum
 
Bab 4 metode penjadwalan proyek
Bab 4 metode penjadwalan proyekBab 4 metode penjadwalan proyek
Bab 4 metode penjadwalan proyek
 
07. pengujian abrasi agregat halus dan kasar menggunakan mesin los angeles (m...
07. pengujian abrasi agregat halus dan kasar menggunakan mesin los angeles (m...07. pengujian abrasi agregat halus dan kasar menggunakan mesin los angeles (m...
07. pengujian abrasi agregat halus dan kasar menggunakan mesin los angeles (m...
 
Renstra Dinas Pemuda dan Olah Raga Kota Palangka Raya
Renstra Dinas Pemuda dan Olah Raga Kota Palangka RayaRenstra Dinas Pemuda dan Olah Raga Kota Palangka Raya
Renstra Dinas Pemuda dan Olah Raga Kota Palangka Raya
 
MODUL TKP M5KB2 - GAMBAR BANGUNAN JALAN _ JEMBATAN
MODUL TKP M5KB2 - GAMBAR BANGUNAN JALAN _ JEMBATANMODUL TKP M5KB2 - GAMBAR BANGUNAN JALAN _ JEMBATAN
MODUL TKP M5KB2 - GAMBAR BANGUNAN JALAN _ JEMBATAN
 
Tingkat Pelayanan Jalan (Level of Service)
Tingkat Pelayanan Jalan (Level of Service)Tingkat Pelayanan Jalan (Level of Service)
Tingkat Pelayanan Jalan (Level of Service)
 
Kp 01 2010 perencanaan jaringan irigasi
Kp 01 2010 perencanaan jaringan irigasiKp 01 2010 perencanaan jaringan irigasi
Kp 01 2010 perencanaan jaringan irigasi
 
PPT TKP M2KB2 - Struktur Statis Tertentu
PPT TKP M2KB2 - Struktur Statis TertentuPPT TKP M2KB2 - Struktur Statis Tertentu
PPT TKP M2KB2 - Struktur Statis Tertentu
 
Tugas Perencanaan Pelabuhan Kelompok 2
Tugas Perencanaan Pelabuhan Kelompok 2Tugas Perencanaan Pelabuhan Kelompok 2
Tugas Perencanaan Pelabuhan Kelompok 2
 
Kedudukan konstitusi bagi suatu negara
Kedudukan konstitusi bagi suatu negaraKedudukan konstitusi bagi suatu negara
Kedudukan konstitusi bagi suatu negara
 
Perencanaan bendung
Perencanaan bendungPerencanaan bendung
Perencanaan bendung
 
Karya tulis ilmiah tema Politik dan Demokrasi
Karya tulis ilmiah tema Politik dan DemokrasiKarya tulis ilmiah tema Politik dan Demokrasi
Karya tulis ilmiah tema Politik dan Demokrasi
 
Pendidikan Anti Korupsi - Buku Ajar Pendidikan dan Budaya Anti Korupsi (PBAK)
Pendidikan Anti Korupsi - Buku Ajar Pendidikan dan Budaya Anti Korupsi (PBAK)Pendidikan Anti Korupsi - Buku Ajar Pendidikan dan Budaya Anti Korupsi (PBAK)
Pendidikan Anti Korupsi - Buku Ajar Pendidikan dan Budaya Anti Korupsi (PBAK)
 
Cara sangat-mudah-mengerjakan-soal-tes-psikotes
Cara sangat-mudah-mengerjakan-soal-tes-psikotesCara sangat-mudah-mengerjakan-soal-tes-psikotes
Cara sangat-mudah-mengerjakan-soal-tes-psikotes
 
Tugas karya ilmiaih jaringan dan komunikasi data
Tugas karya ilmiaih jaringan dan komunikasi dataTugas karya ilmiaih jaringan dan komunikasi data
Tugas karya ilmiaih jaringan dan komunikasi data
 
Menggambar sambungan kayu
Menggambar sambungan kayuMenggambar sambungan kayu
Menggambar sambungan kayu
 
Presentasi Perkerasan Jalan Raya UNS 2015
Presentasi Perkerasan Jalan Raya UNS 2015Presentasi Perkerasan Jalan Raya UNS 2015
Presentasi Perkerasan Jalan Raya UNS 2015
 
Makalah tawuran
Makalah tawuranMakalah tawuran
Makalah tawuran
 
Daya dukung pondasi dengan analisis terzaghi
Daya dukung pondasi dengan analisis terzaghiDaya dukung pondasi dengan analisis terzaghi
Daya dukung pondasi dengan analisis terzaghi
 
Modul 2- balok terjepit sebelah
Modul 2- balok terjepit sebelahModul 2- balok terjepit sebelah
Modul 2- balok terjepit sebelah
 

Destacado

Mobile 111229042626-phpapp02
Mobile 111229042626-phpapp02Mobile 111229042626-phpapp02
Mobile 111229042626-phpapp02Rajasekar Sekaran
 
PRACTICAL APPROXIMATE K NEAREST NEIGHBOR QUERIES WITH LOCATION AND QUERY PRIVACY
PRACTICAL APPROXIMATE K NEAREST NEIGHBOR QUERIES WITH LOCATION AND QUERY PRIVACYPRACTICAL APPROXIMATE K NEAREST NEIGHBOR QUERIES WITH LOCATION AND QUERY PRIVACY
PRACTICAL APPROXIMATE K NEAREST NEIGHBOR QUERIES WITH LOCATION AND QUERY PRIVACYNexgen Technology
 
03 propriétés de l'intégrale sur un segment d'une fonction continue
03 propriétés de l'intégrale sur un segment d'une fonction continue03 propriétés de l'intégrale sur un segment d'une fonction continue
03 propriétés de l'intégrale sur un segment d'une fonction continueAchraf Ourti
 
04 intégrale d'une fonction continue sur un segment et dérivation
04 intégrale d'une fonction continue sur un segment et dérivation04 intégrale d'une fonction continue sur un segment et dérivation
04 intégrale d'une fonction continue sur un segment et dérivationAchraf Ourti
 
Choice architecture Part 1
Choice architecture Part 1Choice architecture Part 1
Choice architecture Part 1Putu Sundika
 
35549307 capacitor-qbank
35549307 capacitor-qbank35549307 capacitor-qbank
35549307 capacitor-qbanknoracleguy
 
Share and Share Alike
Share and Share AlikeShare and Share Alike
Share and Share Alikeawebneck
 
Business inteligince
Business inteliginceBusiness inteligince
Business inteliginceIssam Chong
 
06 equations différentielles
06 equations différentielles06 equations différentielles
06 equations différentiellesAchraf Ourti
 
01 fonctions convexes
01 fonctions convexes01 fonctions convexes
01 fonctions convexesAchraf Ourti
 
10 courbes paramétrées planes
10 courbes paramétrées planes10 courbes paramétrées planes
10 courbes paramétrées planesAchraf Ourti
 
Documento tecnico norme_di_convivenza
Documento tecnico norme_di_convivenzaDocumento tecnico norme_di_convivenza
Documento tecnico norme_di_convivenzaittgiuseppemazzotti
 

Destacado (20)

Mobile 111229042626-phpapp02
Mobile 111229042626-phpapp02Mobile 111229042626-phpapp02
Mobile 111229042626-phpapp02
 
Post office building
Post office buildingPost office building
Post office building
 
PRACTICAL APPROXIMATE K NEAREST NEIGHBOR QUERIES WITH LOCATION AND QUERY PRIVACY
PRACTICAL APPROXIMATE K NEAREST NEIGHBOR QUERIES WITH LOCATION AND QUERY PRIVACYPRACTICAL APPROXIMATE K NEAREST NEIGHBOR QUERIES WITH LOCATION AND QUERY PRIVACY
PRACTICAL APPROXIMATE K NEAREST NEIGHBOR QUERIES WITH LOCATION AND QUERY PRIVACY
 
20131102 rel grollo_13nov13
20131102 rel grollo_13nov1320131102 rel grollo_13nov13
20131102 rel grollo_13nov13
 
03 propriétés de l'intégrale sur un segment d'une fonction continue
03 propriétés de l'intégrale sur un segment d'une fonction continue03 propriétés de l'intégrale sur un segment d'une fonction continue
03 propriétés de l'intégrale sur un segment d'une fonction continue
 
04 intégrale d'une fonction continue sur un segment et dérivation
04 intégrale d'une fonction continue sur un segment et dérivation04 intégrale d'une fonction continue sur un segment et dérivation
04 intégrale d'une fonction continue sur un segment et dérivation
 
06
0606
06
 
07 coniques
07 coniques07 coniques
07 coniques
 
Sony ptz 25
Sony ptz 25Sony ptz 25
Sony ptz 25
 
04
0404
04
 
Choice architecture Part 1
Choice architecture Part 1Choice architecture Part 1
Choice architecture Part 1
 
35549307 capacitor-qbank
35549307 capacitor-qbank35549307 capacitor-qbank
35549307 capacitor-qbank
 
Baabul
BaabulBaabul
Baabul
 
Share and Share Alike
Share and Share AlikeShare and Share Alike
Share and Share Alike
 
Business inteligince
Business inteliginceBusiness inteligince
Business inteligince
 
06 equations différentielles
06 equations différentielles06 equations différentielles
06 equations différentielles
 
01 fonctions convexes
01 fonctions convexes01 fonctions convexes
01 fonctions convexes
 
10 courbes paramétrées planes
10 courbes paramétrées planes10 courbes paramétrées planes
10 courbes paramétrées planes
 
02
0202
02
 
Documento tecnico norme_di_convivenza
Documento tecnico norme_di_convivenzaDocumento tecnico norme_di_convivenza
Documento tecnico norme_di_convivenza
 

Último

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 

Último (20)

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 

The Post Office Problem

  • 1. The Post Office Problem k-d trees, k-nn search, and the Johnson- Lindenstrauss lemma
  • 2. Who am I?  Jeremy Holland  Senior lead developer at Centresource  Math and algorithms nerd  @awebneck, github.com/awebneck, freenode: awebneck, (you get the idea)
  • 3. If you like the talk...  I like scotch. Just putting it out there.
  • 4. What is the Post Office Problem?  Don Knuth, professional CS badass.  TAOCP, vol. 3  Otherwise known as ”Nearest Neighbor search”  Let's say you've...
  • 5. Just moved to Denmark!
  • 6. But you need to mail a letter! Which post office do you go to?
  • 7. Finding free images of post offices is hard, so... We'll just reduce it to this: q
  • 8. Naive implementation Calculate distance to all points, find smallest min = INFINITY P = <points to be searched> K = <dimensionality of points, e.g. 2> q = <query point> best = nil for p in P do dimDistSum = 0 for k in K do dimDistSum += (q[k]-p[k])**2 dist = dimDistSum.sqrt if dist < min min = dist best = p return best
  • 9. With a little preprocessing...  But that takes time! - can we do better?  You bet!  k-d tree  Binary tree (each node has at most two children)  Each node represents a single point in the set to be searched
  • 10. Each node looks like...  Domain: the vector describing the point (i.e. [p[0], p[1], … p[k-1]])  Range: Some identifying characteristic (e.g. PK in a database)  Split: A chosen dimension from 0 ≤ split < k  Left: The left child (left.domain[split] < self.domain[split])  Right: The right child (right.domain[split] ≥ self.domain[split])
  • 11. Let's build a k-d tree! Point 1: [20,10]
  • 12. Let's build a k-d tree! Let's split on the x axis
  • 13. Let's build a k-d tree! Add a new point: [10,5]
  • 14. Let's build a k-d tree! The new point is the Left Child of the first point
  • 15. Let's build a k-d tree! Let's split him on the y axis
  • 16. Let's build a k-d tree! And add a 3rd point: [25,3]
  • 17. Let's build a k-d tree! The new point is the Right Child of the first point
  • 18. Let's build a k-d tree! So on and so forth...
  • 19. Let's build a k-d tree! Giving you a tree:
  • 20. How do we search it? Step 1: Find the best bin (where the query point would otherwise be inserted) q root
  • 21. How do we search it? NOTE: There is no node for this bin – just the space a node would be if existed! q root
  • 22. How do we search it? Step 2: Make the current leaf node the current ”best guess” Best guess
  • 23. How do we search it? … and set the ”best guess radius” to be the distance between the query and that point Best guess radius
  • 24. How do we search it? Step 3: Back up the tree 1 node Current node
  • 25. How do we search it? If the distance between the query and the new node is less than the best guess radius...
  • 26. How do we search it? Then set the best guess radius to the new distance, and make the current node the best
  • 27. How do we search it? Step 4: If the hypersphere described by the best guess radius crosses the current split... Oh nooooh!
  • 28. How do we search it? And the current node has a child on the other side... Oh snap!
  • 29. How do we search it? … then make that node the current node, and repeat:
  • 30. How do we search it? Here, the distance is not less than the best guess radius...
  • 31. How do we search it? … and the hypersphere neither crosses the split ... Whew, missed it!
  • 32. How do we search it? … nor does the current node have any children ... Whew, missed it!
  • 33. How do we search it? So we can eliminate it and back up the tree again!
  • 34. How do we search it? We've already compared this node, so let's keep going back up the tree
  • 35. How do we search it? Again, the radius is bigger than the best guess, and there is no crossing – back up again!
  • 36. How do we search it? ...and again...
  • 37. How do we search it? All the way back to the root!
  • 38. How do we search it? And you have your nearest neighbor, with a good case of running time! I'm the answer!
  • 39. But that was a pretty good case...  We barely had to backtrack at all – best case is  Worst case (lots of backtracking – examining almost every node) can get up to  Amount of backtracking is directly proportional to k!  If k is small (say 2, as in this example) and n is large, we see a huge improvement over linear search  As k becomes large, the benefits of this over a naive implementation virtually disappear!
  • 40. The Curse of Dimensionality  Curse you, dimensionality!  High-dimensional vector spaces are darned hard to search!  Why? Too many dimensions! Why are there so many dimensions!?!  What can we do about it?  Get rid of the extra weight!  Enter Mssrs. Johnson and Lindenstrauss
  • 41. It turns out...  Your vectors have a high dimension  Absolute distance and precise location versus relative distance between points  Relative distance can be largely preserved by a lower dimensional space  Reduce k dimensions to kproj dimensions, kproj << k
  • 42. Example: 2d to 1d 11.180 5.0 00 8.246 7.28 17. 464 0 7. 6 0 .29 71 10 3.16 13.34 2 2 14.000
  • 43. Example: 2d to 1d, 1st attempt 11.180 Projection Plane 5.0 00 8.246 7.28 17. 464 0 7. 6 0 .29 71 10 3.16 13.34 2 2 14.000
  • 44. Example: 2d to 1d, 1st attempt 11.180 5.0 00 8.246 7.28 17. 464 0 7. 6 0 .29 71 10 3.16 13.34 2 2 14.000 Finished 1-d Projection
  • 45. Example: 2d to 1d, 2nd attempt 11.180 Projection Plane 5.0 00 8.246 7.28 17. 464 0 7. 6 0 .29 71 10 3.16 13.34 2 2 14.000
  • 46. Example: 2d to 1d, 2nd attempt 11.180 5.0 00 8.246 7.28 17. 464 0 7. 6 0 .29 71 10 3.16 13.34 2 2 14.000 Finished 1-d Projection
  • 47. Example: 2d to 1d, 3rd attempt 11.180 Projection Plane 5.0 00 8.246 7.28 17. 464 0 7. 6 0 .29 71 10 3.16 13.34 2 2 14.000
  • 48. Example: 2d to 1d, 3rd attempt 11.180 Projection Plane 5.0 00 8.246 7.28 17. 464 0 7. 6 0 .29 71 10 3.16 13.34 2 2 14.000 Finished 1-d Projection
  • 49. It turns out...  Relative distance can be largely but not completely preserved by a lower dimensional space  Every projection will have errors  How do you choose one with the fewest?  Trick question: Let fate decide!
  • 50. Multiple random projection  Choose the projections radomly  Multiple projections  Exchange cost in resources for cost in accuracy  More projections = greater resource cost = greater accuracy  Fewer projections = lesser resource cost = lesser accuracy  Trivially parallelizable  Learn to be happy with ”good enough”
  • 51. Multiple random projections Get the nearest from each projection, then run a naive nearest on the results thereof. Nns = [] P = <projections> q = <query point> for p in P do pq = <project q to the same plane as p> nns << <nearest neighbor to pq from projection> <execute naive nearest on nns to find nearest of result> return nn Et voilá!
  • 52. Multiple random projection  Experiments yield > 98% accuracy when multiple nearest neighbors are selected from each projection and d is reduced from 256 to 15, with approximately 30% of the calculation. (see credits)  Additional experiments yielded similar results, as did my own  That's pretty darn-tootin' good
  • 53. Stuff to watch out for  Balancing is vitally important (assuming uniform distribution of points): careful attention must be paid to selection of nodes (node with median coordinate for split axis)  Cycle through axes for each level of the tree – root should split on 0, lvl 1 on 1, lvl 2 on 2, etc.
  • 54. Stuff to watch out for  Building the trees still takes some time  Building the projections is effectively matrix multiplication, time in (Strassen's algorithm)  Building the (balanced) trees from the projections takes time in approximately  Solution: build the trees ahead of time and store them for later querying (i.e. index those bad boys!)
  • 55. Thanks!  Credits:  Based in large part on research conducted by Yousuf Ahmed, NYU: http://bit.ly/NZ7ZHo  K-d trees: J. L. Bentley, Stanford U.: http://bit.ly/Mpy05p  Dimensionality reduction: W. B. Johnson and J. Lindenstrauss: http://bit.ly/m9SGPN  Research Fuel: Ardbeg Uigeadail: http://bit.ly/fcag0E