Se está descargando tu SlideShare. ×

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Cargando en…3
×

1 de 62 Anuncio

# Approximate nearest neighbor methods and vector models – NYC ML meetup

Nearest neighbors refers to something that is conceptually very simple. For a set of points in some space (possibly many dimensions), we want to find the closest k neighbors quickly.

This presentation covers a library called Annoy built my me that that helps you do (approximate) nearest neighbor queries in high dimensional spaces. We're going through vector models, how to measure similarity, and why nearest neighbor queries are useful.

Nearest neighbors refers to something that is conceptually very simple. For a set of points in some space (possibly many dimensions), we want to find the closest k neighbors quickly.

This presentation covers a library called Annoy built my me that that helps you do (approximate) nearest neighbor queries in high dimensional spaces. We're going through vector models, how to measure similarity, and why nearest neighbor queries are useful.

Anuncio
Anuncio

Anuncio

Anuncio

### Approximate nearest neighbor methods and vector models – NYC ML meetup

1. 1. Approximate nearest neighbors & vector models
2. 2. I’m Erik • @fulhack • Author of Annoy, Luigi • Currently CTO of Better • Previously 5 years at Spotify
3. 3. What’s nearest neighbor(s) • Let’s say you have a bunch of points
4. 4. Grab a bunch of points
5. 5. 5 nearest neighbors
6. 6. 20 nearest neighbors
7. 7. 100 nearest neighbors
8. 8. …But what’s the point? • vector models are everywhere • lots of applications (language processing, recommender systems, computer vision)
9. 9. MNIST example • 28x28 = 784-dimensional dataset • Deﬁne distance in terms of pixels:
10. 10. MNIST neighbors
11. 11. …Much better approach 1. Start with high dimensional data 2. Run dimensionality reduction to 10-1000 dims 3. Do stuff in a small dimensional space
12. 12. Deep learning for food • Deep model trained on a GPU on 6M random pics downloaded from Yelp 156x156x32 154x154x32 152x152x32 76x76x64 74x74x64 72x72x64 36x36x128 34x34x128 32x32x128 16x16x256 14x14x256 12x12x256 6x6x512 4x4x512 2x2x512 2048 2048 128 1244 3x3 convolutions 2x2 maxpool fully connected with dropout bottleneck layer
13. 13. Distance in smaller space 1. Run image through the network 2. Use the 128-dimensional bottleneck layer as an item vector 3. Use cosine distance in the reduced space
14. 14. Nearest food pics
15. 15. Vector methods for text • TF-IDF (old) – no dimensionality reduction • Latent Semantic Analysis (1988) • Probabilistic Latent Semantic Analysis (2000) • Semantic Hashing (2007) • word2vec (2013), RNN, LSTM, …
16. 16. Represent documents and/or words as f-dimensional vector Latentfactor1 Latent factor 2 banana apple boat
17. 17. Vector methods for collaborative ﬁltering • Supervised methods: See everything from the Netﬂix Prize • Unsupervised: Use NLP methods
18. 18. CF vectors – examplesIPMF item item: P(i ! j) = exp(bT j bi)/Zi = exp(bT j bi) P k exp(bT k bi) VECTORS: pui = aT u bi simij = cos(bi, bj) = bT i bj |bi||bj| O(f) i j simi,j 2pac 2pac 1.0 2pac Notorious B.I.G. 0.91 2pac Dr. Dre 0.87 2pac Florence + the Machine 0.26 Florence + the Machine Lana Del Rey 0.81 IPMF item item MDS: P(i ! j) = exp(bT j bi)/Zi = exp( |bj bi| 2 ) P k exp( |bk bi| 2 ) simij = |bj bi| 2 (u, i, count) @L
19. 19. Geospatial indexing • Ping the world: https://github.com/erikbern/ping • k-NN regression using Annoy
20. 20. Nearest neighbors the brute force way • we can always do an exhaustive search to ﬁnd the nearest neighbors • imagine MySQL doing a linear scan for every query…
21. 21. Using word2vec’s brute force search \$ time echo -e "chinese rivernEXITn" | ./distance GoogleNews- vectors-negative300.bin ! Qiantang_River 0.597229 Yangtse 0.587990 Yangtze_River 0.576738 lake 0.567611 rivers 0.567264 creek 0.567135 Mekong_river 0.550916 Xiangjiang_River 0.550451 Beas_river 0.549198 Minjiang_River 0.548721 real 2m34.346s user 1m36.235s sys 0m16.362s
22. 22. Introducing Annoy • https://github.com/spotify/annoy • mmap-based ANN library • Written in C++, with Python and R bindings • 585 stars on Github
23. 23. Using Annoy’s search \$ time echo -e "chinese rivernEXITn" | python nearest_neighbors.py ~/tmp/word2vec/GoogleNews-vectors- negative300.bin 100000 Yangtse 0.907756 Yangtze_River 0.920067 rivers 0.930308 creek 0.930447 Mekong_river 0.947718 Huangpu_River 0.951850 Ganges 0.959261 Thu_Bon 0.960545 Yangtze 0.966199 Yangtze_river 0.978978 real 0m0.470s user 0m0.285s sys 0m0.162s
24. 24. Using Annoy’s search \$ time echo -e "chinese rivernEXITn" | python nearest_neighbors.py ~/tmp/word2vec/GoogleNews-vectors- negative300.bin 1000000 Qiantang_River 0.897519 Yangtse 0.907756 Yangtze_River 0.920067 lake 0.929934 rivers 0.930308 creek 0.930447 Mekong_river 0.947718 Xiangjiang_River 0.948208 Beas_river 0.949528 Minjiang_River 0.950031 real 0m2.013s user 0m1.386s sys 0m0.614s
25. 25. (performance)
26. 26. 1. Building an Annoy index
28. 28. Split it in two halves
29. 29. Split again
30. 30. Again…
31. 31. …more iterations later
32. 32. Side note: making trees small • Split until K items in each leaf (K~100) • Takes (n/K) memory instead of n
33. 33. Binary tree
34. 34. 2. Searching
35. 35. Nearest neighbors
36. 36. Searching the tree
37. 37. Problemo • The point that’s the closest isn’t necessarily in the same leaf of the binary tree • Two points that are really close may end up on different sides of a split • Solution: go to both sides of a split if it’s close
38. 38. Trick 1: Priority queue • Traverse the tree using a priority queue • sort by min(margin) for the path from the root
39. 39. Trick 2: many trees • Construct trees randomly many times • Use the same priority queue to search all of them at the same time
40. 40. heap + forest = best • Since we use a priority queue, we will dive down the best splits with the biggest distance • More trees always helps! • Only constraint is more trees require more RAM
41. 41. Annoy query structure 1. Use priority queue to search all trees until we’ve found k items 2. Take union and remove duplicates (a lot) 3. Compute distance for remaining items 4. Return the nearest n items
42. 42. Find candidates
43. 43. Take union of all leaves
44. 44. Compute distances
45. 45. Return nearest neighbors
46. 46. “Curse of dimensionality”
47. 47. Are we screwed? • Would be nice if the data is has a much smaller “intrinsic dimension”!
48. 48. Improving the algorithm Queries/s 1-NN accuracy more accurate faster
49. 49. • https://github.com/erikbern/ann-benchmarks ann-benchmarks
50. 50. perf/accuracy tradeoffs Queries/s 1-NN accuracy search more nodes more trees
51. 51. Things that work • Smarter plane splitting • Priority queue heuristics • Search more nodes than number of results • Align nodes closer together
52. 52. Things that don’t work • Use lower-precision arithmetic • Priority queue by other heuristics (number of trees) • Precompute vector norms
53. 53. Things for the future • Use a optimization scheme for tree building • Add more distance functions (eg. edit distance) • Use a proper KV store as a backend (eg. LMDB) to support incremental adds, out-of-core, arbitrary keys: https://github.com/Houzz/annoy2
54. 54. Thanks! • https://github.com/spotify/annoy • https://github.com/erikbern/ann-benchmarks • https://github.com/erikbern/ann-presentation • erikbern.com • @fulhack