5. Why k-means?
• Clustering allows fast search
– k-nn models allow agile modeling
– lots of data points, 108 typical
– lots of clusters, 104 typical
• Model features
– Distance to nearest centroids
– Poor man’s manifold discovery
6. What is Quality?
• Robust clustering not a goal
– we don’t care if the same clustering is replicated
• Generalization to unseen data critical
– number of points per cluster
– distance distributions
– target function distributions
– model performance stability
8. The Problem
• Spirals are a classic “counter” example for k-
means
• Classic low dimensional manifold with added
noise
• But clustering still makes modeling work well
11. The Cluster Proximity Features
• Every point can be described by the nearest
cluster
– 4.3 bits per point in this case
– Significant error that can be decreased (to a point)
by increasing number of clusters
• Or by the proximity to the 2 nearest clusters (2
x 4.3 bits + 1 sign bit + 2 proximities)
– Error is negligible
– Unwinds the data into a simple representation
14. The Limiting Case
• Too many clusters lead to over-fitting
• Which we mediate by averaging over several
nearby clusters
• In the limit we get k-nn modeling
– and probably use k-means to speed up search
16. Intuitive Theory
• Traditionally, minimize over all distributions
– optimization is NP-complete
– that isn’t like real data
• Recently, assume well-clusterable data
s 2 D 2 (X) > D 2 (X)
k-1 k
• Interesting approximation bounds provable
1+ O(s 2 )
17. For Example
1
D (X) >
2
D (X)
2
s
4 2 5
Grouping these
two clusters
seriously hurts
squared distance
19. Lloyd’s Algorithm
• Part of CS folk-lore
• Developed in the late 50’s for signal quantization, published
in 80’s
initialize k cluster centroids somehow
for each of many iterations:
for each data point:
assign point to nearest cluster
recompute cluster centroids from points assigned to clusters
• Highly variable quality, several restarts recommended
20. Typical k-means Failure
Selecting two seeds
here cannot be
fixed with Lloyds
Result is that these two
clusters get glued
together
21. Ball k-means
• Provably better for highly clusterable data
• Tries to find initial centroids in each “core” of each real
clusters
• Avoids outliers in centroid computation
initialize centroids randomly with distance maximizing
tendency
for each of a very few iterations:
for each data point:
assign point to nearest cluster
recompute centroids using only points much closer than
closest cluster
22. Still Not a Win
• Ball k-means is nearly guaranteed with k = 2
• Probability of successful seeding drops
exponentially with k
• Alternative strategy has high probability of
success, but takes O(nkd + k3d) time
23. Surrogate Method
• Start with sloppy clustering into κ = k log n
clusters
• Use this sketch as a weighted surrogate for the
data
• Cluster surrogate data using ball k-means
• Results are provably good for highly clusterable
data
• Sloppy clustering is on-line
• Surrogate can be kept in memory
• Ball k-means pass can be done at any time
24. Algorithm Costs
• O(k d log n) per point per iteration for Lloyd’s
algorithm
• Number of iterations not well known
• Iteration > log n reasonable assumption
25. Algorithm Costs
• Surrogate methods
– fast, sloppy single pass clustering with κ = k log n
– fast sloppy search for nearest cluster, O(d log κ) = O(d
(log k + log log n)) per point
– fast, in-memory, high-quality clustering of κ weighted
centroids
O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality
O(κ d log k) or O(d log κ log k) for larger k, looser quality
– result is k high-quality centroids
• Even the sloppy clusters may suffice
26. Algorithm Costs
• How much faster for the sketch phase?
– take k = 2000, d = 10, n = 100,000
– k d log n = 2000 x 10 x 26 = 500,000
– log k + log log n = 11 + 5 = 17
– 30,000 times faster is a bona fide big deal
27. Pragmatics
• But this requires a fast search internally
• Have to cluster on the fly for sketch
• Have to guarantee sketch quality
• Previous methods had very high complexity
28. How It Works
• For each point
– Find approximately nearest centroid (distance = d)
– If (d > threshold) new centroid
– Else if (u > d/threshold) new cluster
– Else add to nearest centroid
• If centroids > κ ≈ C log N
– Recursively cluster centroids with higher threshold
• Result is large set of centroids
– these provide approximation of original distribution
– we can cluster centroids to get a close approximation of
clustering original
– or we can just use the result directly
30. How Can We Search Faster?
• First rule: don’t do it
– If we can eliminate most candidates, we can do less work
– Projection search and k-means search
• Second rule: don’t do it
– We can convert big floating point math to clever bit-wise
integer math
– Locality sensitive hashing
• Third rule: reduce dimensionality
– Projection search
– Random projection for very high dimension
33. LSH Search
• Each random projection produces independent sign bit
• If two vectors have the same projected sign bits, they
probably point in the same direction (i.e. cos θ ≈ 1)
• Distance in L2 is closely related to cosine
x - y 2 = x - 2(x × y) + y
2 2
= x 2 - 2 x y cosq + y 2
• We can replace (some) vector dot products with long
integer XOR
34. 1
LSH Bit-match Versus Cosine
0.8
0.6
0.4
0.2
Y Ax is
0
0 8 16 24 32 40 48 56 64
- 0.2
- 0.4
- 0.6
- 0.8
-1
X Ax is
37. Parallel Speedup?
200
Non- threaded
✓
100
2
Tim e per point (μs)
Threaded version
3
50
4
40 6
5
8
30
10 14
12
20 Perfect Scaling 16
10
1 2 3 4 5 20
Threads
38. What About Map-Reduce?
• Map-reduce implementation is nearly trivial
– Compute surrogate on each split
– Total surrogate is union of all partial surrogates
– Do in-memory clustering on total surrogate
• Threaded version shows linear speedup
already
– Map-reduce speedup is likely, not entirely
guaranteed
39. How Well Does it Work?
• Theoretical guarantees for well clusterable
data
– Shindler, Wong and Meyerson, NIPS, 2011
• Evaluation on synthetic data
– Rough clustering produces correct surrogates
– Ball k-means strategy 1 performance is very good
with large k
41. The Business Case
• Our customer has 100 million cards in
circulation
• Quick and accurate decision-making is key.
– Marketing offers
– Fraud prevention
42. Opportunity
• Demand of modeling is increasing rapidly
• So they are testing something simpler and
more agile
• Like k-nearest neighbor
43. What’s that?
• Find the k nearest training examples – lookalike
customers
• This is easy … but hard
– easy because it is so conceptually simple and you don’t
have knobs to turn or models to build
– hard because of the stunning amount of math
– also hard because we need top 50,000 results
• Initial rapid prototype was massively too slow
– 3K queries x 200K examples takes hours
– needed 20M x 25M in the same time
45. Required Scale and Speed and
Accuracy
• Want 20 million queries against 25 million
references in 10,000 s
• Should be able to search > 100 million
references
• Should be linearly and horizontally scalable
• Must have >50% overlap against reference
search
46. How Hard is That?
• 20 M x 25 M x 100 Flop = 50 P Flop
• 1 CPU = 5 Gflops
• We need 10 M CPU seconds => 10,000 CPU’s
• Real-world efficiency losses may increase that by
10x
• Not good!
47. K-means Search
• First do clustering with lots (thousands) of clusters
• Then search nearest clusters to find nearest points
• We win if we find >50% overlap with “true” answer
• We lose if we can’t cluster super-fast
– more on this later
50. Some Details
• Clumpy data works better
– Real data is clumpy
• Speedups of 100-200x seem practical with
50% overlap
– Projection search and LSH give additional 100x
• More experiments needed
51. Summary
• Nearest neighbor algorithms can be blazing
fast
• But you need blazing fast clustering
– Which we now have
52. Contact Me!
• We’re hiring at MapR in US and Europe
• MapR software available for research use
• Come get the slides at
http://www.mapr.com/company/events/speaking/strata-10-
2-12
• Contact me at tdunning@maprtech.com or
@ted_dunning
Notas del editor
The basic idea here is that I have colored slides to be presented by you in blue. You should substitute and reword those slides as you like. In a few places, I imagined that we would have fast back and forth as in the introduction or final slide where we can each say we are hiring in turn.The overall thrust of the presentation is for you to make these points:Amex does lots of modelingit is expensivehaving a way to quickly test models and new variables would be awesomeso we worked on a new project with MapRMy part will say the following:Knn basic pictorial motivation (could move to you if you like)describe knn quality metric of overlapshow how bad metric breaks knn (optional)quick description of LSH and projection searchpicture of why k-means search is coolmotivate k-means speed as tool for k-means searchdescribe single pass k-means algorithmdescribe basic data structuresshow parallel speedupOur summary should state that we have achievedsuper-fast k-means clusteringinitial version of super-fast knn search with good overlap
The sub-bullets are just for reference and should be deleted later
This slide is red to indicate missing data
The idea here is to guess what color a new dot should be by looking at the points within the circle. The first should obviously be purple. The second cyan. The third is uncertain, but probably isn’t green or cyan and probably is a bit more likely to be red than purple.