Oxford 05-oct-2012

Fast Single-pass k-means
Clustering

whoami – Ted Dunning
• Chief Application Architect, MapR Technologies
• Committer, member, Apache Software
Foundation
– particularly Mahout, Zookeeper and Drill

• Contact me at
tdunning@maprtech.com
tdunning@apache.com
ted.dunning@gmail.com
@ted_dunning

Agenda
• Rationale
• Theory
– clusterable data, k-mean failure modes, sketches
• Algorithms
– ball k-means, surrogate methods
• Implementation
– searchers, vectors, clusterers
• Results
• Application

Why k-means?
• Clustering allows fast search
– k-nn models allow agile modeling
– lots of data points, 108 typical
– lots of clusters, 104 typical
• Model features
– Distance to nearest centroids
– Poor man’s manifold discovery

What is Quality?
• Robust clustering not a goal
– we don’t care if the same clustering is replicated
• Generalization to unseen data critical
– number of points per cluster
– distance distributions
– target function distributions
– model performance stability

The Problem
• Spirals are a classic “counter” example for k-
means
• Classic low dimensional manifold with added
noise

• But clustering still makes modeling work well

The Cluster Proximity Features
• Every point can be described by the nearest
cluster
– 4.3 bits per point in this case
– Significant error that can be decreased (to a point)
by increasing number of clusters
• Or by the proximity to the 2 nearest clusters (2
x 4.3 bits + 1 sign bit + 2 proximities)
– Error is negligible
– Unwinds the data into a simple representation

Diagonalized Cluster Proximity

The Limiting Case
• Too many clusters lead to over-fitting
• Which we mediate by averaging over several
nearby clusters
• In the limit we get k-nn modeling
– and probably use k-means to speed up search

Intuitive Theory
• Traditionally, minimize over all distributions
– optimization is NP-complete
– that isn’t like real data
• Recently, assume well-clusterable data

s 2 D 2 (X) > D 2 (X)
k-1 k

• Interesting approximation bounds provable
1+ O(s 2 )

For Example
1
D (X) >
2
D (X)
2

s
4 2 5

Grouping these
two clusters
seriously hurts
squared distance

Lloyd’s Algorithm
• Part of CS folk-lore
• Developed in the late 50’s for signal quantization, published
in 80’s

initialize k cluster centroids somehow
for each of many iterations:
for each data point:
assign point to nearest cluster
recompute cluster centroids from points assigned to clusters

• Highly variable quality, several restarts recommended

Typical k-means Failure

Selecting two seeds
here cannot be
fixed with Lloyds

Result is that these two
clusters get glued
together

Ball k-means
• Provably better for highly clusterable data
• Tries to find initial centroids in each “core” of each real
clusters
• Avoids outliers in centroid computation

initialize centroids randomly with distance maximizing
tendency
for each of a very few iterations:
for each data point:
assign point to nearest cluster
recompute centroids using only points much closer than
closest cluster

Still Not a Win
• Ball k-means is nearly guaranteed with k = 2
• Probability of successful seeding drops
exponentially with k
• Alternative strategy has high probability of
success, but takes O(nkd + k3d) time

Surrogate Method
• Start with sloppy clustering into κ = k log n
clusters
• Use this sketch as a weighted surrogate for the
data
• Cluster surrogate data using ball k-means
• Results are provably good for highly clusterable
data
• Sloppy clustering is on-line
• Surrogate can be kept in memory
• Ball k-means pass can be done at any time

Algorithm Costs
• O(k d log n) per point per iteration for Lloyd’s
algorithm
• Number of iterations not well known
• Iteration > log n reasonable assumption

Algorithm Costs
• Surrogate methods
– fast, sloppy single pass clustering with κ = k log n
– fast sloppy search for nearest cluster, O(d log κ) = O(d
(log k + log log n)) per point
– fast, in-memory, high-quality clustering of κ weighted
centroids
O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality
O(κ d log k) or O(d log κ log k) for larger k, looser quality
– result is k high-quality centroids
• Even the sloppy clusters may suffice

Algorithm Costs
• How much faster for the sketch phase?
– take k = 2000, d = 10, n = 100,000
– k d log n = 2000 x 10 x 26 = 500,000
– log k + log log n = 11 + 5 = 17
– 30,000 times faster is a bona fide big deal

Pragmatics
• But this requires a fast search internally
• Have to cluster on the fly for sketch
• Have to guarantee sketch quality
• Previous methods had very high complexity

How It Works
• For each point
– Find approximately nearest centroid (distance = d)
– If (d > threshold) new centroid
– Else if (u > d/threshold) new cluster
– Else add to nearest centroid
• If centroids > κ ≈ C log N
– Recursively cluster centroids with higher threshold

• Result is large set of centroids
– these provide approximation of original distribution
– we can cluster centroids to get a close approximation of
clustering original
– or we can just use the result directly

How Can We Search Faster?
• First rule: don’t do it
– If we can eliminate most candidates, we can do less work
– Projection search and k-means search

• Second rule: don’t do it
– We can convert big floating point math to clever bit-wise
integer math
– Locality sensitive hashing

• Third rule: reduce dimensionality
– Projection search
– Random projection for very high dimension

Projection Search
total ordering!

LSH Search
• Each random projection produces independent sign bit
• If two vectors have the same projected sign bits, they
probably point in the same direction (i.e. cos θ ≈ 1)
• Distance in L2 is closely related to cosine

x - y 2 = x - 2(x × y) + y
2 2

= x 2 - 2 x y cosq + y 2
• We can replace (some) vector dot products with long
integer XOR

1
LSH Bit-match Versus Cosine
0.8

0.6

0.4

0.2
Y Ax is

0
0 8 16 24 32 40 48 56 64

- 0.2

- 0.4

- 0.6

- 0.8

-1

X Ax is

The Internals
• Mechanism for extending Mahout Vectors
– DelegatingVector, WeightedVector, Centroid

• Searcher interface
– ProjectionSearch, KmeansSearch, LshSearch, Brute

• Super-fast clustering
– Kmeans, StreamingKmeans

Parallel Speedup?
200

Non- threaded

✓
100
2
Tim e per point (μs)

Threaded version
3

50
4
40 6
5

8
30
10 14
12
20 Perfect Scaling 16

10
1 2 3 4 5 20

Threads

What About Map-Reduce?
• Map-reduce implementation is nearly trivial
– Compute surrogate on each split
– Total surrogate is union of all partial surrogates
– Do in-memory clustering on total surrogate
• Threaded version shows linear speedup
already
– Map-reduce speedup is likely, not entirely
guaranteed

How Well Does it Work?
• Theoretical guarantees for well clusterable
data
– Shindler, Wong and Meyerson, NIPS, 2011

• Evaluation on synthetic data
– Rough clustering produces correct surrogates
– Ball k-means strategy 1 performance is very good
with large k

The Business Case
• Our customer has 100 million cards in
circulation

• Quick and accurate decision-making is key.
– Marketing offers
– Fraud prevention

Opportunity
• Demand of modeling is increasing rapidly

• So they are testing something simpler and
more agile

• Like k-nearest neighbor

What’s that?
• Find the k nearest training examples – lookalike
customers

• This is easy … but hard
– easy because it is so conceptually simple and you don’t
have knobs to turn or models to build
– hard because of the stunning amount of math
– also hard because we need top 50,000 results

• Initial rapid prototype was massively too slow
– 3K queries x 200K examples takes hours
– needed 20M x 25M in the same time

Required Scale and Speed and
Accuracy
• Want 20 million queries against 25 million
references in 10,000 s
• Should be able to search > 100 million
references
• Should be linearly and horizontally scalable
• Must have >50% overlap against reference
search

How Hard is That?
• 20 M x 25 M x 100 Flop = 50 P Flop

• 1 CPU = 5 Gflops

• We need 10 M CPU seconds => 10,000 CPU’s

• Real-world efficiency losses may increase that by
10x

• Not good!

K-means Search
• First do clustering with lots (thousands) of clusters

• Then search nearest clusters to find nearest points

• We win if we find >50% overlap with “true” answer

• We lose if we can’t cluster super-fast
– more on this later

Some Details
• Clumpy data works better
– Real data is clumpy 

• Speedups of 100-200x seem practical with
50% overlap
– Projection search and LSH give additional 100x

• More experiments needed

Summary
• Nearest neighbor algorithms can be blazing
fast

• But you need blazing fast clustering
– Which we now have

Contact Me!
• We’re hiring at MapR in US and Europe

• MapR software available for research use

• Come get the slides at
http://www.mapr.com/company/events/speaking/strata-10-
2-12

• Contact me at tdunning@maprtech.com or
@ted_dunning

Oxford 05-oct-2012

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (16)

Destacado

Destacado (8)

Similar a Oxford 05-oct-2012

Similar a Oxford 05-oct-2012 (20)

Más de Ted Dunning

Más de Ted Dunning (20)

Último

Último (20)

Oxford 05-oct-2012

Notas del editor