Customer analysis at scale

Customer Behavior Analysis with
Large Scale k-means Analysis

whoami – Ted Dunning
• Chief Application Architect, MapR Technologies
• Committer, member, Apache Software Foundation
– particularly Mahout, Zookeeper and Drill
• Contact me at
tdunning@maprtech.com
tdunning@apache.com
ted.dunning@gmail.com
@ted_dunning
• Get slides and more info at
http://www.mapr.com/company/events/speaking/strata-10-2-12

Agenda
• Nearest neighbor models
• K-means algorithms
– O(k d log n) per point for Lloyd’s algorithm
– Surrogate (sketch) methods
• Results

Context
• Digital transformation.
• Data helps us better serve our customers.
• Privacy is paramount.

The Business Case
• Our customer has 100 million cards in
circulation
• Quick and accurate decision-making is key.
– Marketing offers
– Fraud prevention

Opportunity
• Demand of modeling is increasing rapidly
• So we are testing something simpler and more
agile
• Like k-nearest neighbor

What’s that?
• Find the k nearest training examples – lookalike
customers
• This is easy … but hard
– easy because it is so conceptually simple and you don’t
have knobs to turn or models to build
– hard because of the stunning amount of math
– also hard because we need top 50,000 results
• Initial rapid prototype was massively too slow
– 3K queries x 200K examples takes hours
– needed 20M x 25M in the same time

Required Scale and Speed and
Accuracy
• Want 20 million queries against 25 million
references in 10,000 s
• Should be able to search > 100 million
references
• Should be linearly and horizontally scalable
• Must have >50% overlap against reference
search
• Evaluation by sub-sampling is viable, but tricky

How Hard is That?
• 20 M x 25 M x 100 Flop = 50 P Flop
• 1 CPU = 5 Gflops
• We need 10 M CPU seconds => 10,000 CPU’s
• Real-world efficiency losses may increase that by
10x
• Not good!

How Can We Search Faster?
• First rule: don’t do it
– If we can eliminate most candidates, we can do less work
– Projection search and k-means search
• Second rule: don’t do it
– We can convert big floating point math to clever bit-wise
integer math
– Locality sensitive hashing
• Third rule: reduce dimensionality
– Projection search
– Random projection for very high dimension

Projection Search
total ordering!

LSH Search
• Each random projection produces independent sign bit
• If two vectors have the same projected sign bits, they
probably point in the same direction (i.e. cos θ ≈ 1)
• Distance in L2 is closely related to cosine
• We can replace (some) vector dot products with long
integer XOR
x - y 2
= x2
- 2(x× y)+ y2
= x2
- 2 x y cosq + y2

LSH Bit-match Versus Cosine
0 8 16 24 32 40 48 56 64
1
- 1
- 0.8
- 0.6
- 0.4
- 0.2
0
0.2
0.4
0.6
0.8
X Axis
YAxis

K-means Search
• First do clustering with lots (thousands) of clusters
• Then search nearest clusters to find nearest points
• We win if we find >50% overlap with “true” answer
• We lose if we can’t cluster super-fast
– more on this later

Some Details
• Clumpy data works better
– Real data is clumpy 
• Speedups of 100-200x seem practical with 50% overlap
– Projection search and LSH can be used to accelerate that
(some)
• More experiments needed
• Definitely need fast search

Lloyd’s Algorithm
• Part of CS folk-lore
• Developed in the late 50’s for signal quantization, published
in 80’s
initialize k cluster centroids somehow
for each of many iterations:
for each data point:
assign point to nearest cluster
recompute cluster centroids from points assigned to clusters
• Highly variable quality, several restarts recommended

Ball k-means
• Provably better for highly clusterable data
• Tries to find initial centroids in the “core” of real
clusters
• Avoids outliers in centroid computation
initialize centroids randomly with distance maximizing
tendency
for each of a very few iterations:
for each data point:
assign point to nearest cluster
recompute centroids using only points much closer than
closest cluster

Surrogate Method
• Start with sloppy clustering into κ = k log n
clusters
• Use this sketch as a weighted surrogate for the
data
• Cluster surrogate data using ball k-means
• Results are provably good for highly clusterable
data
• Sloppy clustering is on-line
• Surrogate can be kept in memory
• Ball k-means pass can be done at any time

Algorithm Costs
• O(k d log n) per point for Lloyd’s algorithm
… not so good for k = 2000, n = 108
• Surrogate methods
– fast, sloppy single pass clustering with κ = k log n
– fast sloppy search for nearest cluster, O(d log κ) = O(d (log k +
log log n)) per point
– fast, in-memory, high-quality clustering of κ weighted centroids
– result consists of k high-quality centroids
• This is a big deal:
– k d log n = 2000 x 10 x 26 = 50,000
– log k + log log n = 11 + 5 = 17
– 3000 times faster makes the grade as a bona fide big deal

The Internals
• Mechanism for extending Mahout Vectors
– DelegatingVector, WeightedVector, Centroid
• Searcher interface
– ProjectionSearch, KmeansSearch, LshSearch, Brute
• Super-fast clustering
– Kmeans, StreamingKmeans

How It Works
• For each point
– Find approximately nearest centroid (distance = d)
– If d > threshold, new centroid
– Else possibly new cluster
– Else add to nearest centroid
• If centroids > K ~ C log N
– Recursively cluster centroids with higher threshold
• Result is large set of centroids
– these provide approximation of original distribution
– we can cluster centroids to get a close approximation of
clustering original
– or we can just use the result directly

Parallel Speedup?
1 2 3 4 5 20
10
100
20
30
40
50
200
Threads
Timeperpoint(μs)
2
3
4
5
6
8
10
12
14
16
Threaded version
Non- threaded
Perfect Scaling
✓

What About Map-Reduce
• Map-reduce implementation is nearly trivial
– Compute surrogate on each split
– Total surrogate is union of all partial surrogates
– Do in-memory clustering on total surrogate
• Threaded version shows linear speedup
already
– Map-reduce speedup is likely, not entirely
guaranteed

How Well Does it Work?
• Theoretical guarantees for well clusterable
data
– Shindler, Wong and Meyerson, NIPS, 2011
• Evaluation on synthetic data
– Rough clustering produces correct surrogates
– Possible issue in ball k-means initialization (still
produces good clustering on test data)

Summary
• Nearest neighbor algorithms can be blazing
fast
• But you need blazing fast clustering
– Which we now have

Contact Us!
• We’re hiring at MapR in US and Europe
• Amex is hiring in Phoenix and New York
• Come get the slides at
http://www.mapr.com/company/events/speaking/strata-10-
2-12
• Contact Ted at tdunning@maprtech.com or
@ted_dunning

Customer analysis at scale

Recommended

Recommended

More Related Content

More from MapR Technologies

More from MapR Technologies (20)

Recently uploaded

Recently uploaded (20)

Customer analysis at scale

Editor's Notes