SlideShare una empresa de Scribd logo
1 de 52
Fast Single-pass k-means
        Clustering
whoami – Ted Dunning
• Chief Application Architect, MapR Technologies
• Committer, member, Apache Software
  Foundation
  – particularly Mahout, Zookeeper and Drill

• Contact me at
 tdunning@maprtech.com
 tdunning@apache.com
 ted.dunning@gmail.com
 @ted_dunning
Agenda
• Rationale
• Theory
  – clusterable data, k-mean failure modes, sketches
• Algorithms
  – ball k-means, surrogate methods
• Implementation
  – searchers, vectors, clusterers
• Results
• Application
RATIONALE
Why k-means?
• Clustering allows fast search
  – k-nn models allow agile modeling
  – lots of data points, 108 typical
  – lots of clusters, 104 typical
• Model features
  – Distance to nearest centroids
  – Poor man’s manifold discovery
What is Quality?
• Robust clustering not a goal
  – we don’t care if the same clustering is replicated
• Generalization to unseen data critical
  – number of points per cluster
  – distance distributions
  – target function distributions
  – model performance stability
An Example
The Problem
• Spirals are a classic “counter” example for k-
  means
• Classic low dimensional manifold with added
  noise

• But clustering still makes modeling work well
An Example
An Example
The Cluster Proximity Features
• Every point can be described by the nearest
  cluster
  – 4.3 bits per point in this case
  – Significant error that can be decreased (to a point)
    by increasing number of clusters
• Or by the proximity to the 2 nearest clusters (2
  x 4.3 bits + 1 sign bit + 2 proximities)
  – Error is negligible
  – Unwinds the data into a simple representation
Diagonalized Cluster Proximity
Lots of Clusters Are Fine
The Limiting Case
• Too many clusters lead to over-fitting
• Which we mediate by averaging over several
  nearby clusters
• In the limit we get k-nn modeling
  – and probably use k-means to speed up search
THEORY
Intuitive Theory
• Traditionally, minimize over all distributions
  – optimization is NP-complete
  – that isn’t like real data
• Recently, assume well-clusterable data

            s 2 D 2 (X) > D 2 (X)
                  k-1       k


• Interesting approximation bounds provable
                 1+ O(s 2 )
For Example
            1
  D (X) >
   2
                    D (X)
                     2

            s
   4            2    5




        Grouping these
           two clusters
         seriously hurts
       squared distance
ALGORITHMS
Lloyd’s Algorithm
• Part of CS folk-lore
• Developed in the late 50’s for signal quantization, published
  in 80’s

  initialize k cluster centroids somehow
  for each of many iterations:
    for each data point:
         assign point to nearest cluster
    recompute cluster centroids from points assigned to clusters

• Highly variable quality, several restarts recommended
Typical k-means Failure

  Selecting two seeds
       here cannot be
     fixed with Lloyds

                 Result is that these two
                       clusters get glued
                                 together
Ball k-means
• Provably better for highly clusterable data
• Tries to find initial centroids in each “core” of each real
  clusters
• Avoids outliers in centroid computation

  initialize centroids randomly with distance maximizing
  tendency
  for each of a very few iterations:
    for each data point:
        assign point to nearest cluster
    recompute centroids using only points much closer than
  closest cluster
Still Not a Win
• Ball k-means is nearly guaranteed with k = 2
• Probability of successful seeding drops
  exponentially with k
• Alternative strategy has high probability of
  success, but takes O(nkd + k3d) time
Surrogate Method
• Start with sloppy clustering into κ = k log n
  clusters
• Use this sketch as a weighted surrogate for the
  data
• Cluster surrogate data using ball k-means
• Results are provably good for highly clusterable
  data
• Sloppy clustering is on-line
• Surrogate can be kept in memory
• Ball k-means pass can be done at any time
Algorithm Costs
• O(k d log n) per point per iteration for Lloyd’s
  algorithm
• Number of iterations not well known
• Iteration > log n reasonable assumption
Algorithm Costs
• Surrogate methods
  – fast, sloppy single pass clustering with κ = k log n
  – fast sloppy search for nearest cluster, O(d log κ) = O(d
    (log k + log log n)) per point
  – fast, in-memory, high-quality clustering of κ weighted
    centroids
     O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality
     O(κ d log k) or O(d log κ log k) for larger k, looser quality
  – result is k high-quality centroids
     • Even the sloppy clusters may suffice
Algorithm Costs
• How much faster for the sketch phase?
  – take k = 2000, d = 10, n = 100,000
  – k d log n = 2000 x 10 x 26 = 500,000
  – log k + log log n = 11 + 5 = 17
  – 30,000 times faster is a bona fide big deal
Pragmatics
•   But this requires a fast search internally
•   Have to cluster on the fly for sketch
•   Have to guarantee sketch quality
•   Previous methods had very high complexity
How It Works
• For each point
   –   Find approximately nearest centroid (distance = d)
   –   If (d > threshold) new centroid
   –   Else if (u > d/threshold) new cluster
   –   Else add to nearest centroid
• If centroids > κ ≈ C log N
   – Recursively cluster centroids with higher threshold

• Result is large set of centroids
   – these provide approximation of original distribution
   – we can cluster centroids to get a close approximation of
     clustering original
   – or we can just use the result directly
IMPLEMENTATION
How Can We Search Faster?
• First rule: don’t do it
   – If we can eliminate most candidates, we can do less work
   – Projection search and k-means search

• Second rule: don’t do it
   – We can convert big floating point math to clever bit-wise
     integer math
   – Locality sensitive hashing

• Third rule: reduce dimensionality
   – Projection search
   – Random projection for very high dimension
Projection Search
               total ordering!
How Many Projections?
LSH Search
• Each random projection produces independent sign bit
• If two vectors have the same projected sign bits, they
  probably point in the same direction (i.e. cos θ ≈ 1)
• Distance in L2 is closely related to cosine

      x - y 2 = x - 2(x × y) + y
                   2               2



             = x 2 - 2 x y cosq + y 2
• We can replace (some) vector dot products with long
  integer XOR
1
                  LSH Bit-match Versus Cosine
           0.8


           0.6


           0.4


           0.2
Y Ax is




             0
                  0   8   16   24    32       40   48   56   64

          - 0.2


          - 0.4


          - 0.6


          - 0.8


            -1

                                    X Ax is
Results with 32 Bits
The Internals
• Mechanism for extending Mahout Vectors
  – DelegatingVector, WeightedVector, Centroid

• Searcher interface
  – ProjectionSearch, KmeansSearch, LshSearch, Brute

• Super-fast clustering
  – Kmeans, StreamingKmeans
Parallel Speedup?
                       200


                                                                      Non- threaded




                                                    ✓
                       100
                                    2
Tim e per point (μs)




                                                                       Threaded version
                                            3

                       50
                                                      4
                       40                                                6
                                                              5

                                                                               8
                       30
                                                                                   10        14
                                                                                        12
                       20                       Perfect Scaling                                   16




                       10
                             1          2       3         4       5                                    20


                                                    Threads
What About Map-Reduce?
• Map-reduce implementation is nearly trivial
  – Compute surrogate on each split
  – Total surrogate is union of all partial surrogates
  – Do in-memory clustering on total surrogate
• Threaded version shows linear speedup
  already
  – Map-reduce speedup is likely, not entirely
    guaranteed
How Well Does it Work?
• Theoretical guarantees for well clusterable
  data
  – Shindler, Wong and Meyerson, NIPS, 2011


• Evaluation on synthetic data
  – Rough clustering produces correct surrogates
  – Ball k-means strategy 1 performance is very good
    with large k
APPLICATION
The Business Case
• Our customer has 100 million cards in
  circulation

• Quick and accurate decision-making is key.
  – Marketing offers
  – Fraud prevention
Opportunity
• Demand of modeling is increasing rapidly

• So they are testing something simpler and
  more agile

• Like k-nearest neighbor
What’s that?
• Find the k nearest training examples – lookalike
  customers

• This is easy … but hard
   – easy because it is so conceptually simple and you don’t
     have knobs to turn or models to build
   – hard because of the stunning amount of math
   – also hard because we need top 50,000 results

• Initial rapid prototype was massively too slow
   – 3K queries x 200K examples takes hours
   – needed 20M x 25M in the same time
K-Nearest Neighbor Example
Required Scale and Speed and
               Accuracy
• Want 20 million queries against 25 million
  references in 10,000 s
• Should be able to search > 100 million
  references
• Should be linearly and horizontally scalable
• Must have >50% overlap against reference
  search
How Hard is That?
• 20 M x 25 M x 100 Flop = 50 P Flop

• 1 CPU = 5 Gflops

• We need 10 M CPU seconds => 10,000 CPU’s

• Real-world efficiency losses may increase that by
  10x

• Not good!
K-means Search
• First do clustering with lots (thousands) of clusters

• Then search nearest clusters to find nearest points

• We win if we find >50% overlap with “true” answer

• We lose if we can’t cluster super-fast
   – more on this later
Lots of Clusters Are Fine
Lots of Clusters Are Fine
Some Details
• Clumpy data works better
  – Real data is clumpy 


• Speedups of 100-200x seem practical with
  50% overlap
  – Projection search and LSH give additional 100x


• More experiments needed
Summary
• Nearest neighbor algorithms can be blazing
  fast

• But you need blazing fast clustering
  – Which we now have
Contact Me!
• We’re hiring at MapR in US and Europe

• MapR software available for research use

• Come get the slides at
   http://www.mapr.com/company/events/speaking/strata-10-
   2-12

• Contact me at tdunning@maprtech.com or
  @ted_dunning

Más contenido relacionado

La actualidad más candente

Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16MLconf
 
CNN Attention Networks
CNN Attention NetworksCNN Attention Networks
CNN Attention NetworksTaeoh Kim
 
Large scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkLarge scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkMila, Université de Montréal
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural NetworksDatabricks
 
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...Andrea Tassi
 
Deep Learning for AI (2)
Deep Learning for AI (2)Deep Learning for AI (2)
Deep Learning for AI (2)Dongheon Lee
 
Get Used to Command Line Interface
Get Used to Command Line InterfaceGet Used to Command Line Interface
Get Used to Command Line InterfaceJunho Cho
 
[251] implementing deep learning using cu dnn
[251] implementing deep learning using cu dnn[251] implementing deep learning using cu dnn
[251] implementing deep learning using cu dnnNAVER D2
 
London data science
London data scienceLondon data science
London data scienceTed Dunning
 
Terascale Learning
Terascale LearningTerascale Learning
Terascale Learningpauldix
 
Introduction to Deep Learning with Python
Introduction to Deep Learning with PythonIntroduction to Deep Learning with Python
Introduction to Deep Learning with Pythonindico data
 
Hands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonHands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonChun-Ming Chang
 
Scalable membership management
Scalable membership management Scalable membership management
Scalable membership management Vinay Setty
 

La actualidad más candente (16)

Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
 
CNN Attention Networks
CNN Attention NetworksCNN Attention Networks
CNN Attention Networks
 
Large scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkLarge scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using spark
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
 
Data science
Data scienceData science
Data science
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
Bayesian Counters
Bayesian CountersBayesian Counters
Bayesian Counters
 
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
 
Deep Learning for AI (2)
Deep Learning for AI (2)Deep Learning for AI (2)
Deep Learning for AI (2)
 
Get Used to Command Line Interface
Get Used to Command Line InterfaceGet Used to Command Line Interface
Get Used to Command Line Interface
 
[251] implementing deep learning using cu dnn
[251] implementing deep learning using cu dnn[251] implementing deep learning using cu dnn
[251] implementing deep learning using cu dnn
 
London data science
London data scienceLondon data science
London data science
 
Terascale Learning
Terascale LearningTerascale Learning
Terascale Learning
 
Introduction to Deep Learning with Python
Introduction to Deep Learning with PythonIntroduction to Deep Learning with Python
Introduction to Deep Learning with Python
 
Hands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonHands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in Python
 
Scalable membership management
Scalable membership management Scalable membership management
Scalable membership management
 

Destacado

Cmu-2011-09.pptx
Cmu-2011-09.pptxCmu-2011-09.pptx
Cmu-2011-09.pptxTed Dunning
 
Mahout and Recommendations
Mahout and RecommendationsMahout and Recommendations
Mahout and RecommendationsTed Dunning
 
Drill lightning-london-big-data-10-01-2012
Drill lightning-london-big-data-10-01-2012Drill lightning-london-big-data-10-01-2012
Drill lightning-london-big-data-10-01-2012Ted Dunning
 
R user-group-2011-09
R user-group-2011-09R user-group-2011-09
R user-group-2011-09Ted Dunning
 
Bda-dunning-2012-12-06
Bda-dunning-2012-12-06Bda-dunning-2012-12-06
Bda-dunning-2012-12-06Ted Dunning
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for MahoutTed Dunning
 
Graham Mossman - SQL and high performance computing on Hadoop
Graham Mossman - SQL and high performance computing on HadoopGraham Mossman - SQL and high performance computing on Hadoop
Graham Mossman - SQL and high performance computing on Hadoophuguk
 
SQL + Hadoop: The High Performance Advantage�
SQL + Hadoop:  The High Performance Advantage�SQL + Hadoop:  The High Performance Advantage�
SQL + Hadoop: The High Performance Advantage�Actian Corporation
 

Destacado (8)

Cmu-2011-09.pptx
Cmu-2011-09.pptxCmu-2011-09.pptx
Cmu-2011-09.pptx
 
Mahout and Recommendations
Mahout and RecommendationsMahout and Recommendations
Mahout and Recommendations
 
Drill lightning-london-big-data-10-01-2012
Drill lightning-london-big-data-10-01-2012Drill lightning-london-big-data-10-01-2012
Drill lightning-london-big-data-10-01-2012
 
R user-group-2011-09
R user-group-2011-09R user-group-2011-09
R user-group-2011-09
 
Bda-dunning-2012-12-06
Bda-dunning-2012-12-06Bda-dunning-2012-12-06
Bda-dunning-2012-12-06
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for Mahout
 
Graham Mossman - SQL and high performance computing on Hadoop
Graham Mossman - SQL and high performance computing on HadoopGraham Mossman - SQL and high performance computing on Hadoop
Graham Mossman - SQL and high performance computing on Hadoop
 
SQL + Hadoop: The High Performance Advantage�
SQL + Hadoop:  The High Performance Advantage�SQL + Hadoop:  The High Performance Advantage�
SQL + Hadoop: The High Performance Advantage�
 

Similar a Oxford 05-oct-2012

Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford MapR Technologies
 
Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28Ted Dunning
 
generalized_nbody_acs_2015_challacombe
generalized_nbody_acs_2015_challacombegeneralized_nbody_acs_2015_challacombe
generalized_nbody_acs_2015_challacombeMatt Challacombe
 
Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraJason Riedy
 
Virus, Vaccines, Genes and Quantum - 2020-06-18
Virus, Vaccines, Genes and Quantum - 2020-06-18Virus, Vaccines, Genes and Quantum - 2020-06-18
Virus, Vaccines, Genes and Quantum - 2020-06-18Aritra Sarkar
 
Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...
Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...
Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...Lucidworks
 
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
clustering_hierarchical ckustering notes.pdf
clustering_hierarchical ckustering notes.pdfclustering_hierarchical ckustering notes.pdf
clustering_hierarchical ckustering notes.pdfp_manimozhi
 
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01Hemant Jha
 
Fuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networksFuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networksmourya chandra
 
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...PyData
 
Clustering of graphs and search of assemblages
Clustering of graphs and search of assemblagesClustering of graphs and search of assemblages
Clustering of graphs and search of assemblagesData-Centric_Alliance
 
Moving Toward Deep Learning Algorithms on HPCC Systems
Moving Toward Deep Learning Algorithms on HPCC SystemsMoving Toward Deep Learning Algorithms on HPCC Systems
Moving Toward Deep Learning Algorithms on HPCC SystemsHPCC Systems
 

Similar a Oxford 05-oct-2012 (20)

Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford
 
Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25
 
Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28
 
Paris Data Geeks
Paris Data GeeksParis Data Geeks
Paris Data Geeks
 
Modern Cryptography
Modern CryptographyModern Cryptography
Modern Cryptography
 
Sparksummitny2016
Sparksummitny2016Sparksummitny2016
Sparksummitny2016
 
generalized_nbody_acs_2015_challacombe
generalized_nbody_acs_2015_challacombegeneralized_nbody_acs_2015_challacombe
generalized_nbody_acs_2015_challacombe
 
Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear Algebra
 
Virus, Vaccines, Genes and Quantum - 2020-06-18
Virus, Vaccines, Genes and Quantum - 2020-06-18Virus, Vaccines, Genes and Quantum - 2020-06-18
Virus, Vaccines, Genes and Quantum - 2020-06-18
 
Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...
Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...
Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...
 
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
 
clustering_hierarchical ckustering notes.pdf
clustering_hierarchical ckustering notes.pdfclustering_hierarchical ckustering notes.pdf
clustering_hierarchical ckustering notes.pdf
 
Data Mining Lecture_7.pptx
Data Mining Lecture_7.pptxData Mining Lecture_7.pptx
Data Mining Lecture_7.pptx
 
Class3
Class3Class3
Class3
 
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
 
09 placement
09 placement09 placement
09 placement
 
Fuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networksFuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networks
 
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
 
Clustering of graphs and search of assemblages
Clustering of graphs and search of assemblagesClustering of graphs and search of assemblages
Clustering of graphs and search of assemblages
 
Moving Toward Deep Learning Algorithms on HPCC Systems
Moving Toward Deep Learning Algorithms on HPCC SystemsMoving Toward Deep Learning Algorithms on HPCC Systems
Moving Toward Deep Learning Algorithms on HPCC Systems
 

Más de Ted Dunning

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxTed Dunning
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with KubernetesTed Dunning
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in KubernetesTed Dunning
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forTed Dunning
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningTed Dunning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning LogisticsTed Dunning
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTed Dunning
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logisticsTed Dunning
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real DataTed Dunning
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteTed Dunning
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoopTed Dunning
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Ted Dunning
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data SecurelyTed Dunning
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeTed Dunning
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownTed Dunning
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015Ted Dunning
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossibleTed Dunning
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningTed Dunning
 

Más de Ted Dunning (20)

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
 
T digest-update
T digest-updateT digest-update
T digest-update
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
 

Último

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 

Último (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

Oxford 05-oct-2012

  • 2. whoami – Ted Dunning • Chief Application Architect, MapR Technologies • Committer, member, Apache Software Foundation – particularly Mahout, Zookeeper and Drill • Contact me at tdunning@maprtech.com tdunning@apache.com ted.dunning@gmail.com @ted_dunning
  • 3. Agenda • Rationale • Theory – clusterable data, k-mean failure modes, sketches • Algorithms – ball k-means, surrogate methods • Implementation – searchers, vectors, clusterers • Results • Application
  • 5. Why k-means? • Clustering allows fast search – k-nn models allow agile modeling – lots of data points, 108 typical – lots of clusters, 104 typical • Model features – Distance to nearest centroids – Poor man’s manifold discovery
  • 6. What is Quality? • Robust clustering not a goal – we don’t care if the same clustering is replicated • Generalization to unseen data critical – number of points per cluster – distance distributions – target function distributions – model performance stability
  • 8. The Problem • Spirals are a classic “counter” example for k- means • Classic low dimensional manifold with added noise • But clustering still makes modeling work well
  • 11. The Cluster Proximity Features • Every point can be described by the nearest cluster – 4.3 bits per point in this case – Significant error that can be decreased (to a point) by increasing number of clusters • Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign bit + 2 proximities) – Error is negligible – Unwinds the data into a simple representation
  • 13. Lots of Clusters Are Fine
  • 14. The Limiting Case • Too many clusters lead to over-fitting • Which we mediate by averaging over several nearby clusters • In the limit we get k-nn modeling – and probably use k-means to speed up search
  • 16. Intuitive Theory • Traditionally, minimize over all distributions – optimization is NP-complete – that isn’t like real data • Recently, assume well-clusterable data s 2 D 2 (X) > D 2 (X) k-1 k • Interesting approximation bounds provable 1+ O(s 2 )
  • 17. For Example 1 D (X) > 2 D (X) 2 s 4 2 5 Grouping these two clusters seriously hurts squared distance
  • 19. Lloyd’s Algorithm • Part of CS folk-lore • Developed in the late 50’s for signal quantization, published in 80’s initialize k cluster centroids somehow for each of many iterations: for each data point: assign point to nearest cluster recompute cluster centroids from points assigned to clusters • Highly variable quality, several restarts recommended
  • 20. Typical k-means Failure Selecting two seeds here cannot be fixed with Lloyds Result is that these two clusters get glued together
  • 21. Ball k-means • Provably better for highly clusterable data • Tries to find initial centroids in each “core” of each real clusters • Avoids outliers in centroid computation initialize centroids randomly with distance maximizing tendency for each of a very few iterations: for each data point: assign point to nearest cluster recompute centroids using only points much closer than closest cluster
  • 22. Still Not a Win • Ball k-means is nearly guaranteed with k = 2 • Probability of successful seeding drops exponentially with k • Alternative strategy has high probability of success, but takes O(nkd + k3d) time
  • 23. Surrogate Method • Start with sloppy clustering into κ = k log n clusters • Use this sketch as a weighted surrogate for the data • Cluster surrogate data using ball k-means • Results are provably good for highly clusterable data • Sloppy clustering is on-line • Surrogate can be kept in memory • Ball k-means pass can be done at any time
  • 24. Algorithm Costs • O(k d log n) per point per iteration for Lloyd’s algorithm • Number of iterations not well known • Iteration > log n reasonable assumption
  • 25. Algorithm Costs • Surrogate methods – fast, sloppy single pass clustering with κ = k log n – fast sloppy search for nearest cluster, O(d log κ) = O(d (log k + log log n)) per point – fast, in-memory, high-quality clustering of κ weighted centroids O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality O(κ d log k) or O(d log κ log k) for larger k, looser quality – result is k high-quality centroids • Even the sloppy clusters may suffice
  • 26. Algorithm Costs • How much faster for the sketch phase? – take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000 – log k + log log n = 11 + 5 = 17 – 30,000 times faster is a bona fide big deal
  • 27. Pragmatics • But this requires a fast search internally • Have to cluster on the fly for sketch • Have to guarantee sketch quality • Previous methods had very high complexity
  • 28. How It Works • For each point – Find approximately nearest centroid (distance = d) – If (d > threshold) new centroid – Else if (u > d/threshold) new cluster – Else add to nearest centroid • If centroids > κ ≈ C log N – Recursively cluster centroids with higher threshold • Result is large set of centroids – these provide approximation of original distribution – we can cluster centroids to get a close approximation of clustering original – or we can just use the result directly
  • 30. How Can We Search Faster? • First rule: don’t do it – If we can eliminate most candidates, we can do less work – Projection search and k-means search • Second rule: don’t do it – We can convert big floating point math to clever bit-wise integer math – Locality sensitive hashing • Third rule: reduce dimensionality – Projection search – Random projection for very high dimension
  • 31. Projection Search total ordering!
  • 33. LSH Search • Each random projection produces independent sign bit • If two vectors have the same projected sign bits, they probably point in the same direction (i.e. cos θ ≈ 1) • Distance in L2 is closely related to cosine x - y 2 = x - 2(x × y) + y 2 2 = x 2 - 2 x y cosq + y 2 • We can replace (some) vector dot products with long integer XOR
  • 34. 1 LSH Bit-match Versus Cosine 0.8 0.6 0.4 0.2 Y Ax is 0 0 8 16 24 32 40 48 56 64 - 0.2 - 0.4 - 0.6 - 0.8 -1 X Ax is
  • 36. The Internals • Mechanism for extending Mahout Vectors – DelegatingVector, WeightedVector, Centroid • Searcher interface – ProjectionSearch, KmeansSearch, LshSearch, Brute • Super-fast clustering – Kmeans, StreamingKmeans
  • 37. Parallel Speedup? 200 Non- threaded ✓ 100 2 Tim e per point (μs) Threaded version 3 50 4 40 6 5 8 30 10 14 12 20 Perfect Scaling 16 10 1 2 3 4 5 20 Threads
  • 38. What About Map-Reduce? • Map-reduce implementation is nearly trivial – Compute surrogate on each split – Total surrogate is union of all partial surrogates – Do in-memory clustering on total surrogate • Threaded version shows linear speedup already – Map-reduce speedup is likely, not entirely guaranteed
  • 39. How Well Does it Work? • Theoretical guarantees for well clusterable data – Shindler, Wong and Meyerson, NIPS, 2011 • Evaluation on synthetic data – Rough clustering produces correct surrogates – Ball k-means strategy 1 performance is very good with large k
  • 41. The Business Case • Our customer has 100 million cards in circulation • Quick and accurate decision-making is key. – Marketing offers – Fraud prevention
  • 42. Opportunity • Demand of modeling is increasing rapidly • So they are testing something simpler and more agile • Like k-nearest neighbor
  • 43. What’s that? • Find the k nearest training examples – lookalike customers • This is easy … but hard – easy because it is so conceptually simple and you don’t have knobs to turn or models to build – hard because of the stunning amount of math – also hard because we need top 50,000 results • Initial rapid prototype was massively too slow – 3K queries x 200K examples takes hours – needed 20M x 25M in the same time
  • 45. Required Scale and Speed and Accuracy • Want 20 million queries against 25 million references in 10,000 s • Should be able to search > 100 million references • Should be linearly and horizontally scalable • Must have >50% overlap against reference search
  • 46. How Hard is That? • 20 M x 25 M x 100 Flop = 50 P Flop • 1 CPU = 5 Gflops • We need 10 M CPU seconds => 10,000 CPU’s • Real-world efficiency losses may increase that by 10x • Not good!
  • 47. K-means Search • First do clustering with lots (thousands) of clusters • Then search nearest clusters to find nearest points • We win if we find >50% overlap with “true” answer • We lose if we can’t cluster super-fast – more on this later
  • 48. Lots of Clusters Are Fine
  • 49. Lots of Clusters Are Fine
  • 50. Some Details • Clumpy data works better – Real data is clumpy  • Speedups of 100-200x seem practical with 50% overlap – Projection search and LSH give additional 100x • More experiments needed
  • 51. Summary • Nearest neighbor algorithms can be blazing fast • But you need blazing fast clustering – Which we now have
  • 52. Contact Me! • We’re hiring at MapR in US and Europe • MapR software available for research use • Come get the slides at http://www.mapr.com/company/events/speaking/strata-10- 2-12 • Contact me at tdunning@maprtech.com or @ted_dunning

Notas del editor

  1. The basic idea here is that I have colored slides to be presented by you in blue. You should substitute and reword those slides as you like. In a few places, I imagined that we would have fast back and forth as in the introduction or final slide where we can each say we are hiring in turn.The overall thrust of the presentation is for you to make these points:Amex does lots of modelingit is expensivehaving a way to quickly test models and new variables would be awesomeso we worked on a new project with MapRMy part will say the following:Knn basic pictorial motivation (could move to you if you like)describe knn quality metric of overlapshow how bad metric breaks knn (optional)quick description of LSH and projection searchpicture of why k-means search is coolmotivate k-means speed as tool for k-means searchdescribe single pass k-means algorithmdescribe basic data structuresshow parallel speedupOur summary should state that we have achievedsuper-fast k-means clusteringinitial version of super-fast knn search with good overlap
  2. The sub-bullets are just for reference and should be deleted later
  3. This slide is red to indicate missing data
  4. The idea here is to guess what color a new dot should be by looking at the points within the circle. The first should obviously be purple. The second cyan. The third is uncertain, but probably isn’t green or cyan and probably is a bit more likely to be red than purple.