SlideShare a Scribd company logo
1 of 53
Download to read offline
Practical Machine Learning

        with Mahout
whoami – Ted Dunning
• Chief Application Architect, MapR Technologies
• Committer, member, Apache Software Foundation
  – particularly Mahout, Zookeeper and Drill

     (we’re hiring)

• Contact me at
 tdunning@maprtech.com
 tdunning@apache.com
 ted.dunning@gmail.com
 @ted_dunning
Agenda
• What works at scale
• Recommendation
• Unsupervised - Clustering
What Works at Scale
• Logging
• Counting
• Session grouping
What Works at Scale
• Logging
• Counting
• Session grouping

• Really. Don’t bet on anything much more
  complex than these
What Works at Scale
• Logging
• Counting
• Session grouping

• Really. Don’t bet on anything much more
  complex than these
• These are harder than they look
Recommendations
Recommendations
• Special case of reflected intelligence
• Traditionally “people who bought x also
  bought y”

• But soooo much more is possible
Examples
• Customers buying books (Linden et al)
• Web visitors rating music (Shardanand and
  Maes) or movies (Riedl, et al), (Netflix)
• Internet radio listeners not skipping songs
  (Musicmatch)
• Internet video watchers watching >30 s
Dyadic Structure
• Functional
  – Interaction: actor -> item*
• Relational
  – Interaction ⊆ Actors x Items
• Matrix
  – Rows indexed by actor, columns by item
  – Value is count of interactions
• Predict missing observations
Recommendations Analysis
• R(x,y) = # people who bought x also bought y

select x, y, count(*) from (
     (select distinct(user_id, item_id) as x from log) A
     join
     (select distinct(user_id, item_id) as y from log) B
     on user_id
) group by x, y
Recommendations Analysis
• R(x,y) = People who bought x also bought y

select x, y, count(*) from (
     (select distinct(user_id, item_id) as x from log) A
     join
     (select distinct(user_id, item_id) as y from log) B
     on user_id
) group by x, y
Recommendations Analysis
• R(x,y) = People who bought x also bought y

select x, y, count(*) from (
     (select distinct(user_id, item_id) as x from log) A
     join
     (select distinct(user_id, item_id) as y from log) B
     on user_id
) group by x, y
Recommendations Analysis
• R(x,y) = People who bought x also bought y

select x, y, count(*) from (
     (select distinct(user_id, item_id) as x from log) A
     join
     (select distinct(user_id, item_id) as y from log) B
     on user_id
) group by x, y
Recommendations Analysis
• R(x,y) = People who bought x also bought y

select x, y, count(*) from (
     (select distinct(user_id, item_id) as x from log) A
     join
     (select distinct(user_id, item_id) as y from log) B
     on user_id
) group by x, y
Recommendations Analysis
• R(x,y) = People who bought x also bought y

select x, y, count(*) from (
     (select distinct(user_id, item_id) as x from log) A
     join
     (select distinct(user_id, item_id) as y from log) B
     on user_id
) group by x, y
Recommendations Analysis


   Rij = å A ui Buj
            u

       =A BT
Fundamental Algorithmic Structure
• Cooccurrence
        K=A A T

• Matrix approximation by factoring
       A » USV  T


        K » VS2 VT
        r = VS V h
              2   T

• LLR
        r = sparsify(A A)h
                       T
But Wait!
• Cooccurrence
      K=A A   T

• Cross occurrence

       K=B A  T
For example
• Users enter queries (A)
  – (actor = user, item=query)
• Users view videos (B)
  – (actor = user, item=video)
• A’A gives query recommendation
  – “did you mean to ask for”
• B’B gives video recommendation
  – “you might like these videos”
The punch-line
• B’A recommends videos in response to a
  query
  – (isn’t that a search engine?)
  – (not quite, it doesn’t look at content or meta-data)
Real-life example
• Query: “Paco de Lucia”
• Conventional meta-data search results:
  – “hombres del paco” times 400
  – not much else
• Recommendation based search:
  – Flamenco guitar and dancers
  – Spanish and classical guitar
  – Van Halen doing a classical/flamenco riff
Real-life example
Hypothetical Example
• Want a navigational ontology?
• Just put labels on a web page with traffic
  – This gives A = users x label clicks
• Remember viewing history
  – This gives B = users x items
• Cross recommend
  – B’A = label to item mapping
• After several users click, results are whatever
  users think they should be
Super-fast k-means Clustering
RATIONALE
What is Quality?
• Robust clustering not a goal
  – we don’t care if the same clustering is replicated
• Generalization is critical
• Agreement to “gold standard” is a non-issue
An Example
An Example
Diagonalized Cluster Proximity
Clusters as Distribution Surrogate
Clusters as Distribution Surrogate
THEORY
For Example
            1
  D (X) >
   2
                    D (X)
                     2

            s
   4            2    5




        Grouping these
           two clusters
         seriously hurts
       squared distance
ALGORITHMS
Typical k-means Failure

  Selecting two seeds
       here cannot be
     fixed with Lloyds

                 Result is that these two
                       clusters get glued
                                 together
Ball k-means
• Provably better for highly clusterable data
• Tries to find initial centroids in each “core” of each real
  clusters
• Avoids outliers in centroid computation

  initialize centroids randomly with distance maximizing
  tendency
  for each of a very few iterations:
    for each data point:
        assign point to nearest cluster
    recompute centroids using only points much closer than
  closest cluster
Still Not a Win
• Ball k-means is nearly guaranteed with k = 2
• Probability of successful seeding drops
  exponentially with k
• Alternative strategy has high probability of
  success, but takes O(nkd + k3d) time
Still Not a Win
• Ball k-means is nearly guaranteed with k = 2
• Probability of successful seeding drops
  exponentially with k
• Alternative strategy has high probability of
  success, but takes O( nkd + k3d ) time

• But for big data, k gets large
Surrogate Method
• Start with sloppy clustering into lots of
  clusters
     κ = k log n clusters
• Use this sketch as a weighted surrogate for the
  data
• Results are provably good for highly
  clusterable data
Algorithm Costs
• Surrogate methods
  – fast, sloppy single pass clustering with κ = k log n
  – fast sloppy search for nearest cluster,
     O(d log κ) = O(d (log k + log log n)) per point
  – fast, in-memory, high-quality clustering of κ weighted
    centroids
     O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality
     O(κ d log k) or O(d log κ log k) for larger k, looser quality
  – result is k high-quality centroids
     • Even the sloppy surrogate may suffice
Algorithm Costs
• Surrogate methods
  – fast, sloppy single pass clustering with κ = k log n
  – fast sloppy search for nearest cluster,
     O(d log κ) = O(d ( log k + log log n )) per point
  – fast, in-memory, high-quality clustering of κ weighted
    centroids
     O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality
     O(κ d log k) or O( d log k ( log k + log log n ) ) for larger
     k, looser quality
  – result is k high-quality centroids
     • For many purposes, even the sloppy surrogate may suffice
Algorithm Costs
• How much faster for the sketch phase?
  – take k = 2000, d = 10, n = 100,000
  – k d log n = 2000 x 10 x 26 = 500,000
  – d (log k + log log n) = 10(11 + 5) = 170
  – 3,000 times faster is a bona fide big deal
Algorithm Costs
• How much faster for the sketch phase?
  – take k = 2000, d = 10, n = 100,000
  – k d log n = 2000 x 10 x 26 = 500,000
  – d (log k + log log n) = 10(11 + 5) = 170
  – 3,000 times faster is a bona fide big deal
How It Works
• For each point
  – Find approximately nearest centroid (distance = d)
  – If (d > threshold) new centroid
  – Else if (u > d/threshold) new cluster
  – Else add to nearest centroid
• If centroids > κ ≈ C log N
  – Recursively cluster centroids with higher threshold
IMPLEMENTATION
But Wait, …
• Finding nearest centroid is inner loop

• This could take O( d κ ) per point and κ can be
  big

• Happily, approximate nearest centroid works
  fine
Projection Search
               total ordering!
1
                  LSH Bit-match Versus Cosine
           0.8


           0.6


           0.4


           0.2
Y Ax is




             0
                  0   8   16   24    32       40   48   56   64

          - 0.2


          - 0.4


          - 0.6


          - 0.8


            -1

                                    X Ax is
RESULTS
Parallel Speedup?
                       200


                                                                      Non- threaded




                                                    ✓
                       100
                                    2
Tim e per point (μs)




                                                                       Threaded version
                                            3

                       50
                                                      4
                       40                                                6
                                                              5

                                                                               8
                       30
                                                                                   10        14
                                                                                        12
                       20                       Perfect Scaling                                   16




                       10
                             1          2       3         4       5                                    20


                                                    Threads
Quality
• Ball k-means implementation appears significantly
  better than simple k-means

• Streaming k-means + ball k-means appears to be about
  as good as ball k-means alone

• All evaluations on 20 newsgroups with held-out data

• Figure of merit is mean and median squared distance
  to nearest cluster
Contact Me!
• We’re hiring at MapR in US and Europe

• MapR software available for research use

• Get the code as part of Mahout trunk (or 0.8 very soon)

• Contact me at tdunning@maprtech.com or @ted_dunning


• Share news with @apachemahout

More Related Content

What's hot

Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureDr. Christian Betz
 
Deep Learning for AI (2)
Deep Learning for AI (2)Deep Learning for AI (2)
Deep Learning for AI (2)Dongheon Lee
 
Java 8 - Return of the Java
Java 8 - Return of the JavaJava 8 - Return of the Java
Java 8 - Return of the JavaFredrik Vraalsen
 
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to pythonActiveState
 
[系列活動] 手把手的深度學習實務
[系列活動] 手把手的深度學習實務[系列活動] 手把手的深度學習實務
[系列活動] 手把手的深度學習實務台灣資料科學年會
 
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)Tech in Asia ID
 
Generative Adversarial Networks 2
Generative Adversarial Networks 2Generative Adversarial Networks 2
Generative Adversarial Networks 2Alireza Shafaei
 
Foilsを使ってみた。
Foilsを使ってみた。Foilsを使ってみた。
Foilsを使ってみた。Keisuke OTAKI
 
Deep Generative Models
Deep Generative ModelsDeep Generative Models
Deep Generative ModelsMijung Kim
 

What's hot (9)

Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and Clojure
 
Deep Learning for AI (2)
Deep Learning for AI (2)Deep Learning for AI (2)
Deep Learning for AI (2)
 
Java 8 - Return of the Java
Java 8 - Return of the JavaJava 8 - Return of the Java
Java 8 - Return of the Java
 
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to python
 
[系列活動] 手把手的深度學習實務
[系列活動] 手把手的深度學習實務[系列活動] 手把手的深度學習實務
[系列活動] 手把手的深度學習實務
 
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
 
Generative Adversarial Networks 2
Generative Adversarial Networks 2Generative Adversarial Networks 2
Generative Adversarial Networks 2
 
Foilsを使ってみた。
Foilsを使ってみた。Foilsを使ってみた。
Foilsを使ってみた。
 
Deep Generative Models
Deep Generative ModelsDeep Generative Models
Deep Generative Models
 

Viewers also liked

Couchbase Server 2.0 - Indexing and Querying - Deep dive
Couchbase Server 2.0 - Indexing and Querying - Deep diveCouchbase Server 2.0 - Indexing and Querying - Deep dive
Couchbase Server 2.0 - Indexing and Querying - Deep diveDipti Borkar
 
Development Platform as a Service - erfarenheter efter ett års användning - ...
Development Platform as a Service - erfarenheter efter ett års användning -  ...Development Platform as a Service - erfarenheter efter ett års användning -  ...
Development Platform as a Service - erfarenheter efter ett års användning - ...IBM Sverige
 
Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013boorad
 
OpenStack Heat slides
OpenStack Heat slidesOpenStack Heat slides
OpenStack Heat slidesdbelova
 
Cassandra at Instagram (August 2013)
Cassandra at Instagram (August 2013)Cassandra at Instagram (August 2013)
Cassandra at Instagram (August 2013)Rick Branson
 
A user's perspective on SaltStack and other configuration management tools
A user's perspective on SaltStack and other configuration management toolsA user's perspective on SaltStack and other configuration management tools
A user's perspective on SaltStack and other configuration management toolsSaltStack
 
Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Sid Anand
 
Building Your First App with MongoDB
Building Your First App with MongoDBBuilding Your First App with MongoDB
Building Your First App with MongoDBMongoDB
 
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Bolke de Bruin
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architectureLiang Xiang
 

Viewers also liked (13)

Couchbase Server 2.0 - Indexing and Querying - Deep dive
Couchbase Server 2.0 - Indexing and Querying - Deep diveCouchbase Server 2.0 - Indexing and Querying - Deep dive
Couchbase Server 2.0 - Indexing and Querying - Deep dive
 
Development Platform as a Service - erfarenheter efter ett års användning - ...
Development Platform as a Service - erfarenheter efter ett års användning -  ...Development Platform as a Service - erfarenheter efter ett års användning -  ...
Development Platform as a Service - erfarenheter efter ett års användning - ...
 
Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013
 
OpenStack Heat slides
OpenStack Heat slidesOpenStack Heat slides
OpenStack Heat slides
 
Cassandra at Instagram (August 2013)
Cassandra at Instagram (August 2013)Cassandra at Instagram (August 2013)
Cassandra at Instagram (August 2013)
 
A user's perspective on SaltStack and other configuration management tools
A user's perspective on SaltStack and other configuration management toolsA user's perspective on SaltStack and other configuration management tools
A user's perspective on SaltStack and other configuration management tools
 
storm at twitter
storm at twitterstorm at twitter
storm at twitter
 
Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016
 
Hadoop on the Cloud
Hadoop on the CloudHadoop on the Cloud
Hadoop on the Cloud
 
Building Your First App with MongoDB
Building Your First App with MongoDBBuilding Your First App with MongoDB
Building Your First App with MongoDB
 
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architecture
 

Similar to Paris data-geeks-2013-03-28

Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012Ted Dunning
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford MapR Technologies
 
Applying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksApplying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksDatabricks
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...huguk
 
Ben Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra ProjectBen Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra ProjectMorningstar Tech Talks
 
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016
Sasi, cassandra on the full text search ride At  Voxxed Day Belgrade 2016Sasi, cassandra on the full text search ride At  Voxxed Day Belgrade 2016
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016Duyhai Doan
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsFlink Forward
 
Number Crunching in Python
Number Crunching in PythonNumber Crunching in Python
Number Crunching in PythonValerio Maggio
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache SparkCloudera, Inc.
 
Cassandra data structures and algorithms
Cassandra data structures and algorithmsCassandra data structures and algorithms
Cassandra data structures and algorithmsDuyhai Doan
 
Data Mining Lecture_10(b).pptx
Data Mining Lecture_10(b).pptxData Mining Lecture_10(b).pptx
Data Mining Lecture_10(b).pptxSubrata Kumer Paul
 
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Gruter
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Databricks
 
How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)Ali-ziane Myriam
 
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01Hemant Jha
 

Similar to Paris data-geeks-2013-03-28 (20)

Paris Data Geeks
Paris Data GeeksParis Data Geeks
Paris Data Geeks
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford
 
Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25
 
Realtime Analytics
Realtime AnalyticsRealtime Analytics
Realtime Analytics
 
Applying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksApplying your Convolutional Neural Networks
Applying your Convolutional Neural Networks
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
 
Ben Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra ProjectBen Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra Project
 
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016
Sasi, cassandra on the full text search ride At  Voxxed Day Belgrade 2016Sasi, cassandra on the full text search ride At  Voxxed Day Belgrade 2016
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
 
Number Crunching in Python
Number Crunching in PythonNumber Crunching in Python
Number Crunching in Python
 
Class3
Class3Class3
Class3
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
 
Cassandra data structures and algorithms
Cassandra data structures and algorithmsCassandra data structures and algorithms
Cassandra data structures and algorithms
 
Data Mining Lecture_10(b).pptx
Data Mining Lecture_10(b).pptxData Mining Lecture_10(b).pptx
Data Mining Lecture_10(b).pptx
 
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
 
How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)
 
Vaex talk-pydata-paris
Vaex talk-pydata-parisVaex talk-pydata-paris
Vaex talk-pydata-paris
 
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
 

More from Ted Dunning

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxTed Dunning
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with KubernetesTed Dunning
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in KubernetesTed Dunning
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forTed Dunning
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningTed Dunning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning LogisticsTed Dunning
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTed Dunning
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logisticsTed Dunning
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real DataTed Dunning
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteTed Dunning
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoopTed Dunning
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Ted Dunning
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data SecurelyTed Dunning
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeTed Dunning
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownTed Dunning
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015Ted Dunning
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossibleTed Dunning
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningTed Dunning
 

More from Ted Dunning (20)

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
 
T digest-update
T digest-updateT digest-update
T digest-update
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
 

Recently uploaded

Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 

Recently uploaded (20)

Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 

Paris data-geeks-2013-03-28

  • 2. whoami – Ted Dunning • Chief Application Architect, MapR Technologies • Committer, member, Apache Software Foundation – particularly Mahout, Zookeeper and Drill (we’re hiring) • Contact me at tdunning@maprtech.com tdunning@apache.com ted.dunning@gmail.com @ted_dunning
  • 3. Agenda • What works at scale • Recommendation • Unsupervised - Clustering
  • 4. What Works at Scale • Logging • Counting • Session grouping
  • 5. What Works at Scale • Logging • Counting • Session grouping • Really. Don’t bet on anything much more complex than these
  • 6. What Works at Scale • Logging • Counting • Session grouping • Really. Don’t bet on anything much more complex than these • These are harder than they look
  • 8. Recommendations • Special case of reflected intelligence • Traditionally “people who bought x also bought y” • But soooo much more is possible
  • 9. Examples • Customers buying books (Linden et al) • Web visitors rating music (Shardanand and Maes) or movies (Riedl, et al), (Netflix) • Internet radio listeners not skipping songs (Musicmatch) • Internet video watchers watching >30 s
  • 10. Dyadic Structure • Functional – Interaction: actor -> item* • Relational – Interaction ⊆ Actors x Items • Matrix – Rows indexed by actor, columns by item – Value is count of interactions • Predict missing observations
  • 11. Recommendations Analysis • R(x,y) = # people who bought x also bought y select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id ) group by x, y
  • 12. Recommendations Analysis • R(x,y) = People who bought x also bought y select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id ) group by x, y
  • 13. Recommendations Analysis • R(x,y) = People who bought x also bought y select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id ) group by x, y
  • 14. Recommendations Analysis • R(x,y) = People who bought x also bought y select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id ) group by x, y
  • 15. Recommendations Analysis • R(x,y) = People who bought x also bought y select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id ) group by x, y
  • 16. Recommendations Analysis • R(x,y) = People who bought x also bought y select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id ) group by x, y
  • 17. Recommendations Analysis Rij = å A ui Buj u =A BT
  • 18. Fundamental Algorithmic Structure • Cooccurrence K=A A T • Matrix approximation by factoring A » USV T K » VS2 VT r = VS V h 2 T • LLR r = sparsify(A A)h T
  • 19. But Wait! • Cooccurrence K=A A T • Cross occurrence K=B A T
  • 20. For example • Users enter queries (A) – (actor = user, item=query) • Users view videos (B) – (actor = user, item=video) • A’A gives query recommendation – “did you mean to ask for” • B’B gives video recommendation – “you might like these videos”
  • 21. The punch-line • B’A recommends videos in response to a query – (isn’t that a search engine?) – (not quite, it doesn’t look at content or meta-data)
  • 22. Real-life example • Query: “Paco de Lucia” • Conventional meta-data search results: – “hombres del paco” times 400 – not much else • Recommendation based search: – Flamenco guitar and dancers – Spanish and classical guitar – Van Halen doing a classical/flamenco riff
  • 24. Hypothetical Example • Want a navigational ontology? • Just put labels on a web page with traffic – This gives A = users x label clicks • Remember viewing history – This gives B = users x items • Cross recommend – B’A = label to item mapping • After several users click, results are whatever users think they should be
  • 27. What is Quality? • Robust clustering not a goal – we don’t care if the same clustering is replicated • Generalization is critical • Agreement to “gold standard” is a non-issue
  • 34. For Example 1 D (X) > 2 D (X) 2 s 4 2 5 Grouping these two clusters seriously hurts squared distance
  • 36. Typical k-means Failure Selecting two seeds here cannot be fixed with Lloyds Result is that these two clusters get glued together
  • 37. Ball k-means • Provably better for highly clusterable data • Tries to find initial centroids in each “core” of each real clusters • Avoids outliers in centroid computation initialize centroids randomly with distance maximizing tendency for each of a very few iterations: for each data point: assign point to nearest cluster recompute centroids using only points much closer than closest cluster
  • 38. Still Not a Win • Ball k-means is nearly guaranteed with k = 2 • Probability of successful seeding drops exponentially with k • Alternative strategy has high probability of success, but takes O(nkd + k3d) time
  • 39. Still Not a Win • Ball k-means is nearly guaranteed with k = 2 • Probability of successful seeding drops exponentially with k • Alternative strategy has high probability of success, but takes O( nkd + k3d ) time • But for big data, k gets large
  • 40. Surrogate Method • Start with sloppy clustering into lots of clusters κ = k log n clusters • Use this sketch as a weighted surrogate for the data • Results are provably good for highly clusterable data
  • 41. Algorithm Costs • Surrogate methods – fast, sloppy single pass clustering with κ = k log n – fast sloppy search for nearest cluster, O(d log κ) = O(d (log k + log log n)) per point – fast, in-memory, high-quality clustering of κ weighted centroids O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality O(κ d log k) or O(d log κ log k) for larger k, looser quality – result is k high-quality centroids • Even the sloppy surrogate may suffice
  • 42. Algorithm Costs • Surrogate methods – fast, sloppy single pass clustering with κ = k log n – fast sloppy search for nearest cluster, O(d log κ) = O(d ( log k + log log n )) per point – fast, in-memory, high-quality clustering of κ weighted centroids O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality O(κ d log k) or O( d log k ( log k + log log n ) ) for larger k, looser quality – result is k high-quality centroids • For many purposes, even the sloppy surrogate may suffice
  • 43. Algorithm Costs • How much faster for the sketch phase? – take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000 – d (log k + log log n) = 10(11 + 5) = 170 – 3,000 times faster is a bona fide big deal
  • 44. Algorithm Costs • How much faster for the sketch phase? – take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000 – d (log k + log log n) = 10(11 + 5) = 170 – 3,000 times faster is a bona fide big deal
  • 45. How It Works • For each point – Find approximately nearest centroid (distance = d) – If (d > threshold) new centroid – Else if (u > d/threshold) new cluster – Else add to nearest centroid • If centroids > κ ≈ C log N – Recursively cluster centroids with higher threshold
  • 47. But Wait, … • Finding nearest centroid is inner loop • This could take O( d κ ) per point and κ can be big • Happily, approximate nearest centroid works fine
  • 48. Projection Search total ordering!
  • 49. 1 LSH Bit-match Versus Cosine 0.8 0.6 0.4 0.2 Y Ax is 0 0 8 16 24 32 40 48 56 64 - 0.2 - 0.4 - 0.6 - 0.8 -1 X Ax is
  • 51. Parallel Speedup? 200 Non- threaded ✓ 100 2 Tim e per point (μs) Threaded version 3 50 4 40 6 5 8 30 10 14 12 20 Perfect Scaling 16 10 1 2 3 4 5 20 Threads
  • 52. Quality • Ball k-means implementation appears significantly better than simple k-means • Streaming k-means + ball k-means appears to be about as good as ball k-means alone • All evaluations on 20 newsgroups with held-out data • Figure of merit is mean and median squared distance to nearest cluster
  • 53. Contact Me! • We’re hiring at MapR in US and Europe • MapR software available for research use • Get the code as part of Mahout trunk (or 0.8 very soon) • Contact me at tdunning@maprtech.com or @ted_dunning • Share news with @apachemahout