SlideShare una empresa de Scribd logo
1 de 25
Descargar para leer sin conexión
TunUp: A Distributed Cloud-based
Genetic Evolutionary Tuning for Data
Clustering

Gianmario Spacagna
gm.spacagna@gmail.com

March 2013



AgilOne, Inc.
1091 N Shoreline Blvd. #250
Mountain View, CA 94043
Agenda
1.   Introduction
2.   Problem description
3.   TunUp
4.   K-means
5.   Clustering evaluation
6.   Full space tuning
7.   Genetic algorithm tuning
8.   Conclusions
Big Data
Business Intelligence
        Why ? Where? What? How?
         Insights of customers, products and companies




   Can someone else know your customer better than you?
  Do you have the domain knowledge and proper computation
                      infrastructure?
Big Data as a Service (BDaaS)
Problem Description




          income   cost




                      customers
Tuning of Clustering
Algorithms
We need tuning when:
    ➢
        New algorithm or version is released
    ➢
        We want to improve accuracy and/or performance
    ➢
        New customer comes and the system must be adapted for the new
        dataset and requirements




9
TunUp
Java framework integrating JavaML and Watchmaker

Main features:

➢
    Data manipulation (loading, labelling and normalization)
➢
    Clustering algorithms (k-means)
➢
    Clustering evaluation (AIC, Dunn, Davies-Bouldin, Silhouette, aRand)
➢
    Evaluation techniques validation (Pearson Correlation t-test)
➢
    Full search space tuning
➢
    Genetic Algorithm tuning (local and parallel implementation)
➢
    RESTful API for web ser vice deployment (tomcat in Amazon EC2)

    Open-source: http://github.com/gm-spacagna/tunup
k-means
Geometric hard-assigning Clustering algorithm:
   It partitions n data points into k clusters in which each point belongs to
   the cluster with the nearest mean centroid.
     If we have k clusters in the set S = S1,....,Sk where xj and μ represents the jth point in the specified
     cluster, the goal of k-means is minimizing the Within-Cluster Sum of Squares:




      Algorithm:
1.    Initialization : a set of k random centroids are generated
2.    Assignment: each point is assigned to the closest centroid
3.    Update: the new centroids are calculated as the mean of the new clusters
4.    Go to 2 until the convergence (centroids are stable and do not change)
k-means tuning
     Input parameters required:        0.   Angular
                                       2.   Chebyshev
1.   K = (2,...,40)                    3.   Cosine
                                       4.   Euclidean
2.   Distance measure                  5.   Jaccard Index
                                       6.   Manhattan
3.   Max iterations = 20 (fixed)       7.   Pearson Correlation Coefficient
                                       8.   Radial Basis Function Kernel
                                       9.   Spearman Footrule




                                   Different input parameters


                                   Ver y different outcomes!!!
Clustering Evaluation
Definition of cluster:
“A group of the same or similar elements gathered or occurring closely
together”

    How do we evaluate if a set of clusters is good or not?

          “Clustering is in the eye of the beholder” [E. Castro, 2002]


    Two main categories:
➢
    Internal criterion : only based on the clustered data itself
➢
    External criterion : based on benchmarks of pre-classified items
Internal Evaluation
Common goal is assigning better scores when:
➢
  High intra-cluster similarity
➢
  Low inter-cluster similarity

 The choice of the evaluation technique depends on the
nature of the data and the cluster model of the algorithm.


    Cluster models:
➢
    Distance-based (k-means)
➢
    Density-based (EM-clustering)
➢
    Distribution-based (DBSCAN)
➢
    Connectivity-based (linkage clustering)
Proposed techniques
AIC: measure of the relative quantity of lost information of a statistical
model. The clustering algorithm is modelled as a Gaussian Mixture Process.
(inverted function)




Dunn: ratio between the minimum inter-clusters similarity and maximum
cluster diameter. (natural fn.)

Davies-Bouldin : average similarity between each cluster and its most
similar one. (inverted fn.)

Silhouette: measure of how well each point lies within its cluster. Indicates
if the object is correctly clustered or if it would be more appropriate into the
neighbouring cluster. (natural fn.)
External criterion:
AdjustedRand
Given a a set of n elements S = {o1,...,on} and two partitions to compare:
X={X1,...,Xr} and Y={Y1,...,Ys}

               number of agreements between X and Y
 RandIndex =
               total number of possible pair combinations


                       RandIndex−ExpectedIndex
AdjustedRandIndex=
                       MaxIndex−ExpectedIndex



We can use AdjustedRand as reference of the best clustering evaluation and
use it as validation for the internal criterion.
Correlation t-test
                       Pearson correlation over a set of 120
                         random k-means configuration
                                  evaluations:




                     Average correlations:

                     AIC : 0.77
                     Dunn: 0.49
                     Davies-Bouldin: 0.51
                     Silhouette: 0.49
Dataset
                                                D31
                                                3100 vectors
                                                2 dimensions
                                                31 clusters




S1
5000 vectors
2 dimensions
15 clusters

               Source: http://cs.joensuu.fi/sipu/datasets/
Initial Centroids issue
N. observations = 200
Input Configuration: k = 31 , Distance Measure = Eclidean

        AdjustedRand                                   AIC




We can consider the median value!
Full space evaluation
N executions averaged = 20




                             Global optimal is for:
                             K = 36
                             DistanceMeasure = Euclidean
Genetic Algorithm Tuning
                                        Crossovering:
                                             [x1,x2,x3,x4,...,xm]

                                            [y1,y2,y3,y4,...,ym]
                  Elitism
                     +
               Roulette wheel

                                             [x1,x2,x3,y4,...,ym]
                                            [y1,y2,y3,x4,...,xm]


                                        Mutation:
                                                                1
                                Pr (mutate k i →k j )∝
                                                         distance ( k i , k j )

                                                              1
                                Pr (mutate d i →d j )=
                                                          N dist −1
Tuning parameters:
Fitness Evaluation : AIC
Prob. mutation: 0.5
Prob. Crossovering: 0.9
Population size: 6
Stagnation limit: 5
Elitism: 1
N executions averaged: 10




    Relevant results:
➢
    Best fitness value always decreasing
➢
    Mean fitness value trend decreasing
➢
    High standard deviation in the previous
    population often generates a better mean
    population in the next one
Results

Test1:
k = 39, Distance Measure = Manhattan

Test2:
k = 33, Distance Measure = RBF Kernel

Test3:
k = 36, Distance Measure = Euclidean




Different results due to:
1. Early convergence
2. Random initial centroids
Parallel GA
 Simulation:                               Amazon Elastic Compute Cloud EC2
 10 evolutions, POP_SIZE = 5, no elitism   10 x Micro instances




Optimal n. of ser vers = POP_SIZE – ELITISM

E[T single evolution] ≤
Conclusions
We developed, tested and analysed TunUp, an open-solution for:
Evaluation, Validation , Tuning of Data Clustering Algorithms

Future applications :
➢
  Tuning of existing algorithms
➢
  Supporting new algorithms design
➢
  Evaluation and comparison of different algorithms

Limitations:
➢
  Single distance measure
➢
  Equal normalization
➢
  Master / slave parallel execution
➢
  Random initial centroids
Questions?
Thank you! Tack! Grazie!

Más contenido relacionado

La actualidad más candente

La actualidad más candente (18)

Support Vector Machines Simply
Support Vector Machines SimplySupport Vector Machines Simply
Support Vector Machines Simply
 
K-Means Clustering Simply
K-Means Clustering SimplyK-Means Clustering Simply
K-Means Clustering Simply
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford
 
MLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackMLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic track
 
Dueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningDueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learning
 
K means clustering
K means clusteringK means clustering
K means clustering
 
Rough K Means - Numerical Example
Rough K Means - Numerical ExampleRough K Means - Numerical Example
Rough K Means - Numerical Example
 
[Vldb 2013] skyline operator on anti correlated distributions
[Vldb 2013] skyline operator on anti correlated distributions[Vldb 2013] skyline operator on anti correlated distributions
[Vldb 2013] skyline operator on anti correlated distributions
 
Reweighting and Boosting to uniforimty in HEP
Reweighting and Boosting to uniforimty in HEPReweighting and Boosting to uniforimty in HEP
Reweighting and Boosting to uniforimty in HEP
 
11 clusadvanced
11 clusadvanced11 clusadvanced
11 clusadvanced
 
Time series clustering presentation
Time series clustering presentationTime series clustering presentation
Time series clustering presentation
 
learned optimizer.pptx
learned optimizer.pptxlearned optimizer.pptx
learned optimizer.pptx
 
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...
 
Gan seminar
Gan seminarGan seminar
Gan seminar
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)
 
Design and Implementation of Parallel and Randomized Approximation Algorithms
Design and Implementation of Parallel and Randomized Approximation AlgorithmsDesign and Implementation of Parallel and Randomized Approximation Algorithms
Design and Implementation of Parallel and Randomized Approximation Algorithms
 
Matrix decomposition and_applications_to_nlp
Matrix decomposition and_applications_to_nlpMatrix decomposition and_applications_to_nlp
Matrix decomposition and_applications_to_nlp
 
8 ijaems jan-2016-20-multi-attribute group decision making of internet public...
8 ijaems jan-2016-20-multi-attribute group decision making of internet public...8 ijaems jan-2016-20-multi-attribute group decision making of internet public...
8 ijaems jan-2016-20-multi-attribute group decision making of internet public...
 

Destacado

Michael Gage SOED 2016
Michael Gage SOED 2016Michael Gage SOED 2016
Michael Gage SOED 2016
Colleen Ganley
 
Reputation – A Critical Driver of Business Value, by Ian Wright MPRCA, Corpor...
Reputation – A Critical Driver of Business Value, by Ian Wright MPRCA, Corpor...Reputation – A Critical Driver of Business Value, by Ian Wright MPRCA, Corpor...
Reputation – A Critical Driver of Business Value, by Ian Wright MPRCA, Corpor...
Mattcartmell
 

Destacado (14)

The Beethoven Frieze
The Beethoven FriezeThe Beethoven Frieze
The Beethoven Frieze
 
Fund Raising: A Ladder for Corporate GrowthFund raising
Fund Raising: A Ladder for Corporate GrowthFund raisingFund Raising: A Ladder for Corporate GrowthFund raising
Fund Raising: A Ladder for Corporate GrowthFund raising
 
Make your team less hierarchical
Make your team less hierarchicalMake your team less hierarchical
Make your team less hierarchical
 
Lamb day
Lamb dayLamb day
Lamb day
 
306 - Lesson 1 - History of Comics
306 - Lesson 1 - History of Comics306 - Lesson 1 - History of Comics
306 - Lesson 1 - History of Comics
 
TouchID, Handoff, Spotlight oraz Multitasking: Nowości W Projektowaniu Interf...
TouchID, Handoff, Spotlight oraz Multitasking: Nowości W Projektowaniu Interf...TouchID, Handoff, Spotlight oraz Multitasking: Nowości W Projektowaniu Interf...
TouchID, Handoff, Spotlight oraz Multitasking: Nowości W Projektowaniu Interf...
 
A short history of drug use according to Pete
A short history of drug use according to PeteA short history of drug use according to Pete
A short history of drug use according to Pete
 
Basics of the Federal Deposit Insurance Corporation
Basics of the Federal Deposit Insurance CorporationBasics of the Federal Deposit Insurance Corporation
Basics of the Federal Deposit Insurance Corporation
 
Needle Founders & Culture code
Needle Founders & Culture code Needle Founders & Culture code
Needle Founders & Culture code
 
Michael Gage SOED 2016
Michael Gage SOED 2016Michael Gage SOED 2016
Michael Gage SOED 2016
 
Yahya Almalki SOED 2016
Yahya Almalki SOED 2016Yahya Almalki SOED 2016
Yahya Almalki SOED 2016
 
Jacob von Uexkull
Jacob von UexkullJacob von Uexkull
Jacob von Uexkull
 
Reputation – A Critical Driver of Business Value, by Ian Wright MPRCA, Corpor...
Reputation – A Critical Driver of Business Value, by Ian Wright MPRCA, Corpor...Reputation – A Critical Driver of Business Value, by Ian Wright MPRCA, Corpor...
Reputation – A Critical Driver of Business Value, by Ian Wright MPRCA, Corpor...
 
Crowdfunding: wie niet vraagt, niet wint
Crowdfunding: wie niet vraagt, niet wintCrowdfunding: wie niet vraagt, niet wint
Crowdfunding: wie niet vraagt, niet wint
 

Similar a TunUp final presentation

Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applications
Frank Nielsen
 

Similar a TunUp final presentation (20)

Lect4
Lect4Lect4
Lect4
 
MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1
 
11ClusAdvanced.ppt
11ClusAdvanced.ppt11ClusAdvanced.ppt
11ClusAdvanced.ppt
 
Chapter 11. Cluster Analysis Advanced Methods.ppt
Chapter 11. Cluster Analysis Advanced Methods.pptChapter 11. Cluster Analysis Advanced Methods.ppt
Chapter 11. Cluster Analysis Advanced Methods.ppt
 
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
 
Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier
 
5 DimensionalityReduction.pdf
5 DimensionalityReduction.pdf5 DimensionalityReduction.pdf
5 DimensionalityReduction.pdf
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applications
 
Interactive High-Dimensional Visualization of Social Graphs
Interactive High-Dimensional Visualization of Social GraphsInteractive High-Dimensional Visualization of Social Graphs
Interactive High-Dimensional Visualization of Social Graphs
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习Adaboost
 
Clustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modelClustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture model
 
Knn 160904075605-converted
Knn 160904075605-convertedKnn 160904075605-converted
Knn 160904075605-converted
 
Data analysis of weather forecasting
Data analysis of weather forecastingData analysis of weather forecasting
Data analysis of weather forecasting
 
Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetEnhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial Dataset
 
Project PPT
Project PPTProject PPT
Project PPT
 
Cs345 cl
Cs345 clCs345 cl
Cs345 cl
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reductionAaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
 

Más de Gianmario Spacagna

Tech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning productsTech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning products
Gianmario Spacagna
 
Managers guide to effective building of machine learning products
Managers guide to effective building of machine learning productsManagers guide to effective building of machine learning products
Managers guide to effective building of machine learning products
Gianmario Spacagna
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
Gianmario Spacagna
 
Logical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetupLogical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetup
Gianmario Spacagna
 
Parallel Tuning of Machine Learning Algorithms, Thesis Proposal
Parallel Tuning of Machine Learning Algorithms, Thesis ProposalParallel Tuning of Machine Learning Algorithms, Thesis Proposal
Parallel Tuning of Machine Learning Algorithms, Thesis Proposal
Gianmario Spacagna
 

Más de Gianmario Spacagna (8)

Latent Panelists Affinities: a Helixa case study
Latent Panelists Affinities: a Helixa case studyLatent Panelists Affinities: a Helixa case study
Latent Panelists Affinities: a Helixa case study
 
Tech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning productsTech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning products
 
Managers guide to effective building of machine learning products
Managers guide to effective building of machine learning productsManagers guide to effective building of machine learning products
Managers guide to effective building of machine learning products
 
Anomaly Detection using Deep Auto-Encoders
Anomaly Detection using Deep Auto-EncodersAnomaly Detection using Deep Auto-Encoders
Anomaly Detection using Deep Auto-Encoders
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
 
Logical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetupLogical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetup
 
Robust and declarative machine learning pipelines for predictive buying at Ba...
Robust and declarative machine learning pipelines for predictive buying at Ba...Robust and declarative machine learning pipelines for predictive buying at Ba...
Robust and declarative machine learning pipelines for predictive buying at Ba...
 
Parallel Tuning of Machine Learning Algorithms, Thesis Proposal
Parallel Tuning of Machine Learning Algorithms, Thesis ProposalParallel Tuning of Machine Learning Algorithms, Thesis Proposal
Parallel Tuning of Machine Learning Algorithms, Thesis Proposal
 

Último

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 

TunUp final presentation

  • 1. TunUp: A Distributed Cloud-based Genetic Evolutionary Tuning for Data Clustering Gianmario Spacagna gm.spacagna@gmail.com March 2013 AgilOne, Inc. 1091 N Shoreline Blvd. #250 Mountain View, CA 94043
  • 2. Agenda 1. Introduction 2. Problem description 3. TunUp 4. K-means 5. Clustering evaluation 6. Full space tuning 7. Genetic algorithm tuning 8. Conclusions
  • 4. Business Intelligence Why ? Where? What? How? Insights of customers, products and companies Can someone else know your customer better than you? Do you have the domain knowledge and proper computation infrastructure?
  • 5. Big Data as a Service (BDaaS)
  • 6. Problem Description income cost customers
  • 7. Tuning of Clustering Algorithms We need tuning when: ➢ New algorithm or version is released ➢ We want to improve accuracy and/or performance ➢ New customer comes and the system must be adapted for the new dataset and requirements 9
  • 8. TunUp Java framework integrating JavaML and Watchmaker Main features: ➢ Data manipulation (loading, labelling and normalization) ➢ Clustering algorithms (k-means) ➢ Clustering evaluation (AIC, Dunn, Davies-Bouldin, Silhouette, aRand) ➢ Evaluation techniques validation (Pearson Correlation t-test) ➢ Full search space tuning ➢ Genetic Algorithm tuning (local and parallel implementation) ➢ RESTful API for web ser vice deployment (tomcat in Amazon EC2) Open-source: http://github.com/gm-spacagna/tunup
  • 9. k-means Geometric hard-assigning Clustering algorithm: It partitions n data points into k clusters in which each point belongs to the cluster with the nearest mean centroid. If we have k clusters in the set S = S1,....,Sk where xj and μ represents the jth point in the specified cluster, the goal of k-means is minimizing the Within-Cluster Sum of Squares: Algorithm: 1. Initialization : a set of k random centroids are generated 2. Assignment: each point is assigned to the closest centroid 3. Update: the new centroids are calculated as the mean of the new clusters 4. Go to 2 until the convergence (centroids are stable and do not change)
  • 10. k-means tuning Input parameters required: 0. Angular 2. Chebyshev 1. K = (2,...,40) 3. Cosine 4. Euclidean 2. Distance measure 5. Jaccard Index 6. Manhattan 3. Max iterations = 20 (fixed) 7. Pearson Correlation Coefficient 8. Radial Basis Function Kernel 9. Spearman Footrule Different input parameters Ver y different outcomes!!!
  • 11. Clustering Evaluation Definition of cluster: “A group of the same or similar elements gathered or occurring closely together” How do we evaluate if a set of clusters is good or not? “Clustering is in the eye of the beholder” [E. Castro, 2002] Two main categories: ➢ Internal criterion : only based on the clustered data itself ➢ External criterion : based on benchmarks of pre-classified items
  • 12. Internal Evaluation Common goal is assigning better scores when: ➢ High intra-cluster similarity ➢ Low inter-cluster similarity The choice of the evaluation technique depends on the nature of the data and the cluster model of the algorithm. Cluster models: ➢ Distance-based (k-means) ➢ Density-based (EM-clustering) ➢ Distribution-based (DBSCAN) ➢ Connectivity-based (linkage clustering)
  • 13. Proposed techniques AIC: measure of the relative quantity of lost information of a statistical model. The clustering algorithm is modelled as a Gaussian Mixture Process. (inverted function) Dunn: ratio between the minimum inter-clusters similarity and maximum cluster diameter. (natural fn.) Davies-Bouldin : average similarity between each cluster and its most similar one. (inverted fn.) Silhouette: measure of how well each point lies within its cluster. Indicates if the object is correctly clustered or if it would be more appropriate into the neighbouring cluster. (natural fn.)
  • 14. External criterion: AdjustedRand Given a a set of n elements S = {o1,...,on} and two partitions to compare: X={X1,...,Xr} and Y={Y1,...,Ys} number of agreements between X and Y RandIndex = total number of possible pair combinations RandIndex−ExpectedIndex AdjustedRandIndex= MaxIndex−ExpectedIndex We can use AdjustedRand as reference of the best clustering evaluation and use it as validation for the internal criterion.
  • 15. Correlation t-test Pearson correlation over a set of 120 random k-means configuration evaluations: Average correlations: AIC : 0.77 Dunn: 0.49 Davies-Bouldin: 0.51 Silhouette: 0.49
  • 16. Dataset D31 3100 vectors 2 dimensions 31 clusters S1 5000 vectors 2 dimensions 15 clusters Source: http://cs.joensuu.fi/sipu/datasets/
  • 17. Initial Centroids issue N. observations = 200 Input Configuration: k = 31 , Distance Measure = Eclidean AdjustedRand AIC We can consider the median value!
  • 18. Full space evaluation N executions averaged = 20 Global optimal is for: K = 36 DistanceMeasure = Euclidean
  • 19. Genetic Algorithm Tuning Crossovering: [x1,x2,x3,x4,...,xm] [y1,y2,y3,y4,...,ym] Elitism + Roulette wheel [x1,x2,x3,y4,...,ym] [y1,y2,y3,x4,...,xm] Mutation: 1 Pr (mutate k i →k j )∝ distance ( k i , k j ) 1 Pr (mutate d i →d j )= N dist −1
  • 20. Tuning parameters: Fitness Evaluation : AIC Prob. mutation: 0.5 Prob. Crossovering: 0.9 Population size: 6 Stagnation limit: 5 Elitism: 1 N executions averaged: 10 Relevant results: ➢ Best fitness value always decreasing ➢ Mean fitness value trend decreasing ➢ High standard deviation in the previous population often generates a better mean population in the next one
  • 21. Results Test1: k = 39, Distance Measure = Manhattan Test2: k = 33, Distance Measure = RBF Kernel Test3: k = 36, Distance Measure = Euclidean Different results due to: 1. Early convergence 2. Random initial centroids
  • 22. Parallel GA Simulation: Amazon Elastic Compute Cloud EC2 10 evolutions, POP_SIZE = 5, no elitism 10 x Micro instances Optimal n. of ser vers = POP_SIZE – ELITISM E[T single evolution] ≤
  • 23. Conclusions We developed, tested and analysed TunUp, an open-solution for: Evaluation, Validation , Tuning of Data Clustering Algorithms Future applications : ➢ Tuning of existing algorithms ➢ Supporting new algorithms design ➢ Evaluation and comparison of different algorithms Limitations: ➢ Single distance measure ➢ Equal normalization ➢ Master / slave parallel execution ➢ Random initial centroids
  • 25. Thank you! Tack! Grazie!