SlideShare a Scribd company logo
1 of 36
Download to read offline
CS-GN-TEAM: internal presentation




 detecting novel associations
                                                    in large data sets
                                                                 Michele Filannino + You

                                                                       Presented paper:
D. N. Reshef et al., “Detecting Novel Associations in Large Data Sets,” Science, vol. 334,
                                                           no. 6062, pp. 1518-1524, 2011.




                                                                                               Manchester, 05/03/2012
presentation my research taster project




where we are




                         05/03/2012, Michele Filannino   2 / 36
Introduction
presentation my research taster project




novel association
 ■ two variables, X and Y, are associated if there is a
   relationship between them
    ●   functional
         ▶


    ●   non functional
         ▶


 ■ novel: unknown

                                             05/03/2012, Michele Filannino   4 / 36
presentation my research taster project




example
                 f0      f1      f2       f3              f4                f5
           s0   4.00    -0.76   5.00    12.00            8.22              1.83

           s1   9.00    0.41    10.00   23.00           27.12              4.30

           s2   3.00    0.14    4.00     0.00            0.56              -0.43

           s3   10.00   -0.54   11.00   100.00          94.02              6.24

           s4   5.00    -0.96   6.00    45.00           39.25              3.56

           s5   2.00    0.91    3.00    123.00         125.73              2.97

           s6   7.00    0.66    8.00     4.00            9.26              2.56

           s7   8.00    0.99    9.00    -2.00           6.90               2.37

           s8   1.00    0.84    2.00    36.00           37.68              1.58

           s9   6.00    -0.28   7.00     0.00           -1.96              0.71


Data set 10x6                                     05/03/2012, Michele Filannino    5 / 36
presentation my research taster project




example
                 f0      f1      f2       f3              f4                f5
           s0   4.00    -0.76   5.00    12.00            8.22              1.83

           s1   9.00    0.41    10.00   23.00           27.12              4.30

           s2   3.00    0.14    4.00     0.00            0.56              -0.43

           s3   10.00   -0.54   11.00   100.00          94.02              6.24

           s4   5.00    -0.96   6.00    45.00           39.25              3.56

           s5   2.00    0.91    3.00    123.00         125.73              2.97

           s6   7.00    0.66    8.00     4.00            9.26              2.56

           s7   8.00    0.99    9.00    -2.00           6.90               2.37

           s8   1.00    0.84    2.00    36.00           37.68              1.58

           s9   6.00    -0.28   7.00     0.00           -1.96              0.71


Data set 10x6                                     05/03/2012, Michele Filannino    6 / 36
presentation my research taster project




scatter plot: f0 vs. f2




f2(x) = f0(x) + 1              05/03/2012, Michele Filannino   7 / 36
presentation my research taster project




example
                 f0      f1      f2       f3              f4                f5
           s0   4.00    -0.76   5.00    12.00            8.22              1.83

           s1   9.00    0.41    10.00   23.00           27.12              4.30

           s2   3.00    0.14    4.00     0.00            0.56              -0.43

           s3   10.00   -0.54   11.00   100.00          94.02              6.24

           s4   5.00    -0.96   6.00    45.00           39.25              3.56

           s5   2.00    0.91    3.00    123.00         125.73              2.97

           s6   7.00    0.66    8.00     4.00            9.26              2.56

           s7   8.00    0.99    9.00    -2.00           6.90               2.37

           s8   1.00    0.84    2.00    36.00           37.68              1.58

           s9   6.00    -0.28   7.00     0.00           -1.96              0.71


Data set 10x6                                     05/03/2012, Michele Filannino    8 / 36
presentation my research taster project




scatter plot: f0 vs. f1




no relation                     05/03/2012, Michele Filannino   9 / 36
presentation my research taster project




correlation coefficients
           Pearson   Mutual Infor.                MI norm.


   f0-f5    0.63         2.45                         0.74


   f0-f1    -0.17        1.57                         0.47


   f0-f2    1.00         3.32                         1.00


   f2-f3    -0.08        3.12                         0.94


   f0-f3    -0.08        3.12                         0.94


                                     05/03/2012, Michele Filannino   10 / 36
presentation my research taster project




pros. & cons.

 ■ Pearson’s coeff.              ■ Mutual Information
   ✔   closed interval result     ✔   non linear relations
   ✖   only linear relations      ✖   only categorical data
   ✖   feature independency       ✖   biased towards higher
                                      arity features




                                                05/03/2012, Michele Filannino   11 / 36
the new measure
presentation my research taster project




motivations

■ generality:
   ●   capture a wide range of interesting associations, not
       limited to specific function types

■ equitability:
   ●   give similar scores to equally noisy relationships of
       different types


                                                05/03/2012, Michele Filannino   13 / 36
presentation my research taster project




definition of MIC
 ■ Given a finite set D of ordered pairs, we can
    partition the X-values of D into x bins and the Y-
    values of D into y bins

 ■ We obtain a pair of partitions called x-by-y grid
 D = (F0, F1)
 F0 = (1.00, 2.00, 3.00 | 4.00, 5.00, 6.00, | 7.00, 8.00, 9.00, 10.00)
 F1 = (-0.96, -0.76 | -0.54, -0.28 | 0.14, | 0.41, | 0.66, 0.84, 0.91, 0.99)


                                                           05/03/2012, Michele Filannino   14 / 36
presentation my research taster project




x-by-y grid




2-by-4 grid             05/03/2012, Michele Filannino   15 / 36
presentation my research taster project




definition of MIC


 ■ given the grid we could calculate D|G, the frequency
   distribution induced by the points in D on the cells
   of G
    ●   different grids G result in different distributions D|G




                                                 05/03/2012, Michele Filannino   16 / 36
presentation my research taster project




maximal MI over all grids




  number of columns   number of rows




                                           05/03/2012, Michele Filannino   17 / 36
presentation my research taster project




characteristic matrix




  Infinite matrix!
                           normalisation factor
                             (derived by MI definition)




                              05/03/2012, Michele Filannino   18 / 36
presentation my research taster project




Maximal Information Coeff.




       max grid size




                                 05/03/2012, Michele Filannino   19 / 36
presentation my research taster project




matrix computation

■ space of grids grows exponentially
   ●   B(n) ≤ O(n1-ε) for 0 < ε < 1

■ approximation of MIC
   ●   heuristic dynamic programming




                                                05/03/2012, Michele Filannino   20 / 36
presentation my research taster project




MIC summary
✔   closed interval result
✔   non linear relations
✔   all types of data
✖   B(n) is crucial
     ✖   too high: non-zero scores even for random data
     ✖   too low: we are searching only for simple pattern
✖   still univariate

                                                 05/03/2012, Michele Filannino   21 / 36
presentation my research taster project




B(n) behaviour




                           05/03/2012, Michele Filannino   22 / 36
presentation my research taster project




B(n) behaviour




                           05/03/2012, Michele Filannino   23 / 36
how to use it
presentation my research taster project




python
     import xstats.MINE as MINE


     x = [40,50,None,70,80,90,100,110,120,130,140,150,
               160,170,180,190,200,210,220,230,240,250,260]


     y = [-0.07,-0.23,-0.1,0.03,-0.04,None,-0.28,-0.44,
               -0.09,0.12,0.06,-0.04,0.31,0.59,0.34,-0.28,-0.09,
               -0.44,0.31,0.03,0.57,0,0.01]


     print "x y", MINE.analyze_pair(x, y)



https://github.com/ajmazurie/xstats.MINE                  05/03/2012, Michele Filannino   25 / 36
presentation my research taster project




python: result

 {'MCN': 2.5849625999999999,
  'MAS': 0.040419996,
  'pearson': 0.31553724,
  'MIC': 0.38196000000000002,
  'MEV': 0.27117000000000002,
  'non_linearity': 0.28239626000000001}




                                              05/03/2012, Michele Filannino   26 / 36
presentation my research taster project




correlation coefficients
                     Mutual
          Pearson               MI norm.         MIC                  graph
                    Informat.

  f0-f5     0.63      2.45        0.74           0.24



  f0-f1    -0.17       1.57       0.47           0.24



  f0-f2     1.00      3.32        1.00           1.00



  f2-f3    -0.08       3.12       0.94           0.24



  f0-f3    -0.08       3.12       0.94           0.24




                                                     05/03/2012, Michele Filannino   27 / 36
presentation my research taster project




MIC summary
✔   closed interval result
✔   non linear relations
✔   all types of data
✖   B(n) is crucial
     ✖   n is too low!
✖   still univariate


                                       05/03/2012, Michele Filannino   28 / 36
presentation my research taster project




python
 import xstats.MINE as MINE
 import math


 x = [n*0.01 for n in range(1,2000)]
 y = [math.sin(n) for n in x]
 result = MINE.analyze_pair(x, y)


 print "MIC:", result[‘MIC’]

 print "Pearson:", result[‘pearson’]


 >>> MIC: 0.99999
 >>> Pearson: -0.16366038

                                                 05/03/2012, Michele Filannino   29 / 36
conclusion
presentation my research taster project




relationship types




Source: paper                  05/03/2012, Michele Filannino   31 / 36
presentation my research taster project




relationship types




Source: paper                  05/03/2012, Michele Filannino   32 / 36
presentation my research taster project




real application




Source: paper                05/03/2012, Michele Filannino   33 / 36
presentation my research taster project




suggestions

■ use MIC only when you have lots of samples
   ●   samples > 2000

■ use B(n) = n0.6
■ don’t use it for all the possible pairs of features
   ●   it is slower than Pearson’s correlation coefficient or
       Mutual Information


                                               05/03/2012, Michele Filannino   34 / 36
Thank you.
presentation my research taster project




references

■ D. N. Reshef et al., “Detecting Novel Associations in
  Large Data Sets,” Science, vol. 334, no. 6062, pp.
  1518-1524, 2011.

■ D. N. Reshef et al., “Supporting Online Material for
  Detecting Novel Associations in Large Data Sets”



                                            05/03/2012, Michele Filannino   36 / 36

More Related Content

More from Michele Filannino

Temporal expressions identification in biomedical texts
Temporal expressions identification in biomedical textsTemporal expressions identification in biomedical texts
Temporal expressions identification in biomedical textsMichele Filannino
 
Nonlinear component analysis as a kernel eigenvalue problem
Nonlinear component analysis as a kernel eigenvalue problemNonlinear component analysis as a kernel eigenvalue problem
Nonlinear component analysis as a kernel eigenvalue problemMichele Filannino
 
Sviluppo di un algoritmo di similarità a supporto dell'annotazione semantica ...
Sviluppo di un algoritmo di similarità a supporto dell'annotazione semantica ...Sviluppo di un algoritmo di similarità a supporto dell'annotazione semantica ...
Sviluppo di un algoritmo di similarità a supporto dell'annotazione semantica ...Michele Filannino
 
Tecniche fuzzy per l'elaborazione del linguaggio naturale
Tecniche fuzzy per l'elaborazione del linguaggio naturaleTecniche fuzzy per l'elaborazione del linguaggio naturale
Tecniche fuzzy per l'elaborazione del linguaggio naturaleMichele Filannino
 
Algoritmo di text-similarity per l'annotazione semantica di Web Service
Algoritmo di text-similarity per l'annotazione semantica di Web ServiceAlgoritmo di text-similarity per l'annotazione semantica di Web Service
Algoritmo di text-similarity per l'annotazione semantica di Web ServiceMichele Filannino
 
SWOP project and META software
SWOP project and META softwareSWOP project and META software
SWOP project and META softwareMichele Filannino
 
Semantic Web Service Annotation
Semantic Web Service AnnotationSemantic Web Service Annotation
Semantic Web Service AnnotationMichele Filannino
 
Orchestrazione delle risorse umane nel BPM
Orchestrazione delle risorse umane nel BPMOrchestrazione delle risorse umane nel BPM
Orchestrazione delle risorse umane nel BPMMichele Filannino
 
Modulo di serendipità in un Item Recommender System
Modulo di serendipità in un Item Recommender SystemModulo di serendipità in un Item Recommender System
Modulo di serendipità in un Item Recommender SystemMichele Filannino
 
Serendipity module in Item Recommender System
Serendipity module in Item Recommender SystemSerendipity module in Item Recommender System
Serendipity module in Item Recommender SystemMichele Filannino
 
Orchestrazione di risorse umane nel BPM: Gestione dinamica feature-based dell...
Orchestrazione di risorse umane nel BPM: Gestione dinamica feature-based dell...Orchestrazione di risorse umane nel BPM: Gestione dinamica feature-based dell...
Orchestrazione di risorse umane nel BPM: Gestione dinamica feature-based dell...Michele Filannino
 

More from Michele Filannino (12)

Temporal expressions identification in biomedical texts
Temporal expressions identification in biomedical textsTemporal expressions identification in biomedical texts
Temporal expressions identification in biomedical texts
 
My research taster project
My research taster projectMy research taster project
My research taster project
 
Nonlinear component analysis as a kernel eigenvalue problem
Nonlinear component analysis as a kernel eigenvalue problemNonlinear component analysis as a kernel eigenvalue problem
Nonlinear component analysis as a kernel eigenvalue problem
 
Sviluppo di un algoritmo di similarità a supporto dell'annotazione semantica ...
Sviluppo di un algoritmo di similarità a supporto dell'annotazione semantica ...Sviluppo di un algoritmo di similarità a supporto dell'annotazione semantica ...
Sviluppo di un algoritmo di similarità a supporto dell'annotazione semantica ...
 
Tecniche fuzzy per l'elaborazione del linguaggio naturale
Tecniche fuzzy per l'elaborazione del linguaggio naturaleTecniche fuzzy per l'elaborazione del linguaggio naturale
Tecniche fuzzy per l'elaborazione del linguaggio naturale
 
Algoritmo di text-similarity per l'annotazione semantica di Web Service
Algoritmo di text-similarity per l'annotazione semantica di Web ServiceAlgoritmo di text-similarity per l'annotazione semantica di Web Service
Algoritmo di text-similarity per l'annotazione semantica di Web Service
 
SWOP project and META software
SWOP project and META softwareSWOP project and META software
SWOP project and META software
 
Semantic Web Service Annotation
Semantic Web Service AnnotationSemantic Web Service Annotation
Semantic Web Service Annotation
 
Orchestrazione delle risorse umane nel BPM
Orchestrazione delle risorse umane nel BPMOrchestrazione delle risorse umane nel BPM
Orchestrazione delle risorse umane nel BPM
 
Modulo di serendipità in un Item Recommender System
Modulo di serendipità in un Item Recommender SystemModulo di serendipità in un Item Recommender System
Modulo di serendipità in un Item Recommender System
 
Serendipity module in Item Recommender System
Serendipity module in Item Recommender SystemSerendipity module in Item Recommender System
Serendipity module in Item Recommender System
 
Orchestrazione di risorse umane nel BPM: Gestione dinamica feature-based dell...
Orchestrazione di risorse umane nel BPM: Gestione dinamica feature-based dell...Orchestrazione di risorse umane nel BPM: Gestione dinamica feature-based dell...
Orchestrazione di risorse umane nel BPM: Gestione dinamica feature-based dell...
 

Recently uploaded

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)Samir Dash
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

Detecting novel associations in large data sets

  • 1. CS-GN-TEAM: internal presentation detecting novel associations in large data sets Michele Filannino + You Presented paper: D. N. Reshef et al., “Detecting Novel Associations in Large Data Sets,” Science, vol. 334, no. 6062, pp. 1518-1524, 2011. Manchester, 05/03/2012
  • 2. presentation my research taster project where we are 05/03/2012, Michele Filannino 2 / 36
  • 4. presentation my research taster project novel association ■ two variables, X and Y, are associated if there is a relationship between them ● functional ▶ ● non functional ▶ ■ novel: unknown 05/03/2012, Michele Filannino 4 / 36
  • 5. presentation my research taster project example f0 f1 f2 f3 f4 f5 s0 4.00 -0.76 5.00 12.00 8.22 1.83 s1 9.00 0.41 10.00 23.00 27.12 4.30 s2 3.00 0.14 4.00 0.00 0.56 -0.43 s3 10.00 -0.54 11.00 100.00 94.02 6.24 s4 5.00 -0.96 6.00 45.00 39.25 3.56 s5 2.00 0.91 3.00 123.00 125.73 2.97 s6 7.00 0.66 8.00 4.00 9.26 2.56 s7 8.00 0.99 9.00 -2.00 6.90 2.37 s8 1.00 0.84 2.00 36.00 37.68 1.58 s9 6.00 -0.28 7.00 0.00 -1.96 0.71 Data set 10x6 05/03/2012, Michele Filannino 5 / 36
  • 6. presentation my research taster project example f0 f1 f2 f3 f4 f5 s0 4.00 -0.76 5.00 12.00 8.22 1.83 s1 9.00 0.41 10.00 23.00 27.12 4.30 s2 3.00 0.14 4.00 0.00 0.56 -0.43 s3 10.00 -0.54 11.00 100.00 94.02 6.24 s4 5.00 -0.96 6.00 45.00 39.25 3.56 s5 2.00 0.91 3.00 123.00 125.73 2.97 s6 7.00 0.66 8.00 4.00 9.26 2.56 s7 8.00 0.99 9.00 -2.00 6.90 2.37 s8 1.00 0.84 2.00 36.00 37.68 1.58 s9 6.00 -0.28 7.00 0.00 -1.96 0.71 Data set 10x6 05/03/2012, Michele Filannino 6 / 36
  • 7. presentation my research taster project scatter plot: f0 vs. f2 f2(x) = f0(x) + 1 05/03/2012, Michele Filannino 7 / 36
  • 8. presentation my research taster project example f0 f1 f2 f3 f4 f5 s0 4.00 -0.76 5.00 12.00 8.22 1.83 s1 9.00 0.41 10.00 23.00 27.12 4.30 s2 3.00 0.14 4.00 0.00 0.56 -0.43 s3 10.00 -0.54 11.00 100.00 94.02 6.24 s4 5.00 -0.96 6.00 45.00 39.25 3.56 s5 2.00 0.91 3.00 123.00 125.73 2.97 s6 7.00 0.66 8.00 4.00 9.26 2.56 s7 8.00 0.99 9.00 -2.00 6.90 2.37 s8 1.00 0.84 2.00 36.00 37.68 1.58 s9 6.00 -0.28 7.00 0.00 -1.96 0.71 Data set 10x6 05/03/2012, Michele Filannino 8 / 36
  • 9. presentation my research taster project scatter plot: f0 vs. f1 no relation 05/03/2012, Michele Filannino 9 / 36
  • 10. presentation my research taster project correlation coefficients Pearson Mutual Infor. MI norm. f0-f5 0.63 2.45 0.74 f0-f1 -0.17 1.57 0.47 f0-f2 1.00 3.32 1.00 f2-f3 -0.08 3.12 0.94 f0-f3 -0.08 3.12 0.94 05/03/2012, Michele Filannino 10 / 36
  • 11. presentation my research taster project pros. & cons. ■ Pearson’s coeff. ■ Mutual Information ✔ closed interval result ✔ non linear relations ✖ only linear relations ✖ only categorical data ✖ feature independency ✖ biased towards higher arity features 05/03/2012, Michele Filannino 11 / 36
  • 13. presentation my research taster project motivations ■ generality: ● capture a wide range of interesting associations, not limited to specific function types ■ equitability: ● give similar scores to equally noisy relationships of different types 05/03/2012, Michele Filannino 13 / 36
  • 14. presentation my research taster project definition of MIC ■ Given a finite set D of ordered pairs, we can partition the X-values of D into x bins and the Y- values of D into y bins ■ We obtain a pair of partitions called x-by-y grid D = (F0, F1) F0 = (1.00, 2.00, 3.00 | 4.00, 5.00, 6.00, | 7.00, 8.00, 9.00, 10.00) F1 = (-0.96, -0.76 | -0.54, -0.28 | 0.14, | 0.41, | 0.66, 0.84, 0.91, 0.99) 05/03/2012, Michele Filannino 14 / 36
  • 15. presentation my research taster project x-by-y grid 2-by-4 grid 05/03/2012, Michele Filannino 15 / 36
  • 16. presentation my research taster project definition of MIC ■ given the grid we could calculate D|G, the frequency distribution induced by the points in D on the cells of G ● different grids G result in different distributions D|G 05/03/2012, Michele Filannino 16 / 36
  • 17. presentation my research taster project maximal MI over all grids number of columns number of rows 05/03/2012, Michele Filannino 17 / 36
  • 18. presentation my research taster project characteristic matrix Infinite matrix! normalisation factor (derived by MI definition) 05/03/2012, Michele Filannino 18 / 36
  • 19. presentation my research taster project Maximal Information Coeff. max grid size 05/03/2012, Michele Filannino 19 / 36
  • 20. presentation my research taster project matrix computation ■ space of grids grows exponentially ● B(n) ≤ O(n1-ε) for 0 < ε < 1 ■ approximation of MIC ● heuristic dynamic programming 05/03/2012, Michele Filannino 20 / 36
  • 21. presentation my research taster project MIC summary ✔ closed interval result ✔ non linear relations ✔ all types of data ✖ B(n) is crucial ✖ too high: non-zero scores even for random data ✖ too low: we are searching only for simple pattern ✖ still univariate 05/03/2012, Michele Filannino 21 / 36
  • 22. presentation my research taster project B(n) behaviour 05/03/2012, Michele Filannino 22 / 36
  • 23. presentation my research taster project B(n) behaviour 05/03/2012, Michele Filannino 23 / 36
  • 25. presentation my research taster project python import xstats.MINE as MINE x = [40,50,None,70,80,90,100,110,120,130,140,150, 160,170,180,190,200,210,220,230,240,250,260] y = [-0.07,-0.23,-0.1,0.03,-0.04,None,-0.28,-0.44, -0.09,0.12,0.06,-0.04,0.31,0.59,0.34,-0.28,-0.09, -0.44,0.31,0.03,0.57,0,0.01] print "x y", MINE.analyze_pair(x, y) https://github.com/ajmazurie/xstats.MINE 05/03/2012, Michele Filannino 25 / 36
  • 26. presentation my research taster project python: result {'MCN': 2.5849625999999999, 'MAS': 0.040419996, 'pearson': 0.31553724, 'MIC': 0.38196000000000002, 'MEV': 0.27117000000000002, 'non_linearity': 0.28239626000000001} 05/03/2012, Michele Filannino 26 / 36
  • 27. presentation my research taster project correlation coefficients Mutual Pearson MI norm. MIC graph Informat. f0-f5 0.63 2.45 0.74 0.24 f0-f1 -0.17 1.57 0.47 0.24 f0-f2 1.00 3.32 1.00 1.00 f2-f3 -0.08 3.12 0.94 0.24 f0-f3 -0.08 3.12 0.94 0.24 05/03/2012, Michele Filannino 27 / 36
  • 28. presentation my research taster project MIC summary ✔ closed interval result ✔ non linear relations ✔ all types of data ✖ B(n) is crucial ✖ n is too low! ✖ still univariate 05/03/2012, Michele Filannino 28 / 36
  • 29. presentation my research taster project python import xstats.MINE as MINE import math x = [n*0.01 for n in range(1,2000)] y = [math.sin(n) for n in x] result = MINE.analyze_pair(x, y) print "MIC:", result[‘MIC’] print "Pearson:", result[‘pearson’] >>> MIC: 0.99999 >>> Pearson: -0.16366038 05/03/2012, Michele Filannino 29 / 36
  • 31. presentation my research taster project relationship types Source: paper 05/03/2012, Michele Filannino 31 / 36
  • 32. presentation my research taster project relationship types Source: paper 05/03/2012, Michele Filannino 32 / 36
  • 33. presentation my research taster project real application Source: paper 05/03/2012, Michele Filannino 33 / 36
  • 34. presentation my research taster project suggestions ■ use MIC only when you have lots of samples ● samples > 2000 ■ use B(n) = n0.6 ■ don’t use it for all the possible pairs of features ● it is slower than Pearson’s correlation coefficient or Mutual Information 05/03/2012, Michele Filannino 34 / 36
  • 36. presentation my research taster project references ■ D. N. Reshef et al., “Detecting Novel Associations in Large Data Sets,” Science, vol. 334, no. 6062, pp. 1518-1524, 2011. ■ D. N. Reshef et al., “Supporting Online Material for Detecting Novel Associations in Large Data Sets” 05/03/2012, Michele Filannino 36 / 36