Detecting novel associations in large data sets

CS-GN-TEAM: internal presentation

detecting novel associations
in large data sets
Michele Filannino + You

Presented paper:
D. N. Reshef et al., “Detecting Novel Associations in Large Data Sets,” Science, vol. 334,
no. 6062, pp. 1518-1524, 2011.

Manchester, 05/03/2012

presentation my research taster project

where we are

05/03/2012, Michele Filannino 2 / 36


novel association
■ two variables, X and Y, are associated if there is a
relationship between them
● functional
▶

● non functional
▶

■ novel: unknown



example
f0 f1 f2 f3 f4 f5
s0 4.00 -0.76 5.00 12.00 8.22 1.83

s1 9.00 0.41 10.00 23.00 27.12 4.30

s2 3.00 0.14 4.00 0.00 0.56 -0.43

s3 10.00 -0.54 11.00 100.00 94.02 6.24

s4 5.00 -0.96 6.00 45.00 39.25 3.56

s5 2.00 0.91 3.00 123.00 125.73 2.97

s6 7.00 0.66 8.00 4.00 9.26 2.56

s7 8.00 0.99 9.00 -2.00 6.90 2.37

s8 1.00 0.84 2.00 36.00 37.68 1.58

s9 6.00 -0.28 7.00 0.00 -1.96 0.71

Data set 10x6 05/03/2012, Michele Filannino 5 / 36


example
f0 f1 f2 f3 f4 f5
s0 4.00 -0.76 5.00 12.00 8.22 1.83

s1 9.00 0.41 10.00 23.00 27.12 4.30

s2 3.00 0.14 4.00 0.00 0.56 -0.43

s3 10.00 -0.54 11.00 100.00 94.02 6.24

s4 5.00 -0.96 6.00 45.00 39.25 3.56

s5 2.00 0.91 3.00 123.00 125.73 2.97

s6 7.00 0.66 8.00 4.00 9.26 2.56

s7 8.00 0.99 9.00 -2.00 6.90 2.37

s8 1.00 0.84 2.00 36.00 37.68 1.58

s9 6.00 -0.28 7.00 0.00 -1.96 0.71



scatter plot: f0 vs. f2

f2(x) = f0(x) + 1 05/03/2012, Michele Filannino 7 / 36


example
f0 f1 f2 f3 f4 f5
s0 4.00 -0.76 5.00 12.00 8.22 1.83

s1 9.00 0.41 10.00 23.00 27.12 4.30

s2 3.00 0.14 4.00 0.00 0.56 -0.43

s3 10.00 -0.54 11.00 100.00 94.02 6.24

s4 5.00 -0.96 6.00 45.00 39.25 3.56

s5 2.00 0.91 3.00 123.00 125.73 2.97

s6 7.00 0.66 8.00 4.00 9.26 2.56

s7 8.00 0.99 9.00 -2.00 6.90 2.37

s8 1.00 0.84 2.00 36.00 37.68 1.58

s9 6.00 -0.28 7.00 0.00 -1.96 0.71



scatter plot: f0 vs. f1

no relation 05/03/2012, Michele Filannino 9 / 36


correlation coeﬃcients
Pearson Mutual Infor. MI norm.

f0-f5 0.63 2.45 0.74

f0-f1 -0.17 1.57 0.47

f0-f2 1.00 3.32 1.00

f2-f3 -0.08 3.12 0.94

f0-f3 -0.08 3.12 0.94



pros. & cons.

■ Pearson’s coeﬀ. ■ Mutual Information
✔ closed interval result ✔ non linear relations
✖ only linear relations ✖ only categorical data
✖ feature independency ✖ biased towards higher
arity features



motivations

■ generality:
● capture a wide range of interesting associations, not
limited to speciﬁc function types

■ equitability:
● give similar scores to equally noisy relationships of
diﬀerent types



deﬁnition of MIC
■ Given a ﬁnite set D of ordered pairs, we can
partition the X-values of D into x bins and the Y-
values of D into y bins

■ We obtain a pair of partitions called x-by-y grid
D = (F0, F1)
F0 = (1.00, 2.00, 3.00 | 4.00, 5.00, 6.00, | 7.00, 8.00, 9.00, 10.00)
F1 = (-0.96, -0.76 | -0.54, -0.28 | 0.14, | 0.41, | 0.66, 0.84, 0.91, 0.99)



x-by-y grid

2-by-4 grid 05/03/2012, Michele Filannino 15 / 36


definition of MIC

■ given the grid we could calculate D|G, the frequency
distribution induced by the points in D on the cells
of G
● different grids G result in different distributions D|G



maximal MI over all grids

number of columns number of rows



characteristic matrix

Inﬁnite matrix!
normalisation factor
(derived by MI deﬁnition)



Maximal Information Coeﬀ.

max grid size



matrix computation

■ space of grids grows exponentially
● B(n) ≤ O(n1-ε) for 0 < ε < 1

■ approximation of MIC
● heuristic dynamic programming



MIC summary
✔ closed interval result
✔ non linear relations
✔ all types of data
✖ B(n) is crucial
✖ too high: non-zero scores even for random data
✖ too low: we are searching only for simple pattern
✖ still univariate



B(n) behaviour



python
import xstats.MINE as MINE

x = [40,50,None,70,80,90,100,110,120,130,140,150,
160,170,180,190,200,210,220,230,240,250,260]

y = [-0.07,-0.23,-0.1,0.03,-0.04,None,-0.28,-0.44,
-0.09,0.12,0.06,-0.04,0.31,0.59,0.34,-0.28,-0.09,
-0.44,0.31,0.03,0.57,0,0.01]

print "x y", MINE.analyze_pair(x, y)

https://github.com/ajmazurie/xstats.MINE 05/03/2012, Michele Filannino 25 / 36


python: result

{'MCN': 2.5849625999999999,
'MAS': 0.040419996,
'pearson': 0.31553724,
'MIC': 0.38196000000000002,
'MEV': 0.27117000000000002,
'non_linearity': 0.28239626000000001}



correlation coeﬃcients
Mutual
Pearson MI norm. MIC graph
Informat.

f0-f5 0.63 2.45 0.74 0.24

f0-f1 -0.17 1.57 0.47 0.24

f0-f2 1.00 3.32 1.00 1.00

f2-f3 -0.08 3.12 0.94 0.24

f0-f3 -0.08 3.12 0.94 0.24



MIC summary
✔ closed interval result
✔ non linear relations
✔ all types of data
✖ B(n) is crucial
✖ n is too low!
✖ still univariate



python
import xstats.MINE as MINE
import math

x = [n*0.01 for n in range(1,2000)]
y = [math.sin(n) for n in x]
result = MINE.analyze_pair(x, y)

print "MIC:", result[‘MIC’]

print "Pearson:", result[‘pearson’]

>>> MIC: 0.99999
>>> Pearson: -0.16366038



relationship types

Source: paper 05/03/2012, Michele Filannino 31 / 36


relationship types



real application



suggestions

■ use MIC only when you have lots of samples
● samples > 2000

■ use B(n) = n0.6
■ don’t use it for all the possible pairs of features
● it is slower than Pearson’s correlation coeﬃcient or
Mutual Information



references

■ D. N. Reshef et al., “Detecting Novel Associations in
Large Data Sets,” Science, vol. 334, no. 6062, pp.
1518-1524, 2011.

■ D. N. Reshef et al., “Supporting Online Material for
Detecting Novel Associations in Large Data Sets”


Detecting novel associations in large data sets

Recommended

Recommended

More Related Content

More from Michele Filannino

More from Michele Filannino (12)

Recently uploaded

Recently uploaded (20)

Detecting novel associations in large data sets