1. CS-GN-TEAM: internal presentation
detecting novel associations
in large data sets
Michele Filannino + You
Presented paper:
D. N. Reshef et al., “Detecting Novel Associations in Large Data Sets,” Science, vol. 334,
no. 6062, pp. 1518-1524, 2011.
Manchester, 05/03/2012
4. presentation my research taster project
novel association
■ two variables, X and Y, are associated if there is a
relationship between them
● functional
▶
● non functional
▶
■ novel: unknown
05/03/2012, Michele Filannino 4 / 36
11. presentation my research taster project
pros. & cons.
■ Pearson’s coeff. ■ Mutual Information
✔ closed interval result ✔ non linear relations
✖ only linear relations ✖ only categorical data
✖ feature independency ✖ biased towards higher
arity features
05/03/2012, Michele Filannino 11 / 36
13. presentation my research taster project
motivations
■ generality:
● capture a wide range of interesting associations, not
limited to specific function types
■ equitability:
● give similar scores to equally noisy relationships of
different types
05/03/2012, Michele Filannino 13 / 36
14. presentation my research taster project
definition of MIC
■ Given a finite set D of ordered pairs, we can
partition the X-values of D into x bins and the Y-
values of D into y bins
■ We obtain a pair of partitions called x-by-y grid
D = (F0, F1)
F0 = (1.00, 2.00, 3.00 | 4.00, 5.00, 6.00, | 7.00, 8.00, 9.00, 10.00)
F1 = (-0.96, -0.76 | -0.54, -0.28 | 0.14, | 0.41, | 0.66, 0.84, 0.91, 0.99)
05/03/2012, Michele Filannino 14 / 36
15. presentation my research taster project
x-by-y grid
2-by-4 grid 05/03/2012, Michele Filannino 15 / 36
16. presentation my research taster project
definition of MIC
■ given the grid we could calculate D|G, the frequency
distribution induced by the points in D on the cells
of G
● different grids G result in different distributions D|G
05/03/2012, Michele Filannino 16 / 36
17. presentation my research taster project
maximal MI over all grids
number of columns number of rows
05/03/2012, Michele Filannino 17 / 36
18. presentation my research taster project
characteristic matrix
Infinite matrix!
normalisation factor
(derived by MI definition)
05/03/2012, Michele Filannino 18 / 36
19. presentation my research taster project
Maximal Information Coeff.
max grid size
05/03/2012, Michele Filannino 19 / 36
20. presentation my research taster project
matrix computation
■ space of grids grows exponentially
● B(n) ≤ O(n1-ε) for 0 < ε < 1
■ approximation of MIC
● heuristic dynamic programming
05/03/2012, Michele Filannino 20 / 36
21. presentation my research taster project
MIC summary
✔ closed interval result
✔ non linear relations
✔ all types of data
✖ B(n) is crucial
✖ too high: non-zero scores even for random data
✖ too low: we are searching only for simple pattern
✖ still univariate
05/03/2012, Michele Filannino 21 / 36
28. presentation my research taster project
MIC summary
✔ closed interval result
✔ non linear relations
✔ all types of data
✖ B(n) is crucial
✖ n is too low!
✖ still univariate
05/03/2012, Michele Filannino 28 / 36
29. presentation my research taster project
python
import xstats.MINE as MINE
import math
x = [n*0.01 for n in range(1,2000)]
y = [math.sin(n) for n in x]
result = MINE.analyze_pair(x, y)
print "MIC:", result[‘MIC’]
print "Pearson:", result[‘pearson’]
>>> MIC: 0.99999
>>> Pearson: -0.16366038
05/03/2012, Michele Filannino 29 / 36
31. presentation my research taster project
relationship types
Source: paper 05/03/2012, Michele Filannino 31 / 36
32. presentation my research taster project
relationship types
Source: paper 05/03/2012, Michele Filannino 32 / 36
33. presentation my research taster project
real application
Source: paper 05/03/2012, Michele Filannino 33 / 36
34. presentation my research taster project
suggestions
■ use MIC only when you have lots of samples
● samples > 2000
■ use B(n) = n0.6
■ don’t use it for all the possible pairs of features
● it is slower than Pearson’s correlation coefficient or
Mutual Information
05/03/2012, Michele Filannino 34 / 36
36. presentation my research taster project
references
■ D. N. Reshef et al., “Detecting Novel Associations in
Large Data Sets,” Science, vol. 334, no. 6062, pp.
1518-1524, 2011.
■ D. N. Reshef et al., “Supporting Online Material for
Detecting Novel Associations in Large Data Sets”
05/03/2012, Michele Filannino 36 / 36