Anonymization of graph-based data is a problem which has been widely studied in recent years and several anonymization methods have been developed. In this presentation the authors (Jordi Casas-Roma, Jordi Herrera-Joancomartí, Vicenç Torra) study different generic information
loss measures for graphs comparing such measures to the cluster-specific ones. They evaluate whether the generic information loss measures are indicative of the usefulness of the data for subsequent data mining processes.
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
Anonymizing Graphs: Measuring Quality for Clustering
1. Motivation
Information loss measures
Experimental framework
Correlating GIL and SIL measures
Conclusions
Anonymizing graphs: measuring quality for
clustering
Jordi Casas-Roma 1
Jordi Herrera-Joancomart´ı 2
Vicen¸c Torra 3
1
Universitat Oberta de Catalunya (UOC)
jcasasr@uoc.edu
2
Universitat Aut`onoma de Barcelona (UAB)
jherrera@deic.uab.cat
3
Artificial Intelligence Research Institute (IIIA)
Spanish National Research Council (CSIC)
vtorra@iiia.csic.es
UOC Research Showcase 2015. February 11, 2015
1 / 17
2. Motivation
Information loss measures
Experimental framework
Correlating GIL and SIL measures
Conclusions
Overview
1 Motivation
2 Information loss measures
3 Experimental framework
4 Correlating GIL and SIL measures
5 Conclusions
2 / 17
4. Motivation
Information loss measures
Experimental framework
Correlating GIL and SIL measures
Conclusions
Motivation
We observe...
There are several graph-mining tasks and several methods to
compute each task.
How can we evaluate the real data utility?
Question
Can we use some generic graph metrics to predict real graph-mining
tasks?
4 / 17
5. Motivation
Information loss measures
Experimental framework
Correlating GIL and SIL measures
Conclusions
Generic information loss (GIL)
Specific information loss (SIL)
Generic information loss measures (GIL)
G G
m(G, G)
Anonymization
process p
Metric m Metric m
Framework for evaluating generic information loss measures
5 / 17
6. Motivation
Information loss measures
Experimental framework
Correlating GIL and SIL measures
Conclusions
Generic information loss (GIL)
Specific information loss (SIL)
Generic information loss measures (GIL)
Network metrics
average distance (dist)
diameter (d)
harmonic mean of the shortest distance (h)
sub-graph centrality (SC)
transitivity (T)
edge intersection (EI)
clustering coefficient (C)
modularity (Q)
m(G, G) = |m(G) − m(Gp)| (1)
6 / 17
7. Motivation
Information loss measures
Experimental framework
Correlating GIL and SIL measures
Conclusions
Generic information loss (GIL)
Specific information loss (SIL)
Generic information loss measures (GIL)
Spectral metrics
the largest eigenvalue of the adjacency matrix A (λ1)
the second smallest eigenvalue of the Laplacian matrix L (µ2)
Vertex metrics
betweenness centrality (CB )
closeness centrality (CC )
degree centrality (CD)
m(G, G) =
1
n
n
i=1
(m(vi ) − m(vi ))
2
(2)
7 / 17
8. Motivation
Information loss measures
Experimental framework
Correlating GIL and SIL measures
Conclusions
Generic information loss (GIL)
Specific information loss (SIL)
Clustering-specific information loss measures (SIL)
G G
Original clusters
c(G)
Precision
index
Perturbed clusters
c(G)
Anonymization
process p
Clustering
method c
Clustering
method c
precision index(G, G) =
1
n
n
v=1
ltc =lpc (3)
8 / 17
9. Motivation
Information loss measures
Experimental framework
Correlating GIL and SIL measures
Conclusions
Generic information loss (GIL)
Specific information loss (SIL)
Clustering-specific information loss measures (SIL)
Clustering algorithms
Markov Cluster Algorithm (MCL)
Algorithm of Girvan and Newman (Girvan-Newman or GN)
Fast greedy modularity optimization (Fastgreedy or FG)
Walktrap (WT)
Infomap (IM)
Multilevel (ML)
9 / 17
10. Motivation
Information loss measures
Experimental framework
Correlating GIL and SIL measures
Conclusions
Experimental framework
Original 1% Anon. 25% Anon.
Graph assessment
1% ... 25%
GIL
Clustering assessment
1% ... 25%
SIL
Perturbation
process
Are they equal?
Experimental framework for testing the correlation between GIL and SIL
10 / 17
11. Motivation
Information loss measures
Experimental framework
Correlating GIL and SIL measures
Conclusions
GIL Self-correlation
SIL Self-correlation
GIL vs. SIL
Comparing datasets
GIL Self-correlation
Do the generic information loss measures behave in similar way
independently of the dataset?
Pearson dist d CB CC CD EI C T λ1 µ2
r 0.85 0.15 0.96 0.90 0.99 0.99 0.97 0.94 0.24 0.09
ρ-value 0 0.007 0 0 0 0 0 0 0 0.006
Pearson self-correlation value (r) and its associated ρ-value of GIL measures.
11 / 17
12. Motivation
Information loss measures
Experimental framework
Correlating GIL and SIL measures
Conclusions
GIL Self-correlation
SIL Self-correlation
GIL vs. SIL
Comparing datasets
SIL self-correlation
Do the clustering-specific information loss measures behave in
similar way independently of the dataset?
Pearson MCL IM ML GN FG WT
r 0.287 0.626 0.777 0.828 0.782 0.656
ρ-value 0 0 0 0 0 0
Pearson self-correlation value (r) and its associated ρ-value of precision index.
12 / 17
13. Motivation
Information loss measures
Experimental framework
Correlating GIL and SIL measures
Conclusions
GIL Self-correlation
SIL Self-correlation
GIL vs. SIL
Comparing datasets
GIL vs. SIL
Are GIL and SIL measures correlated?
Pearson MCL IM ML GN FG WT µ
dist 0.580 0.716 0.807 0.785 0.747 0.755 0.732
d 0.201 0.101 * 0.098 * 0.134 0.218 0.014 * 0.128
CB 0.559 0.687 0.854 0.865 0.831 0.724 0.753
CC 0.667 0.833 0.903 0.909 0.874 0.899 0.848
CD 0.296 0.380 0.416 0.504 0.481 0.457 0.422
EI 0.581 0.820 0.861 0.887 0.814 0.748 0.785
C 0.614 0.833 0.889 0.909 0.836 0.802 0.814
T 0.557 0.763 0.840 0.840 0.770 0.690 0.743
λ1 0.191 0.482 0.509 0.546 0.529 0.397 0.442
µ2 0.086 * 0.152 0.131 0.154 0.135 0.040 * 0.116
µ 0.433 0.577 0.631 0.653 0.624 0.553 NA
Pearson correlation values (r) and their average values µ. An asterisk indicates
ρ-values ≥ 0.05, i.e, results which are not statistically significant.
13 / 17
14. Motivation
Information loss measures
Experimental framework
Correlating GIL and SIL measures
Conclusions
GIL Self-correlation
SIL Self-correlation
GIL vs. SIL
Comparing datasets
Aggregated GIL vs. SIL
Can we use more than one GIL measure to improve correlation?
Num. GIL measures r-square σ
1 CC 0.725 0.146
2 CB +CC 0.742 0.150
3 CB +CC +EI 0.765 0.155
4 d+CB +CC +EI 0.777 0.127
5 dist+d+CB +CC +EI 0.787 0.117
Multivariate regression analysis: r-square is indicative of the aggregate
correlation and σ is the standard deviation.
14 / 17
15. Motivation
Information loss measures
Experimental framework
Correlating GIL and SIL measures
Conclusions
GIL Self-correlation
SIL Self-correlation
GIL vs. SIL
Comparing datasets
GIL vs. SIL
Are the results independently of the data where they are applied?
Pearson Karate Football Jazz Flickr URV Email
µ 0.716 0.796 0.717 0.780 0.729
σ 0.247 0.119 0.170 0.184 0.163
Pearson correlation averaged values (µ) and standard deviation (σ) for each
dataset.
15 / 17
16. Motivation
Information loss measures
Experimental framework
Correlating GIL and SIL measures
Conclusions
Conclusions
Some measures behave in similar way independently of the data in
which they are applied.
There is strong correlation between some GIL and SIL:
1 closeness centrality
2 clustering coefficient
3 edge intersection
4 betweenness centrality
5 transitivity
6 average distance
Considering more than one metric helps us to get slightly higher
correlation values, but adding computational cost.
16 / 17
17. Motivation
Information loss measures
Experimental framework
Correlating GIL and SIL measures
Conclusions
The End
Thanks for your attention
Jordi Casas-Roma UOC jcasasr@uoc.edu
Jordi Herrera-Joancomart´ı UAB jherrera@deic.uab.cat
Vicen¸c Torra IIIA-CSIC vtorra@iiia.csic.es
17 / 17