SlideShare una empresa de Scribd logo
1 de 29
Pei Lee, ICDE 2012
On Top-k Structural
Similarity Search
Pei Lee, Laks V.S. Lakshmanan
University of British Columbia
Vancouver, BC, Canada
Jeffrey Xu Yu
Chinese University of Hong Kong
Hong Kong, China
1
2014-4-16
Outline
 Problem definition
 Structural similarity
 top k structural similarity search
 Existing top k structural similarity search methods
 SimRank, P-Rank
 Constraints
 TopSim: a family of efficient top k structural similarity
search algorithms with accuracy guarantee
 Truncated TopSim, Prioritized TopSim
 Experiments
2
Problem Statement
Graph structures are ubiquitous
 Social networks, citation networks, web graphs, etc
3
Problem Statement
What’s structural similarity?
 Structural similarity: the pairwise similarity between nodes
in a graph
 Applications: link prediction, recommendation, etc
4
Problem Statement
 Intuition: two nodes are similar, if their neighbors are similar
 Derived from PageRank’s intuition
v
a h
b gd
c
e
u
fHow to quantify the similarity
between node u and v?
Problem Definition:
Input: ( , ), ,
Output: ( , )
G V E u V v V
S u v
A node is important, if this node is referenced
by many other important nodes
What’s top-k structural similarity
search?5
Problem Statement
Problem Definition:
Input: ( , ), ,
Output: Top- similar nodes for
G V E v V k
k v
 Given a node v in a huge graph
 Find top-k similar nodes with v
 But
 Definitely do not want to compare with every node
 The accuracy of results should be guaranteed.
Existing Structural Similarity
Measures
 Neighbor-based approaches
 Jaccard Coefficient, Cosine Similarity, Pearson
correlation, Co-citation, etc
 Cons: no common neighbors, no similarity!
 Random walk based approaches
 SimRank (Jeh & Widom, KDD’02)
 P-Rank (Zhao et.al, CIKM’09) (by extending SimRank)
 Cons:
 high computational cost
 Not designed for top-k similarity search
6
Related Work
SimRank & P-Rank
 SimRank: two nodes are similar, if they are
referenced by similar nodes
7
Related Work
v
a
cb
u
( , ) 0.5 0S b c
( , ) 0.25 0S u v
( , ) 1S a a
1
( ) ( )
( , ) ( , )
| ( ) || ( ) |
n n
i I u j I v
C
S u v S i j
I u I v
1
T
n n
S CWS W 
Pairwise iterative form:
Matrix form:
In-neighbors
Transition matrix
Correction matrix
 P-Rank: two nodes are similar, if they are related
with similar nodes
1
( ) ( ) ( ) ( )
(1 )
( , ) ( , ) ( , )
| ( ) || ( ) | | ( ) || ( ) |
n n n
i I u j I v i O u j O v
C C
S u v S i j S i j
I u I v O u O v
0 < C < 1
0 < λ < 1
SimRank Reversed SimRank
Top-k similarity search: challenges
 Matrix-based approach: (KDD’02, VLDB’08)
 Offline: compute a |V|-by-|V| similarity matrix
 SimRank/P-Rank takes O(|E|2) time, which degenerate to
O(|V|4) in the worst case
 Space cost: hard to store this huge similarity matrix
 Vector-based approach: (SDM’10)
 Offline: compute a vector with length |V|
 Takes O(|V|D2n) time in the worst case, where n is the
iteration number, D is the average edge degree
 All these approaches need to access the whole graph to
find the exact top-k similar nodes
8
Challenges
Contributions
 Transform the computation of pairwise similarity on graph G
to the computation of authority on G G, based on a
propagation & aggregation process;
 Propose TopSim, a local top-k structural similarity search
algorithm that avoids accessing the whole graph while the
accuracy is guaranteed.
 Propose Trun-TopSim-SM and Prio-TopSim-SM, which are
two approximations allowing us to trade accuracy for speed.
9
Contributions
How TopSim works
10
Coupling random walk on G
Single random walk on G G
Propagation & Aggregation
Similarity Path
Similarity Score
Product of graphs: G G
 Given G(V, E), G G is defined as
 For node u and v in G, uv is a node in G G
 For edge (e, u) and (e, v) in G, (ee, uv) is an edge in G G
11
d
b
u
v
a
c
e
uvce eebd
uu
vu
vv
dd
cb
aada
eccc
ae
ea
 Each node pair in G will be a node in G G
 Each edge pair in G will be an edge in G G
 No need to materialize G G: only conceptually exists to facilitate analysis
G
G G
Coupling random walk
 Coupling random walk: two random surfers walk simultaneously and
follow the same edge direction
 Surf1, Surf2
 Coupling random walk on G can be equivalently transformed as a single
random walk on G G
 SimRank: S(u, v) is the first meeting probability of two random surfers
starting from u and v respectively and following backward links.
12
d
b
u
v
a
c
e
uvce eebd
uu
vu
vv
dd
cb
aada
eccc
ae
ea
G G G
Compute similarity based on
coupling random walk
 We actually transform a similarity ranking problem on
G into an authority ranking problem on G G
 R(uv) = S(u, v)
 Initialization:
 Source node (if u = v): R(uv) = 1 is fixed
 Target node (if u ≠ v): R(uv) = 0 and R(uv) will be updated
 How is R(uv) updated?
 Propagation & Aggregation process on G G
 Propagation: nodes propagate their authority to their neighbors
following random walk steps
 Aggregation: nodes receive and aggregate the authorities that are
propagated-in from their neighbors.
13
Compute S(u,v)?
 Similarity path: a path from source node to target node without
going by source nodes
 Probability of a transition step:
 Similarity:
 Sum of similarity paths with end node uv
14
uvce eebd
uu
vu
vv
dd
cb
aada
eccc
ae
ea
uvce eebd
uu
vu
vv
dd
cb
aada
eccc
ae
ea
Compute S(u,v): example
15
1
11
1
1 1
Path 1: (ee, uv)
P(ee, uv) = 0.5
 If we only consider 3 steps
Path 2: (aa, bd, ce, uv)
P(aa, bd, ce, uv) = 0.5*1*0.5 = 0.25
S3(u,v) = P(ee, uv)*C + P(aa, bd, ce, uv)*C3 = 0.28
C = 0.5
TopSim
16
Optimization based on SimMap
 Observation: many similarity paths are overlapped
17
v
a h
b gd
c
e
u
1
2
3
f
0
 SimMap SM(u) = {(key, value)}
 key is the node visited by Surf2 on step i when Surf1 visits the node u
 value = Si(key, u)
 SM(v) is exactly the result list
 TopSim-SM
 Example:
 Start from c
 SM(b) = {(d, 1/2), (f, 1/2)}
 SM(a) = {(e, 1/8)}
 SM(v) = {(u, 1/32)}
Similarity paths
Family of TopSim Algorithms
18
Algorithms Quality Performance
TopSim Exact Slow if the graph is not sparse
TopSim-SM Exact More efficient than TopSim
Trade accuracy for speed More efficient than TopSim-SM
Trade accuracy for speed More efficient than TopSim-SM
Trun-TopSim-SM
Prio-TopSim-SM
TopSim approximations for Scale-
free graphs
 Scale-free graphs
 Some nodes have very high degree
 Web graphs, citation networks, etc
 Random surfers will be trapped by high degree nodes
 The size of SimMaps will be exploded
 Revisit the transition probabilities:
19
a
 n
TopSim approximations
 Basic idea:
 Only consider similarity paths with higher probability
 Truncated TopSim-SM
 If P(u0u0, …, uivi) < η, stop and ignore this path
 Prioritized TopSim-SM
 Set a buffer size H for each SimMap;
 Only expand top H nodes in SimMaps:
 If | SM(u) | > H, set | SM(u) | = H.
 Find accuracy and complexity analysis in paper
20
Experiments
 Datasets
 Arxiv High Energy Physics paper citation network,
including 34,546 nodes and 421,578 edges
 DBLP co-author graph, with 0.92M nodes, 6.1M edges
 DBLP citation network, with 1.5M papers and 2.1M
citations
 Live Journal social network, with 4.84M users and
68.99M friendship ties
 Factors
 C = 0.5, η = 0.001, H = 100
21
Accuracy of similarity scores
22
Accuracy ratio Accuracy loss
(Running on Arxiv citation network)
3 steps/iterations are good enough for the accuracy of top-20 list
Precision@k
23
(Running on DBLP citation network)
k around 20~30 yields the highest
precision
3 steps/iterations yields a high
precision
24
Kendall Tau distance
(care more about the ranking order …)
a
b
a
b
a
b
b
a
concordant discordant
The higher, the better
Kendall Tau distance
(care more about the ranking order …)
25
k around 20~30 yields the highest
precision
3 steps/iterations yields a high
precision
Running time with different node
sizes and node degrees26
TopSim algorithms are not very
sensitive to the graph size
TopSim approximations can
handle high degree nodes
Running time and accessed nodes
27
Excitements
 We transform a similarity problem on graph G into an
equivalent authority ranking problem on the product graph
G G to facilitate analysis;
 We propose a family of TopSim algorithms that:
 Produce top-k results with accuracy guarantee;
 Only access a small portion of the graph.
 Handle both SimRank and P-Rank under the same top k
framework.
 Questions?
28
SimRank P-Rank
TopSim
TopSim-SM
 Start from v and find source nodes at each step
 From level n-1 to 0
 Let Surf1 start from source node and walk to node v
 Let Surf2 start from the same source node and put the visited
nodes into SimMaps
 When Surf1 visits v, Surf2 will exactly visits the similar
nodes of v in the same step
29
v
a h
b gd
c
e
u
1
2
3
f
0
 Example:
 Start from c
 SM(b) = {(d, 1/2), (f, 1/2)}
 SM(a) = {(e, 1/8)}
 SM(v) = {(u, 1/32)}

Más contenido relacionado

La actualidad más candente

Higher-Order Voronoi Diagrams of Polygonal Objects. Dissertation
Higher-Order Voronoi Diagrams of Polygonal Objects. DissertationHigher-Order Voronoi Diagrams of Polygonal Objects. Dissertation
Higher-Order Voronoi Diagrams of Polygonal Objects. DissertationMaksym Zavershynskyi
 
Data Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneData Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneDoug Needham
 
A Sweepline Algorithm for Higher Order Voronoi Diagrams
A Sweepline Algorithm for Higher Order Voronoi DiagramsA Sweepline Algorithm for Higher Order Voronoi Diagrams
A Sweepline Algorithm for Higher Order Voronoi DiagramsMaksym Zavershynskyi
 
Reduction of multiple subsystem [compatibility mode]
Reduction of multiple subsystem [compatibility mode]Reduction of multiple subsystem [compatibility mode]
Reduction of multiple subsystem [compatibility mode]azroyyazid
 
The shortest path is not always a straight line
The shortest path is not always a straight lineThe shortest path is not always a straight line
The shortest path is not always a straight lineVasia Kalavri
 
Self-managed and automatically reconfigurable stream processing
Self-managed and automatically reconfigurable stream processingSelf-managed and automatically reconfigurable stream processing
Self-managed and automatically reconfigurable stream processingVasia Kalavri
 
Ads unit 2 ppt
Ads unit 2 pptAds unit 2 ppt
Ads unit 2 pptpraveena p
 
Lecture 10 11-signal_flow_graphs
Lecture 10 11-signal_flow_graphsLecture 10 11-signal_flow_graphs
Lecture 10 11-signal_flow_graphsSaifullah Memon
 
Randomized Algorithms for Higher-Order Voronoi Diagrams
Randomized Algorithms for Higher-Order Voronoi DiagramsRandomized Algorithms for Higher-Order Voronoi Diagrams
Randomized Algorithms for Higher-Order Voronoi DiagramsMaksym Zavershynskyi
 
Routing algorithm
Routing algorithmRouting algorithm
Routing algorithmfarimoin
 
On the higher order Voronoi diagram of line-segments (ISAAC2012)
On the higher order Voronoi diagram of line-segments (ISAAC2012)On the higher order Voronoi diagram of line-segments (ISAAC2012)
On the higher order Voronoi diagram of line-segments (ISAAC2012)Maksym Zavershynskyi
 
Shortest path search for real road networks and dynamic costs with pgRouting
Shortest path search for real road networks and dynamic costs with pgRoutingShortest path search for real road networks and dynamic costs with pgRouting
Shortest path search for real road networks and dynamic costs with pgRoutingantonpa
 

La actualidad más candente (19)

Higher-Order Voronoi Diagrams of Polygonal Objects. Dissertation
Higher-Order Voronoi Diagrams of Polygonal Objects. DissertationHigher-Order Voronoi Diagrams of Polygonal Objects. Dissertation
Higher-Order Voronoi Diagrams of Polygonal Objects. Dissertation
 
Temporal graph
Temporal graphTemporal graph
Temporal graph
 
Data Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneData Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZone
 
A Sweepline Algorithm for Higher Order Voronoi Diagrams
A Sweepline Algorithm for Higher Order Voronoi DiagramsA Sweepline Algorithm for Higher Order Voronoi Diagrams
A Sweepline Algorithm for Higher Order Voronoi Diagrams
 
Reduction of multiple subsystem [compatibility mode]
Reduction of multiple subsystem [compatibility mode]Reduction of multiple subsystem [compatibility mode]
Reduction of multiple subsystem [compatibility mode]
 
Signal flow graph
Signal flow graphSignal flow graph
Signal flow graph
 
The shortest path is not always a straight line
The shortest path is not always a straight lineThe shortest path is not always a straight line
The shortest path is not always a straight line
 
Self-managed and automatically reconfigurable stream processing
Self-managed and automatically reconfigurable stream processingSelf-managed and automatically reconfigurable stream processing
Self-managed and automatically reconfigurable stream processing
 
Ads unit 2 ppt
Ads unit 2 pptAds unit 2 ppt
Ads unit 2 ppt
 
Lecture 10 11-signal_flow_graphs
Lecture 10 11-signal_flow_graphsLecture 10 11-signal_flow_graphs
Lecture 10 11-signal_flow_graphs
 
Randomized Algorithms for Higher-Order Voronoi Diagrams
Randomized Algorithms for Higher-Order Voronoi DiagramsRandomized Algorithms for Higher-Order Voronoi Diagrams
Randomized Algorithms for Higher-Order Voronoi Diagrams
 
A superglue for string comparison
A superglue for string comparisonA superglue for string comparison
A superglue for string comparison
 
signal flow graph
signal flow graphsignal flow graph
signal flow graph
 
Routing algorithm
Routing algorithmRouting algorithm
Routing algorithm
 
Spsp fw
Spsp fwSpsp fw
Spsp fw
 
On the higher order Voronoi diagram of line-segments (ISAAC2012)
On the higher order Voronoi diagram of line-segments (ISAAC2012)On the higher order Voronoi diagram of line-segments (ISAAC2012)
On the higher order Voronoi diagram of line-segments (ISAAC2012)
 
object 3d(2)
object 3d(2)object 3d(2)
object 3d(2)
 
Shortest path search for real road networks and dynamic costs with pgRouting
Shortest path search for real road networks and dynamic costs with pgRoutingShortest path search for real road networks and dynamic costs with pgRouting
Shortest path search for real road networks and dynamic costs with pgRouting
 
Lecture25
Lecture25Lecture25
Lecture25
 

Destacado

Analysis of different similarity measures: Simrank
Analysis of different similarity measures: SimrankAnalysis of different similarity measures: Simrank
Analysis of different similarity measures: SimrankAbhishek Mungoli
 
Similarity distance measures
Similarity  distance measuresSimilarity  distance measures
Similarity distance measuresthilagasna
 
Chemical features: how do we describe a compound to a computer?
Chemical features: how do we describe a compound to a computer?Chemical features: how do we describe a compound to a computer?
Chemical features: how do we describe a compound to a computer?Richard Lewis
 
Fingerprinting Chemical Structures
Fingerprinting Chemical StructuresFingerprinting Chemical Structures
Fingerprinting Chemical StructuresRajarshi Guha
 
Detection of novel metabolites and enzyme functions though in silico expansio...
Detection of novel metabolites and enzyme functions though in silico expansio...Detection of novel metabolites and enzyme functions though in silico expansio...
Detection of novel metabolites and enzyme functions though in silico expansio...James Jeffryes
 
Molecular Representation, Similarity and Search
Molecular Representation, Similarity and SearchMolecular Representation, Similarity and Search
Molecular Representation, Similarity and SearchRajarshi Guha
 
Molecular similarity searching methods, seminar
Molecular similarity searching methods, seminarMolecular similarity searching methods, seminar
Molecular similarity searching methods, seminarHaitham Hijazi
 
EUGM 2013 - Björn Windshügel (European ScreeningPort): Chemoinformatic tools ...
EUGM 2013 - Björn Windshügel (European ScreeningPort): Chemoinformatic tools ...EUGM 2013 - Björn Windshügel (European ScreeningPort): Chemoinformatic tools ...
EUGM 2013 - Björn Windshügel (European ScreeningPort): Chemoinformatic tools ...ChemAxon
 
Interaction fingerprint: 1D representation of 3D protein-ligand complexes
Interaction fingerprint: 1D representation of 3D protein-ligand complexesInteraction fingerprint: 1D representation of 3D protein-ligand complexes
Interaction fingerprint: 1D representation of 3D protein-ligand complexesVladimir Chupakhin
 
Fingerprinting
FingerprintingFingerprinting
Fingerprintingannperry09
 
Mainframe refresher-part-1
Mainframe refresher-part-1Mainframe refresher-part-1
Mainframe refresher-part-1vishwas17
 
Advanced Blogging Overview
Advanced Blogging OverviewAdvanced Blogging Overview
Advanced Blogging OverviewMrAppleby
 
Cheminformatics II
Cheminformatics IICheminformatics II
Cheminformatics IIbaoilleach
 
A project on taxi
A  project on taxi A  project on taxi
A project on taxi 549manish
 
Fingerprints
FingerprintsFingerprints
Fingerprintsabelts
 

Destacado (16)

Analysis of different similarity measures: Simrank
Analysis of different similarity measures: SimrankAnalysis of different similarity measures: Simrank
Analysis of different similarity measures: Simrank
 
Similarity distance measures
Similarity  distance measuresSimilarity  distance measures
Similarity distance measures
 
Chemical features: how do we describe a compound to a computer?
Chemical features: how do we describe a compound to a computer?Chemical features: how do we describe a compound to a computer?
Chemical features: how do we describe a compound to a computer?
 
Fingerprinting Chemical Structures
Fingerprinting Chemical StructuresFingerprinting Chemical Structures
Fingerprinting Chemical Structures
 
Detection of novel metabolites and enzyme functions though in silico expansio...
Detection of novel metabolites and enzyme functions though in silico expansio...Detection of novel metabolites and enzyme functions though in silico expansio...
Detection of novel metabolites and enzyme functions though in silico expansio...
 
Molecular Representation, Similarity and Search
Molecular Representation, Similarity and SearchMolecular Representation, Similarity and Search
Molecular Representation, Similarity and Search
 
Molecular similarity searching methods, seminar
Molecular similarity searching methods, seminarMolecular similarity searching methods, seminar
Molecular similarity searching methods, seminar
 
EUGM 2013 - Björn Windshügel (European ScreeningPort): Chemoinformatic tools ...
EUGM 2013 - Björn Windshügel (European ScreeningPort): Chemoinformatic tools ...EUGM 2013 - Björn Windshügel (European ScreeningPort): Chemoinformatic tools ...
EUGM 2013 - Björn Windshügel (European ScreeningPort): Chemoinformatic tools ...
 
Interaction fingerprint: 1D representation of 3D protein-ligand complexes
Interaction fingerprint: 1D representation of 3D protein-ligand complexesInteraction fingerprint: 1D representation of 3D protein-ligand complexes
Interaction fingerprint: 1D representation of 3D protein-ligand complexes
 
Fingerprinting
FingerprintingFingerprinting
Fingerprinting
 
Fingerprints
FingerprintsFingerprints
Fingerprints
 
Mainframe refresher-part-1
Mainframe refresher-part-1Mainframe refresher-part-1
Mainframe refresher-part-1
 
Advanced Blogging Overview
Advanced Blogging OverviewAdvanced Blogging Overview
Advanced Blogging Overview
 
Cheminformatics II
Cheminformatics IICheminformatics II
Cheminformatics II
 
A project on taxi
A  project on taxi A  project on taxi
A project on taxi
 
Fingerprints
FingerprintsFingerprints
Fingerprints
 

Similar a [ICDE 2012] On Top-k Structural Similarity Search

1 chayes
1 chayes1 chayes
1 chayesYandex
 
Lec5 pagerank
Lec5 pagerankLec5 pagerank
Lec5 pagerankCarlos
 
Lec5 Pagerank
Lec5 PagerankLec5 Pagerank
Lec5 Pagerankmobius.cn
 
Pagerank (from Google)
Pagerank (from Google)Pagerank (from Google)
Pagerank (from Google)Sri Prasanna
 
Graphs in Data Structure
 Graphs in Data Structure Graphs in Data Structure
Graphs in Data Structurehafsa komal
 
240401_JW_labseminar[LINE: Large-scale Information Network Embeddin].pptx
240401_JW_labseminar[LINE: Large-scale Information Network Embeddin].pptx240401_JW_labseminar[LINE: Large-scale Information Network Embeddin].pptx
240401_JW_labseminar[LINE: Large-scale Information Network Embeddin].pptxthanhdowork
 
lec 09-graphs-bfs-dfs.ppt
lec 09-graphs-bfs-dfs.pptlec 09-graphs-bfs-dfs.ppt
lec 09-graphs-bfs-dfs.pptTalhaFarooqui12
 
Mit15 082 jf10_lec01
Mit15 082 jf10_lec01Mit15 082 jf10_lec01
Mit15 082 jf10_lec01Saad Liaqat
 
TMPA-2015: Implementing the MetaVCG Approach in the C-light System
TMPA-2015: Implementing the MetaVCG Approach in the C-light SystemTMPA-2015: Implementing the MetaVCG Approach in the C-light System
TMPA-2015: Implementing the MetaVCG Approach in the C-light SystemIosif Itkin
 
Talk on Graph Theory - I
Talk on Graph Theory - ITalk on Graph Theory - I
Talk on Graph Theory - IAnirudh Raja
 
Ranking systems
Ranking systemsRanking systems
Ranking systemsMafer
 
Ranking systems
Ranking systemsRanking systems
Ranking systemsJoyce
 
Higher-order clustering coefficients
Higher-order clustering coefficientsHigher-order clustering coefficients
Higher-order clustering coefficientsAustin Benson
 
Lecture_10_Parallel_Algorithms_Part_II.ppt
Lecture_10_Parallel_Algorithms_Part_II.pptLecture_10_Parallel_Algorithms_Part_II.ppt
Lecture_10_Parallel_Algorithms_Part_II.pptWahyuAde4
 
Scalable Online Betweenness Centrality in Evolving Graphs
Scalable Online Betweenness Centrality in Evolving GraphsScalable Online Betweenness Centrality in Evolving Graphs
Scalable Online Betweenness Centrality in Evolving GraphsNicolas Kourtellis
 
Lecture 5b graphs and hashing
Lecture 5b graphs and hashingLecture 5b graphs and hashing
Lecture 5b graphs and hashingVictor Palmar
 

Similar a [ICDE 2012] On Top-k Structural Similarity Search (20)

Lecture13
Lecture13Lecture13
Lecture13
 
1 chayes
1 chayes1 chayes
1 chayes
 
Lec5 Pagerank
Lec5 PagerankLec5 Pagerank
Lec5 Pagerank
 
Lec5 pagerank
Lec5 pagerankLec5 pagerank
Lec5 pagerank
 
Lec5 Pagerank
Lec5 PagerankLec5 Pagerank
Lec5 Pagerank
 
Pagerank (from Google)
Pagerank (from Google)Pagerank (from Google)
Pagerank (from Google)
 
Graphs in Data Structure
 Graphs in Data Structure Graphs in Data Structure
Graphs in Data Structure
 
240401_JW_labseminar[LINE: Large-scale Information Network Embeddin].pptx
240401_JW_labseminar[LINE: Large-scale Information Network Embeddin].pptx240401_JW_labseminar[LINE: Large-scale Information Network Embeddin].pptx
240401_JW_labseminar[LINE: Large-scale Information Network Embeddin].pptx
 
lec 09-graphs-bfs-dfs.ppt
lec 09-graphs-bfs-dfs.pptlec 09-graphs-bfs-dfs.ppt
lec 09-graphs-bfs-dfs.ppt
 
Mit15 082 jf10_lec01
Mit15 082 jf10_lec01Mit15 082 jf10_lec01
Mit15 082 jf10_lec01
 
TMPA-2015: Implementing the MetaVCG Approach in the C-light System
TMPA-2015: Implementing the MetaVCG Approach in the C-light SystemTMPA-2015: Implementing the MetaVCG Approach in the C-light System
TMPA-2015: Implementing the MetaVCG Approach in the C-light System
 
Talk on Graph Theory - I
Talk on Graph Theory - ITalk on Graph Theory - I
Talk on Graph Theory - I
 
Ranking systems
Ranking systemsRanking systems
Ranking systems
 
Ranking systems
Ranking systemsRanking systems
Ranking systems
 
Higher-order clustering coefficients
Higher-order clustering coefficientsHigher-order clustering coefficients
Higher-order clustering coefficients
 
Lecture_10_Parallel_Algorithms_Part_II.ppt
Lecture_10_Parallel_Algorithms_Part_II.pptLecture_10_Parallel_Algorithms_Part_II.ppt
Lecture_10_Parallel_Algorithms_Part_II.ppt
 
Scalable Online Betweenness Centrality in Evolving Graphs
Scalable Online Betweenness Centrality in Evolving GraphsScalable Online Betweenness Centrality in Evolving Graphs
Scalable Online Betweenness Centrality in Evolving Graphs
 
Line
LineLine
Line
 
Lausanne 2019 #4
Lausanne 2019 #4Lausanne 2019 #4
Lausanne 2019 #4
 
Lecture 5b graphs and hashing
Lecture 5b graphs and hashingLecture 5b graphs and hashing
Lecture 5b graphs and hashing
 

Último

Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 

Último (20)

Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 

[ICDE 2012] On Top-k Structural Similarity Search

  • 1. Pei Lee, ICDE 2012 On Top-k Structural Similarity Search Pei Lee, Laks V.S. Lakshmanan University of British Columbia Vancouver, BC, Canada Jeffrey Xu Yu Chinese University of Hong Kong Hong Kong, China 1 2014-4-16
  • 2. Outline  Problem definition  Structural similarity  top k structural similarity search  Existing top k structural similarity search methods  SimRank, P-Rank  Constraints  TopSim: a family of efficient top k structural similarity search algorithms with accuracy guarantee  Truncated TopSim, Prioritized TopSim  Experiments 2 Problem Statement
  • 3. Graph structures are ubiquitous  Social networks, citation networks, web graphs, etc 3 Problem Statement
  • 4. What’s structural similarity?  Structural similarity: the pairwise similarity between nodes in a graph  Applications: link prediction, recommendation, etc 4 Problem Statement  Intuition: two nodes are similar, if their neighbors are similar  Derived from PageRank’s intuition v a h b gd c e u fHow to quantify the similarity between node u and v? Problem Definition: Input: ( , ), , Output: ( , ) G V E u V v V S u v A node is important, if this node is referenced by many other important nodes
  • 5. What’s top-k structural similarity search?5 Problem Statement Problem Definition: Input: ( , ), , Output: Top- similar nodes for G V E v V k k v  Given a node v in a huge graph  Find top-k similar nodes with v  But  Definitely do not want to compare with every node  The accuracy of results should be guaranteed.
  • 6. Existing Structural Similarity Measures  Neighbor-based approaches  Jaccard Coefficient, Cosine Similarity, Pearson correlation, Co-citation, etc  Cons: no common neighbors, no similarity!  Random walk based approaches  SimRank (Jeh & Widom, KDD’02)  P-Rank (Zhao et.al, CIKM’09) (by extending SimRank)  Cons:  high computational cost  Not designed for top-k similarity search 6 Related Work
  • 7. SimRank & P-Rank  SimRank: two nodes are similar, if they are referenced by similar nodes 7 Related Work v a cb u ( , ) 0.5 0S b c ( , ) 0.25 0S u v ( , ) 1S a a 1 ( ) ( ) ( , ) ( , ) | ( ) || ( ) | n n i I u j I v C S u v S i j I u I v 1 T n n S CWS W  Pairwise iterative form: Matrix form: In-neighbors Transition matrix Correction matrix  P-Rank: two nodes are similar, if they are related with similar nodes 1 ( ) ( ) ( ) ( ) (1 ) ( , ) ( , ) ( , ) | ( ) || ( ) | | ( ) || ( ) | n n n i I u j I v i O u j O v C C S u v S i j S i j I u I v O u O v 0 < C < 1 0 < λ < 1 SimRank Reversed SimRank
  • 8. Top-k similarity search: challenges  Matrix-based approach: (KDD’02, VLDB’08)  Offline: compute a |V|-by-|V| similarity matrix  SimRank/P-Rank takes O(|E|2) time, which degenerate to O(|V|4) in the worst case  Space cost: hard to store this huge similarity matrix  Vector-based approach: (SDM’10)  Offline: compute a vector with length |V|  Takes O(|V|D2n) time in the worst case, where n is the iteration number, D is the average edge degree  All these approaches need to access the whole graph to find the exact top-k similar nodes 8 Challenges
  • 9. Contributions  Transform the computation of pairwise similarity on graph G to the computation of authority on G G, based on a propagation & aggregation process;  Propose TopSim, a local top-k structural similarity search algorithm that avoids accessing the whole graph while the accuracy is guaranteed.  Propose Trun-TopSim-SM and Prio-TopSim-SM, which are two approximations allowing us to trade accuracy for speed. 9 Contributions
  • 10. How TopSim works 10 Coupling random walk on G Single random walk on G G Propagation & Aggregation Similarity Path Similarity Score
  • 11. Product of graphs: G G  Given G(V, E), G G is defined as  For node u and v in G, uv is a node in G G  For edge (e, u) and (e, v) in G, (ee, uv) is an edge in G G 11 d b u v a c e uvce eebd uu vu vv dd cb aada eccc ae ea  Each node pair in G will be a node in G G  Each edge pair in G will be an edge in G G  No need to materialize G G: only conceptually exists to facilitate analysis G G G
  • 12. Coupling random walk  Coupling random walk: two random surfers walk simultaneously and follow the same edge direction  Surf1, Surf2  Coupling random walk on G can be equivalently transformed as a single random walk on G G  SimRank: S(u, v) is the first meeting probability of two random surfers starting from u and v respectively and following backward links. 12 d b u v a c e uvce eebd uu vu vv dd cb aada eccc ae ea G G G
  • 13. Compute similarity based on coupling random walk  We actually transform a similarity ranking problem on G into an authority ranking problem on G G  R(uv) = S(u, v)  Initialization:  Source node (if u = v): R(uv) = 1 is fixed  Target node (if u ≠ v): R(uv) = 0 and R(uv) will be updated  How is R(uv) updated?  Propagation & Aggregation process on G G  Propagation: nodes propagate their authority to their neighbors following random walk steps  Aggregation: nodes receive and aggregate the authorities that are propagated-in from their neighbors. 13
  • 14. Compute S(u,v)?  Similarity path: a path from source node to target node without going by source nodes  Probability of a transition step:  Similarity:  Sum of similarity paths with end node uv 14 uvce eebd uu vu vv dd cb aada eccc ae ea
  • 15. uvce eebd uu vu vv dd cb aada eccc ae ea Compute S(u,v): example 15 1 11 1 1 1 Path 1: (ee, uv) P(ee, uv) = 0.5  If we only consider 3 steps Path 2: (aa, bd, ce, uv) P(aa, bd, ce, uv) = 0.5*1*0.5 = 0.25 S3(u,v) = P(ee, uv)*C + P(aa, bd, ce, uv)*C3 = 0.28 C = 0.5
  • 17. Optimization based on SimMap  Observation: many similarity paths are overlapped 17 v a h b gd c e u 1 2 3 f 0  SimMap SM(u) = {(key, value)}  key is the node visited by Surf2 on step i when Surf1 visits the node u  value = Si(key, u)  SM(v) is exactly the result list  TopSim-SM  Example:  Start from c  SM(b) = {(d, 1/2), (f, 1/2)}  SM(a) = {(e, 1/8)}  SM(v) = {(u, 1/32)} Similarity paths
  • 18. Family of TopSim Algorithms 18 Algorithms Quality Performance TopSim Exact Slow if the graph is not sparse TopSim-SM Exact More efficient than TopSim Trade accuracy for speed More efficient than TopSim-SM Trade accuracy for speed More efficient than TopSim-SM Trun-TopSim-SM Prio-TopSim-SM
  • 19. TopSim approximations for Scale- free graphs  Scale-free graphs  Some nodes have very high degree  Web graphs, citation networks, etc  Random surfers will be trapped by high degree nodes  The size of SimMaps will be exploded  Revisit the transition probabilities: 19 a  n
  • 20. TopSim approximations  Basic idea:  Only consider similarity paths with higher probability  Truncated TopSim-SM  If P(u0u0, …, uivi) < η, stop and ignore this path  Prioritized TopSim-SM  Set a buffer size H for each SimMap;  Only expand top H nodes in SimMaps:  If | SM(u) | > H, set | SM(u) | = H.  Find accuracy and complexity analysis in paper 20
  • 21. Experiments  Datasets  Arxiv High Energy Physics paper citation network, including 34,546 nodes and 421,578 edges  DBLP co-author graph, with 0.92M nodes, 6.1M edges  DBLP citation network, with 1.5M papers and 2.1M citations  Live Journal social network, with 4.84M users and 68.99M friendship ties  Factors  C = 0.5, η = 0.001, H = 100 21
  • 22. Accuracy of similarity scores 22 Accuracy ratio Accuracy loss (Running on Arxiv citation network) 3 steps/iterations are good enough for the accuracy of top-20 list
  • 23. Precision@k 23 (Running on DBLP citation network) k around 20~30 yields the highest precision 3 steps/iterations yields a high precision
  • 24. 24 Kendall Tau distance (care more about the ranking order …) a b a b a b b a concordant discordant The higher, the better
  • 25. Kendall Tau distance (care more about the ranking order …) 25 k around 20~30 yields the highest precision 3 steps/iterations yields a high precision
  • 26. Running time with different node sizes and node degrees26 TopSim algorithms are not very sensitive to the graph size TopSim approximations can handle high degree nodes
  • 27. Running time and accessed nodes 27
  • 28. Excitements  We transform a similarity problem on graph G into an equivalent authority ranking problem on the product graph G G to facilitate analysis;  We propose a family of TopSim algorithms that:  Produce top-k results with accuracy guarantee;  Only access a small portion of the graph.  Handle both SimRank and P-Rank under the same top k framework.  Questions? 28 SimRank P-Rank TopSim
  • 29. TopSim-SM  Start from v and find source nodes at each step  From level n-1 to 0  Let Surf1 start from source node and walk to node v  Let Surf2 start from the same source node and put the visited nodes into SimMaps  When Surf1 visits v, Surf2 will exactly visits the similar nodes of v in the same step 29 v a h b gd c e u 1 2 3 f 0  Example:  Start from c  SM(b) = {(d, 1/2), (f, 1/2)}  SM(a) = {(e, 1/8)}  SM(v) = {(u, 1/32)}

Notas del editor

  1. The problem, challenges, theory, experiments, conclusion, (related work, theory proofs)