SlideShare una empresa de Scribd logo
1 de 49
Descargar para leer sin conexión
Nesreen  K.  Ahmed
Department  of  Computer  Science,  Purdue  University    
Joint  work  with
Nick  Duffield1,  Jennifer  Neville2,  Ramana Kompella2
1Texas  A&M  University,  2Purdue  University
KDD’14,  New  York
August  25th,  2014
Social  
Network Internet  (AS)
BiologicalPolitical  Blogs
Graph
Analytics
Studying  and  analyzing  complex  networks
is  a  challenging and  computational  intensive task
Studying  and  analyzing  complex  networks
is  a  challenging and  computational  intensive task
Ø Today’s  networks  are  massive  in  size  
-­‐ e.g.,  online  social  networks  have  billions  of  users
Ø Today’s  networks  are  dynamic/streaming  over  time
-­‐ e.g.,  Twitter  streams,  email  communications  
Ø Today’s  networks  are  massive  in  size  
-­‐ e.g.,  online  social  networks  have  billions  of  users
Ø Today’s  networks  are  dynamic/streaming  over  time
-­‐ e.g.,  Twitter  streams,  email  communications  
Studying  and  analyzing  complex  networks
is  a  challenging and  computational  intensive task
Studying  and  analyzing  complex  networks
is  a  challenging and  computational  intensive task
Ø Today’s  networks  are  massive  in  size  
-­‐ e.g.,  online  social  networks  have  billions  of  users
Ø Today’s  networks  are  dynamic/streaming  over  time
-­‐ e.g.,  Twitter  streams,  email  communications  
Ø Today’s  networks  are  massive  in  size  
-­‐ e.g.,  online  social  networks  have  billions  of  users
Ø Today’s  networks  are  dynamic/streaming  over  time
-­‐ e.g.,  Twitter  streams,  email  communications  
Due  to  these  challenges,  we  usually  need  to  sampleDue  to  these  challenges,  we  usually  need  to  sample
Statistical  
Sampling
Graph  G Sample  S
e.g. Uniform Random
Sampling
Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
No.  Edges No.  TrianglesNo.  Wedges
Frequent  connected  subsets  of  edges
Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
No.  Edges No.  TrianglesNo.  Wedges
Frequent  connected  subsets  of  edges
Transitivity
§ Random  Sampling
• Uniform  random  sampling – [Tsourakakis et.  al  KDD’09]    
— Graph  Sparsification with  probability  p  
— Chance  of  sampling  a  subgraph (e.g.,  triangle)  is  very  low
— Estimates  suffer  from  high  variance  
• Wedge  Sampling – [Seshadhri et.  al  SDM’13]  
— Sample  vertices,  then  sample  pairs  of  incident  edges  (wedges)
— Output  the  estimate  of  the  closed  wedges  (triangles)  
§ Random  Sampling
• Uniform  random  sampling – [Tsourakakis et.  al  KDD’09]    
— Graph  Sparsification with  probability  p  
— Chance  of  sampling  a  subgraph (e.g.,  triangle)  is  very  low
— Estimates  suffer  from  high  variance  
• Wedge  Sampling – [Seshadhri et.  al  SDM’13]  
— Sample  vertices,  then  sample  pairs  of  incident  edges  (wedges)
— Output  the  estimate  of  the  closed  wedges  (triangles)  
Assume  graph  fits  into  memory
or  
Need  to  store  a  large  fraction  of  data
§ Assume  specific  order  of  the  stream  – [Buriol et.  al  2006]  
• Incidence  stream  model– neighbors  of  a  vertex  arrive  together  in  the  
stream
§ Use  multiple  passes  over  the  stream  – [Becchetti et.  al  KDD‘08]
§ Single-­‐pass  Algorithms
• Streaming-­‐Triangles  – [Jha et.  al  KDD’13]
— Sample  edges  using  reservoir  sampling,  then,  sample  pairs  of  
incident  edges  (wedges)
— Scan  for  closed  wedges  (triangles)
§ Assume  specific  order  of  the  stream  – [Buriol et.  al  2006]  
• Incidence  stream  model– neighbors  of  a  vertex  arrive  together  in  the  
stream
§ Use  multiple  passes  over  the  stream  – [Becchetti et.  al  KDD‘08]
§ Single-­‐pass  Algorithms
• Streaming-­‐Triangles  -­‐-­‐ [Jha et.  al  KDD’13]
— Sample  edges  using  reservoir  sampling,  then,  sample  pairs  of  
incident  edges  (wedges)
— Scan  for  closed  wedges  (triangles)
Sampling  designs  used  for  specific  graph  properties  (triangles)  
and  
Not  generally  applicable  to  other  properties
Input
Graph-Sample-and-Hold Framework
gSH(p,q)Output
Input
Graph-Sample-and-Hold Framework
gSH(p,q)Output
Input
Graph-Sample-and-Hold Framework
gSH(p,q)Output
Input
Graph-Sample-and-Hold Framework
gSH(p,q)Output
Sample
Input
Graph-Sample-and-Hold Framework
gSH(p,q)Output
HoldSample
Input
Graph-Sample-and-Hold Framework
gSH(p,q)Output
HoldSample
Input
Graph-Sample-and-Hold Framework
gSH(p,q)Output
Sample Hold
Chance  to  sample  a  subgraph (e.g.,  triangle)  is  very  low
Estimates  suffer  from  a  high  variance
Graph-Sample-and-Hold Framework
gSH(p,q)
① Sampling ② Estimation
Input
Input
HoldSample
Input
HoldSample
Input
HoldSample
Input
HoldSample
Input
HoldSample
Input
HoldSample
Input
Multiple  dependencies  
Graph-Sample-and-Hold Framework
gSH(p,q)
① Sampling ② Estimation
Input
We  use  Horvitz-­‐Thompson  statistical  estimation  framework
• Used  for  unequal  probability  sampling  
[Horvitz  and  Thompson-­1952]
Journal  of  American  Stat.  Assoc.
§ We  define  the  selection  estimator  for  an  edge  
§ We  define  the  selection  estimator  for  an  edge  
§ We  define  the  selection  estimator  for  an  edge  
§ We  define  the  selection  estimator  for  an  edge  
Unbiasedness
[Theorem  1  (i)]
§ We  define  the  selection  estimator  for  an  edge
§ We  derive  the  estimate  of  edge  count
Unbiasedness
[Theorem  1  (i)]
§ We  generalize the  equations  to  any  subset  of  edges
§ We  generalize the  equations  to  any  subset  of  edges
Unbiasedness
[Theorem  1  (ii)]
§ We  generalize the  equations  to  any  subset  of  edges
§ Ex: The  estimator  of  a  sampled  triangle  is:
Unbiasedness
[Theorem  1  (ii)]
§ The  estimate  of  triangle  counts  is
§ The  estimate  of  wedge  counts  is
§ Using  the  gSH framework,  we  obtain  estimators  for:
• Edge  Count
• Triangle  Count
• Wedge  Count
• Global  clustering  Coeff.  =  3  *                        /
§ The  estimate  of  triangle  counts  is
§ The  estimate  of  wedge  counts  is
§ Using  the  gSH framework,  we  obtain  estimators  for:
• Edge  Count
• Triangle  Count
• Wedge  Count
• Global  clustering  Coeff.  =  3  *                        /
We  also  compute  HT  
unbiased  variance  estimators
Social	
  facebook
Graphs
Graphs	
  
Web
Graphs Links	
  to
6	
  graphs	
  with 7K	
  – 685K	
  nodes,	
  250K– 7M	
  edges
CMU,	
  UCLA,	
  and	
  Wisconsin
Google,	
   Berkeley-­‐Stanford,	
  Stanford	
  
friends
o K with probability r
datasets. n is the number of nodes, NK is the
is the number of triangles, NΛ is the number
length 2, α is the global clustering coefficient,
NK NT NΛ α D
249.9K 2.3M 37.4M 0.18526 0.0114
747.6K 5.1M 107.1M 0.14314 0.0036
835.9K 4.8M 121.4M 0.12013 0.0029
1.9M 11.3M 3.9T 0.00862 5.01 × 10−5
4.3M 13.3M 727.4M 0.05523 1.15 × 10−5
6.6M 64.6M 27.9T 0.00694 2.83 × 10−5
1, while the corresponding sums of the de-
he the true node degree.
socfb-CMU 249.9K 249.6K 0.0013
socfb-UCLA 747.6K 751.3K 0.0050
socfb-Wisconsin 835.9K 835.7K 0.0003
web-Stanford 1.9M 1.9M 0.0004 1
web-Google 4.3M 4.3M 0.0007 2
web-BerkStan 6.6M 6.6M 0.0006 3
Triangles NT
NT NT
|NT −NT |
NT
S
socfb-CMU 2.3M 2.3M 0.0003
socfb-UCLA 5.1M 5.1M 0.0095
socfb-Wisconsin 4.8M 4.8M 0.0058
web-Stanford 11.3M 11.3M 0.0023 1
web-Google 13.3M 13.4M 0.0029 2
web-BerkStan 64.6M 65M 0.0063 3
Path. Length two N
NΛ NΛ
|NΛ−NΛ|
NΛ
S
socfb-CMU 37.4M 37.3M 0.0018
socfb-UCLA 107.1M 107.8M 0.0060
socfb-Wisconsin 121.4M 121.2M 0.0018
web-Stanford 3.9T 3.9T 0.0004 1
web-Google 727.4M 724.3M 0.0042 2
web-BerkStan 27.9T 27.9T 0.0002 3
Global Clustering α
α α |α−α|
α
S
socfb-CMU 0.18526 0.18574 0.00260
socfb-UCLA 0.14314 0.14363 0.00340
NK NK
|NK −NK |
NK
SSize LB UB
socfb-CMU 249.9K 249.6K 0.0013 1.7K 236.8K 262.4K
socfb-UCLA 747.6K 751.3K 0.0050 5K 729.3K 773.34K
socfb-Wisconsin 835.9K 835.7K 0.0003 5.5K 812.2K 859.1K
web-Stanford 1.9M 1.9M 0.0004 14.8K 1.9M 2M
web-Google 4.3M 4.3M 0.0007 25.2K 4.2M 4.3M
web-BerkStan 6.6M 6.6M 0.0006 39.8K 6.5M 6.7M
Triangles NT
NT NT
|NT −NT |
NT
SSize LB UB
socfb-CMU 2.3M 2.3M 0.0003 1.7K 1.6M 2.9M
socfb-UCLA 5.1M 5.1M 0.0095 5K 4.2M 6.03M
socfb-Wisconsin 4.8M 4.8M 0.0058 5.5K 4M 5.7M
web-Stanford 11.3M 11.3M 0.0023 14.8K 3.7M 18.8M
web-Google 13.3M 13.4M 0.0029 25.2K 11.7M 15M
web-BerkStan 64.6M 65M 0.0063 39.8K 45.5M 84.6M
Path. Length two NΛ
NΛ NΛ
|NΛ−NΛ|
NΛ
SSize LB UB
socfb-CMU 37.4M 37.3M 0.0018 1.7K 32.6M 42M
socfb-UCLA 107.1M 107.8M 0.0060 5K 100.1M 115.42M
socfb-Wisconsin 121.4M 121.2M 0.0018 5.5K 108.9M 133.4M
web-Stanford 3.9T 3.9T 0.0004 14.8K 3.6T 4.2T
web-Google 727.4M 724.3M 0.0042 25.2K 677.1M 771.5M
web-BerkStan 27.9T 27.9T 0.0002 39.8K 26.5T 29.3T
Global Clustering α
α α |α−α|
α
SSize LB UB
datasets. n is the number of nodes, NK is the
is the number of triangles, NΛ is the number
length 2, α is the global clustering coefficient,
NK NT NΛ α D
249.9K 2.3M 37.4M 0.18526 0.0114
747.6K 5.1M 107.1M 0.14314 0.0036
835.9K 4.8M 121.4M 0.12013 0.0029
1.9M 11.3M 3.9T 0.00862 5.01 × 10−5
4.3M 13.3M 727.4M 0.05523 1.15 × 10−5
6.6M 64.6M 27.9T 0.00694 2.83 × 10−5
1, while the corresponding sums of the de-
he the true node degree.
ENTS AND EVALUATION
mance of our proposed framework (gSH) as
m 2 (with r = 1 for edges that are closing
ocial and information networks with 250K–
he following networks, we consider an undi-
NT NT
|NT −NT |
NT
SS
socfb-CMU 2.3M 2.3M 0.0003 1
socfb-UCLA 5.1M 5.1M 0.0095 5
socfb-Wisconsin 4.8M 4.8M 0.0058 5
web-Stanford 11.3M 11.3M 0.0023 14
web-Google 13.3M 13.4M 0.0029 25
web-BerkStan 64.6M 65M 0.0063 39
Path. Length two NΛ
NΛ NΛ
|NΛ−NΛ|
NΛ
SS
socfb-CMU 37.4M 37.3M 0.0018 1
socfb-UCLA 107.1M 107.8M 0.0060 5
socfb-Wisconsin 121.4M 121.2M 0.0018 5
web-Stanford 3.9T 3.9T 0.0004 14
web-Google 727.4M 724.3M 0.0042 25
web-BerkStan 27.9T 27.9T 0.0002 39
Global Clustering α
α α |α−α|
α
SS
socfb-CMU 0.18526 0.18574 0.00260 1
socfb-UCLA 0.14314 0.14363 0.00340 5
socfb-Wisconsin 0.12013 0.12101 0.00730 5
web-Stanford 0.00862 0.00862 0.00020 14
web-Google 0.05523 0.05565 0.00760 25
web-BerkStan 0.00694 0.00698 0.00680 39
e
r
,
lower bounds, and upper bounds when sample size ≤ 40K edges,
with sampling probability p, q = 0.005 for web-BerkStan, and
p = 0.005, q = 0.008 otherwise. SSize is the number of sam-
pled edges, and LB, UB are the 95% lower, and upper bound
respectively.
Edges NK
NK NK
|NK −NK |
NK
SSize LB UB
socfb-CMU 249.9K 249.6K 0.0013 1.7K 236.8K 262.4K
socfb-UCLA 747.6K 751.3K 0.0050 5K 729.3K 773.34K
socfb-Wisconsin 835.9K 835.7K 0.0003 5.5K 812.2K 859.1K
web-Stanford 1.9M 1.9M 0.0004 14.8K 1.9M 2M
web-Google 4.3M 4.3M 0.0007 25.2K 4.2M 4.3M
web-BerkStan 6.6M 6.6M 0.0006 39.8K 6.5M 6.7M
Triangles NT
NT NT
|NT −NT |
NT
SSize LB UB
socfb-CMU 2.3M 2.3M 0.0003 1.7K 1.6M 2.9M
socfb-UCLA 5.1M 5.1M 0.0095 5K 4.2M 6.03M
socfb-Wisconsin 4.8M 4.8M 0.0058 5.5K 4M 5.7M
web-Stanford 11.3M 11.3M 0.0023 14.8K 3.7M 18.8M
web-Google 13.3M 13.4M 0.0029 25.2K 11.7M 15M
web-BerkStan 64.6M 65M 0.0063 39.8K 45.5M 84.6M
Path. Length two NΛ
NΛ NΛ
|NΛ−NΛ|
NΛ
SSize LB UB
socfb-CMU 37.4M 37.3M 0.0018 1.7K 32.6M 42M
socfb-UCLA 107.1M 107.8M 0.0060 5K 100.1M 115.42M
Edge  Count Triangle  Count
Wedge  Count Global  Clustering
Sample  size  <=  40K  edges
o K with probability r
datasets. n is the number of nodes, NK is the
is the number of triangles, NΛ is the number
length 2, α is the global clustering coefficient,
NK NT NΛ α D
249.9K 2.3M 37.4M 0.18526 0.0114
747.6K 5.1M 107.1M 0.14314 0.0036
835.9K 4.8M 121.4M 0.12013 0.0029
1.9M 11.3M 3.9T 0.00862 5.01 × 10−5
4.3M 13.3M 727.4M 0.05523 1.15 × 10−5
6.6M 64.6M 27.9T 0.00694 2.83 × 10−5
1, while the corresponding sums of the de-
he the true node degree.
socfb-CMU 249.9K 249.6K 0.0013
socfb-UCLA 747.6K 751.3K 0.0050
socfb-Wisconsin 835.9K 835.7K 0.0003
web-Stanford 1.9M 1.9M 0.0004 1
web-Google 4.3M 4.3M 0.0007 2
web-BerkStan 6.6M 6.6M 0.0006 3
Triangles NT
NT NT
|NT −NT |
NT
S
socfb-CMU 2.3M 2.3M 0.0003
socfb-UCLA 5.1M 5.1M 0.0095
socfb-Wisconsin 4.8M 4.8M 0.0058
web-Stanford 11.3M 11.3M 0.0023 1
web-Google 13.3M 13.4M 0.0029 2
web-BerkStan 64.6M 65M 0.0063 3
Path. Length two N
NΛ NΛ
|NΛ−NΛ|
NΛ
S
socfb-CMU 37.4M 37.3M 0.0018
socfb-UCLA 107.1M 107.8M 0.0060
socfb-Wisconsin 121.4M 121.2M 0.0018
web-Stanford 3.9T 3.9T 0.0004 1
web-Google 727.4M 724.3M 0.0042 2
web-BerkStan 27.9T 27.9T 0.0002 3
Global Clustering α
α α |α−α|
α
S
socfb-CMU 0.18526 0.18574 0.00260
socfb-UCLA 0.14314 0.14363 0.00340
NK NK
|NK −NK |
NK
SSize LB UB
socfb-CMU 249.9K 249.6K 0.0013 1.7K 236.8K 262.4K
socfb-UCLA 747.6K 751.3K 0.0050 5K 729.3K 773.34K
socfb-Wisconsin 835.9K 835.7K 0.0003 5.5K 812.2K 859.1K
web-Stanford 1.9M 1.9M 0.0004 14.8K 1.9M 2M
web-Google 4.3M 4.3M 0.0007 25.2K 4.2M 4.3M
web-BerkStan 6.6M 6.6M 0.0006 39.8K 6.5M 6.7M
Triangles NT
NT NT
|NT −NT |
NT
SSize LB UB
socfb-CMU 2.3M 2.3M 0.0003 1.7K 1.6M 2.9M
socfb-UCLA 5.1M 5.1M 0.0095 5K 4.2M 6.03M
socfb-Wisconsin 4.8M 4.8M 0.0058 5.5K 4M 5.7M
web-Stanford 11.3M 11.3M 0.0023 14.8K 3.7M 18.8M
web-Google 13.3M 13.4M 0.0029 25.2K 11.7M 15M
web-BerkStan 64.6M 65M 0.0063 39.8K 45.5M 84.6M
Path. Length two NΛ
NΛ NΛ
|NΛ−NΛ|
NΛ
SSize LB UB
socfb-CMU 37.4M 37.3M 0.0018 1.7K 32.6M 42M
socfb-UCLA 107.1M 107.8M 0.0060 5K 100.1M 115.42M
socfb-Wisconsin 121.4M 121.2M 0.0018 5.5K 108.9M 133.4M
web-Stanford 3.9T 3.9T 0.0004 14.8K 3.6T 4.2T
web-Google 727.4M 724.3M 0.0042 25.2K 677.1M 771.5M
web-BerkStan 27.9T 27.9T 0.0002 39.8K 26.5T 29.3T
Global Clustering α
α α |α−α|
α
SSize LB UB
datasets. n is the number of nodes, NK is the
is the number of triangles, NΛ is the number
length 2, α is the global clustering coefficient,
NK NT NΛ α D
249.9K 2.3M 37.4M 0.18526 0.0114
747.6K 5.1M 107.1M 0.14314 0.0036
835.9K 4.8M 121.4M 0.12013 0.0029
1.9M 11.3M 3.9T 0.00862 5.01 × 10−5
4.3M 13.3M 727.4M 0.05523 1.15 × 10−5
6.6M 64.6M 27.9T 0.00694 2.83 × 10−5
1, while the corresponding sums of the de-
he the true node degree.
ENTS AND EVALUATION
mance of our proposed framework (gSH) as
m 2 (with r = 1 for edges that are closing
ocial and information networks with 250K–
he following networks, we consider an undi-
NT NT
|NT −NT |
NT
SS
socfb-CMU 2.3M 2.3M 0.0003 1
socfb-UCLA 5.1M 5.1M 0.0095 5
socfb-Wisconsin 4.8M 4.8M 0.0058 5
web-Stanford 11.3M 11.3M 0.0023 14
web-Google 13.3M 13.4M 0.0029 25
web-BerkStan 64.6M 65M 0.0063 39
Path. Length two NΛ
NΛ NΛ
|NΛ−NΛ|
NΛ
SS
socfb-CMU 37.4M 37.3M 0.0018 1
socfb-UCLA 107.1M 107.8M 0.0060 5
socfb-Wisconsin 121.4M 121.2M 0.0018 5
web-Stanford 3.9T 3.9T 0.0004 14
web-Google 727.4M 724.3M 0.0042 25
web-BerkStan 27.9T 27.9T 0.0002 39
Global Clustering α
α α |α−α|
α
SS
socfb-CMU 0.18526 0.18574 0.00260 1
socfb-UCLA 0.14314 0.14363 0.00340 5
socfb-Wisconsin 0.12013 0.12101 0.00730 5
web-Stanford 0.00862 0.00862 0.00020 14
web-Google 0.05523 0.05565 0.00760 25
web-BerkStan 0.00694 0.00698 0.00680 39
e
r
,
lower bounds, and upper bounds when sample size ≤ 40K edges,
with sampling probability p, q = 0.005 for web-BerkStan, and
p = 0.005, q = 0.008 otherwise. SSize is the number of sam-
pled edges, and LB, UB are the 95% lower, and upper bound
respectively.
Edges NK
NK NK
|NK −NK |
NK
SSize LB UB
socfb-CMU 249.9K 249.6K 0.0013 1.7K 236.8K 262.4K
socfb-UCLA 747.6K 751.3K 0.0050 5K 729.3K 773.34K
socfb-Wisconsin 835.9K 835.7K 0.0003 5.5K 812.2K 859.1K
web-Stanford 1.9M 1.9M 0.0004 14.8K 1.9M 2M
web-Google 4.3M 4.3M 0.0007 25.2K 4.2M 4.3M
web-BerkStan 6.6M 6.6M 0.0006 39.8K 6.5M 6.7M
Triangles NT
NT NT
|NT −NT |
NT
SSize LB UB
socfb-CMU 2.3M 2.3M 0.0003 1.7K 1.6M 2.9M
socfb-UCLA 5.1M 5.1M 0.0095 5K 4.2M 6.03M
socfb-Wisconsin 4.8M 4.8M 0.0058 5.5K 4M 5.7M
web-Stanford 11.3M 11.3M 0.0023 14.8K 3.7M 18.8M
web-Google 13.3M 13.4M 0.0029 25.2K 11.7M 15M
web-BerkStan 64.6M 65M 0.0063 39.8K 45.5M 84.6M
Path. Length two NΛ
NΛ NΛ
|NΛ−NΛ|
NΛ
SSize LB UB
socfb-CMU 37.4M 37.3M 0.0018 1.7K 32.6M 42M
socfb-UCLA 107.1M 107.8M 0.0060 5K 100.1M 115.42M
Edge  Count Triangle  Count
Wedge  Count Global  Clustering
Sample  size  <=  40K  edges
<1%
Over  all  graphs,  all  properties
10
4
10
5
0.95
1
1.05
socfb−UCLA
ˆNK/NK
Sample Size (Edges)
(a) Edges
10
4
10
5
0.85
0.9
0.95
1
1.05
1.1
1.15
socfb−UCLA
ˆNT/NT
Sample Size (Edges)
(b) Triangles
0
1
ˆNΛ/NΛ
1
1.05
socfb−Wisconsin
ˆNK/NK
0.9
0.95
1
1.05
1.1
1.15
socfb−Wisconsin
ˆNT/NT
0
1
ˆNΛ/NΛ
Edge  Count
10
4
10
5
0.95
1
1.05
socfb−UCLA
ˆNK/NK
Sample Size (Edges)
(a) Edges
10
4
10
5
0.85
0.9
0.95
1
1.05
1.1
1.15
socfb−UCLA
ˆNT/NT
Sample Size (Edges)
(b) Triangles
1
1.05
socfb−Wisconsin
ˆNK/NK
0.9
0.95
1
1.05
1.1
1.15
socfb−Wisconsin
ˆNT/NT
Triangle  Count
10
5
socfb−UCLA
Sample Size (Edges)
10
4
10
5
0.9
0.95
1
1.05
1.1
socfb−UCLA
ˆNΛ/NΛ
Sample Size (Edges)
Wedge  Count
10
4
10
5
0.95
Sample Size (Edges)
(a) Edges
10
4
0.85
Sample
(b) Tria
10
4
10
5
0.95
1
1.05
socfb−Wisconsin
ˆNK/NK
Sample Size (Edges)
(d) Edges
10
4
0.85
0.9
0.95
1
1.05
1.1
1.15
socfb−
ˆNT/NT
Sample
(e) Tria
10
4
10
5
0.85
0.9
0.95
1
1.05
1.1
1.15
socfb−UCLA
ˆα/α
Sample Size (Edges)
Global  Clustering
Actual
Estimated/Actual
Confidence  Upper  &  Lower  Bounds  
Sample  Size  =  40K  edges
Dataset:
facebook friendship	
   graph	
  at	
  UCLA	
  
§ Comparison  to  Streaming-­‐Triangles  [Jha et.  al-­‐KDD’13]
socfb-Wisconsin 0.95 0.95 0.96 0.95
web-Stanford 0.97 0.92 0.95 0.92
web-Google 0.95 0.93 0.95 0.95
web-BerkStan 0.96 0.94 0.93 0.93
Table 5: The relative error and sample size of
Jha [25] in comparison to our framework for triangle
count estimation
Jha et al. [25] gSH
graph |NT −NT |
NT
SSize |NT −NT |
NT
SSize
web-Stanford ≈ 0.07 40K 0.0023 14.8K
web-Google ≈ 0.04 40K 0.0029 25.2K
web-BerkStan ≈ 0.12 40K 0.0063 39.8K
For each p = pi, q = qi, we compute the proportion of samples
in which the actual statistic lies in the confidence interval across
100 independent sampling experiments gSHT (pi, qi). We vary
p, q in the range of 0.005–0.01, and for each possible combination
of p, q (e.g., p = 0.005, q = 0.008), we compute the exact cover-
age probability γ. Table 4 provides the mean coverage probability
web-B
a very large e
5.4 Effec
While Figu
posed framew
tion of what i
still needs to b
choice of para
the graph.
Figure 2 sh
the range of 0.
graphs. Note
Table 2) goin
observe that w
of sampled ed
of edges in th
0.03, the fract
– 5%. These o
On the oth
sampled edge
For example,
fraction of sa
Datasets:
Web	
  graphs
WWW
Page2
links	
  toWWW
Page1
5%	
  of	
  graph	
  edges
0.6%	
  of	
  graph	
  edges
0.57%	
  of	
  graph	
  edges
Sample	
  size
Our	
  approach
§ Comparison  to  Streaming-­‐Triangles  [Jha et.  al-­‐KDD’13]
socfb-Wisconsin 0.95 0.95 0.96 0.95
web-Stanford 0.97 0.92 0.95 0.92
web-Google 0.95 0.93 0.95 0.95
web-BerkStan 0.96 0.94 0.93 0.93
Table 5: The relative error and sample size of
Jha [25] in comparison to our framework for triangle
count estimation
Jha et al. [25] gSH
graph |NT −NT |
NT
SSize |NT −NT |
NT
SSize
web-Stanford ≈ 0.07 40K 0.0023 14.8K
web-Google ≈ 0.04 40K 0.0029 25.2K
web-BerkStan ≈ 0.12 40K 0.0063 39.8K
For each p = pi, q = qi, we compute the proportion of samples
in which the actual statistic lies in the confidence interval across
100 independent sampling experiments gSHT (pi, qi). We vary
p, q in the range of 0.005–0.01, and for each possible combination
of p, q (e.g., p = 0.005, q = 0.008), we compute the exact cover-
age probability γ. Table 4 provides the mean coverage probability
web-B
a very large e
5.4 Effec
While Figu
posed framew
tion of what i
still needs to b
choice of para
the graph.
Figure 2 sh
the range of 0.
graphs. Note
Table 2) goin
observe that w
of sampled ed
of edges in th
0.03, the fract
– 5%. These o
On the oth
sampled edge
For example,
fraction of sa
Datasets:
Web	
  graphs
WWW
Page2
links	
  toWWW
Page1
5%	
  of	
  graph	
  edges
0.6%	
  of	
  graph	
  edges
0.57%	
  of	
  graph	
  edges
92%	
  -­‐ 96%	
  Improvement	
  
Sample	
  size
Our	
  approach
§ A  sample  is  representative if  graph  properties  of  interest  can  be  
estimated  with  a  known  degree  of  accuracy  
§ We  proposed  a  generic  framework  Graph  Sample  and  Hold  (gSH)
-­‐ gSH is  an  efficient single-­‐pass  streaming  framework
-­‐ gSH selects  a  representative sample  and  computes  unbiased estimates  of  
counts  of  connected  subsets  of  edges  (e.g.,  triangles,  wedges  …)    
-­‐ Theoretical  properties  of  gSH are  supported  by  empirical  analysis      
§ gSH admits  generalizations  by  allowing  the  dependence of  the  
sampling  process  to  vary based  on  the  stored  state
§ gSH has  a  relative  estimation  error  <  1%  for  a  sample  size  <40K  
edges
§ Extend  gSH to  adaptively  change  the  parameters  (p,q),  to  
maintain  a  fixed  size  stored  state
§ Extend  gSH to  estimate  graphlets and  induced  subgraphs with  
more  than  three  nodes
Thank  you!
Questions?
Nesreen  Ahmed,  Nick  Duffield,  Jennifer  Neville,  Ramana Kompella
{nkahmed,  neville,  kompella}  @  cs.purdue.edu,  duffieldng@tamu.edu

Más contenido relacionado

La actualidad más candente

Parallel Algorithms for Geometric Graph Problems (at Stanford)
Parallel Algorithms for Geometric Graph Problems (at Stanford)Parallel Algorithms for Geometric Graph Problems (at Stanford)
Parallel Algorithms for Geometric Graph Problems (at Stanford)Grigory Yaroslavtsev
 
Introduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithmIntroduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithmKatsuki Ohto
 
Differential privacy without sensitivity [NIPS2016読み会資料]
Differential privacy without sensitivity [NIPS2016読み会資料]Differential privacy without sensitivity [NIPS2016読み会資料]
Differential privacy without sensitivity [NIPS2016読み会資料]Kentaro Minami
 
Modeling and Querying Metadata in the Semantic Sensor Web: stRDF and stSPARQL
Modeling and Querying Metadata in the Semantic Sensor Web: stRDF and stSPARQLModeling and Querying Metadata in the Semantic Sensor Web: stRDF and stSPARQL
Modeling and Querying Metadata in the Semantic Sensor Web: stRDF and stSPARQLKostis Kyzirakos
 
Joint CSI Estimation, Beamforming and Scheduling Design for Wideband Massive ...
Joint CSI Estimation, Beamforming and Scheduling Design for Wideband Massive ...Joint CSI Estimation, Beamforming and Scheduling Design for Wideband Massive ...
Joint CSI Estimation, Beamforming and Scheduling Design for Wideband Massive ...T. E. BOGALE
 
Improving Variational Inference with Inverse Autoregressive Flow
Improving Variational Inference with Inverse Autoregressive FlowImproving Variational Inference with Inverse Autoregressive Flow
Improving Variational Inference with Inverse Autoregressive FlowTatsuya Shirakawa
 
VRP2013 - Comp Aspects VRP
VRP2013 - Comp Aspects VRPVRP2013 - Comp Aspects VRP
VRP2013 - Comp Aspects VRPVictor Pillac
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientFabian Pedregosa
 
Optimization in deep learning
Optimization in deep learningOptimization in deep learning
Optimization in deep learningJeremy Nixon
 
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEEuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEHONGJOO LEE
 
Queue length estimation on urban corridors
Queue length estimation on urban corridorsQueue length estimation on urban corridors
Queue length estimation on urban corridorsGuillaume Costeseque
 
Speaker Diarization
Speaker DiarizationSpeaker Diarization
Speaker DiarizationHONGJOO LEE
 
Information in the Weights
Information in the WeightsInformation in the Weights
Information in the WeightsMark Chang
 
Leveraging Multiple GPUs and CPUs for Graphlet Counting in Large Networks
Leveraging Multiple GPUs and CPUs for  Graphlet Counting in Large Networks Leveraging Multiple GPUs and CPUs for  Graphlet Counting in Large Networks
Leveraging Multiple GPUs and CPUs for Graphlet Counting in Large Networks Ryan Rossi
 
Information in the Weights
Information in the WeightsInformation in the Weights
Information in the WeightsMark Chang
 
Time series data mining techniques
Time series data mining techniquesTime series data mining techniques
Time series data mining techniquesShanmukha S. Potti
 

La actualidad más candente (20)

Parallel Algorithms for Geometric Graph Problems (at Stanford)
Parallel Algorithms for Geometric Graph Problems (at Stanford)Parallel Algorithms for Geometric Graph Problems (at Stanford)
Parallel Algorithms for Geometric Graph Problems (at Stanford)
 
Introduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithmIntroduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithm
 
Differential privacy without sensitivity [NIPS2016読み会資料]
Differential privacy without sensitivity [NIPS2016読み会資料]Differential privacy without sensitivity [NIPS2016読み会資料]
Differential privacy without sensitivity [NIPS2016読み会資料]
 
Modeling and Querying Metadata in the Semantic Sensor Web: stRDF and stSPARQL
Modeling and Querying Metadata in the Semantic Sensor Web: stRDF and stSPARQLModeling and Querying Metadata in the Semantic Sensor Web: stRDF and stSPARQL
Modeling and Querying Metadata in the Semantic Sensor Web: stRDF and stSPARQL
 
Joint CSI Estimation, Beamforming and Scheduling Design for Wideband Massive ...
Joint CSI Estimation, Beamforming and Scheduling Design for Wideband Massive ...Joint CSI Estimation, Beamforming and Scheduling Design for Wideband Massive ...
Joint CSI Estimation, Beamforming and Scheduling Design for Wideband Massive ...
 
Improving Variational Inference with Inverse Autoregressive Flow
Improving Variational Inference with Inverse Autoregressive FlowImproving Variational Inference with Inverse Autoregressive Flow
Improving Variational Inference with Inverse Autoregressive Flow
 
VRP2013 - Comp Aspects VRP
VRP2013 - Comp Aspects VRPVRP2013 - Comp Aspects VRP
VRP2013 - Comp Aspects VRP
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradient
 
Optimization in deep learning
Optimization in deep learningOptimization in deep learning
Optimization in deep learning
 
K-Means Algorithm
K-Means AlgorithmK-Means Algorithm
K-Means Algorithm
 
Encoding survey
Encoding surveyEncoding survey
Encoding survey
 
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEEuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
 
Queue length estimation on urban corridors
Queue length estimation on urban corridorsQueue length estimation on urban corridors
Queue length estimation on urban corridors
 
Scalable Link Discovery for Modern Data-Driven Applications
Scalable Link Discovery for Modern Data-Driven ApplicationsScalable Link Discovery for Modern Data-Driven Applications
Scalable Link Discovery for Modern Data-Driven Applications
 
Speaker Diarization
Speaker DiarizationSpeaker Diarization
Speaker Diarization
 
Information in the Weights
Information in the WeightsInformation in the Weights
Information in the Weights
 
Leveraging Multiple GPUs and CPUs for Graphlet Counting in Large Networks
Leveraging Multiple GPUs and CPUs for  Graphlet Counting in Large Networks Leveraging Multiple GPUs and CPUs for  Graphlet Counting in Large Networks
Leveraging Multiple GPUs and CPUs for Graphlet Counting in Large Networks
 
Locality sensitive hashing
Locality sensitive hashingLocality sensitive hashing
Locality sensitive hashing
 
Information in the Weights
Information in the WeightsInformation in the Weights
Information in the Weights
 
Time series data mining techniques
Time series data mining techniquesTime series data mining techniques
Time series data mining techniques
 

Destacado

Improving personalized recommendations through temporal overlapping community...
Improving personalized recommendations through temporal overlapping community...Improving personalized recommendations through temporal overlapping community...
Improving personalized recommendations through temporal overlapping community...Mani kandan
 
Fast, Scalable Graph Processing: Apache Giraph on YARN
Fast, Scalable Graph Processing: Apache Giraph on YARNFast, Scalable Graph Processing: Apache Giraph on YARN
Fast, Scalable Graph Processing: Apache Giraph on YARNDataWorks Summit
 
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...rhatr
 
Hadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphHadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphDataWorks Summit
 
Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Claudio Martella
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big dataSigmoid
 
2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - HortonworksAvery Ching
 
Dynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDataWorks Summit
 
Graphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphXGraphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphXAndrea Iacono
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingPetr Zapletal
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXKrishna Sankar
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Community detection in graphs
Community detection in graphsCommunity detection in graphs
Community detection in graphsNicola Barbieri
 
Applying large scale text analytics with graph databases
Applying large scale text analytics with graph databasesApplying large scale text analytics with graph databases
Applying large scale text analytics with graph databasesData Ninja API
 
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache FlinkMartin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache FlinkFlink Forward
 
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)Ankur Dave
 
Graph theory
Graph theoryGraph theory
Graph theoryKumar
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphXAndy Petrella
 

Destacado (20)

Improving personalized recommendations through temporal overlapping community...
Improving personalized recommendations through temporal overlapping community...Improving personalized recommendations through temporal overlapping community...
Improving personalized recommendations through temporal overlapping community...
 
Apache giraph
Apache giraphApache giraph
Apache giraph
 
Fast, Scalable Graph Processing: Apache Giraph on YARN
Fast, Scalable Graph Processing: Apache Giraph on YARNFast, Scalable Graph Processing: Apache Giraph on YARN
Fast, Scalable Graph Processing: Apache Giraph on YARN
 
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
 
Hadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphHadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache Giraph
 
Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big data
 
2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks
 
Dynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache Giraph
 
Graphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphXGraphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphX
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 
Graph Analytics
Graph AnalyticsGraph Analytics
Graph Analytics
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Community detection in graphs
Community detection in graphsCommunity detection in graphs
Community detection in graphs
 
Applying large scale text analytics with graph databases
Applying large scale text analytics with graph databasesApplying large scale text analytics with graph databases
Applying large scale text analytics with graph databases
 
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache FlinkMartin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
 
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
 
Graph theory
Graph theoryGraph theory
Graph theory
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
 

Similar a Graph Sample and Hold: A Framework for Big Graph Analytics

R Packages for Time-Varying Networks and Extremal Dependence
R Packages for Time-Varying Networks and Extremal DependenceR Packages for Time-Varying Networks and Extremal Dependence
R Packages for Time-Varying Networks and Extremal DependenceWork-Bench
 
Evolution of Graph Algorithms – Benefits and Challenges
Evolution of Graph Algorithms – Benefits and ChallengesEvolution of Graph Algorithms – Benefits and Challenges
Evolution of Graph Algorithms – Benefits and ChallengesEbru Cucen Çüçen
 
Automatic Visualization - Leland Wilkinson, Chief Scientist, H2O.ai
Automatic Visualization - Leland Wilkinson, Chief Scientist, H2O.aiAutomatic Visualization - Leland Wilkinson, Chief Scientist, H2O.ai
Automatic Visualization - Leland Wilkinson, Chief Scientist, H2O.aiSri Ambati
 
Advanced Statistics And Probability (MSC 615
Advanced Statistics And Probability (MSC 615Advanced Statistics And Probability (MSC 615
Advanced Statistics And Probability (MSC 615Maria Perkins
 
Follow the money with graphs
Follow the money with graphsFollow the money with graphs
Follow the money with graphsStanka Dalekova
 
Automatic Visualization
Automatic VisualizationAutomatic Visualization
Automatic VisualizationSri Ambati
 
Spectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structuresSpectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structuresDavid Gleich
 
Scalable membership management
Scalable membership management Scalable membership management
Scalable membership management Vinay Setty
 
Locally densest subgraph discovery
Locally densest subgraph discoveryLocally densest subgraph discovery
Locally densest subgraph discoveryaftab alam
 
Graph Summarization with Quality Guarantees
Graph Summarization with Quality GuaranteesGraph Summarization with Quality Guarantees
Graph Summarization with Quality GuaranteesTwo Sigma
 
Linked science presentation 25
Linked science presentation 25Linked science presentation 25
Linked science presentation 25Francesco Osborne
 
Mit15 082 jf10_lec01
Mit15 082 jf10_lec01Mit15 082 jf10_lec01
Mit15 082 jf10_lec01Saad Liaqat
 
Modeling and Roll, Pitch and Yaw Simulation of Quadrotor.
Modeling and Roll, Pitch and Yaw Simulation of Quadrotor.Modeling and Roll, Pitch and Yaw Simulation of Quadrotor.
Modeling and Roll, Pitch and Yaw Simulation of Quadrotor.Oka Danil
 
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...Waqas Nawaz
 
08 Exponential Random Graph Models (ERGM)
08 Exponential Random Graph Models (ERGM)08 Exponential Random Graph Models (ERGM)
08 Exponential Random Graph Models (ERGM)dnac
 
(121013) #fitalk locating the source of diffusion in large-scale network
(121013) #fitalk   locating the source of diffusion in large-scale network(121013) #fitalk   locating the source of diffusion in large-scale network
(121013) #fitalk locating the source of diffusion in large-scale networkINSIGHT FORENSIC
 
(121013) #fitalk locating the source of diffusion in large-scale network
(121013) #fitalk   locating the source of diffusion in large-scale network(121013) #fitalk   locating the source of diffusion in large-scale network
(121013) #fitalk locating the source of diffusion in large-scale networkINSIGHT FORENSIC
 

Similar a Graph Sample and Hold: A Framework for Big Graph Analytics (20)

R Packages for Time-Varying Networks and Extremal Dependence
R Packages for Time-Varying Networks and Extremal DependenceR Packages for Time-Varying Networks and Extremal Dependence
R Packages for Time-Varying Networks and Extremal Dependence
 
Graph Evolution Models
Graph Evolution ModelsGraph Evolution Models
Graph Evolution Models
 
Evolution of Graph Algorithms – Benefits and Challenges
Evolution of Graph Algorithms – Benefits and ChallengesEvolution of Graph Algorithms – Benefits and Challenges
Evolution of Graph Algorithms – Benefits and Challenges
 
Lausanne 2019 #4
Lausanne 2019 #4Lausanne 2019 #4
Lausanne 2019 #4
 
Automatic Visualization - Leland Wilkinson, Chief Scientist, H2O.ai
Automatic Visualization - Leland Wilkinson, Chief Scientist, H2O.aiAutomatic Visualization - Leland Wilkinson, Chief Scientist, H2O.ai
Automatic Visualization - Leland Wilkinson, Chief Scientist, H2O.ai
 
Advanced Statistics And Probability (MSC 615
Advanced Statistics And Probability (MSC 615Advanced Statistics And Probability (MSC 615
Advanced Statistics And Probability (MSC 615
 
Follow the money with graphs
Follow the money with graphsFollow the money with graphs
Follow the money with graphs
 
Automatic Visualization
Automatic VisualizationAutomatic Visualization
Automatic Visualization
 
Spectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structuresSpectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structures
 
Scalable membership management
Scalable membership management Scalable membership management
Scalable membership management
 
Locally densest subgraph discovery
Locally densest subgraph discoveryLocally densest subgraph discovery
Locally densest subgraph discovery
 
Graph Summarization with Quality Guarantees
Graph Summarization with Quality GuaranteesGraph Summarization with Quality Guarantees
Graph Summarization with Quality Guarantees
 
Linked science presentation 25
Linked science presentation 25Linked science presentation 25
Linked science presentation 25
 
Mit15 082 jf10_lec01
Mit15 082 jf10_lec01Mit15 082 jf10_lec01
Mit15 082 jf10_lec01
 
Modeling and Roll, Pitch and Yaw Simulation of Quadrotor.
Modeling and Roll, Pitch and Yaw Simulation of Quadrotor.Modeling and Roll, Pitch and Yaw Simulation of Quadrotor.
Modeling and Roll, Pitch and Yaw Simulation of Quadrotor.
 
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...
 
08 Exponential Random Graph Models (2016)
08 Exponential Random Graph Models (2016)08 Exponential Random Graph Models (2016)
08 Exponential Random Graph Models (2016)
 
08 Exponential Random Graph Models (ERGM)
08 Exponential Random Graph Models (ERGM)08 Exponential Random Graph Models (ERGM)
08 Exponential Random Graph Models (ERGM)
 
(121013) #fitalk locating the source of diffusion in large-scale network
(121013) #fitalk   locating the source of diffusion in large-scale network(121013) #fitalk   locating the source of diffusion in large-scale network
(121013) #fitalk locating the source of diffusion in large-scale network
 
(121013) #fitalk locating the source of diffusion in large-scale network
(121013) #fitalk   locating the source of diffusion in large-scale network(121013) #fitalk   locating the source of diffusion in large-scale network
(121013) #fitalk locating the source of diffusion in large-scale network
 

Último

Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 

Último (20)

Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 

Graph Sample and Hold: A Framework for Big Graph Analytics

  • 1. Nesreen  K.  Ahmed Department  of  Computer  Science,  Purdue  University     Joint  work  with Nick  Duffield1,  Jennifer  Neville2,  Ramana Kompella2 1Texas  A&M  University,  2Purdue  University KDD’14,  New  York August  25th,  2014
  • 2. Social   Network Internet  (AS) BiologicalPolitical  Blogs Graph Analytics
  • 3. Studying  and  analyzing  complex  networks is  a  challenging and  computational  intensive task Studying  and  analyzing  complex  networks is  a  challenging and  computational  intensive task Ø Today’s  networks  are  massive  in  size   -­‐ e.g.,  online  social  networks  have  billions  of  users Ø Today’s  networks  are  dynamic/streaming  over  time -­‐ e.g.,  Twitter  streams,  email  communications   Ø Today’s  networks  are  massive  in  size   -­‐ e.g.,  online  social  networks  have  billions  of  users Ø Today’s  networks  are  dynamic/streaming  over  time -­‐ e.g.,  Twitter  streams,  email  communications  
  • 4. Studying  and  analyzing  complex  networks is  a  challenging and  computational  intensive task Studying  and  analyzing  complex  networks is  a  challenging and  computational  intensive task Ø Today’s  networks  are  massive  in  size   -­‐ e.g.,  online  social  networks  have  billions  of  users Ø Today’s  networks  are  dynamic/streaming  over  time -­‐ e.g.,  Twitter  streams,  email  communications   Ø Today’s  networks  are  massive  in  size   -­‐ e.g.,  online  social  networks  have  billions  of  users Ø Today’s  networks  are  dynamic/streaming  over  time -­‐ e.g.,  Twitter  streams,  email  communications   Due  to  these  challenges,  we  usually  need  to  sampleDue  to  these  challenges,  we  usually  need  to  sample Statistical   Sampling Graph  G Sample  S e.g. Uniform Random Sampling
  • 5. Given a large graph G represented as a stream of edges e1, e2, e3… We show how to efficiently sample from G while limiting memory space to calculate unbiased estimates of various graph properties Given a large graph G represented as a stream of edges e1, e2, e3… We show how to efficiently sample from G while limiting memory space to calculate unbiased estimates of various graph properties
  • 6. Given a large graph G represented as a stream of edges e1, e2, e3… We show how to efficiently sample from G while limiting memory space to calculate unbiased estimates of various graph properties Given a large graph G represented as a stream of edges e1, e2, e3… We show how to efficiently sample from G while limiting memory space to calculate unbiased estimates of various graph properties
  • 7. Given a large graph G represented as a stream of edges e1, e2, e3… We show how to efficiently sample from G while limiting memory space to calculate unbiased estimates of various graph properties Given a large graph G represented as a stream of edges e1, e2, e3… We show how to efficiently sample from G while limiting memory space to calculate unbiased estimates of various graph properties No.  Edges No.  TrianglesNo.  Wedges Frequent  connected  subsets  of  edges
  • 8. Given a large graph G represented as a stream of edges e1, e2, e3… We show how to efficiently sample from G while limiting memory space to calculate unbiased estimates of various graph properties Given a large graph G represented as a stream of edges e1, e2, e3… We show how to efficiently sample from G while limiting memory space to calculate unbiased estimates of various graph properties No.  Edges No.  TrianglesNo.  Wedges Frequent  connected  subsets  of  edges Transitivity
  • 9. § Random  Sampling • Uniform  random  sampling – [Tsourakakis et.  al  KDD’09]     — Graph  Sparsification with  probability  p   — Chance  of  sampling  a  subgraph (e.g.,  triangle)  is  very  low — Estimates  suffer  from  high  variance   • Wedge  Sampling – [Seshadhri et.  al  SDM’13]   — Sample  vertices,  then  sample  pairs  of  incident  edges  (wedges) — Output  the  estimate  of  the  closed  wedges  (triangles)  
  • 10. § Random  Sampling • Uniform  random  sampling – [Tsourakakis et.  al  KDD’09]     — Graph  Sparsification with  probability  p   — Chance  of  sampling  a  subgraph (e.g.,  triangle)  is  very  low — Estimates  suffer  from  high  variance   • Wedge  Sampling – [Seshadhri et.  al  SDM’13]   — Sample  vertices,  then  sample  pairs  of  incident  edges  (wedges) — Output  the  estimate  of  the  closed  wedges  (triangles)   Assume  graph  fits  into  memory or   Need  to  store  a  large  fraction  of  data
  • 11. § Assume  specific  order  of  the  stream  – [Buriol et.  al  2006]   • Incidence  stream  model– neighbors  of  a  vertex  arrive  together  in  the   stream § Use  multiple  passes  over  the  stream  – [Becchetti et.  al  KDD‘08] § Single-­‐pass  Algorithms • Streaming-­‐Triangles  – [Jha et.  al  KDD’13] — Sample  edges  using  reservoir  sampling,  then,  sample  pairs  of   incident  edges  (wedges) — Scan  for  closed  wedges  (triangles)
  • 12. § Assume  specific  order  of  the  stream  – [Buriol et.  al  2006]   • Incidence  stream  model– neighbors  of  a  vertex  arrive  together  in  the   stream § Use  multiple  passes  over  the  stream  – [Becchetti et.  al  KDD‘08] § Single-­‐pass  Algorithms • Streaming-­‐Triangles  -­‐-­‐ [Jha et.  al  KDD’13] — Sample  edges  using  reservoir  sampling,  then,  sample  pairs  of   incident  edges  (wedges) — Scan  for  closed  wedges  (triangles) Sampling  designs  used  for  specific  graph  properties  (triangles)   and   Not  generally  applicable  to  other  properties
  • 19. Input Graph-Sample-and-Hold Framework gSH(p,q)Output Sample Hold Chance  to  sample  a  subgraph (e.g.,  triangle)  is  very  low Estimates  suffer  from  a  high  variance
  • 21.
  • 22. Input
  • 30. We  use  Horvitz-­‐Thompson  statistical  estimation  framework • Used  for  unequal  probability  sampling   [Horvitz  and  Thompson-­1952] Journal  of  American  Stat.  Assoc.
  • 31. § We  define  the  selection  estimator  for  an  edge  
  • 32. § We  define  the  selection  estimator  for  an  edge  
  • 33. § We  define  the  selection  estimator  for  an  edge  
  • 34. § We  define  the  selection  estimator  for  an  edge   Unbiasedness [Theorem  1  (i)]
  • 35. § We  define  the  selection  estimator  for  an  edge § We  derive  the  estimate  of  edge  count Unbiasedness [Theorem  1  (i)]
  • 36. § We  generalize the  equations  to  any  subset  of  edges
  • 37. § We  generalize the  equations  to  any  subset  of  edges Unbiasedness [Theorem  1  (ii)]
  • 38. § We  generalize the  equations  to  any  subset  of  edges § Ex: The  estimator  of  a  sampled  triangle  is: Unbiasedness [Theorem  1  (ii)]
  • 39. § The  estimate  of  triangle  counts  is § The  estimate  of  wedge  counts  is § Using  the  gSH framework,  we  obtain  estimators  for: • Edge  Count • Triangle  Count • Wedge  Count • Global  clustering  Coeff.  =  3  *                        /
  • 40. § The  estimate  of  triangle  counts  is § The  estimate  of  wedge  counts  is § Using  the  gSH framework,  we  obtain  estimators  for: • Edge  Count • Triangle  Count • Wedge  Count • Global  clustering  Coeff.  =  3  *                        / We  also  compute  HT   unbiased  variance  estimators
  • 41. Social  facebook Graphs Graphs   Web Graphs Links  to 6  graphs  with 7K  – 685K  nodes,  250K– 7M  edges CMU,  UCLA,  and  Wisconsin Google,   Berkeley-­‐Stanford,  Stanford   friends
  • 42. o K with probability r datasets. n is the number of nodes, NK is the is the number of triangles, NΛ is the number length 2, α is the global clustering coefficient, NK NT NΛ α D 249.9K 2.3M 37.4M 0.18526 0.0114 747.6K 5.1M 107.1M 0.14314 0.0036 835.9K 4.8M 121.4M 0.12013 0.0029 1.9M 11.3M 3.9T 0.00862 5.01 × 10−5 4.3M 13.3M 727.4M 0.05523 1.15 × 10−5 6.6M 64.6M 27.9T 0.00694 2.83 × 10−5 1, while the corresponding sums of the de- he the true node degree. socfb-CMU 249.9K 249.6K 0.0013 socfb-UCLA 747.6K 751.3K 0.0050 socfb-Wisconsin 835.9K 835.7K 0.0003 web-Stanford 1.9M 1.9M 0.0004 1 web-Google 4.3M 4.3M 0.0007 2 web-BerkStan 6.6M 6.6M 0.0006 3 Triangles NT NT NT |NT −NT | NT S socfb-CMU 2.3M 2.3M 0.0003 socfb-UCLA 5.1M 5.1M 0.0095 socfb-Wisconsin 4.8M 4.8M 0.0058 web-Stanford 11.3M 11.3M 0.0023 1 web-Google 13.3M 13.4M 0.0029 2 web-BerkStan 64.6M 65M 0.0063 3 Path. Length two N NΛ NΛ |NΛ−NΛ| NΛ S socfb-CMU 37.4M 37.3M 0.0018 socfb-UCLA 107.1M 107.8M 0.0060 socfb-Wisconsin 121.4M 121.2M 0.0018 web-Stanford 3.9T 3.9T 0.0004 1 web-Google 727.4M 724.3M 0.0042 2 web-BerkStan 27.9T 27.9T 0.0002 3 Global Clustering α α α |α−α| α S socfb-CMU 0.18526 0.18574 0.00260 socfb-UCLA 0.14314 0.14363 0.00340 NK NK |NK −NK | NK SSize LB UB socfb-CMU 249.9K 249.6K 0.0013 1.7K 236.8K 262.4K socfb-UCLA 747.6K 751.3K 0.0050 5K 729.3K 773.34K socfb-Wisconsin 835.9K 835.7K 0.0003 5.5K 812.2K 859.1K web-Stanford 1.9M 1.9M 0.0004 14.8K 1.9M 2M web-Google 4.3M 4.3M 0.0007 25.2K 4.2M 4.3M web-BerkStan 6.6M 6.6M 0.0006 39.8K 6.5M 6.7M Triangles NT NT NT |NT −NT | NT SSize LB UB socfb-CMU 2.3M 2.3M 0.0003 1.7K 1.6M 2.9M socfb-UCLA 5.1M 5.1M 0.0095 5K 4.2M 6.03M socfb-Wisconsin 4.8M 4.8M 0.0058 5.5K 4M 5.7M web-Stanford 11.3M 11.3M 0.0023 14.8K 3.7M 18.8M web-Google 13.3M 13.4M 0.0029 25.2K 11.7M 15M web-BerkStan 64.6M 65M 0.0063 39.8K 45.5M 84.6M Path. Length two NΛ NΛ NΛ |NΛ−NΛ| NΛ SSize LB UB socfb-CMU 37.4M 37.3M 0.0018 1.7K 32.6M 42M socfb-UCLA 107.1M 107.8M 0.0060 5K 100.1M 115.42M socfb-Wisconsin 121.4M 121.2M 0.0018 5.5K 108.9M 133.4M web-Stanford 3.9T 3.9T 0.0004 14.8K 3.6T 4.2T web-Google 727.4M 724.3M 0.0042 25.2K 677.1M 771.5M web-BerkStan 27.9T 27.9T 0.0002 39.8K 26.5T 29.3T Global Clustering α α α |α−α| α SSize LB UB datasets. n is the number of nodes, NK is the is the number of triangles, NΛ is the number length 2, α is the global clustering coefficient, NK NT NΛ α D 249.9K 2.3M 37.4M 0.18526 0.0114 747.6K 5.1M 107.1M 0.14314 0.0036 835.9K 4.8M 121.4M 0.12013 0.0029 1.9M 11.3M 3.9T 0.00862 5.01 × 10−5 4.3M 13.3M 727.4M 0.05523 1.15 × 10−5 6.6M 64.6M 27.9T 0.00694 2.83 × 10−5 1, while the corresponding sums of the de- he the true node degree. ENTS AND EVALUATION mance of our proposed framework (gSH) as m 2 (with r = 1 for edges that are closing ocial and information networks with 250K– he following networks, we consider an undi- NT NT |NT −NT | NT SS socfb-CMU 2.3M 2.3M 0.0003 1 socfb-UCLA 5.1M 5.1M 0.0095 5 socfb-Wisconsin 4.8M 4.8M 0.0058 5 web-Stanford 11.3M 11.3M 0.0023 14 web-Google 13.3M 13.4M 0.0029 25 web-BerkStan 64.6M 65M 0.0063 39 Path. Length two NΛ NΛ NΛ |NΛ−NΛ| NΛ SS socfb-CMU 37.4M 37.3M 0.0018 1 socfb-UCLA 107.1M 107.8M 0.0060 5 socfb-Wisconsin 121.4M 121.2M 0.0018 5 web-Stanford 3.9T 3.9T 0.0004 14 web-Google 727.4M 724.3M 0.0042 25 web-BerkStan 27.9T 27.9T 0.0002 39 Global Clustering α α α |α−α| α SS socfb-CMU 0.18526 0.18574 0.00260 1 socfb-UCLA 0.14314 0.14363 0.00340 5 socfb-Wisconsin 0.12013 0.12101 0.00730 5 web-Stanford 0.00862 0.00862 0.00020 14 web-Google 0.05523 0.05565 0.00760 25 web-BerkStan 0.00694 0.00698 0.00680 39 e r , lower bounds, and upper bounds when sample size ≤ 40K edges, with sampling probability p, q = 0.005 for web-BerkStan, and p = 0.005, q = 0.008 otherwise. SSize is the number of sam- pled edges, and LB, UB are the 95% lower, and upper bound respectively. Edges NK NK NK |NK −NK | NK SSize LB UB socfb-CMU 249.9K 249.6K 0.0013 1.7K 236.8K 262.4K socfb-UCLA 747.6K 751.3K 0.0050 5K 729.3K 773.34K socfb-Wisconsin 835.9K 835.7K 0.0003 5.5K 812.2K 859.1K web-Stanford 1.9M 1.9M 0.0004 14.8K 1.9M 2M web-Google 4.3M 4.3M 0.0007 25.2K 4.2M 4.3M web-BerkStan 6.6M 6.6M 0.0006 39.8K 6.5M 6.7M Triangles NT NT NT |NT −NT | NT SSize LB UB socfb-CMU 2.3M 2.3M 0.0003 1.7K 1.6M 2.9M socfb-UCLA 5.1M 5.1M 0.0095 5K 4.2M 6.03M socfb-Wisconsin 4.8M 4.8M 0.0058 5.5K 4M 5.7M web-Stanford 11.3M 11.3M 0.0023 14.8K 3.7M 18.8M web-Google 13.3M 13.4M 0.0029 25.2K 11.7M 15M web-BerkStan 64.6M 65M 0.0063 39.8K 45.5M 84.6M Path. Length two NΛ NΛ NΛ |NΛ−NΛ| NΛ SSize LB UB socfb-CMU 37.4M 37.3M 0.0018 1.7K 32.6M 42M socfb-UCLA 107.1M 107.8M 0.0060 5K 100.1M 115.42M Edge  Count Triangle  Count Wedge  Count Global  Clustering Sample  size  <=  40K  edges
  • 43. o K with probability r datasets. n is the number of nodes, NK is the is the number of triangles, NΛ is the number length 2, α is the global clustering coefficient, NK NT NΛ α D 249.9K 2.3M 37.4M 0.18526 0.0114 747.6K 5.1M 107.1M 0.14314 0.0036 835.9K 4.8M 121.4M 0.12013 0.0029 1.9M 11.3M 3.9T 0.00862 5.01 × 10−5 4.3M 13.3M 727.4M 0.05523 1.15 × 10−5 6.6M 64.6M 27.9T 0.00694 2.83 × 10−5 1, while the corresponding sums of the de- he the true node degree. socfb-CMU 249.9K 249.6K 0.0013 socfb-UCLA 747.6K 751.3K 0.0050 socfb-Wisconsin 835.9K 835.7K 0.0003 web-Stanford 1.9M 1.9M 0.0004 1 web-Google 4.3M 4.3M 0.0007 2 web-BerkStan 6.6M 6.6M 0.0006 3 Triangles NT NT NT |NT −NT | NT S socfb-CMU 2.3M 2.3M 0.0003 socfb-UCLA 5.1M 5.1M 0.0095 socfb-Wisconsin 4.8M 4.8M 0.0058 web-Stanford 11.3M 11.3M 0.0023 1 web-Google 13.3M 13.4M 0.0029 2 web-BerkStan 64.6M 65M 0.0063 3 Path. Length two N NΛ NΛ |NΛ−NΛ| NΛ S socfb-CMU 37.4M 37.3M 0.0018 socfb-UCLA 107.1M 107.8M 0.0060 socfb-Wisconsin 121.4M 121.2M 0.0018 web-Stanford 3.9T 3.9T 0.0004 1 web-Google 727.4M 724.3M 0.0042 2 web-BerkStan 27.9T 27.9T 0.0002 3 Global Clustering α α α |α−α| α S socfb-CMU 0.18526 0.18574 0.00260 socfb-UCLA 0.14314 0.14363 0.00340 NK NK |NK −NK | NK SSize LB UB socfb-CMU 249.9K 249.6K 0.0013 1.7K 236.8K 262.4K socfb-UCLA 747.6K 751.3K 0.0050 5K 729.3K 773.34K socfb-Wisconsin 835.9K 835.7K 0.0003 5.5K 812.2K 859.1K web-Stanford 1.9M 1.9M 0.0004 14.8K 1.9M 2M web-Google 4.3M 4.3M 0.0007 25.2K 4.2M 4.3M web-BerkStan 6.6M 6.6M 0.0006 39.8K 6.5M 6.7M Triangles NT NT NT |NT −NT | NT SSize LB UB socfb-CMU 2.3M 2.3M 0.0003 1.7K 1.6M 2.9M socfb-UCLA 5.1M 5.1M 0.0095 5K 4.2M 6.03M socfb-Wisconsin 4.8M 4.8M 0.0058 5.5K 4M 5.7M web-Stanford 11.3M 11.3M 0.0023 14.8K 3.7M 18.8M web-Google 13.3M 13.4M 0.0029 25.2K 11.7M 15M web-BerkStan 64.6M 65M 0.0063 39.8K 45.5M 84.6M Path. Length two NΛ NΛ NΛ |NΛ−NΛ| NΛ SSize LB UB socfb-CMU 37.4M 37.3M 0.0018 1.7K 32.6M 42M socfb-UCLA 107.1M 107.8M 0.0060 5K 100.1M 115.42M socfb-Wisconsin 121.4M 121.2M 0.0018 5.5K 108.9M 133.4M web-Stanford 3.9T 3.9T 0.0004 14.8K 3.6T 4.2T web-Google 727.4M 724.3M 0.0042 25.2K 677.1M 771.5M web-BerkStan 27.9T 27.9T 0.0002 39.8K 26.5T 29.3T Global Clustering α α α |α−α| α SSize LB UB datasets. n is the number of nodes, NK is the is the number of triangles, NΛ is the number length 2, α is the global clustering coefficient, NK NT NΛ α D 249.9K 2.3M 37.4M 0.18526 0.0114 747.6K 5.1M 107.1M 0.14314 0.0036 835.9K 4.8M 121.4M 0.12013 0.0029 1.9M 11.3M 3.9T 0.00862 5.01 × 10−5 4.3M 13.3M 727.4M 0.05523 1.15 × 10−5 6.6M 64.6M 27.9T 0.00694 2.83 × 10−5 1, while the corresponding sums of the de- he the true node degree. ENTS AND EVALUATION mance of our proposed framework (gSH) as m 2 (with r = 1 for edges that are closing ocial and information networks with 250K– he following networks, we consider an undi- NT NT |NT −NT | NT SS socfb-CMU 2.3M 2.3M 0.0003 1 socfb-UCLA 5.1M 5.1M 0.0095 5 socfb-Wisconsin 4.8M 4.8M 0.0058 5 web-Stanford 11.3M 11.3M 0.0023 14 web-Google 13.3M 13.4M 0.0029 25 web-BerkStan 64.6M 65M 0.0063 39 Path. Length two NΛ NΛ NΛ |NΛ−NΛ| NΛ SS socfb-CMU 37.4M 37.3M 0.0018 1 socfb-UCLA 107.1M 107.8M 0.0060 5 socfb-Wisconsin 121.4M 121.2M 0.0018 5 web-Stanford 3.9T 3.9T 0.0004 14 web-Google 727.4M 724.3M 0.0042 25 web-BerkStan 27.9T 27.9T 0.0002 39 Global Clustering α α α |α−α| α SS socfb-CMU 0.18526 0.18574 0.00260 1 socfb-UCLA 0.14314 0.14363 0.00340 5 socfb-Wisconsin 0.12013 0.12101 0.00730 5 web-Stanford 0.00862 0.00862 0.00020 14 web-Google 0.05523 0.05565 0.00760 25 web-BerkStan 0.00694 0.00698 0.00680 39 e r , lower bounds, and upper bounds when sample size ≤ 40K edges, with sampling probability p, q = 0.005 for web-BerkStan, and p = 0.005, q = 0.008 otherwise. SSize is the number of sam- pled edges, and LB, UB are the 95% lower, and upper bound respectively. Edges NK NK NK |NK −NK | NK SSize LB UB socfb-CMU 249.9K 249.6K 0.0013 1.7K 236.8K 262.4K socfb-UCLA 747.6K 751.3K 0.0050 5K 729.3K 773.34K socfb-Wisconsin 835.9K 835.7K 0.0003 5.5K 812.2K 859.1K web-Stanford 1.9M 1.9M 0.0004 14.8K 1.9M 2M web-Google 4.3M 4.3M 0.0007 25.2K 4.2M 4.3M web-BerkStan 6.6M 6.6M 0.0006 39.8K 6.5M 6.7M Triangles NT NT NT |NT −NT | NT SSize LB UB socfb-CMU 2.3M 2.3M 0.0003 1.7K 1.6M 2.9M socfb-UCLA 5.1M 5.1M 0.0095 5K 4.2M 6.03M socfb-Wisconsin 4.8M 4.8M 0.0058 5.5K 4M 5.7M web-Stanford 11.3M 11.3M 0.0023 14.8K 3.7M 18.8M web-Google 13.3M 13.4M 0.0029 25.2K 11.7M 15M web-BerkStan 64.6M 65M 0.0063 39.8K 45.5M 84.6M Path. Length two NΛ NΛ NΛ |NΛ−NΛ| NΛ SSize LB UB socfb-CMU 37.4M 37.3M 0.0018 1.7K 32.6M 42M socfb-UCLA 107.1M 107.8M 0.0060 5K 100.1M 115.42M Edge  Count Triangle  Count Wedge  Count Global  Clustering Sample  size  <=  40K  edges <1% Over  all  graphs,  all  properties
  • 44. 10 4 10 5 0.95 1 1.05 socfb−UCLA ˆNK/NK Sample Size (Edges) (a) Edges 10 4 10 5 0.85 0.9 0.95 1 1.05 1.1 1.15 socfb−UCLA ˆNT/NT Sample Size (Edges) (b) Triangles 0 1 ˆNΛ/NΛ 1 1.05 socfb−Wisconsin ˆNK/NK 0.9 0.95 1 1.05 1.1 1.15 socfb−Wisconsin ˆNT/NT 0 1 ˆNΛ/NΛ Edge  Count 10 4 10 5 0.95 1 1.05 socfb−UCLA ˆNK/NK Sample Size (Edges) (a) Edges 10 4 10 5 0.85 0.9 0.95 1 1.05 1.1 1.15 socfb−UCLA ˆNT/NT Sample Size (Edges) (b) Triangles 1 1.05 socfb−Wisconsin ˆNK/NK 0.9 0.95 1 1.05 1.1 1.15 socfb−Wisconsin ˆNT/NT Triangle  Count 10 5 socfb−UCLA Sample Size (Edges) 10 4 10 5 0.9 0.95 1 1.05 1.1 socfb−UCLA ˆNΛ/NΛ Sample Size (Edges) Wedge  Count 10 4 10 5 0.95 Sample Size (Edges) (a) Edges 10 4 0.85 Sample (b) Tria 10 4 10 5 0.95 1 1.05 socfb−Wisconsin ˆNK/NK Sample Size (Edges) (d) Edges 10 4 0.85 0.9 0.95 1 1.05 1.1 1.15 socfb− ˆNT/NT Sample (e) Tria 10 4 10 5 0.85 0.9 0.95 1 1.05 1.1 1.15 socfb−UCLA ˆα/α Sample Size (Edges) Global  Clustering Actual Estimated/Actual Confidence  Upper  &  Lower  Bounds   Sample  Size  =  40K  edges Dataset: facebook friendship   graph  at  UCLA  
  • 45. § Comparison  to  Streaming-­‐Triangles  [Jha et.  al-­‐KDD’13] socfb-Wisconsin 0.95 0.95 0.96 0.95 web-Stanford 0.97 0.92 0.95 0.92 web-Google 0.95 0.93 0.95 0.95 web-BerkStan 0.96 0.94 0.93 0.93 Table 5: The relative error and sample size of Jha [25] in comparison to our framework for triangle count estimation Jha et al. [25] gSH graph |NT −NT | NT SSize |NT −NT | NT SSize web-Stanford ≈ 0.07 40K 0.0023 14.8K web-Google ≈ 0.04 40K 0.0029 25.2K web-BerkStan ≈ 0.12 40K 0.0063 39.8K For each p = pi, q = qi, we compute the proportion of samples in which the actual statistic lies in the confidence interval across 100 independent sampling experiments gSHT (pi, qi). We vary p, q in the range of 0.005–0.01, and for each possible combination of p, q (e.g., p = 0.005, q = 0.008), we compute the exact cover- age probability γ. Table 4 provides the mean coverage probability web-B a very large e 5.4 Effec While Figu posed framew tion of what i still needs to b choice of para the graph. Figure 2 sh the range of 0. graphs. Note Table 2) goin observe that w of sampled ed of edges in th 0.03, the fract – 5%. These o On the oth sampled edge For example, fraction of sa Datasets: Web  graphs WWW Page2 links  toWWW Page1 5%  of  graph  edges 0.6%  of  graph  edges 0.57%  of  graph  edges Sample  size Our  approach
  • 46. § Comparison  to  Streaming-­‐Triangles  [Jha et.  al-­‐KDD’13] socfb-Wisconsin 0.95 0.95 0.96 0.95 web-Stanford 0.97 0.92 0.95 0.92 web-Google 0.95 0.93 0.95 0.95 web-BerkStan 0.96 0.94 0.93 0.93 Table 5: The relative error and sample size of Jha [25] in comparison to our framework for triangle count estimation Jha et al. [25] gSH graph |NT −NT | NT SSize |NT −NT | NT SSize web-Stanford ≈ 0.07 40K 0.0023 14.8K web-Google ≈ 0.04 40K 0.0029 25.2K web-BerkStan ≈ 0.12 40K 0.0063 39.8K For each p = pi, q = qi, we compute the proportion of samples in which the actual statistic lies in the confidence interval across 100 independent sampling experiments gSHT (pi, qi). We vary p, q in the range of 0.005–0.01, and for each possible combination of p, q (e.g., p = 0.005, q = 0.008), we compute the exact cover- age probability γ. Table 4 provides the mean coverage probability web-B a very large e 5.4 Effec While Figu posed framew tion of what i still needs to b choice of para the graph. Figure 2 sh the range of 0. graphs. Note Table 2) goin observe that w of sampled ed of edges in th 0.03, the fract – 5%. These o On the oth sampled edge For example, fraction of sa Datasets: Web  graphs WWW Page2 links  toWWW Page1 5%  of  graph  edges 0.6%  of  graph  edges 0.57%  of  graph  edges 92%  -­‐ 96%  Improvement   Sample  size Our  approach
  • 47. § A  sample  is  representative if  graph  properties  of  interest  can  be   estimated  with  a  known  degree  of  accuracy   § We  proposed  a  generic  framework  Graph  Sample  and  Hold  (gSH) -­‐ gSH is  an  efficient single-­‐pass  streaming  framework -­‐ gSH selects  a  representative sample  and  computes  unbiased estimates  of   counts  of  connected  subsets  of  edges  (e.g.,  triangles,  wedges  …)     -­‐ Theoretical  properties  of  gSH are  supported  by  empirical  analysis       § gSH admits  generalizations  by  allowing  the  dependence of  the   sampling  process  to  vary based  on  the  stored  state § gSH has  a  relative  estimation  error  <  1%  for  a  sample  size  <40K   edges
  • 48. § Extend  gSH to  adaptively  change  the  parameters  (p,q),  to   maintain  a  fixed  size  stored  state § Extend  gSH to  estimate  graphlets and  induced  subgraphs with   more  than  three  nodes
  • 49. Thank  you! Questions? Nesreen  Ahmed,  Nick  Duffield,  Jennifer  Neville,  Ramana Kompella {nkahmed,  neville,  kompella}  @  cs.purdue.edu,  duffieldng@tamu.edu