Applications in many areas analyze an ever-changing environment. On billion vertices graphs, providing snapshots imposes a large performance cost. We propose the first formal model for graph analysis running concurrently with streaming data updates. We consider an algorithm valid if its output is correct for the initial graph plus some implicit subset of concurrent changes. We show theoretical properties of the model, demonstrate the model on various algorithms, and extend it to updating results incrementally.
4. (insert prefix here)-scale data analysis
Cyber-security Identify anomalies, malicious actors
Health care Finding outbreaks, population epidemiology
Social networks Advertising, searching, grouping
Intelligence Decisions at scale, regulating markets, smart &
sustainable cities
Systems biology Understanding interactions, drug design
Power grid Disruptions, conservation
Simulation Discrete events, cracking meshes
Changes are important. Cannot stop the world...
New Model for Streaming Graphs — ICIAM 2019 2/20
5. Potential Applications
• Social Networks
• Identify communities, influences, bridges, trends,
anomalies (trends before they happen)...
• Potential to help social sciences, city planning, and
others with large-scale data.
• Cybersecurity
• Determine if new connections can access a device or
represent new threat in < 5ms...
• Is the transfer by a virus / persistent threat?
• Bioinformatics, health
• Construct gene sequences, analyze protein
interactions, map brain interactions
• Credit fraud forensics ⇒ detection ⇒ monitoring
• Real-time integration of all the customer’s data
New Model for Streaming Graphs — ICIAM 2019 3/20
6. Streaming graph data
Network data rates:
• Gigabit ethernet: 81k – 1.5M packets per second
• Over 130 000 flows per second on 10 GigE (< 7.7 µs)
Person-level data rates:
• 500M posts per day on Twitter (6k / sec)1
• 3M posts per minute on Facebook (50k / sec)2
Should analyze only changes and not entire graph.
Throughput & latency trade off and expose different
levels of concurrency.
1
www.internetlivestats.com/twitter-statistics/
2
www.jeffbullas.com/2015/04/17/21-awesome-facebook-facts-and-statistics-you-need-to-check-out/
New Model for Streaming Graphs — ICIAM 2019 4/20
7. Streaming graph analysis
Terminology (not universal):
• Streaming changes into a massive, evolving graph
• Need to handle deletions as well as insertions
Previous STINGER performance results (x86-64):
Data ingest >2M upd/sec [Ediger, McColl, Poovey, Campbell, &
B 2014]
Clustering coefficients >100K upd/sec [Riedy, Meyerhenke,
B, E, & Mattson 2012]
Connected comp. >1M upd/sec [McColl, Green, & B 2013]
Community clustering >100K upd/sec∗
[R & B 2013]
PageRank Up to 40× latency improvement [R 2016]
New Model for Streaming Graphs — ICIAM 2019 5/20
9. Starting incremental / streaming algorithms
• Incremental and
streaming algorithms
start somewhere.
• Initial, static
computation can take a
rather long time...
• During which the graph
cannot change?
• What about supporting
many simultaneous
analyses?
Data ingest rates, R-MAT into
R-MAT, scales 24 & 30
●
●
●
●
●
●
1e+02
1e+03
1e+04
1e+05
1e+06
1 10 100 1000 10000 1e+05
Batch size
Updaterate(upd/s)
platform ● Power8 Haswell Haswell−30
What can we run while the graph changes?
New Model for Streaming Graphs — ICIAM 2019 6/20
10. What if we don’t hold up changes?
When is an algorithm valid?
Analyze concurrently with the graph changes, and
produce a result correct for the starting graph and
some subset of concurrent changes.
• No locking beyond atomic operations.
• No versioned data structure.
• No stopping.
New Model for Streaming Graphs — ICIAM 2019 7/20
11. Sample of other execution models
• Put in a query, wait for sufficient data [Phillips, et al.
at Sandia]
• Different but very interesting model.
• Evolving: Sample, accurate w/high-prob.
• Difficult to generalize into graph results (e.g.
shortest path tree).
• Classical: dynamic algorithms, versioned data
• Can require drastically more storage, possibly a copy
of the graph per property, or more overhead for
techniques like read-copy-update.
Generally do not address the latency of computing the
“static” starting point.
New Model for Streaming Graphs — ICIAM 2019 8/20
12. Algorithm validity in our model: Example.
Can you compute degrees in an undirected graph (no self
loops) concurrently with changes?
Algorithm: Iterate over vertices, count the number of
neighbors.
1
Compute deg(v1)
1 0
Compute deg(v2)
delete edge
Cannot correspond to an undirected graph at all!
Valid for our model? No!
Not incorrect, just not valid for our model.
New Model for Streaming Graphs — ICIAM 2019 9/20
13. Algorithm validity in our model: Example.
Can you compute degrees in an undirected graph (no self
loops) concurrently with changes?
Algorithm: Iterate over edges, increment the degrees of
the endpoints.
1 1
Inc deg(v1), deg(v2)
1 1
(later...)
delete edge
Corresponds to the beginning graph plus a subset of
concurrent changes.
Valid for our model? Yes!
Undirected stored as directed: skip edges with v1 ≥ v2.
New Model for Streaming Graphs — ICIAM 2019 10/20
14. Algorithm validity in our model
s
w(e1) = 10
w(e2) = 5 → 1
∆ = 4
• What is valid?
• Typical BFS
• Shiloach-Vishkin connected components
• PageRank, Katz via Jacobi
• Making a copy! (Vertex-induced subgraph)
• What is invalid?
• Making a decision twice in implementations
• ∆-stepping SSSP: Decrease a weight below ∆
• Degree optimization: Cross threshold, miss vertex
• Applying old or different information
• Multiple searches: Betweenness centrality
• Labeling in S. Kahan’s components alg
New Model for Streaming Graphs — ICIAM 2019 11/20
15. Example: PageRank, Katz Centrality
PageRank
Distribution of rand. walks
(I − αD−1
AT
)x = 1/|V|
Katz Centrality
Count of number of walks
(I − αAT
)x = 1
A: row → col adjacency matrix
D: diagonal matrix of out-degrees
|V|: number of vertices, 1: all-1 vector
Both can be solved by Jacobi iteration, e.g. for Katz:
(I − αAT
)x = 1 ⇒ x(k+1)
= αAT
x(k)
+ 1
New Model for Streaming Graphs — ICIAM 2019 12/20
16. Jacobi can be valid for our model
Core loop of Jacobi iteration for Katz centrality:
while r(k)
≥ ϵ
1. x(k+1)
= αAT
x(k)
+ 1
2. r(k+1)
= 1−(I−αAT
)x(k+1)
3. k = k + 1
Except this is not valid. Residual r(k+1)
may use a different
graph / adjacency matrix A.
New Model for Streaming Graphs — ICIAM 2019 13/20
17. Jacobi can be valid for our model
Core loop of Jacobi iteration for Katz centrality:
do
1. x(k+1)
= αAT
x(k)
+ 1 and
r(k)
= 1 − (I − αAT
)x(k)
2. k = k + 1
until r(k−1)
< ε
Must use the same graph for all requirements.
Will need r(k−1)
later!
This also affects convergence speed.
New Model for Streaming Graphs — ICIAM 2019 13/20
18. Fun properties for one-shot queries
Due to Chunxing Yin3
, under sensible assumptions:
1. You can produce a single-change stream to
demonstrate invalidity.
2. Algorithms producing a subgraph of the input cannot
be guaranteed to run concurrently with changes and
always produce moment-in-time outputs.
3
Yin, Riedy, et al. A New Algorithmic Model for Graph Analysis of Streaming Data. 14th International Workshop on
Mining and Learning with Graphs (MLG), May 2018.
New Model for Streaming Graphs — ICIAM 2019 14/20
19. Fun properties for one-shot queries
Due to Chunxing Yin3
, under sensible assumptions:
1. You can produce a single-change stream to
demonstrate invalidity.
• Proof idea: Start with a graph that incorporates all
the visible changes, introduce the one change at the
right time.
2. Algorithms producing a subgraph of the input cannot
be guaranteed to run concurrently with changes and
always produce moment-in-time outputs.
3
Yin, Riedy, et al. A New Algorithmic Model for Graph Analysis of Streaming Data. 14th International Workshop on
Mining and Learning with Graphs (MLG), May 2018.
New Model for Streaming Graphs — ICIAM 2019 14/20
20. Fun properties for one-shot queries
Due to Chunxing Yin3
, under sensible assumptions:
1. You can produce a single-change stream to
demonstrate invalidity.
2. Algorithms producing a subgraph of the input cannot
be guaranteed to run concurrently with changes and
always produce moment-in-time outputs.
• Proof idea: Any time a snapshot result could happen,
delete then re-insert an edge from the output.
3
Yin, Riedy, et al. A New Algorithmic Model for Graph Analysis of Streaming Data. 14th International Workshop on
Mining and Learning with Graphs (MLG), May 2018.
New Model for Streaming Graphs — ICIAM 2019 14/20
22. On to streaming...
Can we update graph metrics as new data arrives without
just re-running?
• Track what changed during the one-shot query.
• Update locally around those changes, while other
changes are occuring.
• If the update is valid, can repeat to follow a
streaming graph.
Initial
∆0
Upd. w/∆0
∆1
Upd. w/∆1
∆2
Examples: PageRank & Katz, iterative refinement.
Connected components, maintain a spanning tree.
New Model for Streaming Graphs — ICIAM 2019 15/20
23. Early results with PageRank
(Will explain the algorithm in a moment.)
Synchronous: Ingest delays will increase with # kernels.
Red dot: ingested batch. Blue dot: PR kernel begins.
Vertical: # of iterations
New Model for Streaming Graphs — ICIAM 2019 16/20
24. Early results with PageRank
(Will explain the algorithm in a moment.)
Concurrent: Constant-rate ingest!
Detects sudden structural change?
Red dot: ingested batch. Blue dot: PR kernel begins.
Vertical: # of iterations
New Model for Streaming Graphs — ICIAM 2019 16/20
28. Open issues
Difficult problems: Updating triangle counts efficiently!
• Option: re-counting a region around changes,
stopping once counts do not change.
• Possibly incorrect on the region’s border,
but only at changes.
• Next run can fix those... A looser model?
Some algorithms essentially copy subgraphs.
• What are the size bounds?
• Can they characterize algorithms / properties?
• Can we formalize what needs kept for updating
results?
New Model for Streaming Graphs — ICIAM 2019 19/20
29. Closing
• Summary
• Analysis concurrent with graph change can work.
• But not all methods are valid.
• Avoid evaluating conditions or exploring the graph
more than once.
• Save information necessary for updates.
• Future work
• Track subgraphs / communities for “slow” analyses
• Develop more valid updating methods.
• Explore approximation results related to concurrent
analysis.
New Model for Streaming Graphs — ICIAM 2019 20/20