Applications in computer network security, social media analysis,and other areas rely on analyzing a changing environment. The data is rich in relationships and lends itself to graph analysis. Traditional static graph analysis cannot keep pace with network security applications analyzing nearly one million events per second and social networks like Facebook collecting 500 thousand comments per second. Streaming frameworks like STINGER support ingesting up three million of edge changes per second but there are few streaming analysis kernels that keep up with these rates. Here we present a new algorithm model for applying complex metrics to a changing graph. In this model, many more algorithms can be applied without having to stop the world.
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
A New Algorithm Model for Massive-Scale Streaming Graph Analysis
1. A New Algorithm Model for Massive-Scale
Streaming Graph Analysis
E. Jason Riedy, Chunxing Yin, and David A. Bader
Georgia Institute of Technology
SIAM Workshop on Network Science, 14 July 2017
4. (insert prefix here)-scale data analysis
Cyber-security Identify anomalies, malicious actors
Health care Finding outbreaks, population epidemiology
Social networks Advertising, searching, grouping
Intelligence Decisions at scale, regulating markets, smart &
sustainable cities
Systems biology Understanding interactions, drug design
Power grid Disruptions, conservation
Simulation Discrete events, cracking meshes
Changes are important. Cannot stop the world...
Streaming Graphs — SIAM NS, 14 July 2017 2/19
5. Potential Applications
• Social Networks
• Identify communities, influences, bridges, trends,
anomalies (trends before they happen)...
• Potential to help social sciences, city planning, and
others with large-scale data.
• Cybersecurity
• Determine if new connections can access a device or
represent new threat in < 5ms...
• Is the transfer by a virus / persistent threat?
• Bioinformatics, health
• Construct gene sequences, analyze protein
interactions, map brain interactions
• Credit fraud forensics ⇒ detection ⇒ monitoring
• Real-time integration of all the customer’s data
Streaming Graphs — SIAM NS, 14 July 2017 3/19
6. Streaming graph data
Network data rates:
• Gigabit ethernet: 81k – 1.5M packets per second
• Over 130 000 flows per second on 10 GigE (< 7.7 µs)
Person-level data rates:
• 500M posts per day on Twitter (6k / sec)1
• 3M posts per minute on Facebook (50k / sec)2
Should analyze only changes and not entire graph.
Throughput & latency trade off and expose different
levels of concurrency.
1
www.internetlivestats.com/twitter-statistics/
2
www.jeffbullas.com/2015/04/17/21-awesome-facebook-facts-and-statistics-you-need-to-check-out/
Streaming Graphs — SIAM NS, 14 July 2017 4/19
7. Streaming graph analysis
Terminology, will go into more details:
• Streaming changes into a massive, evolving graph
• Will compare models later...
• Need to handle deletions as well as insertions
Previous STINGER performance results (x86-64):
Data ingest >2M upd/sec [Ediger, McColl, Poovey, Campbell, &
Bader 2014]
Clustering coefficients >100K upd/sec [R, Meyerhenke, B, E,
& Mattson 2012]
Connected comp. >1M upd/sec [McColl, Green, & B 2013]
Community clustering >100K upd/sec∗
[R & B 2013]
PageRank Up to 40× latency improvement [R 2016]
Streaming Graphs — SIAM NS, 14 July 2017 5/19
9. STINGER: Framework for streaming graphs
Slide credit: Rob McColl and David Ediger
• OpenMP + sufficiently POSIX-ish
• Multiple processes for resilience
Streaming Graphs — SIAM NS, 14 July 2017 6/19
10. Current STINGER model
Pre-process batch:
Sort by source vertex,
reconcile ins/del.
Pre-change hook
Alter graph (may “age off”old edges)
Post-change hook
STINGER
graph
Batch of insertions / deletions
Affected vertices
Change in metric
Streaming Graphs — SIAM NS, 14 July 2017 7/19
11. Is STINGER’s current model good enough?
Data ingest rates, R-MAT into R-MAT, scales 24 & 30
q
q
q
q
q
q
1e+02
1e+03
1e+04
1e+05
1e+06
1 10 100 1000 10000 1e+05
Batch size
Updaterate(upd/s)
platform q Power8 Haswell Haswell−30
q
q q
q
q q0.00316
0.00562
0.01000
0.01778
0.03162
1 10 100 1000 10000 1e+05
Batch size
Avg.updatetime(s)
platform q Power8 Haswell Haswell−30
Want to add analysis clients without slowing data ingest!
Note that scale 30 starts with 1.1B vertices, 17B edges...
(Different STINGER internal parameters.)
Streaming Graphs — SIAM NS, 14 July 2017 8/19
12. What if we don’t hold up changes?
When is an algorithm valid?
Analyze concurrently with the graph changes, and
produce a result correct for the starting graph and
some subset of concurrent changes.3
• No locking beyond atomic operations.
• No versioned data structure.
• No stopping.
3
Chunxing Yin, Riedy, Bader. “Validity of Graph Algorithms on
Streaming Data.” 2017. (in submission)
Streaming Graphs — SIAM NS, 14 July 2017 9/19
13. Sample of other execution models
• Put in a query, wait for sufficient data [Phillips, et al.
at Sandia]
• Different but very interesting model.
• Evolving: Sample, accurate w/high-prob.
• Difficult to generalize into graph results (e.g.
shortest path tree).
• Classical: dynamic algorithms, versioned data
• Can require drastically more storage, possibly a copy
of the graph per property, or more overhead for
techniques like read-copy-update.
We are assuming we cannot “re-run” the world and must
keep up.
Streaming Graphs — SIAM NS, 14 July 2017 10/19
14. Algorithm validity in our model: Example.
Can you compute degrees in an undirected graph (no self
loops) concurrently with changes?
Algorithm: Iterate over vertices, count the number of
neighbors.
1
Compute deg(v1)
1 0
Compute deg(v2)
delete edge
Cannot correspond to an undirected graph at all!
Valid for our model? No!
Not incorrect, just not valid for our model.
Streaming Graphs — SIAM NS, 14 July 2017 11/19
15. Algorithm validity in our model: Example.
Can you compute degrees in an undirected graph (no self
loops) concurrently with changes?
Algorithm: Iterate over edges, increment the degrees of
the endpoints.
1 1
Inc deg(v1), deg(v2)
1 1
(later...)
delete edge
Corresponds to the beginning graph plus a subset of
concurrent changes.
Valid for our model? Yes!
Undirected stored as directed: skip edges with v1 ≥ v2.
Streaming Graphs — SIAM NS, 14 July 2017 12/19
16. Algorithm validity in our model
s
w(e1) = 10
w(e2) = 5 → 1
∆ = 4
• What is valid?
• Typical BFS
• Shiloach-Vishkin connected components
• PageRank (will describe...)
• Saved decisions...
• What is invalid?
• Making a decision twice in implementations
• ∆-stepping SSSP: Decrease a weight below ∆
• Degree optimization: Cross threshold, miss vertex
• Applying old or different information
• Multiply counting triangles: Counts match no graph
• Multiple searches: Betweenness centrality
• Labeling in S. Kahan’s components alg
Streaming Graphs — SIAM NS, 14 July 2017 13/19
17. PageRank without stopping
Apply Jacobi iteration to the linear system form of
PageRank:
x(k+1)
= αAT
D−1
x(k)
+ (1 − α)v.
Amusingly, the residual
r(k)
= (1 − α)v − (I − αAT
D−1
)x(k)
= x(k+1)
− x(k)
.
So if r(k)
is small, converged to a solution of a system near
the graph in the most recent iteration, hence to a graph
containing the original plus some subset of changes.
Streaming Graphs — SIAM NS, 14 July 2017 14/19
18. Fun properties for one-shot queries
Due to Chunxing Yin, under sensible assumptions:
1. You can produce a single-change stream to
demonstrate invalidity.
• Idea: Start with a graph that incorporates all the
visible changes, introduce the one change at the
right time.
2. Algorithms that produce a subgraph of their input
cannot be guaranteed to run concurrently with
changes and always produce moment-in-time
outputs.
• Idea: Any time a snapshot result could happen,
delete then re-insert an edge from the output.
Streaming Graphs — SIAM NS, 14 July 2017 15/19
19. On to streaming...
Can we update graph metrics as new data arrives?
• Track what changed during the one-shot query.
• Update locally around those changes, while other
changes are occuring.
• If the update is valid, can repeat to follow a
streaming graph.
Initial
∆0
Upd. w/∆0
∆1
Upd. w/∆1
∆2
Example: PageRank. Treat only the changed portions as
unconverged.
Streaming Graphs — SIAM NS, 14 July 2017 16/19
20. Then what?
• Many analyses do not scale in
performance to graphs with
billions of vertices.
• But we can extract
subgraphs...
• without stopping data ingest,
and...
• update the results!
Work in progress, based on PageRank and Katz.
Streaming Graphs — SIAM NS, 14 July 2017 17/19
22. Closing
• Summary
• Analysis concurrent with graph change can work.
• But not all methods are valid. Avoid evaluating
conditions or exploring the graph more than once.
• Valid updating methods can continue
• Future work
• Track subgraphs / communities for “slow” analyses
• Develop more valid updating methods,
approximation results
• Consider the debugging problem...
• And metadata...
Non-stop validity is only one approach! There are others.
Streaming Graphs — SIAM NS, 14 July 2017 18/19
23. STINGER: Where do you get it?
Home: www.cc.gatech.edu/stinger/
Code: git.cc.gatech.edu/git/project/stinger.git/
Gateway to
• code,
• development,
• documentation,
• presentations...
Remember: Academic code, but maturing
with contributions.
Users / contributors / questioners:
Georgia Tech, PNNL, CMU, Berkeley, Intel,
Cray, NVIDIA, IBM, Federal Government,
Ionic Security, Citi, Accenture, ...
Streaming Graphs — SIAM NS, 14 July 2017 19/19