Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Chengqi zhang graph processing and mining in the era of big data
1. Graph Processing and Mining
in the Era of Big Data
Chengqi Zhang
Centre for Quantum Computation & Intelligent Systems (QCIS)
University of Technology, Sydney (UTS)
4. Big Data Characteristics
Big
Data
Volume
• Petabytes
• Records
• Transactions
Velocity
• Batch
• Real time
• Streaming
Variety
• Structured
• Unstructured
• Semi-
structured
5. Graph in Big Data: Volume
• 1.23 billon active users in 2013
• 190 friends/user on average
• 500 TB data/day in 2012
• 2.1 billion webpages in 2000
• 15 billion edges in 2000
• 20 PB data/day in 2008
• 180-200 PB data in 2011
• 6.5 PB data + 50 TB/day in 2009
6. Graph in Big Data: Velocity
• Fast flowing data
• Evolving data structures and relationships
7. Graph in Big Data: Variety
• Directed vs Undirected
• Labeled vs Unlabeled
• Weighted vs Unweighted
• Heterogeneous vs homogeneous
9. Challenges and Opportunities
New Graph Semantics (Variety)
New Query Processing Algorithms (Volume & Velocity)
New Indexing Techniques (Volume & Velocity)
New Computing Models (Volume)
New Graph Mining Tasks (Variety)
10. New Graph Semantics
Traditional (Google)
• Input: keywords
• Output: webpages
containing keywords
• Ranked by PageRank
New (Google)
• Input: keywords
• Output: knowledge
graph/subgraph
• Ranking should consider
both structural and
content information
11. New Graph Mining Tasks
Chemical Compound Database
Chemical Features
Team of Experts
Several Years
Graph Mining
Several Hours
12. New Query Processing Algorithms
Location
Relationship
Text
Spatial query processing, nearest neighbor search …
Link analysis, shortest path search, community detection …
Text processing, string matching, semantic analysis …
All of these should be processed in
Milliseconds
13. New Indexing Techniques
Traditional: webpages, files ?
Hash table, B-tree, Inverted Index …
New: subgraphs, trees, paths ?
What’s more
Graph is Frequently Changing…
14. New Computing Models
Single Machine vs Multiple Machines
Internal Algorithms vs External Algorithms
Single Core vs Multiple Cores
16. Structural Keyword Search
Jim, data mining
Jim
data mining
data mining
Jim
Jim, data mining
data mining
Jim
data mining
Jim
Traditional: Content Keyword Search
New: Structural Keyword Search
Our Work:
• ICDE’07: Finding Top-K Min-Cost Connected Trees in Databases
• SIGMOD’09: Keyword Search in Databases: The Power of RDBMS
• Morgan & Claypool 2009 (Book): Keyword Search in Databases
• VLDBJ’11: Scalable Keyword Search on Large Data Streams
• ICDE’11 & TKDE’12: Computing Structural Statistics by Keywords in
Databases
17. Graph Matching
MatchGraph 1 Graph 2
2
41
7
53
6
2
41
7
1
53
6
Graph PatternMatch
Our Work:
• EDBT’12: Finding Top-K Similar Graphs in Graph Databases
• CIKM’11 & VLDBJ’13: High Efficiency and Quality: Large Graphs Matching
• VLDB’14: Leveraging Graph Dimensions in Online Graph Search
18. Community Detection
?
What is a community in a graph?
A cohesive subgraph?
A dense subgraph?
Everyone is highly connected to others?
Everyone is with small distance with others?
An Example: k-core
1-core
2-core
3-core
19. Community Detection
Graph 3-core
4-clique 3-edge-cc 4-truss
? Other Semantics?
Our Work:
• SIGMOD’13: Efficiently Computing k-Edge Connected Components via Graph
Decomposition
• SIGMOD’14: Querying k-truss Community in Large and Dynamic Graphs
• VLDB’15: Influential Community Search in Large Networks
• KDD’15: Locally Densest Subgraph Discovery
25. Polynomial Delay
Enumeration Problems in Graph?
• Structural keyword search
• Community detection
• Graph pattern matching
• Similar graph search
Polynomial Time w.r.t. Input?
Output can be exponential
Impossible!
So…
Polynomial Total: Polynomial to Input+Output
Possible, but…
26. Polynomial Delay
time… … …
Many
answers!
Can’t you
be faster?
time
How about
this?
Polynomial Total
Polynomial Delay
New Solution
Polynomial Delay: Delay Time
Polynomial to Input
Total time is still large, but…
Our Work:
• ICDE’09: Querying Communities in Relational Databases
• Algorithmica’13: Fast Maximal Cliques Enumeration in Sparse Graphs
• EDBT’15: Efficiently Computing Top-K Shortest Path Join
• VLDB’15: Optimal Enumeration - Efficient Top-k Tree Matching
28. Diversified Top-K Cliques (ICDE’15)
A
B
E
J
G H
K
I
F
C
D
Maximum CliqueTop-2 Maximum Cliques
Too much
overlap!
Diversified Top-2 Maximum Cliques
Cover All
Nodes!
Problem Statement:
Compute k Cliques to Cover Maximum
Number of Nodes
30. Dijkstra’s Algorithm?
Shortest Path Computation
A* Algorithm?
Traverse the whole graph in worst case
Precompute all-pair shortest paths?
Impractical!
Our approach (VLDBJ’12):
Compute a subset of pairs
VLDBJ’12
Our Work:
• VLDBJ’12: The Exact Distance to Destination in Undirected World
• VLDB’13: Top-K Nearest Keyword Search on Large Graphs
• VLDBJ’13: Computing Weight Constraint Reachability in Large Networks
• SIGMOD’15: Index-based Optimal Algorithms for Computing Steiner
Components with Maximum Connectivity
39. Future Developments
Social Network Recommendation
Location Based Social Network
Big Graph Processing in Cloud
Massive Graph Matching
Graph Summary
Graph Stream
Personalized Community
SearchHigh Influence Community
SearchGraph Clustering in Cloud
Massive Uncertain Graph
40. Conclusion
Mining and Query
Processing
The Era of Big Data
Indexing
Semantics
Computing Model
Big Graph: Larger, More Complex
More Challenges!
More Opportunities to Explore the
Unknown World!
42. References
1. Jeffrey Xu Yu, Lu Qin, and Lijun Chang: Keyword Search in Databases, published by
Morgan & Claypool, 2009.
2. Xin Huang, Hong Cheng, Rong-Hua Li, Lu Qin, and Jeffrey Xu Yu: Top-K Structural
Diversity Search in Large Networks, in the International Journal on Very Large Data Bases
(VLDBJ), Vol. 24, No. 3, Pages 319-343, 2015.
3. Zhiwei Zhang, Jeffrey Xu Yu, Lu Qin, Lijun Chang, and Xuemin Lin: I/O Efficient:
Computing SCCs in Massive Graphs, in the International Journal on Very Large Data Bases
(VLDBJ), Vol. 24, No. 2, Pages 245-270, 2014.
4. Yuanyuan Zhu, Lu Qin, Jeffrey Xu Yu, Yiping Ke, and Xuemin Lin: High Efficiency and
Quality: Large Graphs Matching, in the International Journal on Very Large Data Bases
(VLDBJ), Vol. 22, No. 3, Pages 345-368, 2013.
5. Miao Qiao, Hong Cheng, Lu Qin, Jeffrey Xu Yu, Philip S. Yu, and Lijun Chang: Computing
Weight Constraint Reachability in Large Networks, in the International Journal on Very
Large Data Bases (VLDBJ), Vol. 22, No. 3, Pages 275-294, 2013.
6. Lijun Chang, Jeffrey Xu Yu, and Lu Qin: Fast Maximal Cliques Enumeration in Sparse
Graphs, in Algorithmica, Vol. 66, No. 1, Pages 173-186, 2013.
7. Lu Qin, Jeffrey Xu Yu, and Lijun Chang: Computing Structural Statistics by Keywords in
Databases. Invited paper by IEEE Transactions on Knowledge and Data Engineering
(TKDE), Vol. 24, No. 10, Pages 1731-1746, 2012.
8. Lijun Chang, Jeffrey Xu Yu, Lu Qin, Hong Cheng, and Miao Qiao: The Exact Distance to
Destination in Undirected World, in the International Journal on Very Large Data Bases
(VLDBJ), Vol. 21, No. 6, Pages 869-888, 2012.
9. Lu Qin, Jeffrey Xu Yu, and Lijun Chang: Scalable Keyword Search on Large Data Streams,
in the International Journal on Very Large Data Bases (VLDBJ), Vol. 20, No. 1, Pages 35-
57, 2011.
10. Lu Qin, Rong-Hua Li, Lijun Chang, and Chengqi Zhang: Locally Densest Subgraph Discovery,
to appear in Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and
Data Mining (KDD'15), 2015.
43. References
11. Longbin Lai, Lu Qin, Xuemin Lin, and Lijun Chang: Scalable Subgraph Enumeration in
MapReduce, to appear in Proceedings of the Very Large Database Endowment (VLDB), 2015.
12. Lijun Chang, Xuemin Lin, Lu Qin, Jeffrey Xu Yu, Wenjie Zhang: Index-based Optimal
Algorithms for Computing Steiner Components with Maximum Connectivity, to appear in
Proceedings of ACM Conference on Management of Data (SIGMOD'15), 2015.
13. Zhiwei Zhang, Jeffrey Xu Yu, Lu Qin, and Zechao Shang: Divide & Conquer: I/O Efficient
Depth First Search, to appear in Proceedings of ACM Conference on Management of Data
(SIGMOD'15), 2015.
14. Lijun Chang, Xuemin Lin, Lu Qin, Jeffrey Xu Yu, and Jian Pei: Efficiently Computing Top-K
Shortest Path Join, in Proceedings of the 18th International Conference on Extending
Database Technology (EDBT'15), 2015.
15. Rong-Hua Li, Jeffrey Xu Yu, Lu Qin, Rui Mao, and Tan Jin: On Random Walk Based Graph
Sampling, in the 31st IEEE International Conference on Data Engineering (ICDE'15), 2015.
16. Long Yuan, Lu Qin, Xuemin Lin, Lijun Chang, and Wenjia Zhang: Diversified Top-K Clique
Search, in the 31st IEEE International Conference on Data Engineering (ICDE'15), 2015.
17. Lijun Chang, Xuemin Lin, Wenjie Zhang, Jeffrey Xu Yu, Ying Zhang, and Lu Qin: Optimal
Enumeration: Efficient Top-k Tree Matching, in Proceedings of the Very Large Database
Endowment (VLDB), Vol. 8, No. 5, Pages 533-544, 2015.
18. Rong-Hua Li, Lu Qin, Jeffrey Xu Yu, and Rui Mao: Influential Community Search in Large
Networks, in Proceedings of the Very Large Database Endowment (VLDB), Vol. 8, No. 5,
Pages 509-520, 2015.
19. Yuanyuan Zhu, Jeffrey Xu Yu, and Lu Qin: Leveraging Graph Dimensions in Online Graph
Search, in Proceedings of the Very Large Database Endowment (VLDB), Vol. 8, No. 1, Pages
85-96, 2015.
20. Xin Huang, Hong Cheng, Lu Qin, Wentao Tian, and Jeffrey Xu Yu: Querying K-Truss
Community in Large and Dynamic Graphs, in Proceedings of ACM Conference on Management
of Data (SIGMOD'14), Pages 1311-1322, 2014.
44. References
21. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, and Xuemin Lin: Scalable
Big Graph Processing in MapReduce, in Proceedings of ACM Conference on Management of
Data (SIGMOD'14), Pages 827-838, 2014.
22. Zhiwei Zhang, Lu Qin, and Jeffrey Xu Yu: Contract & Expand: I/O Efficient SCCs
Computing, in the 30th IEEE International Conference on Data Engineering (ICDE'14),
Pages 208-219, 2014.
23. Xin Huang, Hong Cheng, Rong-Hua Li, Lu Qin, and Jeffrey Xu Yu: Top-K Structural Diversity
Search in Large Networks, in Proceedings of the Very Large Database Endowment (VLDB),
Vol. 6, No. 13, Pages 1618-1629, 2013.
24. Miao Qiao, Lu Qin, Hong Cheng, Jeffrey Xu Yu, and Wentao Tian: Top-K Nearest Keyword
Search on Large Graphs, in Proceedings of the Very Large Database Endowment (VLDB),
Vol. 6, No. 10, Pages 901-912, 2013.
25. Lijun Chang, Jeffrey Xu Yu, Lu Qin, Xuemin Lin, Chengfei Liu, and Weifa Liang: Efficiently
Computing k-Edge Connected Components via Graph Decomposition, in Proceedings of ACM
Conference on Management of Data (SIGMOD'13), Pages 205-216, 2013.
26. Zhiwei Zhang, Jeffrey Xu Yu, Lu Qin, Lijun Chang, and Xuemin Lin: I/O Efficient:
Computing SCCs in Massive Graphs, in Proceedings of ACM Conference on Management of
Data (SIGMOD'13), Pages 181-192, 2013.
27. Yuanyuan Zhu, Jeffrey Xu Yu, Hong Cheng, and Lu Qin: Graph Classification: A Diversified
Discriminative Feature Selection Approach, in Proceedings of 2012 ACM International
Conference on Information and Knowledge Management (CIKM'12), Pages 205-214, 2012.
28. Lu Qin, Jeffrey Xu Yu, and Lijun Chang: Diversifying Top-K Results, in Proceedings of the
Very Large Database Endowment (VLDB), Vol. 5, No. 11, Pages 1124-1135, 2012.
29. Yuanyuan Zhu, Lu Qin, and Jeffrey Xu Yu: Finding Top-K Similar Graphs in Graph
Databases, in Proceedings of the 15th International Conference on Extending Database
Technology (EDBT'12), Pages 456-467, 2012.
30. Zhiwei Zhang, Jeffrey Xu Yu, Lu Qin, Qing Zhu, and Xiaofang Zhou: I/O Cost Minimization:
Reachability Queries Processing over Massive Graphs, in Proceedings of the 15th
International Conference on Extending Database Technology (EDBT'12), Pages 468-479,
2012.
45. References
31. Yuanyuan Zhu, Lu Qin, Jeffrey Xu Yu, Yiping Ke, and Xuemin Lin: High Efficiency and Quality:
Large Graphs Matching, in Proceedings of 2011 ACM International Conference on Information and
Knowledge Management (CIKM'11), Pages 1755-1764, 2011.
32. Lijun Chang, Jeffrey Xu Yu, Lu Qin, Yuanyuan Zhu, and Haixun Wang: Finding Information Nebula
over Large Networks, in Proceedings of 2011 ACM International Conference on Information and
Knowledge Management (CIKM'11), Pages 1465-1474, 2011.
33. Lu Qin, Jeffrey Xu Yu, and Lijun Chang: Computing Structural Statistics by Keywords in
Databases, in Proceedings of the 27th IEEE International Conference on Data Engineering
(ICDE'11), Pages 363-374, 2011.
34. Lu Qin, Jeffrey Xu Yu, and Lijun Chang: Ten Thousand SQLs: Parallel Keyword Queries Computing,
in Proceedings of the Very Large Database Endowment (VLDB), Vol. 3, No. 1, Pages 58-69, 2010.
35. Lu Qin, Jeffrey Xu Yu, and Lijun Chang: Keyword Search in Databases: The Power of RDBMS, in
Proceedings of ACM Conference on Management of Data (SIGMOD'09), Pages 681-694, 2009.
36. Lu Qin, Jeffrey Xu Yu, Lijun Chang, and Yufei Tao: Querying Communities in Relational Databases,
in Proceedings of the 25th IEEE International Conference on Data Engineering (ICDE'09), Pages
724-735, 2009.
37. Lu Qin, Jeffrey Xu Yu, Lijun Chang, and Yufei Tao: Scalable Keyword Search on Large Data
Streams, in Proceedings of the 25th IEEE International Conference on Data Engineering
(ICDE'09), Short Paper, Pages 1199-1202, 2009.
38. Lu Qin, Jeffrey Xu Yu, Bolin Ding, and Yoshiharu Ishikawa: Monitoring Aggregate k-NN Objects in
Road Networks, in Proceedings of the 20th International Conference on Scientific and Statistical
Database Management (SSDBM’08), Pages 168-186, 2008.
39. Bolin Ding, Jeffrey Xu Yu, and Lu Qin: Finding Time-Dependent Shortest Paths over Large Graphs,
in Proceedings of the 11th International Conference on Extending Database Technology (EDBT'08),
Pages 205-216, 2008.
40. Bolin Ding, Jeffrey Xu Yu, Shan Wang, Lu Qin, Xiao Zhang, and Xuemin Lin: Finding Top-k Min-Cost
Connected Trees in Databases, in Proceedings of the 23rd IEEE International Conference on Data
Engineering (ICDE'07), Pages 836-845, 2007. (Best Student Paper)
46. References
41. Jia Wu, Xingquan Zhu, Chengqi Zhang, Philip S. Yu. Bag Constrained Structure Pattern
Mining for Multi-Graph Classification. IEEE Transactions on Knowledge and Data
Engineering (TKDE), Vol 26, No 10, pp.2382-2396, 2014.
42. Jia Wu, Zhibin Hong, Shirui Pan, Xingquan Zhu, Chengqi Zhang, Zhihua Cai. Multi-Graph
Learning with Positive and Unlabeled Bags. SDM 2014: 217-225.
43. Jia Wu, Xingquan Zhu, Chengqi Zhang, Zhihua Cai: Multi-instance Multi-graph Dual
Embedding Learning. ICDM’13, 2013: 827-836.
44. Jia Wu, Shirui Pan, Xingquan Zhu, Chengqi Zhang. Multi-Graph-View Learning for
Complicated Object Classification. International Joint Conference on Artificial Intelligence
(IJCAI’15), 2015
45. Shirui Pan, Jia Wu, and Xingquan Zhu, "CogBoost: Boosting for Fast Cost-sensitive Graph
Classification", IEEE Transactions on Knowledge and Data Engineering (TKDE), Accepted,
2015.
46. Shirui Pan, Xingquan Zhu, Chengqi Zhang, and Philip S. Yu. "Graph Stream Classification
using Labeled and Unlabeled Graphs", International Conference on Data Engineering
(ICDE’13), 2013
47. Shirui Pan and Xingquan Zhu. "CGStream: Continuous Correlated Graph Query for Data
Streams". 21st ACM International Conference on Information and Knowledge Management
(CIKM), 2012.
48. Shirui Pan and Xingquan Zhu. "Graph Classification with Imbalanced Class Distributions and
Noise", 23rd International Joint Conference on Artificial Intelligence (IJCAI), 2013
49. Jia Wu, Zhibin Hong, Shirui Pan, Xingquan Zhu, Chengqi Zhang, Zhihua Cai. "Multi-graph-
view Learning for Graph Classification", Proceedings of the 2014 IEEE International
Conference on Data Mining (ICDM), 2014
50. Shirui Pan, Jia Wu, Xingquan Zhu, Guodong Long, Chentqi Zhang, “Finding the Best not the
Most: Regularized Loss Minimization Subgraph Selection for Graph Classification”, to
appear in Pattern Recognition (PR), 2015
Graph is a powerful data structure to model the relationship among entities in the real world
Web Graph: nodes are webpages, edges are hyperlinks
Road Network: nodes are road intersections, edges are road segments
Social Network: nodes are users, edges are friendships
The Internet of Things: nodes are objects, edges are the relationships among objects
Our research mainly focus on the Three Vs of Big Data
The statistics of some real graph datasets
What happens in one minute?
Fast flowing data: new data are streaming in rapidly
Evolving data structures and relationships: New relationships among people/entities are established/destroyed in every second
A large variety of graph data
Directed: Twitter; Undirected: Facebook
Labeled: Chemical Compound; Unlabeled: Web Graph
Weighted: Social Network; Unweighted: Computer Network
Heterogeneous: The Internet of Things; Homogeneous: Paper Reference Network
Different challenges tickle different big data Vs
Unlike traditional Google search that answer a user query using a single webpage, Google Knowledge Graph Search aims to answer a user question using a collection of correlated webpages (modeled as subgraphs).
Identify discriminative chemical features (modeled as subgraphs) in a chemical compound database (modeled as a database of graphs) is a critical task in Bioinformatics and Chemistry.
Traditionally, this task usually relies on the experiences of domain experts, and the period is usually very long. However, with the help of graph mining, we can largely reduce the search space by providing a list of most promising features. This can largely shorten the period of identifying useful chemical features, and thus reduce the cost.
Facebook introduced Graph Search in 2013. In order to support graph search efficiently, we need to consider techniques to handle various types of information to be combined in graph algorithms. For example:
When combined with location information, we may need techniques such as spatial query processing, nearest neighbor search, etc.
An Example: search all male users in age 20-30, that is within 100m of my current location.
2. When handing relationships, we may need techniques such as link analysis, shortest path search, community detection, etc.
An Example: search all potential friends who share at least three common friends with me.
3. When combined with text information, we may need techniques such as text processing, string matching, semantic analysis, etc.
An Example: search all my friends who like “hiking” and “swimming”
To index a set of documents/wegpages, we can use the traditional Hash Table, B-Tree, Inverted index easily using linear time and space.
However, in graphs, the answers are usually subgraphs, trees, and paths, the size of which can be exponentially large to the size of the graph. Therefore, traditional indexing structures cannot be directly used.
In addition, when the graph changes, we should be able to maintain the index structure incrementally without re-computing from scratch.
Traditionally, we use a single machine to store the graph. Now when the size of the graph is large, we may need to use multiple machines and derive distributed algorithms to process the graph.
Traditionally, we keep the whole graph in the main memory of the machine. Now we need to consider external algorithms since the graph may be too large to fit in the main memory of a machine.
Traditionally, we use a single core to process a graph. Now we need to consider multi-core programming to improve the efficiency of query processing.
A number of graph processing systems have been established. Such as Hadoop@Apache, Pegasus@CMU, SNAP@Stanford, GraphLab@CMU, Hama@Apache, and Giraph@Apache.
The traditional keyword search semantics in relational databases is content based search. Given a list of keywords, the answers are individual tuples that contain all/part of the keywords in the query.
Now, by modeling the relational database as a graph, we proposed the structural keyword search. Given a list of keywords, the answers are a set of subgraphs. Each subgraph contains the tuples that contain the keywords as well as the relationships among the tuples.
Our work focus on how to define a proper result semantics and how to use efficient algorithms to answer the query under each semantics.
Problem 1: Given two large graphs, how to find the most common part of the two graphs? This is the Maximum Common Subgraph (MCS) problem, which is computational intractable.
Problem 2: Given a large data graph and a small pattern graph, find all the subgraphs of the data graph that are isomorphic to the pattern graph. This is the subgraph isomorphism problem which is NP-hard.
Our work mainly focus on how to find approximate solutions for graph matching, and how to increase the quality and efficiency of the matching.
How to define a community in a graph is an open problem. However, there are some common senses. Generally, a community should be (1) a cohesive subgraph, (2) a subgraph with high density, (3) a subgraph with low diameter, and (4) a subgraph with high connectivity.
Here is an example of k-core. A k-core is a subgraph such that every node has at least k neighbors in the subgraph.
When k is small, the k-core is large, but sparse. When k becomes large, the k-core becomes small, but dense.
Sometimes, k-core may result in undesired subgraphs. For this example, the second result is undesirable because it is a loose concentration of two dense subgraphs. Therefore, a lot of other semantics are proposed. For example:
k-clique: a k-clique is a subgraph such that every two node in the subgraph are connected by an edge.
K-edge connected component (k-edge-cc): a k-edge cc is a subgraph such that after removing any k-1 edges, it is still connected.
K-truss: a k-truss is a subgraph such that every edge is contained in at least k-2 triangles.
Our work mainly focus on defining an appropriate community semantic to hand a specific real-world application. We also focus on the efficiency and dynamic updating issues.
A new researcher in the database area may want to find the most influential research groups in the database collaboration network, and follow their publications. This is the focus of our VLDB 2015 paper: how to find the most influential communities in a large network?
In community detection, if we simply focus on the density of the returned subgraphs, all the subgraphs returned may come from the most densest region of the graph and other regions are omitted. In this work, we focus on finding dense subgraphs by considering the density of its local region. In this way, not only the globally large dense regions can be identified, dense subgraphs in other regions can also be found. This can help us to find some emerging but not necessarily large communities in the graph.
Given a graph database with a set of graphs, graph classification aims to train a classifier to distinguish different features (subgraphs) in the graph database.
A traditional method needs three phases: (1) we first compute the frequent subgraphs using the frequent subgraph mining techniques, and (2) compute the optimal subgraphs (features) from the set of frequent subgraphs, and (3) then we can train the classifier based on the optimal subgraphs.
In our work of CIKM 2012, we propose to combine phase (1) and phase (2) into one phase. In this way, more structural information can be involved when selecting the set of optimal subgraphs (features).
In our work of PR 2015, we use only one phase to compute the classifier. We integrate the process of classifier training into the process of optimal subgraph selection, and we allow iterative refinement to further optimize the algorithm.
Our recent work along this direction mainly focus on different learning models for graph classification.
Let’s fist consider some efficiency issues in graphs.
A large number of graph problems are enumeration problems (e.g., to enumerate a list of subgraphs that satisfy a certain property).
In algorithms, we say an algorithm is efficient if the algorithm can terminate in polynomial time w.r.t. the size of the input (e.g., the size of graph and query). However, for enumeration problem, usually the number of answers can be exponential to the size of the graph. Therefore, we need new terminologies to measure the efficiency of an enumeration problem.
The first attempt is that: instead of requiring the time complexity of an algorithm to be polynomial to the size of input, we can make the complexity of the algorithm to be polynomial to the size of the input and output. This is called polynomial total. Polynomial total is possible for enumeration problems, however, it can still cause some problems (see next slide).
Suppose for a certain enumeration problem, there are 3600 answers.
By polynomial total, it is possible that the user waits for an hour until all answers are output at once. Obviously, such scenario is not desirable by the user.
Therefore, in our new solution, instead of considering the total time, we require the delay time between consecutive answers to polynomial to the input only. This is called polynomial delay. For the above example, using polynomial delay, the user can see a new result in every second, and the user can also decide whether to see the next result when getting a certain number of answers. Comparing to polynomial total, although the total time may not decrease in polynomial delay, the user experience are obviously much better.
Our work in this direction focus on deriving a polynomial delay algorithm for different graph semantics. E.g, multi-center community (ICDE’09), clique (Algorithmica’13), top-k shortest paths (edbt’15), top-k tree matching (vldb’15).
We now consider the effectiveness issues in graphs.
We still consider the enumeration problems in graphs. One of the common properties for most enumeration problems in graphs is that: the answers are subgraphs that can overlap with each other. As a result, some results are very similar with each other.
Consider the top-k densest subgraph enumeration problem. In the shown example, if we want to find the top-6 answers, it is possible that all subgraphs are derived from the most densest regions in the graph. Obviously, the top-6 answers are not desirable because they are too similar and contains little information as a whole.
This motivates us to consider the diversity when answering graph problems. The definition of diversity varies in different graph semantics, but the intuition is to enlarge the information contained in the returned answers. For the above example, we can see that if we considering the diversity in the top-6 answers, although the result may not be as large as the original top-6 answers, the new subgraphs cover most part of the graph and thus is more desirable.
Along this direction, our work mainly focus on defining the diversity for different graph problems and derive efficient solutions to compute the diversified answers. An example is shown in the next slide.
A clique is a subgraph in which every two nodes are connected by an edge.
Identifying large cliques in a graph is a useful graph operation and wide applied in a lot of applications. However, if we simply compute the top-k cliques with largest size, the result can largely overlap with each other. Therefore, we consider diversity: instead of considering to maximize the size of each individual clique, we aim to compute k cliques that can together cover the maximum number of nodes in the graph.
More technique details on how to efficiently solve the problem are in the paper.
To illustrate the indexing techniques, let us consider a fundamental graph problem: compute the shortest path between two nodes in a graph.
Given a source node (red node) and a target node (blue node), a straightforward solution is to use the classic Dijkstra’s Algorithm to compute the shortest path between them in an online manner. We can also use the A* algorithm to make it more efficient. However, both algorithms may traverse the whole graph in the worst cases. When the graph is large, the online computation algorithms are slow, because no index is used.
Another solution is to precompute the shortest paths for every pair of nodes as an index. In this way, given the query, the answer can be obtained directly from the precomputed answers. Obviously, when the graph is large, the precomputation cost is too high and impractical.
In our VLDBJ work, we propose a new algorithm. Our basic idea is as follows: for each node, instead of computing its shortest paths to all other nodes in the graph, we only precompute a small portion of them. In query processing, given a source node and a target node, we can join the precomputed shortest paths of the source and target nodes and if success, we concatenate the two shortest paths into one path and we can guarantee that the new path is the shortest path from the source node to the target node.
Our other work in this category focus on deriving indexing techniques for various graph query semantics.
The figure shows a typical memory hierarchy of a compute, which is introduced in every textbook of operation systems.
The devices at the left part have high processing speed but low storage size, whereas those at the right part have low processing speed but high storage size.
Here, we mainly focus on the storage of the main memory (DRAM) and secondary storage (disk). When a graph is large, it is usually hard to fit in the main memory of a machine. However, the disk is usually large enough to hold the graph. Therefore, the aim is to derive an I/O efficient algorithm to a graph problem.
There are four issues:
We need to decide which part of the graph should be loaded into the main memory, and which part are put on disk
When we access data on disk, we need to maximize sequential I/Os and minimize random I/Os, because random accesses on disk is much slower than sequential accesses.
For dense graphs such as social networks, we can usually guarantee that all nodes can be kept in the main memory and edges have to be stored on disk. This is called a semi-external algorithm. In the semi-external situation, some graph problems may be solved efficiently.
Two most popular approaches for I/O efficient graph computation can be considered: (1) Partition based: we partition the graph into several parts each of which can hold into the main memory. We can use a divide and conquer method to compute the result. (2) We keep only the partial answers in memory and scan the graph on disk iteratively until the result converges to the final result.
In this direction, we have derived I/O efficient algorithms for a number of fundamental graph problems. For example, reachability queries (EDBT’12), Strongly Connected Components (SIGMOD’13, VLDBJ’14, and ICDE’14), and Depth-First Search (SIGMOD’15).
We consider two types of parallel computation models.
The first is the multicore programming, and the second is the distributed computing such as MapReduce and BSP.
We provide a simple comparison:
Multicore programming is usually computation sensitive. Given a problem, we usually focus on how to divide the computation into different cores in a balanced manner. Distributed computing is usually data sensitive. Given a problem, we usually focus on how to divide the data to be stored in different computers.
Multicore programming is based on a shared memory paradigm. Every core can access any part of the main memory. Each core have a separated L1 cache, in which computation can be totally parallelized. Distributed computing is based on a shared nothing paradigm. Every computer can only access its own CPU, memory, and disk. The computation in different computers can be totally parallelized. Computers can only exchange data using the network.
Multicore programming mainly focus on reducing the cache miss to maximize the parallelism. Distributed computing mainly focus on reducing the communication cost to maximize the parallelism.
In this direction, we have derived an efficient multicore algorithm to answer keyword queries (VLDB’10) and process several fundamental graph tasks using MapReduce (SIGMOD’14 and VLDB’15)
Our final aim is to build a graph processing system with the following three objectives:
First, we want to extract the primitive operators from graph processing and mining. However, we need to consider both the completeness of the operators as well as to guarantee the minimality, which is challenging.
Second, we want to achieve high scalability in graph processing. However, it is not easy to guarantee the optimality of the algorithm.
Third, we aim to make the query processing tasks real-time tractable. However, since a large number of graph algorithms are NP-hard, it is not easy to guarantee the real-time tractability.
Our system structure consists of five layers:
In bottom data environment layer, we aim to handle different data environments, e.g., streaming, static, Probabilistic, etc.
In the computing model layer, we aim to support different computing models, e.g., in-memory, distributed, multi-core, external, etc.
In the computing paradigms layer, we target on implementing the most primitive operators used in graphs, such as joins, breath-first-search, depth-first-search, topological sort, spanning tree computation, etc.
In the query/mining primitives layer, we focus on designing some primitive operators for query processing and graph mining based on different semantics. For example, in query processing, we can design primitive operators depending on whether the query is a pattern, a set of nodes, or a set of keywords. In graph mining, we can design primitive operators depending on whether the result are subgraphs or some aggregated information.
In the topmost application layer, we aim to combine different primitive operators to design algorithms that can be used in various application scenarios, e.g., social network, chemical, web search, etc.
For each layer we have published a large number of research papers in top database/data mining conference/journals.
Following the same framework, our future developments aim to enrich each component of our system from the data layer to the application layer, so that we can finally deliver a general-purpose graph processing system that integrate all graph semantics, data environments, and computing models for various applications.
As a conclusion, the emerging of the era of big data brings a number of new challenges for the traditional graph processing techniques including new graph semantics, new mining tasks, new query processing algorithms, new indexing techniques, and new computing models. However, graphs are still becoming larger, and more complex. Big graph processing is still on its early stage since many challenges are still unsolved. There are more opportunities for us to explore the unknown world!