Driving Behavioral Change for Information Management through Data-Driven Gree...
IncQuery-D: Incremental Queries in the Cloud
1. Budapest University of Technology and Economics
Department of Measurement and Information Systems
Budapest University of Technology and Economics
Fault Tolerant Systems Research Group
INCQUERY-D:
INCREMENTAL QUERIES IN THE CLOUD
Gábor Szárnyas, Benedek Izsó,
István Ráth, Dániel Varró
2. Overview
Introduction
MDE scalability challenges for model queries
Overview: scaling out in the cloud
Evaluation: a feasibility study
Conclusions and future work
4. Scalability challenges in MDE
Complex instance models and queries
Instance model complexity
o Size
o Structure
Query complexity
o MDE workloads involve much more complex queries
than typical data-driven applications (e.g. model
validation, transformations, …)
Scalability challenges arise due to their
combination
5. Model sizes
Instance models with several million elements
o AUTOSAR models [1]
o Source code models
o Sensor data
Source: Markus Scheidgen, How Big are Models – An Estimation, 2012. [2]
application model size
software models 0 – 109
sensor data 109
geo-spatial models 109 – 1012
[1] http://wiki.eclipse.org/Auto_IWG_WP2
[2] http://hwl.hu-berlin.de/fileadmin/user_upload/documents/howbig_techreport.pdf
6. EMF-IncQuery
State of the art incremental graph query engine
Open source Eclipse project by BUTE and others
Typical use cases
o Validation
o Incremental model transformation
o Model synchronization, view maintenance
7. Single workstation limitations
Majority of tools mostly work for <1M model
elements due to algorithmic complexity
Best tools for <10M model elements due to JVM’s
limitations
o A JVM cannot handle 15+ GB heap memory efficiently
o Long GC pauses
o Specialized JVMs (e.g. Azul Systems’ Zing)
• Commercial, experimental
• May require special hardware
Proposed solution
o Scale out: distributed system
10. DB shard 0
Architecture
In-memory storageServer 1
DB shard 1
Server 2
DB shard 2
Server 3
DB shard 3
Transaction
Server 0
Rete
net
Indexer
layer
IncQuery-D middleware
Rete net
Distributed indexing,
notification
Distributed persistent
storage
Distributed production network
• Each intermediate node can be allocated
to a different host
• Remote internode communication
EMF-IncQuery IncQuery-D
11. Rete net
Asynchronous communication
Consistency guaranteed by a termination protocol
indexer indexer indexer indexer
production
DB shard 0 DB shard 1 DB shard 2 DB shard 3
12. IncQuery-D
Scaling out by…
o Sharding the data
o Sharding the pattern matcher network →
Avoid memory bottleneck
Further advantages
o Agnostic to the representation of the graph
• Property graph, (EMF, RDF)
• Information from the metamodel is only used for indexing
o Query layer decoupled from the data storage
• Storage layer freely exchangeable
• Indexing is independent of storage features
13. Scalability considerations
Construction process
1. Shard the data in the storage layer
2. Derive a Rete net layout from the query
3. Allocate the middleware indexers
4. Allocate the Rete nodes in the cloud
Design aspects for scalability
o Local resource limitations
o Load balancing
o Minimize remote communication
• Given problem characteristics, global resource requirements can
be calculated
• Approach intrinsically supports dynamic scaling
15. Benchmark goal
o Evaluate the feasibility of the concept
o Measure the scalability characteristics
o Workload profile similar to real world model validation
Scenarios
o Batch – “traditional” batch graph search
o Incremental – Rete network
Operations
o Simulates a user’s interaction with a model
o Load and first validation; transformation; revalidation
Evaluation of IncQuery-D
16. Load and first validation: load the graph to the databases
and execute the query
Transformation: query the graph and delete some
elements
Revalidation: execute the query
Batch graph scenarioIncremental scenario – IncQuery-D
Transformation RevalidationGraphML
DB shards Result set
Load and first
validation
DB shards Result set
17. Load and first validation: load the graph to the databases
and initialize the Rete net and retrieve the results
Revalidation: retrieve the results from the Rete net
Transformation: incrementally query the graph and
delete some elements, propagate the changes
Batch graph scenarioIncremental scenario – IncQuery-D
Transformation RevalidationGraphML
DB shards Result set
Rete net
Load and first
validation
DB shards Result set
Rete net
18. Implementation
Server 1
DB shard 1
Server 2
DB shard 2
Server 3
DB shard 3
Transaction
In-memory
EMF model
DB shard 0
Server 0
Rete
net
Indexer
layer
IncQuery-D middleware
Rete net
Neo4j
4 Ubuntu Linux servers
16 GB RAM
2×2.5 GHz Intel Xeon CPU
Detailed benchmark description: http://incquery.net/publications/incquery-d
Cypher
through REST
Akka
(asynchronous
communication)
Akka
(asynchronous
communication)
19. 1
2
4
8
16
32
64
128
256
512
1024
2048
4096
0.1 /
0.008
0.2 /
0.015
0.5 /
0.03
0.9 /
0.06
1.7 /
0.114
3.5 /
0.231
7.1 /
0.47
14.1 /
0.945
28.0 /
1.907
55.8 /
3.853
time[s]
model size [million elements / file size in GB]
Neo4j/Cypher (batch) IncQuery-D (incremental)
Load and first validation phase
Small overhead for
the Rete network’s
construction
50M+: approx. 30 minutesParallel loading of the
graph from a GraphML
representation
20. 1
2
4
8
16
32
64
128
256
512
1024
2048
4096
0.1 /
0.008
0.2 /
0.015
0.5 /
0.03
0.9 /
0.06
1.7 /
0.114
3.5 /
0.231
7.1 /
0.47
14.1 /
0.945
28.0 /
1.907
55.8 /
3.853
time[s]
model size [million elements / file size in GB]
Neo4j/Cypher (batch) IncQuery-D (incremental)
Transformation phase
1. Elementary model query
2. Model manipulation
• Both implemented with Cypher
• The query evaluation time is dominating
• Query is supported by the Rete net
• Only the manipulation implemented with Cypher
• Overhead due to change propagation is negligible
• 1.5 OOM faster
• Performs a transformation
over a 55M model in one
minute
21. 0.25
1
4
16
64
256
1024
4096
0.1 /
0.008
0.2 /
0.015
0.5 /
0.03
0.9 /
0.06
1.7 /
0.114
3.5 /
0.231
7.1 /
0.47
14.1 /
0.945
28.0 /
1.907
55.8 /
3.853
time[s]
model size [million elements / file size in GB]
Neo4j/Cypher (batch) IncQuery-D (incremental)
Revalidation phase
Near instant
response time for
very large models
Different characteristics,
4 OOM for the largest model
Revalidation time is
independent of node size
23. Conclusions
Novel approach for the distributed execution of
incremental graph queries
Distributed Rete network
o Middleware for change propagation and indexing
o Incremental query layer decoupled from a sharded
graph database
Results
o Working proof of concept
o Near instantaneous query evaluation up to 50M+
model elements
o Improves scalability of transformations significantly
24. Future work
Tooling and automation
o Evolve the prototype into a developer tool
Explore optimization possibilities
o Allocation of Rete nodes
o Dynamic reallocation of Rete nodes
o Sharding strategy, resource usage, network
communication overhead
Cloud readiness
Experiment with distributed EMF model stores
o CDO, MongoEMF, Morsa, …