Extending Complex Event Processing to Graph-structured Information

Extending Complex Event Processing
to Graph-structured Information
Gala Barquero1, Loli Burgueño2, Javier Troya3, Antonio Vallecillo1
1Universidad de Málaga, Spain
2Universitat Oberta de Catalunya, Spain
3Universidad de Sevilla, Spain

Complex Event Processing
1. CEP is a method for data stream-processing for analyzing and correlating streams of
information about real-time events in order to derive conclusions from them.
2. CEP permits defining complex events on top of other events (primitive or complex)
3. CEP programs are composed of rules which are in charge of processing the events
2

3
Queries Data Results Data Results
Queries
(patterns)

4. CEP programs define (size or temporal) windows on the stream of events
4

Current CEP technologies
1. Efficient languages and technologies for processing huge streams of data
 6.5 zettabytes (10^21) in 2016
 15.3 zettabytes expected in 2020
2. Increasingly used (and useful) in applications for critical infrastructure monitoring,
real-time market trend analysis, plagues and natural disasters prediction, ...
5

However, real information is normally structured in more complex ways
6

However, real information is normally structured in more complex ways
1. The data is not only structured as a sequence of timed events, but as graphs that
combine transient (streams) and persistent (database) information
 Queries about social trends based on Twitter feeds and shared Flickr photos
 Monitoring tendencies via Twitter and Facebook posts
7

Our contribution
1. Extend CEP systems and languages to deal with graph-based information
 Able to deal both with streams of timed events and with graphs of persistent data
 Extend the concept of a CEP “sequential window” to a “spatial window”
 Keep up with the stringent requirements on performance and scalability of CEP
systems
2. For this we decided to:
 Generalize the structure of a CEP stream from a sequence of time-ordered events to a
Model (i.e., a graph of interrelated elements – time being just one dimension)
 Consider the behavior of a CEP system as a particular kind of in-place Model
Transformation
 Use the concept of “vicinity graphs” to define and implement spatial windows in
models (a generalization of CEP’s sequential windows)
 Use recent graph parallel computational technologies to provide the supporting
storage and access infrastructure for the models, and graph-processing systems to
implement the corresponding in-place model transformations
8

Case study: Twitter and Flicker
9

10
Q1
A HotTopic event is generated every time a hashtag has been used
by both Twitter and Flickr users at least 100 times in the last hour

Q1: A HotTopic event is generated every time a hashtag has been used by both
Twitter and Flickr users at least 100 times in the last hour.
Q2: A PopularTwitterPhoto element is created when the hashtag of a photo is
mentioned in a tweet that receives more than 30 likes in the last hour.
Q3: A PopularFlickrPhoto element is created when a photo is favored by more
than 50 Flickr users who have more than 50 followers.
Q4: We generate a NiceTwitterPhoto event when a user, with an h-index higher
than 50, posts three tweets in a row in the last hour containing a hashtag
that describes a photo.
Q5: A InfluencerTweeted event is generated, considering the 10K most recent
tweets, when a user with h-index higher than 70 and more than 50K
followers, sends a tweet.
11

Current Implementation
1. Models implemented with Apache Spark
 RDDs (resilient distributed dataset) used to store both model elements (graph vertices)
and their relations (edges)
 Models populated using the sources’ APIs to obtain the data
 One thread for each stream of events in case of streaming data
2. Model transformation rules (modeling the corresponding CEP rules) implemented in
Scala
 Implemented in terms of Spark and GraphX functions
 One dedicated running thread for each rule
 Produced events stored using RDDs too
3. Data lifecycle
 Transient data (and their relationships) have an “expiration date” (ED)
 The ED is determined by the largest window of the rules that deal with the event
 Once the ED of an element has passed, the element is removed from the system
12

Scala code for the “HotTopic” Rule
13

Analyses
1. Performance
 How fast are we?
 Is the performance of our
proposal acceptable for dealing
with large systems?
 How do we compare with CEP
systems? (when only
one-dimensional streams are
used)
2. Expressiveness
 Are we as expressive as CEP
languages?
 Can we write all CEP patterns
with GraphX?
 How easy is to write Rules with
our proposal?
14

Performance analysis
1. Performance Figures for the Twitter and Flickr case study (in milliseconds)
2. Comparison figures with other solutions (127K/6500K):
15

Performance analysis: comparison with streaming CEP systems
1. A different case study (Motorbike) implemented using both our solution and Esper
16

Expressiveness
1. We have been able to express all queries using Scala and GraphX
2. However, the expression of the queries is not simple
17
Scala code for the “DriverLeftSeat” rule:

Expressiveness
18
Esper code for the “DriverLeftSeat” rule:
Cypher code for the “DriverLeftSeat” rule:

Technology (and its rapid evolution) is an issue in this context
19
Technology In
memory
Query
Language
Pros Cons
Neo4j No Cypher * Expressiveness and usability of Cypher!!!
* Easy to install and to use
* Scalability
* Disk Access (R/W) very slow
* No in-memory implementation available
Spark +
Graphx
Yes Scala * Versatile and very expressive language.
* Easy to install
* Implements cluster mode (distributed)
* Cumbersome as query lang. for graphs
* Uses lazy evaluation
* Complex configuration in cluster mode
Viatra Yes Viatra * Speed and general performance
* Good language for querying models
* Very expressive
* Difficult to install and configure
* Documentation is scarce
Tinkergraph Yes Gremlin * Graph-native language and tools
* In-memory implementation
* Learning curve of Gremlin
CrateDB No SQL * Uses disk but very efficiently (scalability).
* SQL is well known and used
* Implements cluster mode (distributed)
* Writting graph queries in SQL is not easy
(specially those queries involving hops)

Conclusions and future work
Contribution: Extension of CEP systems to deal with graph-structured information:
 Able to deal both with streams of timed events and with graphs of persistent data
 Represent the information to manage as a Model
 Consider the behavior of a CEP system as an in-place Model Transformation
 Extend the concept of CEP windows to models’ spatial windows
 Use graph parallel computational technologies to provide the supporting storage
and access infrastructure, and
 Use of graph-processing languages and systems to implement the corresponding
model transformations
20

Future work
1. Performance:
 Experiment with other technologies, beyond Spark+GraphX
 Each one has pros and cons (expressiveness, performance, scalability, distribution)
 Volatility is an issue… They change too rapidly!
2. Expressiveness
 Compilers from Query languages to Storage technologies can be a solution
 For example, from Cypher to Gremlin or to Scala+GraphX
3. Correctness/Accuracy
 What is the error introduced by the use of spatial windows?
 Here we need to trade accuracy for performance
 Approximate queries and model transformations…
21
Q: A YoungInfluencer is a TwitterUser younger
than 25 years old, which has more than 30
followers older than 25 years old.

Extending Complex Event Processing to Graph-structured Information

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Extending Complex Event Processing to Graph-structured Information

Similar a Extending Complex Event Processing to Graph-structured Information (20)

Más de Antonio Vallecillo

Más de Antonio Vallecillo (20)

Último

Último (20)

Extending Complex Event Processing to Graph-structured Information