The requirements of many modern data platforms develop along two directions: (1) Low latency, i.e. the shift from batch-oriented to event-driven processes, which facilitate much more timely and reactive insights; and (2) complex analytics, i.e. the ability to efficiently apply analytic functions or models to the incoming data streams. However, many companies don't start from scratch, and already have well-established data infrastructure and processes with various degrees of affinity and compatibility to these novel paradigms. Based on extensive experience of building data platforms with customers, we describe in this talk some key challenges and aspects of introducing streaming-based approaches in real-world productive environments. These include e.g. integrating existing batch-oriented data sources and APIs, checking consistency when using event sourcing to exchange data, and building realtime analytical visualizations. For all cases, architectural options are discussed, and the final solution is explained, including technologies like Apache Nifi, Airflow, Phoenix, Druid and the Confluent Platform. We close the talk by describing non-technical aspects like building up an event-driven mindset among analysts.
Speaker: Dr. Dominik Benz, inovex
Event: Confluent Meetup, 08.10.2018
Mehr Tech-Vorträge: https://www.inovex.de/de/content-pool/vortraege/
Mehr Tech-Artikel: https://www.inovex.de/blog/
Down the event-driven road: Experiences of integrating streaming into analytic data platforms
1. Down the event-driven road:
Experiences of integrating
streaming into analytic data
platforms
Dr. Dominik Benz, Head of Machine Learning Engineering, inovex GmbH
Confluent Meetup Munich, 8.10.2018
3. Down the event-driven road ..
Analytic
(Streaming)
Data Platforms
Integrating existing
(batch) data sources
Checking
consistency
Building realtime
visualizations
Wrap up & Summary
4. A typical analytic data platform
raw processed datahub analysisingress egress
Scheduling, orchestration, metadata
user access, system integration,
development
(Hive) Tables
Airflow, Hive
Metastore
Batch Processin
(Spark, Hive, ..
Flat files,
abases, APIs, ...
SQL, Notebooks
(Zeppelin, ..)
5. A typical (?) streaming data platform
raw processed datahub analysisingress egress
Scheduling, orchestration, metadata
user access, system integration,
development
(Kafka) Topics,
KTables, ..
(Confluent)
Schema Registry
Stream Process
(Kafka Streams,
..)
Kafka Conne
nput Data
(Streams)
KSQL
6. Down the event-driven road ..
Analytic
(Streaming)
Data Platforms
Integrating existing
(batch) data sources
Checking
consistency
Building realtime
visualizations
Wrap up & Summary
8. › Hortonworks-based platform, including Nifi
and Confluent Platform
› Apache Airflow established scheduling / workflow
tool, integrated into monitoring, alerting, ..
› Tracking Service: Currently batch-oriented API
(request data, get download links, ..),
but click event stream planned
› Developers / Analysts with mixed background
w.r.t. programming skills
Integrating web tracking: setup / constraints
9. › drag-and-drop visual definition of data
pipelines
› various built-in connectors (file, stream,
database, service, ...)
› event-based processing paradigm
› built-in queues, data provenance,
backpressure handling, registry, ...
› focus: ingest & lightweight (!) transform
› not a complex event processor (like Kaf
Streams, Flink, Spark Streaming, ...)
› integrated into HDP stack
Apache Nifi in a Nutshell
10. › python library to define & schedule batch
workflows
› programmatic specification of a „DAG“
(= tasks + dependencies)
› clean handling of job run metadata (succ
duration, ..)
› developed by AirBnB, open-sourced 201
› built-in standard operators (bash, hive,
spark, kubernetes, ..)
› easily extendible (custom operators, ..)
› once used -> never Oozie again J
Apache Airflow in a nutshell
11. Integrating web tracking: options
tracking
service
ing
Option Aspects
Airflow only + integrated into monitoring, ..
+ job status handling, reloading
- not prepared for future stream
API
- handling file content complicat
Unified Abstraction
(e.g. Apache Beam)
+ one model for batch / stream
ingest
- comparatively high entry barrie
Nifi only + visual pipeline definition
+ easy handling of file content
+ event-based paradigm
+ operators available
- custom status handling, reloadi
Kafka-Connect + fault-tolerant
+ scalable setup
- custom connector coding
- custom status handling, reloadi
12. › Combines
advantages
of Airflow & Nifi
› Prepared for futur
streaming API
› Integrated into
monitoring, alertin
..
› Status handling /
reloading easy
Integrating web tracking: chosen solution – Airflow + Nifi
tracking
service
trigger
(hourly)
download
check
status
(sensors)
trigger, fetch
download links
download,
process, store
data
13. Down the event-driven road ..
Analytic
(Streaming)
Data Platforms
Integrating existing
(batch) data sources
Checking
consistency
Building realtime
visualizations
Wrap up & Summary
15. › Analysts need up-to-date version of customer
consent information in platform
› Hard correctness requirements (especially
regarding revoked consent)
› Continuous monitoring of correctness
› Alerting in case of differences
Checking consistency: setup / constraints
16. Checking Consistency: Statistics Events
ustomer
portal
kafka
use existing channel (kafka)
source inject periodic „statistics events“
into stream with defined measure point
(in time)
{type:GRANT, cid:12, ts:2018-10-01 11:00:00 ..}
{type:GRANT, cid:10, ts:2018-10-01 11:01:00 ..}
{type:REVOK, cid:09, ts:2018-10-01 11:01:05 ..}
{type=STAT, measure_ts=2018-10-01 11:01:20,
stats={num_consent_v1:72625,
num_consent_v2: 6252, ..}
}
17. Checking Consistency: Evaluate Statistics Event
› perform count on
target side (Hive) up
to
$measurePoint
› compare counts
› counts = simple
plausibility check, b
more elaborated
checks (hashes)
thinkable
{type=STAT, measure_ts=2018-10-01 11:01:20,
stats={num_consent_v1:72625,
num_consent_v2: 6252, ..}
}
in sync?
{
measure_ts=2018-10-01 11:01:20,
hive_stats={
num_consent_v1:72625,
num_consent_v2: 6252, ..}
}
ome
r
sent
)
base
18. Down the event-driven road ..
Analytic
(Streaming)
Data Platforms
Integrating existing
(batch) data sources
Checking
consistency
Building realtime
visualizations
Wrap up & Summary
20. › Goal: timely insights into various purchase
aspects (items bought last 5min, ..)
› flexible / configurable frontend (time window,
aggregation dimension, ..)
› scalable to 100s / 1000s of dashboard users
› low latency of dashboard backend
Realtime visualizations: setup / constraints
21. Realtime visualizations: components / options
JM
S
ransport layer
ervice backend
service API
processing
Kafka-connect
Kafka
Kafka-streams
Kafka-connect
HBase
Phoenix / JDBC
Spring Boot
Nifi
Kafka
Tranquility
Druid
Spring Boot
aggregation during
processing
aggregation at
query-time
Built-in
configurab
aggregati
Nifi
Kafka
Kafka-connect
HBase
Phoenix / JDBC
Spring Boot
22. Realtime visualizations: chosen solution
JM
S
Nifi
Kafka
Tranquility
Druid
Spring Boot
› Druid: time series database with focus on
› Realtime ingestion, good Kafka integation
› „slice-and-dice“ queries
› distributed scale-out architecture
› Event processing kept simple in Nifi
› mainly cleaning, transformation
› aggregation is pushed down to Druid
› But: yet another distributed system .. L
› Experiences good so far, but needs work / skills
23. Down the event-driven road ..
Analytic
(Streaming)
Data Platforms
Integrating existing
(batch) data sources
Checking
consistency
Building realtime
visualizations
Wrap up & Summary
24. › Technology moves from batch to stream – what
about people?
› Analysts‘ world = often batch world
› tooling centered around static datasets
› can (and must) be generated from streams
› but: education towards stream / event-based
thinking necessary!
› Incremental / stream-based data exchange =
paradigm shift
› efforts / commitment „from both ends“ necessary
The human factor ..
https://flic.kr/p/f2W
25. Stream me up, Scotty ..
The future is event-based, but on the way:
› Existing batch-oriented APIs
› use (scheduled) event-based tools for easier later migration
› Checking consistency
› inject plausibility checks into data stream
› Realtime visualizations
› Druid + Kafka powerful and flexible combination
› Don‘t forget the human in the loop!
26. Vielen Dank
Dr. Dominik Benz
dominik.benz@inovex.de
inovex GmbH
Park Plaza
Ludwig-Erhard-Allee 6
76131 Karlsruhe