Streaming Data Integration Tools

BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF
HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH
Streaming Data Ingestion in BigData-
und IoT-Anwendungen
Guido Schmutz – 27.9.2018
@gschmutz guidoschmutz.wordpress.com

Guido Schmutz
Working at Trivadis for more than 21 years
Oracle ACE Director for Fusion Middleware and SOA
Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
Head of Trivadis Architecture Board
Technology Manager @ Trivadis
More than 30 years of software development experience
Contact: guido.schmutz@trivadis.com
Blog: http://guidoschmutz.wordpress.com
Slideshare: http://www.slideshare.net/gschmutz
Twitter: gschmutz

COPENHAGEN
MUNICH
LAUSANNE
BERN
ZURICH
BRUGG
GENEVA
HAMBURG
DÜSSELDORF
FRANKFURT
STUTTGART
FREIBURG
BASEL
VIENNA
With over 600 specialists and IT experts in your region.
14 Trivadis branches and more than
600 employees
200 Service Level Agreements
Over 4,000 training participants
Research and development budget:
CHF 5.0 million
Financially self-supporting and
sustainably profitable
Experience from more than 1,900
projects per year at over 800
customers

Agenda
1. Big Data and IoT Reference Architecture
2. Event Hub
3. Stream Data Integration
Apache NiFi
StreamSets Data Collector
Kafka Connect
4. Summary

Big Data and IoT Reference
Architecture

Bulk Source
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
BI Tools
Enterprise Data
Warehouse
SQL
Search / Explore
Parallel
Processing
Storage
Storage
RawRefined
Results
high latency
Enterprise Apps
Logic
{ }
API
File Import / SQL Import
DB
Extract
File
DB
Big Data solves Volume and Variety – not Velocity
Introduction to Stream Processing

Bulk Source
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
BI Tools
Enterprise Data
Warehouse
SQL
Search / Explore
Parallel
Processing
Storage
Storage
RawRefined
Results
high latency
Enterprise Apps
Logic
{ }
API
DB
Extract
File
DB
Event Source
Location
Telemetry
IoT
Data
Mobile
Apps
Social
Event Stream

Bulk Source
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
BI Tools
Enterprise Data
Warehouse
SQL
Search / Explore
• Machine Learning
• Graph Algorithms
• Natural Language Processing
Parallel
Processing
Storage
Storage
RawRefined
Results
high latency
Enterprise Apps
Logic
{ }
API
DB
Extract
File
DB
Event Stream
Event Source
Location
IoT
Data
Mobile
Apps
Social
Event
Hub
Event
Hub
Event
Hub
Telemetry

Event
Hub
Event
Hub
Hadoop Clusterd
Hadoop Cluster
Stream Analytics
Platform
Stream Processing Architecture solves Velocity
BI Tools
Enterprise Data
Warehouse
Event
Hub
Search / Explore
Enterprise Apps
Search
Results
Stream Analytics
Reference /
Models
Dashboard
Logic
{ }
API
Event
Stream
Event
Stream
Event
Stream
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Low(est) latency, no history
Telemetry

Hadoop Clusterd
Hadoop Cluster
Stream Analytics
Platform
Big Data for all historical data analysis
BI Tools
Enterprise Data
Warehouse
Search / Explore
Enterprise Apps
Search
Results
Stream Analytics
Reference /
Models
Dashboard
Logic
{ }
API
Event
Stream
Event
Stream
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
Parallel
Processing
Storage
Storage
RawRefined
Results
Data FlowEvent
Hub
Event
Stream
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Telemetry

Data Store
Integrate existing systems through CDC
Data
Event Hub
Integration
Consuming Systems
StateLogic
CDC
CDC Connector
Traditional Silo-based
System
LogicUser Interface
Capture changes directly on database
Change Data Capture (CDC) => think like
a global database trigger
Transform existing systems to event
producer
Event
Stream
Event
Stream

Hadoop Clusterd
Hadoop Cluster
Stream Analytics
Platform
Integrate existing systems with lower latency through CDC
BI Tools
Enterprise Data
Warehouse
Search / Explore
Enterprise Apps
Search
Results
Stream Analytics
Reference /
Models
Dashboard
Logic
{ }
API
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
Parallel
Processing
Storage
Storage
RawRefined
Results
Event
Stream
Event
Stream
Data FlowEvent
Hub
Event
Stream
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Telemetry

New systems participate in event-oriented fashion
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
Parallel
Processing
Storage
Storage
RawRefined
Results
Microservice Platform
Microservice State
{ }
API
Stream Analytics Platform
Stream
Processor
State
{ }
API
Event
Stream
SQL
Search
BI Tools
Enterprise Data
Warehouse
Search / Explore
Service
Enterprise Apps
Logic
{ }
API
Event
Stream
Data FlowEvent
Hub
Event
Stream
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Event
Stream
Event
Stream
Telemetry

Edge computing allows processing close to data sources
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
Parallel
Processing
Storage
Storage
RawRefined
Results
Microservice Platfrom
Microservice State
{ }
API
Stream Analytics Platform
Stream
Processor
State
{ }
API
SQL
Search
BI Tools
Enterprise Data
Warehouse
Search / Explore
Service
Enterprise Apps
Logic
{ }
API
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Edge Node
Event
Hub
Event
Stream
Event
Stream
Event Stream
Telemetry
Rules
Event Hub
Storage

Hadoop Clusterd
Hadoop Cluster
Big Data
Unified Architecture for Modern Data Analytics Solutions
SQL
Search
BI Tools
Enterprise Data
Warehouse
Search / Explore
Event
Hub
Parallel
Processing
Storage
Storage
RawRefined
Results
Microservice State
{ }
API
Stream
Processor
State
{ }
API
Event
Stream
Event
Stream
Service
Stream Analytics
Microservices
Enterprise Apps
Logic
{ }
API
Edge Node
Rules
Event Hub
Storage
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Event Stream
Telemetry

Two Types of Stream Processing
(from Gartner)
Stream Data Integration
• primarily focuses on the ingestion and
processing of data sources targeting real-
time extract-transform-load (ETL) and data
integration use cases
• filter and enrich the data
• optionally calculate time-windowed
aggregations before storing the results in a
database or file system
Stream Analytics
• targets analytics use cases
• calculating aggregates and detecting
patterns to generate higher-level, more
relevant summary information (complex
events)
• Complex events may signify threats or
opportunities that require a response from
the business through real-time dashboards,
alerts or decision automation

Implementing "Event Hub"
Hadoop Clusterd
Hadoop Cluster
Cluster Infrastructure
Parallel
Processing
Storage
Storage
RawRefined
Results
Microservice State
{ }
API
Stream
Processor
State
{ }
API
SQL
Search
BI Tools
Enterprise Data
Warehouse
Search / Explore
Service
Enterprise Apps
Logic
{ }
API
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
Edge Node
Rules
Event Hub
Storage
Event
Hub
Event
Stream
Event
Stream
Event Stream
Replay
Big Data Batch Analytics
Stream Analytics
Modern Applications

Apache Kafka – A Streaming Platform
High-Level Architecture
Distributed Log at the Core
Scale-Out Architecture
Logs do not (necessarily) forget

Hold Data for Long-Term – Data Retention
Producer 1
Broker 1
Broker 2
Broker 3
1. Never
2. Time based (TTL)
log.retention.{ms | minutes | hours}
3. Size based
log.retention.bytes
4. Log compaction based
(entries with same key are removed):
kafka-topics.sh --zookeeper zk:2181
--create --topic customers
--replication-factor 1
--partitions 1
--config cleanup.policy=compact

Keep Topics in Compacted Form
0 1 2 3 4 5 6 7 8 9 10 11
K1 K2 K1 K1 K3 K2 K4 K5 K5 K2 K6 K2
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
Offset
Key
Value
3 4 6 8 9 10
K1 K3 K4 K5 K2 K6
V4 V5 V7 V9 V10 V11
Offset
Key
Value
Compaction
V1
V2
V3 V4
V5
V6
V7
V8 V9
V1
0
V1
1
K1
K3
K4
K5
K2
K6

Implementing "Stream Data Integration"
Hadoop Clusterd
Hadoop Cluster
Cluster Infrastructure
Parallel
Processing
Storage
Storage
RawRefined
Results
Microservice State
{ }
API
Stream
Processor
State
{ }
API
SQL
Search
BI Tools
Enterprise Data
Warehouse
Search / Explore
Service
Enterprise Apps
Logic
{ }
API
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
Edge Node
Rules
Event Hub
Storage
Event
Hub
Event
Stream
Event
Stream
Event Stream
Replay
Big Data Batch Analytics
Stream Analytics
Modern Applications

Integrating (Streaming) Data Sources
SQL Polling
Change Data Capture
(CDC)
File Polling
File Stream (File Tailing)
File Stream (Appender)
Sensor Stream

IoT GW
MQTT
Broker
IoT devices will often not be able to
talk to Kafka directly
DB Source
Big Data
Lo
g
Stream
Processing
IoT Sensor
Event Hub
Topic
Topic
REST
Topic
IoT GW
CDC GW
Connect
CDC
DB Source
Lo
g
CDC
Native
IoT Sensor
IoT Sensor
26
Dataflow
GW
Topic
Topic
Queue
Messaging
GWTopic
Dataflow
GWDataflow
Topic
REST
26
File Source
Lo
g
Lo
g
Lo
g
Social
Native

Why is Data Ingestion Difficult?
Physical and Logical
Infrastructure changes
rapidly
Key Challenges:
Infrastructure Automation
Edge Deployment
Infrastructure Drift
Data Structures and
formats evolve and change
unexpectedly
Key Challenges:
Consumption Readiness
Corruption and Loss
Structure Drift
Data semantics change
with evolving applications
Key Challenges
Timely Intervention
System Consistency
Semantic Drift
Source: Streamsets

Integration with or without
Transformation?
Zero Transformation
• No transformation, plain ingest, no
schema validation
• Keep the original format – Text,
CSV, …
• Allows to store data that may have
errors in the schema
Format Transformation
• Prefer name of Format Translation
• Simply change the format
• Change format from Text to Avro
• Does schema validation
Enrichment Transformation
• Add new data to the message
• Do not change existing values
• Convert a value from one system to
another and add it to the message
Value Transformation
• Replaces values in the message
• Convert a value from one system to
another and change the value in-place
• Destroys the raw data!
d

Demo Case
Truck-2
truck/nn/
position
Truck-1
Truck-3
truck
position raw?
truck/nn/
positionTruck-4
Truck-5
Raw Data
Store
?
{"truckid":"57","driverid":"15","routeid":"192762466
2","eventtype":"Normal","latitude":"38.65","longitu
de":"-
90.21","correlationId":"4412891759760421296"}

Stream Data Integration: Apache
NiFi

Apache NiFi
• Originated at NSA as Niagarafiles – developed
behind closed doors for 8 years
• Open sourced December 2014, Apache Top
Level Project July 2015
• Look-and-Feel modernized in 2016
• Opaque, “file-oriented” payload
• Distributed system of processors with
centralized control
• Based on flow-based programming concepts
• Data Provenance and Data Lineage
• Web-based user interface

Processors for Source and Sink
• ConsumeXXXX (AMQP, EWS, IMAP, JMS, Kafka, MQTT, POP3, …)
• DeleteXXXX (DynamoDB, Elasticsearch, HDFS, RethinkDB, S3, SQS, ...)
• FetchXXXX (AzureBlobStorage, ElasticSearch, File, FTP, HBase, HDFS, S3 ...)
• ExecuteXXXX (FlumeSink, FlumeSource, Script, SQL, ...)
• GetXXXX (AzureEventHub, Couchbase, DynamoDB, File, FTP, HBase, HDFS,
HTTP, Ignite, JMSQueue, JMSTopic, Kafka, Mongo, Solr, Splunk, SQS, TCP, ...)
• ListenXXXX (HTTP, RELP, SMTP, Syslog, TCP, UDP, WebSocket, ...)
• PublishXXXX (Kafka, MQTT)
• PutXXXX (AzureBlobStorage, AzureEventHub, CassandraQL, CloudWatchMetric,
Couchbase, DynamoDB, Elasticsearch, Email, FTP, File, Hbase, HDFS, HiveQL,
Kudu, Lambda, Mongo, Parquet, Slack, SQL, TCP, ....)
• QueryXXXX (Cassandra, DatabaseTable, DNS, Elasticserach)

Processors for Processing
• ConvertXxxxToYyyy
• ConvertRecord
• EnforceOrder
• EncryptContent
• ExtractXXXX (AvroMetdata,
EmailAttachments, Grok,
HL7Attributes, ImageMetadata, ...)
• GeoEnrichIP
• JoltTransformJSON
• MergeContent
• ReplaceText
• ResizeImage
• SplitXXXX (Avro, Content, JSON,
Record, Xml, ...)
• TailFile
• TransformXML
• UpdateAttribute

Demo Case
Truck-2
truck/nn/
position
Truck-1
Truck-3
truck
position raw
truck/nn/
positionTruck-4
Truck-5
Raw Data
Store
MQTT
to Kafka
Kafka to
Raw
de":"-
90.21","correlationId":"4412891759760421296"}
Port: 1883
Port: 1884

Demo: Dataflow for MQTT to Kafka

Demo: Masking Field with ReplaceText Processor

Stream Data Integration:
StreamSets DataCollector

StreamSets Data Collector
• Founded by ex-Cloudera, Informatica
employees
• Continuous open source, intent-driven, big data
ingest
• Visible, record-oriented approach fixes
combinatorial explosion
• Batch or stream processing
• Standalone, Spark cluster, MapReduce cluster
• IDE for pipeline development by ‘civilians’
• Relatively new - first public release September
2015
• So far, vast majority of commits are from
StreamSets staff

StreamSets Origins
Source: https://streamsets.com/connectors
An origin stage represents the
source for the pipeline. You can
use a single origin stage in a
pipeline
Origins on the right are available
out of the box
API for writing custom origins

StreamSets Processors
A processor stage represents a type of
data processing that you want to perform
use as many processors in a pipeline as
you need
Programming languages supported
• Java
• JavaScript
• Jython
• Groovy
• Java Expression Language (EL) Spark
Some of processors available out-of-the-
box:
• Expression Evaluator
• Field Flattener
• Field Hasher
• Field Masker
• Field Merger
• Field Order
• Field Splitter
• Field Zip
• Groovy Evaluator
• JDBC Lookup
• JSON Parser
• Spark Evaluator
• …

StreamSets Destinations
A destination stage represents
the target for a pipeline. You can
use one or more destinations in a
pipeline
Destinations on the right are
available out of the box
API for writing custom origins

Demo Case
Truck-2
truck/nn/
position
Truck-1
Truck-3
truck
position raw
truck/nn/
positionTruck-4
Truck-5
Raw Data
Store
MQTT-1
to Kafka
Kafka to
Raw
de":"-
90.21","correlationId":"4412891759760421296"}
MQTT-2
to Kafka
Edge
Port: 1883
Port: 1884

Demo: Sending Message to Kafka in Avro

StreamSets Dataflow Performance Manager
• Map dataflows to topologies, manage releases &
track changes
• Measure KPIs and establish baselines for data
availability and accuracy
• Master dataflow operations through Data SLAs

Stream Data Integration: Kafka
Connect

Kafka Connect - Overview
Source
Connecto
r
Sink
Connecto
r

Kafka Connect – Single Message Transforms (SMT)
Simple Transformations for a single message
Defined as part of Kafka Connect
• some useful transforms provided out-of-the-box
• Easily implement your own
Optionally deploy 1+ transforms with each
connector
• Modify messages produced by source
connector
• Modify messages sent to sink connectors
Makes it much easier to mix and match connectors
Some of currently available
transforms:
• InsertField
• ReplaceField
• MaskField
• ValueToKey
• ExtractField
• TimestampRouter
• RegexRouter
• SetSchemaMetaData
• Flatten
• TimestampConverter

Kafka Connect – Many Connectors
60+ since first release (0.9+)
20+ from Confluent and Partners
Source: http://www.confluent.io/product/connectors
Confluent supported Connectors
Certified Connectors Community Connectors

Demo Case
Truck-2
truck/nn/
position
Truck-1
Truck-3
truck
position raw
truck/nn/
positionTruck-4
Truck-5
Raw Data
Store
MQTT-1
to Kafka
Kafka to
Raw
de":"-
90.21","correlationId":"4412891759760421296"}
MQTT-2
to Kafka
Port: 1883
Port: 1884

Demo (II) – devices send to MQTT instead of Kafka
#!/bin/bash
curl -X "POST" "http://192.168.69.138:8083/connectors"
-H "Content-Type: application/json"
-d $'{
"name": "mqtt-source",
"config": {
"connector.class": "io.confluent.connect.mqtt.MqttSourceConnector",
"tasks.max": "1",
"name": "mqtt-source",
"mqtt.server.uri": "tcp://mosquitto:1883",
"mqtt.topics": "truck/+/position",
"kafka.topic":"truck_position",
"mqtt.clean.session.enabled":"true",
"mqtt.connect.timeout.seconds":"30",
"mqtt.keepalive.interval.seconds":"60",
"mqtt.qos":"0"
}
}'

Summary
Apache NiFi
• visual dataflow modelling
• very powerful – “with power
comes responsibility”
• special package for Edge
computing
• data lineage and data
provenance
• supports for backpressure
• no transport mechanism
(DEV/TST/PROD)
• custom processors
• supported by Hortonworks
StreamSets
• visual dataflow modelling
• very powerful – “with power
comes responsibility”
• special package for Edge
computing
• data lineage and data
provenance
• no transport mechanism
• custom sources, sinks,
processors
• supported by StreamSets
Kafka Connect
• declarative style data flows
• simplicity - “simple things
done simple”
• very well integrated with
Kafka – comes with Kafka
• Single Message Transforms
(SMT)
• use Kafka Streams for
complex data flows
• custom connectors
• supported by Confluent

Technology on its own won't help you.
You need to know how to use it properly.

Streaming Data Integration Tools

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Streaming Data Integration Tools

Similar a Streaming Data Integration Tools (20)

Más de Guido Schmutz

Más de Guido Schmutz (20)

Último

Último (20)

Streaming Data Integration Tools