StreamSets Data Collector is an open source data integration tool that can ingest data from various sources in both batch and streaming modes. It uses a record-oriented approach to data processing which avoids issues caused by combinatorial explosion. Pipelines can be developed visually using an IDE interface, allowing non-technical users to build integrations. StreamSets originated from ex-Cloudera and Informatica employees and focuses on continuous open source development.
1. BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF
HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH
Streaming Data Ingestion in BigData-
und IoT-Anwendungen
Guido Schmutz – 27.9.2018
@gschmutz guidoschmutz.wordpress.com
2. Guido Schmutz
Working at Trivadis for more than 21 years
Oracle ACE Director for Fusion Middleware and SOA
Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
Head of Trivadis Architecture Board
Technology Manager @ Trivadis
More than 30 years of software development experience
Contact: guido.schmutz@trivadis.com
Blog: http://guidoschmutz.wordpress.com
Slideshare: http://www.slideshare.net/gschmutz
Twitter: gschmutz
4. Agenda
1. Big Data and IoT Reference Architecture
2. Event Hub
3. Stream Data Integration
Apache NiFi
StreamSets Data Collector
Kafka Connect
4. Summary
6. Bulk Source
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
BI Tools
Enterprise Data
Warehouse
SQL
Search / Explore
Parallel
Processing
Storage
Storage
RawRefined
Results
high latency
Enterprise Apps
Logic
{ }
API
File Import / SQL Import
DB
Extract
File
DB
Big Data solves Volume and Variety – not Velocity
Introduction to Stream Processing
7. Bulk Source
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
BI Tools
Enterprise Data
Warehouse
SQL
Search / Explore
Parallel
Processing
Storage
Storage
RawRefined
Results
high latency
Enterprise Apps
Logic
{ }
API
File Import / SQL Import
DB
Extract
File
DB
Event Source
Location
Telemetry
IoT
Data
Mobile
Apps
Social
Big Data solves Volume and Variety – not Velocity
Introduction to Stream Processing
Event Stream
8. Bulk Source
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
BI Tools
Enterprise Data
Warehouse
SQL
Search / Explore
• Machine Learning
• Graph Algorithms
• Natural Language Processing
Parallel
Processing
Storage
Storage
RawRefined
Results
high latency
Enterprise Apps
Logic
{ }
API
File Import / SQL Import
DB
Extract
File
DB
Event Stream
Event Source
Location
IoT
Data
Mobile
Apps
Social
Big Data solves Volume and Variety – not Velocity
Introduction to Stream Processing
Event
Hub
Event
Hub
Event
Hub
Telemetry
9. Event
Hub
Event
Hub
Hadoop Clusterd
Hadoop Cluster
Stream Analytics
Platform
Stream Processing Architecture solves Velocity
BI Tools
Enterprise Data
Warehouse
Event
Hub
Search / Explore
Enterprise Apps
Search
Results
Stream Analytics
Reference /
Models
Dashboard
Logic
{ }
API
Event
Stream
Event
Stream
Event
Stream
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Introduction to Stream Processing
Low(est) latency, no history
Telemetry
10. Hadoop Clusterd
Hadoop Cluster
Stream Analytics
Platform
Big Data for all historical data analysis
BI Tools
Enterprise Data
Warehouse
Search / Explore
Enterprise Apps
Search
Results
Stream Analytics
Reference /
Models
Dashboard
Logic
{ }
API
Event
Stream
Event
Stream
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
Parallel
Processing
Storage
Storage
RawRefined
Results
Data FlowEvent
Hub
Event
Stream
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
File Import / SQL Import
Introduction to Stream Processing
Telemetry
11. Data Store
Integrate existing systems through CDC
Data
Event Hub
Integration
Consuming Systems
StateLogic
CDC
CDC Connector
Traditional Silo-based
System
LogicUser Interface
Capture changes directly on database
Change Data Capture (CDC) => think like
a global database trigger
Transform existing systems to event
producer
Event
Stream
Event
Stream
Introduction to Stream Processing
12. Hadoop Clusterd
Hadoop Cluster
Stream Analytics
Platform
Integrate existing systems with lower latency through CDC
BI Tools
Enterprise Data
Warehouse
Search / Explore
Enterprise Apps
Search
Results
Stream Analytics
Reference /
Models
Dashboard
Logic
{ }
API
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
Parallel
Processing
Storage
Storage
RawRefined
Results
File Import / SQL Import
Event
Stream
Event
Stream
Data FlowEvent
Hub
Event
Stream
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Introduction to Stream Processing
Telemetry
13. New systems participate in event-oriented fashion
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
Parallel
Processing
Storage
Storage
RawRefined
Results
Microservice Platform
Microservice State
{ }
API
Stream Analytics Platform
Stream
Processor
State
{ }
API
Event
Stream
SQL
Search
BI Tools
Enterprise Data
Warehouse
Search / Explore
Service
Enterprise Apps
Logic
{ }
API
File Import / SQL Import
Event
Stream
Data FlowEvent
Hub
Event
Stream
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Event
Stream
Event
Stream
Introduction to Stream Processing
Telemetry
14. Edge computing allows processing close to data sources
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
Parallel
Processing
Storage
Storage
RawRefined
Results
Microservice Platfrom
Microservice State
{ }
API
Stream Analytics Platform
Stream
Processor
State
{ }
API
SQL
Search
BI Tools
Enterprise Data
Warehouse
Search / Explore
Service
Enterprise Apps
Logic
{ }
API
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Edge Node
File Import / SQL Import
Event
Hub
Event
Stream
Event
Stream
Event Stream
Introduction to Stream Processing
Telemetry
Rules
Event Hub
Storage
15. Hadoop Clusterd
Hadoop Cluster
Big Data
Unified Architecture for Modern Data Analytics Solutions
SQL
Search
BI Tools
Enterprise Data
Warehouse
Search / Explore
File Import / SQL Import
Event
Hub
Parallel
Processing
Storage
Storage
RawRefined
Results
Microservice State
{ }
API
Stream
Processor
State
{ }
API
Event
Stream
Event
Stream
Service
Stream Analytics
Microservices
Enterprise Apps
Logic
{ }
API
Edge Node
Rules
Event Hub
Storage
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Event Stream
Telemetry
16. Two Types of Stream Processing
(from Gartner)
Introduction to Stream Processing
Stream Data Integration
• primarily focuses on the ingestion and
processing of data sources targeting real-
time extract-transform-load (ETL) and data
integration use cases
• filter and enrich the data
• optionally calculate time-windowed
aggregations before storing the results in a
database or file system
Stream Analytics
• targets analytics use cases
• calculating aggregates and detecting
patterns to generate higher-level, more
relevant summary information (complex
events)
• Complex events may signify threats or
opportunities that require a response from
the business through real-time dashboards,
alerts or decision automation
18. Implementing "Event Hub"
Hadoop Clusterd
Hadoop Cluster
Cluster Infrastructure
Parallel
Processing
Storage
Storage
RawRefined
Results
Microservice State
{ }
API
Stream
Processor
State
{ }
API
SQL
Search
BI Tools
Enterprise Data
Warehouse
Search / Explore
Service
Enterprise Apps
Logic
{ }
API
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
Edge Node
Rules
Event Hub
Storage
File Import / SQL Import
Event
Hub
Event
Stream
Event
Stream
Event Stream
Replay
Big Data Batch Analytics
Stream Analytics
Modern Applications
Introduction to Stream Processing
19. Apache Kafka – A Streaming Platform
Introduction to Stream Processing
High-Level Architecture
Distributed Log at the Core
Scale-Out Architecture
Logs do not (necessarily) forget
20. Hold Data for Long-Term – Data Retention
Producer 1
Broker 1
Broker 2
Broker 3
1. Never
2. Time based (TTL)
log.retention.{ms | minutes | hours}
3. Size based
log.retention.bytes
4. Log compaction based
(entries with same key are removed):
kafka-topics.sh --zookeeper zk:2181
--create --topic customers
--replication-factor 1
--partitions 1
--config cleanup.policy=compact
Introduction to Stream Processing
23. Implementing "Stream Data Integration"
Hadoop Clusterd
Hadoop Cluster
Cluster Infrastructure
Parallel
Processing
Storage
Storage
RawRefined
Results
Microservice State
{ }
API
Stream
Processor
State
{ }
API
SQL
Search
BI Tools
Enterprise Data
Warehouse
Search / Explore
Service
Enterprise Apps
Logic
{ }
API
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
Edge Node
Rules
Event Hub
Storage
File Import / SQL Import
Event
Hub
Event
Stream
Event
Stream
Event Stream
Replay
Big Data Batch Analytics
Stream Analytics
Modern Applications
Introduction to Stream Processing
24. Integrating (Streaming) Data Sources
Introduction to Stream Processing
SQL Polling
Change Data Capture
(CDC)
File Polling
File Stream (File Tailing)
File Stream (Appender)
Sensor Stream
25. IoT GW
MQTT
Broker
IoT devices will often not be able to
talk to Kafka directly
DB Source
Big Data
Lo
g
Stream
Processing
IoT Sensor
Event Hub
Topic
Topic
REST
Topic
IoT GW
CDC GW
Connect
CDC
DB Source
Lo
g
CDC
Native
IoT Sensor
IoT Sensor
26
Dataflow
GW
Topic
Topic
Queue
Messaging
GWTopic
Dataflow
GWDataflow
Topic
REST
26
File Source
Lo
g
Lo
g
Lo
g
Social
Native
26. Why is Data Ingestion Difficult?
Physical and Logical
Infrastructure changes
rapidly
Key Challenges:
Infrastructure Automation
Edge Deployment
Infrastructure Drift
Data Structures and
formats evolve and change
unexpectedly
Key Challenges:
Consumption Readiness
Corruption and Loss
Structure Drift
Data semantics change
with evolving applications
Key Challenges
Timely Intervention
System Consistency
Semantic Drift
Source: Streamsets
27. Integration with or without
Transformation?
Zero Transformation
• No transformation, plain ingest, no
schema validation
• Keep the original format – Text,
CSV, …
• Allows to store data that may have
errors in the schema
Format Transformation
• Prefer name of Format Translation
• Simply change the format
• Change format from Text to Avro
• Does schema validation
Enrichment Transformation
• Add new data to the message
• Do not change existing values
• Convert a value from one system to
another and add it to the message
Value Transformation
• Replaces values in the message
• Convert a value from one system to
another and change the value in-place
• Destroys the raw data!
Introduction to Stream Processing
d
30. Apache NiFi
• Originated at NSA as Niagarafiles – developed
behind closed doors for 8 years
• Open sourced December 2014, Apache Top
Level Project July 2015
• Look-and-Feel modernized in 2016
• Opaque, “file-oriented” payload
• Distributed system of processors with
centralized control
• Based on flow-based programming concepts
• Data Provenance and Data Lineage
• Web-based user interface
39. StreamSets Data Collector
• Founded by ex-Cloudera, Informatica
employees
• Continuous open source, intent-driven, big data
ingest
• Visible, record-oriented approach fixes
combinatorial explosion
• Batch or stream processing
• Standalone, Spark cluster, MapReduce cluster
• IDE for pipeline development by ‘civilians’
• Relatively new - first public release September
2015
• So far, vast majority of commits are from
StreamSets staff
41. StreamSets Processors
A processor stage represents a type of
data processing that you want to perform
use as many processors in a pipeline as
you need
Programming languages supported
• Java
• JavaScript
• Jython
• Groovy
• Java Expression Language (EL) Spark
Some of processors available out-of-the-
box:
• Expression Evaluator
• Field Flattener
• Field Hasher
• Field Masker
• Field Merger
• Field Order
• Field Splitter
• Field Zip
• Groovy Evaluator
• JDBC Lookup
• JSON Parser
• Spark Evaluator
• …
42. StreamSets Destinations
A destination stage represents
the target for a pipeline. You can
use one or more destinations in a
pipeline
Destinations on the right are
available out of the box
API for writing custom origins
Source: https://streamsets.com/connectors
50. StreamSets Dataflow Performance Manager
• Map dataflows to topologies, manage releases &
track changes
• Measure KPIs and establish baselines for data
availability and accuracy
• Master dataflow operations through Data SLAs
Source: https://streamsets.com/connectors
52. Kafka Connect - Overview
Source
Connecto
r
Sink
Connecto
r
Introduction to Stream Processing
53. Kafka Connect – Single Message Transforms (SMT)
Simple Transformations for a single message
Defined as part of Kafka Connect
• some useful transforms provided out-of-the-box
• Easily implement your own
Optionally deploy 1+ transforms with each
connector
• Modify messages produced by source
connector
• Modify messages sent to sink connectors
Makes it much easier to mix and match connectors
Some of currently available
transforms:
• InsertField
• ReplaceField
• MaskField
• ValueToKey
• ExtractField
• TimestampRouter
• RegexRouter
• SetSchemaMetaData
• Flatten
• TimestampConverter
54. Kafka Connect – Many Connectors
60+ since first release (0.9+)
20+ from Confluent and Partners
Source: http://www.confluent.io/product/connectors
Confluent supported Connectors
Certified Connectors Community Connectors
58. Summary
Apache NiFi
• visual dataflow modelling
• very powerful – “with power
comes responsibility”
• special package for Edge
computing
• data lineage and data
provenance
• supports for backpressure
• no transport mechanism
(DEV/TST/PROD)
• custom processors
• supported by Hortonworks
StreamSets
• visual dataflow modelling
• very powerful – “with power
comes responsibility”
• special package for Edge
computing
• data lineage and data
provenance
• no transport mechanism
• custom sources, sinks,
processors
• supported by StreamSets
Kafka Connect
• declarative style data flows
• simplicity - “simple things
done simple”
• very well integrated with
Kafka – comes with Kafka
• Single Message Transforms
(SMT)
• use Kafka Streams for
complex data flows
• custom connectors
• supported by Confluent
59. Technology on its own won't help you.
You need to know how to use it properly.