ITPC Building Modern Data Streaming Apps

© 2023 Cloudera, Inc. All rights reserved.
Building Modern Data Streaming
Apps
Tim Spann
Principal Developer Advocate
17-March-2023

© 2023 Cloudera, Inc. All rights reserved. 3
Building Modern Data Streaming Apps
In this session, I will show you some best practices I have
discovered over the last seven years in building data streaming
applications including IoT, CDC, Logs, and more.
In my modern approach, we utilize several Apache frameworks to
maximize the best features of all. We often start with Apache NiFi
as the orchestrator of streams flowing into Apache Kafka. From
there we build streaming ETL with Spark. We build continuous
queries against our topics with Flink SQL.

FLiPN-FLaNK Stack
Tim Spann
@PaasDev // Blog: www.datainmotion.dev
Principal Developer Advocate.
Princeton Future of Data Meetup.
ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC
https://github.com/tspannhw/EverythingApacheNiFi
https://medium.com/@tspann
Apache NiFi x Apache Kafka x Apache Flink x Java

FLiP Stack Weekly
This week in Apache Flink, Apache Pulsar, Apache
NiFi, Apache Spark, Java and Open Source friends.
https://bit.ly/32dAJft

Welcome to Future of Data - Princeton + Virtual
@PaasDev
https://www.meetup.com/futureofdata-princeton/
From Big Data to AI to Streaming to Containers to
Cloud to Analytics to Cloud Storage to Fast Data to
Machine Learning to Microservices to ...

Largest Java Conference in the US!
12 tracks on Java, Cloud, Frameworks, Streaming, etc…
Devnexus.com
Join me! Save with SEEMESPEAK
https://devnexus.com/presentations/apache-pulsar-development-101-with-java

STREAMING

WHAT IS REAL-TIME?

ENABLING ANALYTICS AND INSIGHTS ANYWHERE
Driving enterprise business value
REAL-TIME
STREAMING
ENGINE
ANALYTICS &
DATA WAREHOUSE
DATA SCIENCE/
MACHINE LEARNING
CENTRALIZED DATA
PLATFORM
STORAGE & PROCESSING
ANALYTICS & INSIGHTS
Stream
Ingest
Ingest – Data
at Rest
Deploy
Models
BI
Solutions
SQL Predictive
Analytics
• Model Building
• Model Training
• Model Scoring
Actions &
Alerts
[SQL]
Real-Time
Apps
STREAMING DATA
SOURCES
Clickstream Market data
Machine logs Social
ENTERPRISE DATA
SOURCES
CRM
Customer
history
Research
Compliance
Data
Risk Data
Lending

STREAMING FROM … TO .. WHILE ..
Data distribution as a ﬁrst class citizen
IOT
Devices
LOG DATA
SOURCES
ON-PREM
DATA SOURCES
BIG DATA CLOUD
SERVICES
CLOUD BUSINESS
PROCESS SERVICES *
CLOUD DATA*
ANALYTICS /SERVICE
(Cloudera DW)
App
Logs
Laptops
/Servers Mobile
Apps
Security
Agents
CLOUD
WAREHOUSE
UNIVERSAL
DATA DISTRIBUTION
(Ingest, Transform, Deliver)
Ingest
Processors
Ingest
Gateway
Router, Filter &
Transform
Processors
Destination
Processors

EVENT-DRIVEN ORGANIZATION
Modernize your data and applications
CDF Event Streaming Platform
Integration - Processing - Management - Cloud
Stream
ETL
Cloud
Storage
Application
Data Lake Data Stores
Make
Payment
µServices
Streams
Edge - IoT Dashboard

Building Real-Time Requires a Team

End to End Streaming Pipeline Example
Enterprise
sources
Weather
Errors
Aggregates
Alerts
Stocks
ETL
Analytics
Machine logs Social
SQL

Aggregate all data from sensors, drones, logs, geo-location devices, images
from cameras, results from running predictions on pre-trained models.
Collect: Bring Together
Mediate point-to-point and bi-directional data flows, distribute, delivering
data reliably to Apache Iceberg, S3, SnowFlake, Slack and Email.
Conduct: Mediate the Data Flow
Orchestrate, parse, merge, aggregate, filter, join, transform, fork, query,
sort, dissect, store, enrich with weather, location, sentiment analysis,
image analysis, object detection, image recognition and more with Apache
Tika, Apache OpenNLP and Machine Learning.
Curate: Gain Insights

APACHE PULSAR

Apache Pulsar
● Serverless computing framework.
● Unbounded storage, multi-tiered
architecture, and tiered-storage.
● Streaming & Pub/Sub messaging
semantics.
● Multi-protocol support.
● Open Source
● Cloud-Native

Streaming
Consumer
Consumer
Consumer
Subscription
Shared
Failover
Consumer
Consumer
Subscription
In case of failure in
Consumer B-0
Consumer
Consumer
Subscription
Exclusive
X
Consumer
Consumer
Key-Shared
Subscription
Pulsar
Topic/Partition
Messaging

Messages - the Basic Unit of Apache Pulsar
Component Description
Value / data payload The data carried by the message. All Pulsar messages contain raw bytes, although message data
can also conform to data schemas.
Key Messages are optionally tagged with keys, used in partitioning and also is useful for things like
topic compaction.
Properties An optional key/value map of user-defined properties.
Producer name The name of the producer who produces the message. If you do not specify a producer name, the
default name is used.
Sequence ID Each Pulsar message belongs to an ordered sequence on its topic. The sequence ID of the
message is its order in that sequence.

Integrated Schema Registry
Schema Registry
schema-1 (value=Avro/Protobuf/JSON) schema-2
(value=Avro/Protobuf/JSON)
schema-3
(value=Avro/Protobuf/JSON)
Schema
Data
ID
Local Cache
for Schemas
+
Schema
Data
ID +
Local Cache
for Schemas
Send schema-1
(value=Avro/Protobuf/JSON) data
serialized per schema ID
Send (register)
schema (if not in
local cache)
Read schema-1
(value=Avro/Protobuf/JSON) data
deserialized per schema ID
Get schema by ID (if
not in local cache)
Producers Consumers

The FLiPN Kitten crosses the stream
4 ways with Apache Pulsar

Kafka on Pulsar (KoP)

Data Offloaders
(Tiered Storage)
Client Libraries
Apache Pulsar Ecosystem
hub.streamnative.io
Connectors
(Sources & Sinks)
Protocol Handlers
Pulsar Functions
(Lightweight Stream
Processing)
Processing Engines
… and more!
… and more!

Pulsar Functions
● Consume messages from one or
more Pulsar topics.
● Apply user-supplied processing logic
to each message.
● Publish the results of the
computation to another topic.
● Support multiple programming
languages (Java, Python, Go)
● Can leverage 3rd-party libraries to
support the execution of ML
models on the edge.

APACHE KAFKA

What is Apache Kafka?
– Distributed: horizontally scalable
– Partitioned: the data is split-up and distributed across the brokers
– Replicated: allows for automatic failover
– Unique: Kafka does not track the consumption of messages (the consumers
do)
– Fast: designed from the ground up with a focus on performance and
throughput
– Kafka was built at Linkedin in 2011
– Open sourced as an Apache project

Yes, Franz, It’s Kafka
Let’s do a metamorphosis on your data. Don’t fear changing data.
You don’t need to be a brilliant writer to stream
data.
Franz Kafka was a German-speaking
Bohemian novelist and short-story writer,
widely regarded as one of the major figures of
20th-century literature. His work fuses
elements of realism and the fantastic.
Wikipedia

What is Can You Do With Apache Kafka?
• Web site activity: track page views, searches, etc. in real time
• Events & log aggregation: particularly in distributed systems where messages
come from multiple sources
• Monitoring and metrics: aggregate statistics from distributed applications and
build a dashboard application
• Stream processing: process raw data, clean it up, and forward it on to another
topic or messaging system
• Real-time data ingestion: fast processing of a very large volume of messages

KAFKA TERMINOLOGY
• Kafka is a publish/subscribe messaging system comprised of the
following components:
– Topic: a message feed
– Producer: a process that publishes messages to a topic
– Consumer: a process that subscribes to a topic and processes its messages
– Broker: a server in a Kafka cluster

Apache Kafka
• Highly reliable distributed
messaging system
• Decouple applications, enables
many-to-many patterns
• Publish-Subscribe semantics
• Horizontal scalability
• Eﬃcient implementation to
operate at speed with big data
volumes
• Organized by topic to support
several use cases
Source
System
Source
System
Source
System
Kafka
Fraud
Detection
Security
Systems
Real-Time
Monitoring
Many-To-Many
Publish-Subscribe
EVENTS

APACHE FLINK

Flink SQL
https://www.datainmotion.dev/2021/04/cloudera-sql-stream-builder-ssb-updated.html
● Streaming Analytics
● Continuous SQL
● Continuous ETL
● Complex Event Processing
● Standard SQL Powered by Apache Calcite

Flink SQL
-- specify Kafka partition key on output
SELECT foo AS _eventKey FROM sensors
-- use event time timestamp from kafka
-- exactly once compatible
SELECT eventTimestamp FROM sensors
-- nested structures access
SELECT foo.’bar’ FROM table; -- must quote nested
column
-- timestamps
SELECT * FROM payments
WHERE eventTimestamp > CURRENT_TIMESTAMP-interval
'10' second;
-- unnest
SELECT b.*, u.*
FROM bgp_avro b,
UNNEST(b.path) AS u(pathitem)
-- aggregations and windows
SELECT card,
MAX(amount) as theamount,
TUMBLE_END(eventTimestamp, interval '5' minute) as
ts
FROM payments
WHERE lat IS NOT NULL
AND lon IS NOT NULL
GROUP BY card,
TUMBLE(eventTimestamp, interval '5' minute)
HAVING COUNT(*) > 4 -- >4==fraud
-- try to do this ksql!
SELECT us_west.user_score+ap_south.user_score
FROM kafka_in_zone_us_west us_west
FULL OUTER JOIN kafka_in_zone_ap_south ap_south
ON us_west.user_id = ap_south.user_id;
Key Takeaway: Rich SQL grammar with advanced time and aggregation tools

34
SQL STREAM BUILDER (SSB)
SQL STREAM BUILDER allows
developers, analysts, and data
scientists to write streaming
applications with industry
standard SQL.
No Java or Scala code
development required.
Simpliﬁes access to data in Kafka
& Flink. Connectors to batch data in
HDFS, Kudu, Hive, S3, JDBC, CDC
and more
Enrich streaming data with batch
data in a single tool
Democratize access to real-time data with just SQL

DATAFLOW
APACHE NIFI

CLOUDERA FLOW AND EDGE MANAGEMENT
Enable easy ingestion, routing, management and delivery of any data anywhere (Edge, cloud, data
center) to any downstream system with built in end-to-end security and provenance
Advanced tooling to industrialize
ﬂow development (Flow Development
Life Cycle)
ACQUIRE
• Over 300 Prebuilt Processors
• Easy to build your own
• Parse, Enrich & Apply Schema
• Filter, Split, Merger & Route
• Throttle & Backpressure
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLOG
PROCESS
HASH
MERGE
EXTRACT
DUPLICATE
SPLIT
ENCRYPT
TALL
EVALUATE
EXECUTE
GEOENRICH
SCAN
REPLACE
TRANSLATE
CONVERT
ROUTE TEXT
ROUTE CONTENT
ROUTE CONTEXT
ROUTE RATE
DISTRIBUTE LOAD
DELIVER
• Guaranteed Delivery
• Full data provenance from
acquisition to delivery
• Diverse, Non-Traditional Sources
• Eco-system integration
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLOG

Cloudera DataFlow: Universal Data Distribution Service
Process
Route
Filter
Enrich
Transform
Distribute
Connectors
Any
destination
Deliver
Ingest
Active
Passive
Connectors
Gateway
Endpoint
Connect & Pull
Send
Data born in
the cloud
Data born
outside the
cloud
UNIVERSAL DATA DISTRIBUTION WITH CLOUDERA DATAFLOW (CDF)
Connect to Any Data Source Anywhere then Process and Deliver to Any Destination

38
WHAT IS APACHE NIFI?
Apache NiFi is a scalable, real-time streaming data
platform that collects, curates, and analyzes data so
customers gain key insights for immediate
actionable intelligence.

APACHE NIFI
Enable easy ingestion, routing, management and delivery of
any data anywhere (Edge, cloud, data center) to any
downstream system with built in end-to-end security and
provenance
ACQUIRE PROCESS DELIVER
• Over 300 Prebuilt Processors
• Easy to build your own
• Parse, Enrich & Apply Schema
• Filter, Split, Merger & Route
• Throttle & Backpressure
• Guaranteed Delivery
• Full data provenance from acquisition to
delivery
• Diverse, Non-Traditional Sources
• Eco-system integration
Advanced tooling to industrialize ﬂow development
(Flow Development Life Cycle)
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLO
G
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLO
G
HASH
MERGE
EXTRACT
DUPLICATE
SPLIT
ROUTE TEXT
ROUTE CONTENT
ROUTE CONTEXT
CONTROL RATE
DISTRIBUTE LOAD
GEOENRICH
SCAN
REPLACE
TRANSLATE
CONVERT
ENCRYPT
TALL
EVALUATE
EXECUTE

FDLC: FLOW DEVELOPMENT LIFECYCLE
NiFi Cluster
Node 1 Node x
Tests cluster
Registry
..
NiFi Cluster
Node 1 Node x
Production cluster
Registry
..
NiFi Cluster
Node 1 Node x
Dev cluster
Registry
..
Dev Ops chain
NiFi CLI, NiPyAPI,
Jenkins
1- push
new flow
version
2- get new
version of
the flow
3 & 6 - automated testing /
promotion (naming
convention, scheduling,
variable replacement)
4- push
new flow
version
5- get new
version of
the flow
7- push
new flow
version

PROVENANCE

EXTENSIBILITY
• Built from the ground up with extensions in mind
• Service-loader pattern for…
– Processors
– Controller Services
– Reporting Tasks
– Prioritizers
• Extensions packaged as NiFi Archives (NARs)
– Deploy NiFi lib directory and restart
– Same model as standard components

STATELESS
ENGINE
• Granular containers per flow
• Flows From NiFi Registry
https://www.datainmotion.dev/2019/11/exploring-apache-nifi-110-parameters.html
bin/nifi.sh stateless RunFromRegistry Continuous --file kafka.json
https://github.com/apache/nifi/blob/ea1becac4fc519c54b8b4d21773e68f8da364755/nifi-nar-bundles/nifi-framework-bundle/nifi-
framework/nifi-stateless/README.md

STATELESS
ENGINE
• See also Parameters
• Docker
• YARN
• Kubernetes (K8)
• Stateful NiFi clusters
• Apache OpenWhisk (FaaS)
{"registryUrl": "http://tspann-mbp15-hw14277:18080",
"bucketId": "140b30f0-5a47-4747-9021-19d4fde7f993",
"flowId": "0540e1fd-c7ca-46fb-9296-e37632021945",
"ssl": {
"keystoreFile": "","keystorePass": "","keyPass": "","keystoreType": "",
"truststoreFile":
"/Library/Java/JavaVirtualMachines/amazon-corretto-11.jdk/Contents/Home/lib/sec
urity/cacerts",
"truststorePass": "changeit", "truststoreType": "JKS"
},
"parameters": {
"broker" : "4.317.852.100:9092",
"topic" : "iot",
"group_id" : "nifi-stateless-kafka-consumer",
"DestinationDirectory" : "/tmp/nifistateless/output2/",
"output_dir": "/Users/tspann/Documents/nifi-1.10.0-SNAPSHOT/logs/output"
}
}
https://github.com/tspannhw/stateless-examples

PARQUETREADER
/
PARQUETWRITER
RECORDS
• Native Record Processors for
Apache Parquet Files!
• CVS <-> Parquet
• XML <-> Parquet
• AVRO <-> Parquet
• JSON <-> Parquet
• More...
https://www.datainmotion.dev/2019/10/migrating-apache-ﬂume-ﬂows-to-apac
he_7.html

POSTSLACK
• Post Images to Slack
https://www.datainmotion.dev/2019/11/niﬁ-110-postslack-easy-image-upload.html

REMOTE INPUT
PORT IN A
PROCESS GROUP
• Put Remote Connections for
Site-To-Site (S2S) Anywhere!
• Not only top level
• Drop down simplicity

48
READYFLOW
GALLERY
• Cloudera provided flow
definitions
• Cover most common data flow
use cases
• Optimized to work with CDP
sources/destinations
• Can be deployed and adjusted
as needed

49
FLOW CATALOG
• Central repository for flow
definitions
• Import existing NiFi flows
• Manage flow definitions
• Initiate flow deployments

Apache NiFi with Python Custom Processors
Python as a 1st class citizen

Processing one billion events per second with NiFi
https://blog.cloudera.com/benchmarking-niﬁ-performance-and-scalability/

FREE LEARNING ENVIRONMENT

CSP Community
Edition
• Kafka, KConnect, SMM, SR,
Flink, and SSB in Docker
• Runs in Docker
• Try new features quickly
• Develop applications locally
● Docker compose ﬁle of CSP to run from command line w/o any
dependencies, including Flink, SQL Stream Builder, Kafka, Kafka
Connect, Streams Messaging Manager and Schema Registry
○ $>docker compose up
● Licensed under the Cloudera Community License
● Unsupported
● Community Group Hub for CSP
● Find it on docs.cloudera.com under Applications

DEMO

End to End Streaming Demo Pipeline
Enterprise
sources
Weather
Errors
Aggregates
Alerts
Stocks
ETL
Analytics
Streaming SQL
Machine logs Social
https://github.com/tspannhw/CloudDemo2021

AI BASED ENHANCEMENTS
SERVE
SOURCES
Data Warehouse
Report
Sensorid
Sensor conditions
Machine Learning
Predict, Automate
Control System
REPORT
Visualize
CLOUD
Collect
COLLECT
Message Broker
Data Flow
Distribute
Sensor id
Temperature
COLLECT and DISTRIBUTE DATA
1. Data is collected from sensors that use mqtt protocol via
CEM and sent to CDP Public Cloud
2. Two CDF flows run in the cloud to accomplish our two
goals: streaming analytics and batch analytics
ENRICH, REPORT
Report, Automate
Real time alerting
SQL
Stream Builder
Data Visualization
Edge
Management
Humidity
Timestamp
Visualize
Data Visualization
USE DATA
3. Streaming Use Case: some conditions of our greenhouse
must be avoided and have to be controlled in real time.
Some warnings have been defined to alert us in case
alerting conditions are met, control system is
automatically activated to adjust environmental
variables.
4. Batch analytics: to ensure the optimal growth of our
plants the ideal conditions have to be met for each
plant. Each 6 hours the plant conditions are monitored
and in case some control adjustment is required a ML
model gives a suggestion about getting the optimal
point minimizing the cost.
EDGE
Control System

RESOURCES AND WRAP-UP

Streaming
Solutions
When to use what?
• Routing vs Analytics
– Listeners
– Joins
– In-Memory
• Operational Load
• Current Skills
• Use NiFi
– Doing more than just Syndication
– Not just small Kafka sized events
– Edge Management is needed
– Listener Type use cases that bind to ports
– Single Product for light ETL, Lineage, Provenance,
Message Replay
• Use Flink
– Joining Streams
– Windowing
– Late Data Handling
– Streaming Analytics
• Use KConnect
– Kafka Centric
• Multiple-Tool Objections
– In-Memory Stateless

STREAMING RESOURCES
• https://dzone.com/articles/real-time-stream-processing-with-hazelcast-an
d-streamnative
• https://ﬂipstackweekly.com/
• https://www.datainmotion.dev/
• https://www.ﬂankstack.dev/
• https://github.com/tspannhw
• https://medium.com/@tspann
• https://medium.com/@tspann/predictions-for-streaming-in-2023-ad4d739
5d714
• https://www.apachecon.com/acna2022/slides/04_Spann_Tim_Citizen_Str
eaming_Engineer.pdf

https://github.com/tspannhw https://www.datainmotion.dev/

April 4 - Atlanta
April 24 - San Francisco
May 10 - Virtual
Upcoming Events

Resources

ITPC Building Modern Data Streaming Apps

Recomendados

Recomendados

Más contenido relacionado

Similar a ITPC Building Modern Data Streaming Apps

Similar a ITPC Building Modern Data Streaming Apps (20)

Más de Timothy Spann

Más de Timothy Spann (18)

Último

Último (20)

ITPC Building Modern Data Streaming Apps