Más contenido relacionado Similar a ITPC Building Modern Data Streaming Apps (20) Más de Timothy Spann (18) ITPC Building Modern Data Streaming Apps1. © 2023 Cloudera, Inc. All rights reserved.
Building Modern Data Streaming
Apps
Tim Spann
Principal Developer Advocate
17-March-2023
3. © 2023 Cloudera, Inc. All rights reserved. 3
Building Modern Data Streaming Apps
In this session, I will show you some best practices I have
discovered over the last seven years in building data streaming
applications including IoT, CDC, Logs, and more.
In my modern approach, we utilize several Apache frameworks to
maximize the best features of all. We often start with Apache NiFi
as the orchestrator of streams flowing into Apache Kafka. From
there we build streaming ETL with Spark. We build continuous
queries against our topics with Flink SQL.
4. © 2023 Cloudera, Inc. All rights reserved.
FLiPN-FLaNK Stack
Tim Spann
@PaasDev // Blog: www.datainmotion.dev
Principal Developer Advocate.
Princeton Future of Data Meetup.
ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC
https://github.com/tspannhw/EverythingApacheNiFi
https://medium.com/@tspann
Apache NiFi x Apache Kafka x Apache Flink x Java
5. © 2023 Cloudera, Inc. All rights reserved. 5
FLiP Stack Weekly
This week in Apache Flink, Apache Pulsar, Apache
NiFi, Apache Spark, Java and Open Source friends.
https://bit.ly/32dAJft
6. © 2023 Cloudera, Inc. All rights reserved. 6
Welcome to Future of Data - Princeton + Virtual
@PaasDev
https://www.meetup.com/futureofdata-princeton/
From Big Data to AI to Streaming to Containers to
Cloud to Analytics to Cloud Storage to Fast Data to
Machine Learning to Microservices to ...
7. © 2023 Cloudera, Inc. All rights reserved. 7
Largest Java Conference in the US!
12 tracks on Java, Cloud, Frameworks, Streaming, etc…
Devnexus.com
Join me! Save with SEEMESPEAK
https://devnexus.com/presentations/apache-pulsar-development-101-with-java
10. © 2023 Cloudera, Inc. All rights reserved. 10
ENABLING ANALYTICS AND INSIGHTS ANYWHERE
Driving enterprise business value
REAL-TIME
STREAMING
ENGINE
ANALYTICS &
DATA WAREHOUSE
DATA SCIENCE/
MACHINE LEARNING
CENTRALIZED DATA
PLATFORM
STORAGE & PROCESSING
ANALYTICS & INSIGHTS
Stream
Ingest
Ingest – Data
at Rest
Deploy
Models
BI
Solutions
SQL Predictive
Analytics
• Model Building
• Model Training
• Model Scoring
Actions &
Alerts
[SQL]
Real-Time
Apps
STREAMING DATA
SOURCES
Clickstream Market data
Machine logs Social
ENTERPRISE DATA
SOURCES
CRM
Customer
history
Research
Compliance
Data
Risk Data
Lending
11. © 2023 Cloudera, Inc. All rights reserved. 11
STREAMING FROM … TO .. WHILE ..
Data distribution as a first class citizen
IOT
Devices
LOG DATA
SOURCES
ON-PREM
DATA SOURCES
BIG DATA CLOUD
SERVICES
CLOUD BUSINESS
PROCESS SERVICES *
CLOUD DATA*
ANALYTICS /SERVICE
(Cloudera DW)
App
Logs
Laptops
/Servers Mobile
Apps
Security
Agents
CLOUD
WAREHOUSE
UNIVERSAL
DATA DISTRIBUTION
(Ingest, Transform, Deliver)
Ingest
Processors
Ingest
Gateway
Router, Filter &
Transform
Processors
Destination
Processors
12. © 2023 Cloudera, Inc. All rights reserved.
© 2019 Cloudera, Inc. All rights reserved. 12
EVENT-DRIVEN ORGANIZATION
Modernize your data and applications
CDF Event Streaming Platform
Integration - Processing - Management - Cloud
Stream
ETL
Cloud
Storage
Application
Data Lake Data Stores
Make
Payment
µServices
Streams
Edge - IoT Dashboard
14. © 2023 Cloudera, Inc. All rights reserved. 14
End to End Streaming Pipeline Example
Enterprise
sources
Weather
Errors
Aggregates
Alerts
Stocks
ETL
Analytics
Clickstream Market data
Machine logs Social
SQL
15. © 2023 Cloudera, Inc. All rights reserved.
Aggregate all data from sensors, drones, logs, geo-location devices, images
from cameras, results from running predictions on pre-trained models.
Collect: Bring Together
Mediate point-to-point and bi-directional data flows, distribute, delivering
data reliably to Apache Iceberg, S3, SnowFlake, Slack and Email.
Conduct: Mediate the Data Flow
Orchestrate, parse, merge, aggregate, filter, join, transform, fork, query,
sort, dissect, store, enrich with weather, location, sentiment analysis,
image analysis, object detection, image recognition and more with Apache
Tika, Apache OpenNLP and Machine Learning.
Curate: Gain Insights
17. © 2023 Cloudera, Inc. All rights reserved.
Apache Pulsar
● Serverless computing framework.
● Unbounded storage, multi-tiered
architecture, and tiered-storage.
● Streaming & Pub/Sub messaging
semantics.
● Multi-protocol support.
● Open Source
● Cloud-Native
19. © 2023 Cloudera, Inc. All rights reserved. 19
Messages - the Basic Unit of Apache Pulsar
Component Description
Value / data payload The data carried by the message. All Pulsar messages contain raw bytes, although message data
can also conform to data schemas.
Key Messages are optionally tagged with keys, used in partitioning and also is useful for things like
topic compaction.
Properties An optional key/value map of user-defined properties.
Producer name The name of the producer who produces the message. If you do not specify a producer name, the
default name is used.
Sequence ID Each Pulsar message belongs to an ordered sequence on its topic. The sequence ID of the
message is its order in that sequence.
20. © 2023 Cloudera, Inc. All rights reserved. 20
Integrated Schema Registry
Schema Registry
schema-1 (value=Avro/Protobuf/JSON) schema-2
(value=Avro/Protobuf/JSON)
schema-3
(value=Avro/Protobuf/JSON)
Schema
Data
ID
Local Cache
for Schemas
+
Schema
Data
ID +
Local Cache
for Schemas
Send schema-1
(value=Avro/Protobuf/JSON) data
serialized per schema ID
Send (register)
schema (if not in
local cache)
Read schema-1
(value=Avro/Protobuf/JSON) data
deserialized per schema ID
Get schema by ID (if
not in local cache)
Producers Consumers
21. © 2023 Cloudera, Inc. All rights reserved. 21
The FLiPN Kitten crosses the stream
4 ways with Apache Pulsar
23. © 2023 Cloudera, Inc. All rights reserved. 23
Data Offloaders
(Tiered Storage)
Client Libraries
Apache Pulsar Ecosystem
hub.streamnative.io
Connectors
(Sources & Sinks)
Protocol Handlers
Pulsar Functions
(Lightweight Stream
Processing)
Processing Engines
… and more!
… and more!
24. © 2023 Cloudera, Inc. All rights reserved. 24
Pulsar Functions
● Consume messages from one or
more Pulsar topics.
● Apply user-supplied processing logic
to each message.
● Publish the results of the
computation to another topic.
● Support multiple programming
languages (Java, Python, Go)
● Can leverage 3rd-party libraries to
support the execution of ML
models on the edge.
26. © 2023 Cloudera, Inc. All rights reserved. 26
What is Apache Kafka?
– Distributed: horizontally scalable
– Partitioned: the data is split-up and distributed across the brokers
– Replicated: allows for automatic failover
– Unique: Kafka does not track the consumption of messages (the consumers
do)
– Fast: designed from the ground up with a focus on performance and
throughput
– Kafka was built at Linkedin in 2011
– Open sourced as an Apache project
27. © 2023 Cloudera, Inc. All rights reserved. 27
Yes, Franz, It’s Kafka
Let’s do a metamorphosis on your data. Don’t fear changing data.
You don’t need to be a brilliant writer to stream
data.
Franz Kafka was a German-speaking
Bohemian novelist and short-story writer,
widely regarded as one of the major figures of
20th-century literature. His work fuses
elements of realism and the fantastic.
Wikipedia
28. © 2023 Cloudera, Inc. All rights reserved. 28
What is Can You Do With Apache Kafka?
• Web site activity: track page views, searches, etc. in real time
• Events & log aggregation: particularly in distributed systems where messages
come from multiple sources
• Monitoring and metrics: aggregate statistics from distributed applications and
build a dashboard application
• Stream processing: process raw data, clean it up, and forward it on to another
topic or messaging system
• Real-time data ingestion: fast processing of a very large volume of messages
29. © 2023 Cloudera, Inc. All rights reserved. 29
KAFKA TERMINOLOGY
• Kafka is a publish/subscribe messaging system comprised of the
following components:
– Topic: a message feed
– Producer: a process that publishes messages to a topic
– Consumer: a process that subscribes to a topic and processes its messages
– Broker: a server in a Kafka cluster
30. © 2021 Cloudera, Inc. All rights reserved. 30
Apache Kafka
• Highly reliable distributed
messaging system
• Decouple applications, enables
many-to-many patterns
• Publish-Subscribe semantics
• Horizontal scalability
• Efficient implementation to
operate at speed with big data
volumes
• Organized by topic to support
several use cases
Source
System
Source
System
Source
System
Kafka
Fraud
Detection
Security
Systems
Real-Time
Monitoring
Many-To-Many
Publish-Subscribe
EVENTS
32. © 2023 Cloudera, Inc. All rights reserved. 32
Flink SQL
https://www.datainmotion.dev/2021/04/cloudera-sql-stream-builder-ssb-updated.html
● Streaming Analytics
● Continuous SQL
● Continuous ETL
● Complex Event Processing
● Standard SQL Powered by Apache Calcite
33. © 2023 Cloudera, Inc. All rights reserved. 33
Flink SQL
-- specify Kafka partition key on output
SELECT foo AS _eventKey FROM sensors
-- use event time timestamp from kafka
-- exactly once compatible
SELECT eventTimestamp FROM sensors
-- nested structures access
SELECT foo.’bar’ FROM table; -- must quote nested
column
-- timestamps
SELECT * FROM payments
WHERE eventTimestamp > CURRENT_TIMESTAMP-interval
'10' second;
-- unnest
SELECT b.*, u.*
FROM bgp_avro b,
UNNEST(b.path) AS u(pathitem)
-- aggregations and windows
SELECT card,
MAX(amount) as theamount,
TUMBLE_END(eventTimestamp, interval '5' minute) as
ts
FROM payments
WHERE lat IS NOT NULL
AND lon IS NOT NULL
GROUP BY card,
TUMBLE(eventTimestamp, interval '5' minute)
HAVING COUNT(*) > 4 -- >4==fraud
-- try to do this ksql!
SELECT us_west.user_score+ap_south.user_score
FROM kafka_in_zone_us_west us_west
FULL OUTER JOIN kafka_in_zone_ap_south ap_south
ON us_west.user_id = ap_south.user_id;
Key Takeaway: Rich SQL grammar with advanced time and aggregation tools
34. 34
© 2022 Cloudera, Inc. All rights reserved.
SQL STREAM BUILDER (SSB)
SQL STREAM BUILDER allows
developers, analysts, and data
scientists to write streaming
applications with industry
standard SQL.
No Java or Scala code
development required.
Simplifies access to data in Kafka
& Flink. Connectors to batch data in
HDFS, Kudu, Hive, S3, JDBC, CDC
and more
Enrich streaming data with batch
data in a single tool
Democratize access to real-time data with just SQL
36. © 2023 Cloudera, Inc. All rights reserved. 36
CLOUDERA FLOW AND EDGE MANAGEMENT
Enable easy ingestion, routing, management and delivery of any data anywhere (Edge, cloud, data
center) to any downstream system with built in end-to-end security and provenance
Advanced tooling to industrialize
flow development (Flow Development
Life Cycle)
ACQUIRE
• Over 300 Prebuilt Processors
• Easy to build your own
• Parse, Enrich & Apply Schema
• Filter, Split, Merger & Route
• Throttle & Backpressure
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLOG
PROCESS
HASH
MERGE
EXTRACT
DUPLICATE
SPLIT
ENCRYPT
TALL
EVALUATE
EXECUTE
GEOENRICH
SCAN
REPLACE
TRANSLATE
CONVERT
ROUTE TEXT
ROUTE CONTENT
ROUTE CONTEXT
ROUTE RATE
DISTRIBUTE LOAD
DELIVER
• Guaranteed Delivery
• Full data provenance from
acquisition to delivery
• Diverse, Non-Traditional Sources
• Eco-system integration
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLOG
37. © 2023 Cloudera, Inc. All rights reserved. 37
Cloudera DataFlow: Universal Data Distribution Service
Process
Route
Filter
Enrich
Transform
Distribute
Connectors
Any
destination
Deliver
Ingest
Active
Passive
Connectors
Gateway
Endpoint
Connect & Pull
Send
Data born in
the cloud
Data born
outside the
cloud
UNIVERSAL DATA DISTRIBUTION WITH CLOUDERA DATAFLOW (CDF)
Connect to Any Data Source Anywhere then Process and Deliver to Any Destination
38. 38
© 2023 Cloudera, Inc. All rights reserved.
WHAT IS APACHE NIFI?
Apache NiFi is a scalable, real-time streaming data
platform that collects, curates, and analyzes data so
customers gain key insights for immediate
actionable intelligence.
39. © 2023 Cloudera, Inc. All rights reserved. 39
APACHE NIFI
Enable easy ingestion, routing, management and delivery of
any data anywhere (Edge, cloud, data center) to any
downstream system with built in end-to-end security and
provenance
ACQUIRE PROCESS DELIVER
• Over 300 Prebuilt Processors
• Easy to build your own
• Parse, Enrich & Apply Schema
• Filter, Split, Merger & Route
• Throttle & Backpressure
• Guaranteed Delivery
• Full data provenance from acquisition to
delivery
• Diverse, Non-Traditional Sources
• Eco-system integration
Advanced tooling to industrialize flow development
(Flow Development Life Cycle)
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLO
G
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLO
G
HASH
MERGE
EXTRACT
DUPLICATE
SPLIT
ROUTE TEXT
ROUTE CONTENT
ROUTE CONTEXT
CONTROL RATE
DISTRIBUTE LOAD
GEOENRICH
SCAN
REPLACE
TRANSLATE
CONVERT
ENCRYPT
TALL
EVALUATE
EXECUTE
40. © 2023 Cloudera, Inc. All rights reserved. 40
FDLC: FLOW DEVELOPMENT LIFECYCLE
NiFi Cluster
Node 1 Node x
Tests cluster
Registry
..
NiFi Cluster
Node 1 Node x
Production cluster
Registry
..
NiFi Cluster
Node 1 Node x
Dev cluster
Registry
..
Dev Ops chain
NiFi CLI, NiPyAPI,
Jenkins
1- push
new flow
version
2- get new
version of
the flow
3 & 6 - automated testing /
promotion (naming
convention, scheduling,
variable replacement)
4- push
new flow
version
5- get new
version of
the flow
7- push
new flow
version
42. © 2023 Cloudera, Inc. All rights reserved. 42
EXTENSIBILITY
• Built from the ground up with extensions in mind
• Service-loader pattern for…
– Processors
– Controller Services
– Reporting Tasks
– Prioritizers
• Extensions packaged as NiFi Archives (NARs)
– Deploy NiFi lib directory and restart
– Same model as standard components
43. © 2020 Cloudera, Inc. All rights reserved. 43
STATELESS
ENGINE
• Granular containers per flow
• Flows From NiFi Registry
https://www.datainmotion.dev/2019/11/exploring-apache-nifi-110-parameters.html
bin/nifi.sh stateless RunFromRegistry Continuous --file kafka.json
https://github.com/apache/nifi/blob/ea1becac4fc519c54b8b4d21773e68f8da364755/nifi-nar-bundles/nifi-framework-bundle/nifi-
framework/nifi-stateless/README.md
44. © 2020 Cloudera, Inc. All rights reserved. 44
STATELESS
ENGINE
• See also Parameters
• Docker
• YARN
• Kubernetes (K8)
• Stateful NiFi clusters
• Apache OpenWhisk (FaaS)
https://www.datainmotion.dev/2019/11/exploring-apache-nifi-110-parameters.html
{"registryUrl": "http://tspann-mbp15-hw14277:18080",
"bucketId": "140b30f0-5a47-4747-9021-19d4fde7f993",
"flowId": "0540e1fd-c7ca-46fb-9296-e37632021945",
"ssl": {
"keystoreFile": "","keystorePass": "","keyPass": "","keystoreType": "",
"truststoreFile":
"/Library/Java/JavaVirtualMachines/amazon-corretto-11.jdk/Contents/Home/lib/sec
urity/cacerts",
"truststorePass": "changeit", "truststoreType": "JKS"
},
"parameters": {
"broker" : "4.317.852.100:9092",
"topic" : "iot",
"group_id" : "nifi-stateless-kafka-consumer",
"DestinationDirectory" : "/tmp/nifistateless/output2/",
"output_dir": "/Users/tspann/Documents/nifi-1.10.0-SNAPSHOT/logs/output"
}
}
https://github.com/tspannhw/stateless-examples
45. © 2020 Cloudera, Inc. All rights reserved. 45
PARQUETREADER
/
PARQUETWRITER
RECORDS
• Native Record Processors for
Apache Parquet Files!
• CVS <-> Parquet
• XML <-> Parquet
• AVRO <-> Parquet
• JSON <-> Parquet
• More...
https://www.datainmotion.dev/2019/11/exploring-apache-nifi-110-parameters.html
https://www.datainmotion.dev/2019/10/migrating-apache-flume-flows-to-apac
he_7.html
46. © 2020 Cloudera, Inc. All rights reserved. 46
POSTSLACK
• Post Images to Slack
https://www.datainmotion.dev/2019/11/exploring-apache-nifi-110-parameters.html
https://www.datainmotion.dev/2019/11/nifi-110-postslack-easy-image-upload.html
47. © 2020 Cloudera, Inc. All rights reserved. 47
REMOTE INPUT
PORT IN A
PROCESS GROUP
• Put Remote Connections for
Site-To-Site (S2S) Anywhere!
• Not only top level
• Drop down simplicity
https://www.datainmotion.dev/2019/11/exploring-apache-nifi-110-parameters.html
48. 48
© 2023 Cloudera, Inc. All rights reserved.
READYFLOW
GALLERY
• Cloudera provided flow
definitions
• Cover most common data flow
use cases
• Optimized to work with CDP
sources/destinations
• Can be deployed and adjusted
as needed
49. 49
© 2023 Cloudera, Inc. All rights reserved.
FLOW CATALOG
• Central repository for flow
definitions
• Import existing NiFi flows
• Manage flow definitions
• Initiate flow deployments
51. © 2023 Cloudera, Inc. All rights reserved. 51
Processing one billion events per second with NiFi
https://blog.cloudera.com/benchmarking-nifi-performance-and-scalability/
53. © 2023 Cloudera, Inc. All rights reserved. 53
CSP Community
Edition
• Kafka, KConnect, SMM, SR,
Flink, and SSB in Docker
• Runs in Docker
• Try new features quickly
• Develop applications locally
● Docker compose file of CSP to run from command line w/o any
dependencies, including Flink, SQL Stream Builder, Kafka, Kafka
Connect, Streams Messaging Manager and Schema Registry
○ $>docker compose up
● Licensed under the Cloudera Community License
● Unsupported
● Community Group Hub for CSP
● Find it on docs.cloudera.com under Applications
55. © 2023 Cloudera, Inc. All rights reserved. 55
End to End Streaming Demo Pipeline
Enterprise
sources
Weather
Errors
Aggregates
Alerts
Stocks
ETL
Analytics
Streaming SQL
Clickstream Market data
Machine logs Social
https://github.com/tspannhw/CloudDemo2021
56. AI BASED ENHANCEMENTS
SERVE
SOURCES
Data Warehouse
Report
Sensorid
Sensor conditions
Machine Learning
Predict, Automate
Control System
REPORT
Visualize
CLOUD
Collect
COLLECT
Message Broker
Data Flow
Distribute
Sensor id
Temperature
COLLECT and DISTRIBUTE DATA
1. Data is collected from sensors that use mqtt protocol via
CEM and sent to CDP Public Cloud
2. Two CDF flows run in the cloud to accomplish our two
goals: streaming analytics and batch analytics
ENRICH, REPORT
Report, Automate
Real time alerting
SQL
Stream Builder
Data Visualization
Edge
Management
Humidity
Timestamp
Visualize
Data Visualization
USE DATA
3. Streaming Use Case: some conditions of our greenhouse
must be avoided and have to be controlled in real time.
Some warnings have been defined to alert us in case
alerting conditions are met, control system is
automatically activated to adjust environmental
variables.
4. Batch analytics: to ensure the optimal growth of our
plants the ideal conditions have to be met for each
plant. Each 6 hours the plant conditions are monitored
and in case some control adjustment is required a ML
model gives a suggestion about getting the optimal
point minimizing the cost.
EDGE
Control System
58. © 2023 Cloudera, Inc. All rights reserved. 58
Streaming
Solutions
When to use what?
• Routing vs Analytics
– Listeners
– Joins
– In-Memory
• Operational Load
• Current Skills
• Use NiFi
– Doing more than just Syndication
– Not just small Kafka sized events
– Edge Management is needed
– Listener Type use cases that bind to ports
– Single Product for light ETL, Lineage, Provenance,
Message Replay
• Use Flink
– Joining Streams
– Windowing
– Late Data Handling
– Streaming Analytics
• Use KConnect
– Kafka Centric
• Multiple-Tool Objections
– In-Memory Stateless
59. © 2023 Cloudera, Inc. All rights reserved. 59
STREAMING RESOURCES
• https://dzone.com/articles/real-time-stream-processing-with-hazelcast-an
d-streamnative
• https://flipstackweekly.com/
• https://www.datainmotion.dev/
• https://www.flankstack.dev/
• https://github.com/tspannhw
• https://medium.com/@tspann
• https://medium.com/@tspann/predictions-for-streaming-in-2023-ad4d739
5d714
• https://www.apachecon.com/acna2022/slides/04_Spann_Tim_Citizen_Str
eaming_Engineer.pdf
60. © 2023 Cloudera, Inc. All rights reserved. 60
https://github.com/tspannhw https://www.datainmotion.dev/
61. © 2023 Cloudera, Inc. All rights reserved. 61
April 4 - Atlanta
April 24 - San Francisco
May 10 - Virtual
Upcoming Events
63. © 2023 Cloudera, Inc. All rights reserved.
© 2021 Cloudera, Inc. All rights reserved. 63