SlideShare a Scribd company logo
1 of 55
Download to read offline
Building Modern Data Streaming
Apps
Tim Spann
Principal Developer Advocate
25-May-2023
4
FLaNK Stack
Tim Spann
@PaasDev // Blog: www.datainmotion.dev
Principal Developer Advocate.
Princeton Future of Data Meetup.
ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC
https://github.com/tspannhw/EverythingApacheNiFi
https://medium.com/@tspann
Apache NiFi x Apache Kafka x Apache Flink x Java
5
FLaNK Stack Weekly
This week in Apache NiFi, Apache Flink, Apache
Pulsar, Apache Spark, Apache Iceberg, Python,
Java and Open Source friends.
https://bit.ly/32dAJft
© 2023 Cloudera, Inc. All rights reserved. 6
Future of Data - Princeton + Virtual
@PaasDev
https://www.meetup.com/futureofdata-princeton
From Big Data to AI to Streaming to Containers to
Cloud to Analytics to Cloud Storage to Fast Data
to Machine Learning to Microservices to ...
https://openeyes.org.ua/en/donate
STREAMING
© 2019 Cloudera, Inc. All rights reserved. 9
Speed Matters
What does “real-time” response mean to your business?
Business event
TIME
Data latency
Analysis
latency
Decision latency
Opportunity
Data captured
Information delivered
Action Taken
Data Freshness
10
What is Real-Time?
11
Streaming From … To …
Data distribution as a first class citizen
IOT
Devices
LOG DATA
SOURCES
ON-PREM
DATA SOURCES
BIG DATA CLOUD
SERVICES
CLOUD BUSINESS
PROCESS SERVICES *
CLOUD DATA*
ANALYTICS /SERVICE
(Cloudera DW)
App
Logs
Laptops
/Servers Mobile
Apps
Security
Agents
CLOUD
WAREHOUSE
UNIVERSAL
DATA DISTRIBUTION
(Ingest, Transform, Deliver)
Ingest
Processors
Ingest
Gateway
Router, Filter &
Transform
Processors
Destination
Processors
© 2023 Cloudera, Inc. All rights reserved. 12
BUILDING REAL-TIME REQUIRES A TEAM
13
CDP: AN OPEN DATA LAKEHOUSE
METADATA AND
DATA CATALOG
OBSERVABILITY REPLICATION
SECURITY &
GOVERNANCE
Private Cloud
APACHE KAFKA
© 2023 Cloudera, Inc. All rights reserved. 15
What is Apache Kafka?
Distributed: horizontally scalable
Partitioned: the data is split-up and distributed across the brokers
Replicated: allows for automatic failover
Unique: Kafka does not track the consumption of messages (the consumers do)
Fast: designed from the ground up with a focus on performance and throughput
Kafka was built at Linkedin in 2011
Open sourced as an Apache project
© 2023 Cloudera, Inc. All rights reserved. 16
Yes, Franz, It’s Kafka
Let’s do a metamorphosis on your data. Don’t fear changing data.
You don’t need to be a brilliant writer to stream
data.
Franz Kafka was a German-speaking
Bohemian novelist and short-story
writer, widely regarded as one of the
major figures of 20th-century
literature. His work fuses elements
of realism and the fantastic.
Wikipedia
© 2023 Cloudera, Inc. All rights reserved. 17
What is Can You Do With Apache Kafka?
Web site activity: track page views, searches, etc. in real time
Events & log aggregation: particularly in distributed systems where messages
come from multiple sources
Monitoring and metrics: aggregate statistics from distributed applications and
build a dashboard application
Stream processing: process raw data, clean it up, and forward it on to another
topic or messaging system
Real-time data ingestion: fast processing of a very large volume of messages
© 2023 Cloudera, Inc. All rights reserved. 18
Kafka Terms
● Kafka is a publish/subscribe messaging system comprised of the following
components:
○ Topic: a message feed
○ Producer: a process that publishes messages to a topic
○ Consumer: a process that subscribes to a topic and processes its
messages
○ Broker: a server in a Kafka cluster
© 2023 Cloudera, Inc. All rights reserved. 19
• Highly reliable distributed
messaging system
• Decouple applications,
enables many-to-many
patterns
• Publish-Subscribe
semantics
• Horizontal scalability
• Efficient implementation
to operate at speed with
big data volumes
• Organized by topic to
support several use cases
Source
System
Source
System
Source
System
Kafka
Fraud
Detection
Security
Systems
Real-Time
Monitoring
Many-To-Many
Publish-Subscribe
EVENTS
APACHE FLINK
© 2023 Cloudera, Inc. All rights reserved. 21
Flink SQL
https://www.datainmotion.dev/2021/04/cloudera-sql-stream-builder-ssb-updated.html
● Streaming Analytics
● Continuous SQL
● Continuous ETL
● Complex Event Processing
● Standard SQL Powered by Apache
Calcite
22
Flink SQL
-- specify Kafka partition key on output
SELECT foo AS _eventKey FROM sensors
-- use event time timestamp from kafka
-- exactly once compatible
SELECT eventTimestamp FROM sensors
-- nested structures access
SELECT foo.’bar’ FROM table; -- must quote nested
column
-- timestamps
SELECT * FROM payments
WHERE eventTimestamp > CURRENT_TIMESTAMP-interval
'10' second;
-- unnest
SELECT b.*, u.*
FROM bgp_avro b,
UNNEST(b.path) AS u(pathitem)
-- aggregations and windows
SELECT card,
MAX(amount) as theamount,
TUMBLE_END(eventTimestamp, interval '5' minute) as
ts
FROM payments
WHERE lat IS NOT NULL
AND lon IS NOT NULL
GROUP BY card,
TUMBLE(eventTimestamp, interval '5' minute)
HAVING COUNT(*) > 4 -- >4==fraud
-- try to do this ksql!
SELECT us_west.user_score+ap_south.user_score
FROM kafka_in_zone_us_west us_west
FULL OUTER JOIN kafka_in_zone_ap_south ap_south
ON us_west.user_id = ap_south.user_id;
Key Takeaway: Rich SQL grammar with advanced time and aggregation tools
DATAFLOW
APACHE NIFI
© 2023 Cloudera, Inc. All rights reserved. 24
Apache NiFi
Enable easy ingestion, routing, management and delivery of any data anywhere (Edge, cloud, data
center) to any downstream system with built in end-to-end security and provenance
ACQUIRE PROCESS DELIVER
• Over 450 Prebuilt Processors
• Easy to build your own
• Parse, Enrich & Apply Schema
• Filter, Split, Merger & Route
• Throttle & Backpressure
• Guaranteed Delivery
• Full data provenance from acquisition to
delivery
• Diverse, Non-Traditional Sources
• Eco-system integration
Advanced tooling to industrialize flow development
(Flow Development Life Cycle)
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLO
G
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLO
G
HASH
MERGE
EXTRACT
DUPLICATE
SPLIT
ROUTE TEXT
ROUTE CONTENT
ROUTE CONTEXT
CONTROL RATE
DISTRIBUTE LOAD
GEOENRICH
SCAN
REPLACE
TRANSLATE
CONVERT
ENCRYPT
TALL
EVALUATE
EXECUTE
© 2023 Cloudera, Inc. All rights reserved. 25
Provenance
26
Extensibility
● Built from the ground up with extensions in mind
● Service-loader pattern for…
○ Processors
○ Controller Services
○ Reporting Tasks
○ Prioritizers
● Extensions packaged as NiFi Archives (NARs)
○ Deploy NiFi lib directory and restart
○ Same model as standard components
27
Custom Processors
https://github.com/tspannhw/nifi-extracttext-processor
https://github.com/tspannhw/nifi-tensorflow-processor
https://github.com/tspannhw/nifi-nlp-processor
https://github.com/tspannhw/nifi-convertjsontoddl-processor
https://github.com/tspannhw/nifi-corenlp-processor
https://github.com/tspannhw/nifi-imageextractor-processor
https://github.com/tspannhw/nifi-attributecleaner-processor
https://github.com/tspannhw/linkextractorprocessor
https://github.com/tspannhw/GetWebCamera
https://github.com/tspannhw/nifi-langdetect-processor
https://github.com/tspannhw/nifi-postimage-processor
© 2023 Cloudera, Inc. All rights reserved. 28
Parquet
Reader/
Writers
● Native Record
Processors for Apache
Parquet Files!
● CSV <-> Parquet
● XML <-> Parquet
● AVRO <-> Parquet
● JSON <-> Parquet
● More...
© 2023 Cloudera, Inc. All rights reserved. 29
NiFi Load Balancing
• Improve NiFi cluster throughput
• Defined at connection level
• Configurable balancing strategies
• Critical for scale up paradigm in
Kubernetes
© 2023 Cloudera, Inc. All rights reserved. 30
ReadyFlow
Gallery
• Cloudera provided
flow definitions
• Cover most common
data flow use cases
• Optimized to work
with CDP
sources/destinations
• Can be deployed and
adjusted as needed
© 2023 Cloudera, Inc. All rights reserved. 31
Flow
Catalog
• Central repository for
flow definitions
• Import existing NiFi
flows
• Manage flow
definitions
• Initiate flow
deployments
Apache NiFi with Python Custom Processors
Python as a 1st class citizen
© 2023 Cloudera, Inc. All rights reserved. 33
Processing millions of events with NiFi
SOURCES AND SINKS
35
© 2022 Cloudera, Inc. All rights reserved.
APACHE ICEBERG
A Flexible, Performant & Scalable Table Format
• Donated by Netflix to the Apache Foundation in 2018
• Flexibility
– Hidden partitioning
– Full schema evolution
• Data Warehouse Operations
– Atomic Consistent Isolated Durable (ACID)
Transactions
– Time travel and rollback
• Supports best in class SQL performance
– High performance at Petabyte scale
DEMO AND CODE
https://github.com/tspannhw/FLaNK-TravelAdvisory
https://github.com/tspannhw/FLaNK-Edge
CREATE TABLE `sr1`.`default_database`.`traveladvisory` (
`title` VARCHAR(2147483647),
`pubdate` VARCHAR(2147483647),
`link` VARCHAR(2147483647),
`guid` VARCHAR(2147483647),
`advisoryId` VARCHAR(2147483647),
`domain` VARCHAR(2147483647),
`category` VARCHAR(2147483647),
`description` VARCHAR(2147483647),
`uuid` VARCHAR(2147483647),
`ts` BIGINT NOT NULL
) COMMENT 'traveladvisory'
WITH (
'properties.bootstrap.servers' = 'kafka:9092',
'avro-cloudera.properties.schema.registry.url' = 'http://schema-registry:7788/api/v1',
'connector' = 'kafka',
'avro-cloudera.schema-name' = 'traveladvisory',
'format' = 'avro-cloudera',
'topic' = 'traveladvisory',
'scan.startup.mode' = 'latest-offset'
)
https://github.com/tspannhw/FLaNK-MTA
https://medium.com/@tspann/finding-the-best-way-around-7491c76ca4cb
RESOURCES AND WRAP-UP
48
Streaming Tech Debt Tips
● Version Control All Assets
● Operationalize with K8
● Use DevOps and APIs
● Latest Java and Python
● Stream Sizing (NiFi, Kafka, Flink)
● Unit and Integration Test
● Backup everything
● Scale in 3s
49
Streaming Resources
● https://dzone.com/articles/real-time-stream-processing-with-hazelcast-and-
streamnative
● https://flipstackweekly.com/
● https://www.datainmotion.dev/
● https://www.flankstack.dev/
● https://github.com/tspannhw
● https://medium.com/@tspann
● https://medium.com/@tspann/predictions-for-streaming-in-2023-ad4d7395
d714
● https://www.apachecon.com/acna2022/slides/04_Spann_Tim_Citizen_Stre
aming_Engineer.pdf
FREE LEARNING ENVIRONMENT
51
© 2022 Cloudera, Inc. All rights reserved.
CSP Community
Edition
● Gets developers zero to Flink in less than an hour
○ Experiment with features
○ Develop apps locally
● One docker compose file of CSP which includes:
○ All dependencies required to run
○ Kafka, Kafka Connect and Flink
○ Streams Messaging Manager
○ Schema Registry
○ SQL Stream Builder Projects
● Licensed under the Cloudera Community License
● Unsupported https://www.cloudera.com/downloads/cdf/csp-community-edition.html
● Community Group Hub (Discussion Forum) for CSP
● Find it on docs.cloudera.com under Applications
Open Source Edition
• Apache NiFi in
Docker
• Runs in Docker
• Try new features
quickly
• Develop
applications locally
● Docker NiFi
○ docker run --name nifi -p 8443:8443 -d -e
SINGLE_USER_CREDENTIALS_USERNAME=admin -e
SINGLE_USER_CREDENTIALS_PASSWORD=ctsBtRBKHRAx69EqUgh
vvgEvjnaLjFEB apache/nifi:latest
● Licensed under the ASF License
● Unsupported
https://hub.docker.com/r/apache/nifi
53
Resources
© 2023 Cloudera, Inc. All rights reserved. 54
55
TH N Y U

More Related Content

Similar to BigDataFest_ Building Modern Data Streaming Apps

CoC23_Utilizing Real-Time Transit Data for Travel Optimization
CoC23_Utilizing Real-Time Transit Data for Travel OptimizationCoC23_Utilizing Real-Time Transit Data for Travel Optimization
CoC23_Utilizing Real-Time Transit Data for Travel OptimizationTimothy Spann
 
RTAS 2023: Building a Real-Time IoT Application
RTAS 2023:  Building a Real-Time IoT ApplicationRTAS 2023:  Building a Real-Time IoT Application
RTAS 2023: Building a Real-Time IoT ApplicationTimothy Spann
 
ITPC Building Modern Data Streaming Apps
ITPC Building Modern Data Streaming AppsITPC Building Modern Data Streaming Apps
ITPC Building Modern Data Streaming AppsTimothy Spann
 
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
OSACon 2023_ Unlocking Financial Data with Real-Time PipelinesOSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
OSACon 2023_ Unlocking Financial Data with Real-Time PipelinesTimothy Spann
 
GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023Timothy Spann
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache KafkaJoe Stein
 
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC MeetupTimothy Spann
 
Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Compre...
Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Compre...Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Compre...
Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Compre...HostedbyConfluent
 
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...Timothy Spann
 
Unconference Round Table Notes
Unconference Round Table NotesUnconference Round Table Notes
Unconference Round Table NotesTimothy Spann
 
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdfOSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdfTimothy Spann
 
Meetup Streaming Data Pipeline Development
Meetup Streaming Data Pipeline DevelopmentMeetup Streaming Data Pipeline Development
Meetup Streaming Data Pipeline DevelopmentTimothy Spann
 
Future of Data Milwaukee Meetup Streaming Data Pipeline Development 28 June 2023
Future of Data Milwaukee Meetup Streaming Data Pipeline Development 28 June 2023Future of Data Milwaukee Meetup Streaming Data Pipeline Development 28 June 2023
Future of Data Milwaukee Meetup Streaming Data Pipeline Development 28 June 2023ssuser73434e
 
AIDEVDAY_ Data-in-Motion to Supercharge AI
AIDEVDAY_ Data-in-Motion to Supercharge AIAIDEVDAY_ Data-in-Motion to Supercharge AI
AIDEVDAY_ Data-in-Motion to Supercharge AITimothy Spann
 
Real-time Streaming Pipelines with FLaNK
Real-time Streaming Pipelines with FLaNKReal-time Streaming Pipelines with FLaNK
Real-time Streaming Pipelines with FLaNKData Con LA
 
Connect K of SMACK:pykafka, kafka-python or?
Connect K of SMACK:pykafka, kafka-python or?Connect K of SMACK:pykafka, kafka-python or?
Connect K of SMACK:pykafka, kafka-python or?Micron Technology
 
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...Timothy Spann
 
Stream Processing with Flink and Stream Sharing
Stream Processing with Flink and Stream SharingStream Processing with Flink and Stream Sharing
Stream Processing with Flink and Stream Sharingconfluent
 
Cloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureCloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureTimothy Spann
 
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...Altinity Ltd
 

Similar to BigDataFest_ Building Modern Data Streaming Apps (20)

CoC23_Utilizing Real-Time Transit Data for Travel Optimization
CoC23_Utilizing Real-Time Transit Data for Travel OptimizationCoC23_Utilizing Real-Time Transit Data for Travel Optimization
CoC23_Utilizing Real-Time Transit Data for Travel Optimization
 
RTAS 2023: Building a Real-Time IoT Application
RTAS 2023:  Building a Real-Time IoT ApplicationRTAS 2023:  Building a Real-Time IoT Application
RTAS 2023: Building a Real-Time IoT Application
 
ITPC Building Modern Data Streaming Apps
ITPC Building Modern Data Streaming AppsITPC Building Modern Data Streaming Apps
ITPC Building Modern Data Streaming Apps
 
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
OSACon 2023_ Unlocking Financial Data with Real-Time PipelinesOSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
 
GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
 
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
 
Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Compre...
Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Compre...Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Compre...
Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Compre...
 
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
 
Unconference Round Table Notes
Unconference Round Table NotesUnconference Round Table Notes
Unconference Round Table Notes
 
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdfOSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
 
Meetup Streaming Data Pipeline Development
Meetup Streaming Data Pipeline DevelopmentMeetup Streaming Data Pipeline Development
Meetup Streaming Data Pipeline Development
 
Future of Data Milwaukee Meetup Streaming Data Pipeline Development 28 June 2023
Future of Data Milwaukee Meetup Streaming Data Pipeline Development 28 June 2023Future of Data Milwaukee Meetup Streaming Data Pipeline Development 28 June 2023
Future of Data Milwaukee Meetup Streaming Data Pipeline Development 28 June 2023
 
AIDEVDAY_ Data-in-Motion to Supercharge AI
AIDEVDAY_ Data-in-Motion to Supercharge AIAIDEVDAY_ Data-in-Motion to Supercharge AI
AIDEVDAY_ Data-in-Motion to Supercharge AI
 
Real-time Streaming Pipelines with FLaNK
Real-time Streaming Pipelines with FLaNKReal-time Streaming Pipelines with FLaNK
Real-time Streaming Pipelines with FLaNK
 
Connect K of SMACK:pykafka, kafka-python or?
Connect K of SMACK:pykafka, kafka-python or?Connect K of SMACK:pykafka, kafka-python or?
Connect K of SMACK:pykafka, kafka-python or?
 
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
 
Stream Processing with Flink and Stream Sharing
Stream Processing with Flink and Stream SharingStream Processing with Flink and Stream Sharing
Stream Processing with Flink and Stream Sharing
 
Cloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureCloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azure
 
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...
 

Recently uploaded

How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 

Recently uploaded (20)

How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 

BigDataFest_ Building Modern Data Streaming Apps

  • 1. Building Modern Data Streaming Apps Tim Spann Principal Developer Advocate 25-May-2023
  • 2.
  • 3.
  • 4. 4 FLaNK Stack Tim Spann @PaasDev // Blog: www.datainmotion.dev Principal Developer Advocate. Princeton Future of Data Meetup. ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC https://github.com/tspannhw/EverythingApacheNiFi https://medium.com/@tspann Apache NiFi x Apache Kafka x Apache Flink x Java
  • 5. 5 FLaNK Stack Weekly This week in Apache NiFi, Apache Flink, Apache Pulsar, Apache Spark, Apache Iceberg, Python, Java and Open Source friends. https://bit.ly/32dAJft
  • 6. © 2023 Cloudera, Inc. All rights reserved. 6 Future of Data - Princeton + Virtual @PaasDev https://www.meetup.com/futureofdata-princeton From Big Data to AI to Streaming to Containers to Cloud to Analytics to Cloud Storage to Fast Data to Machine Learning to Microservices to ...
  • 9. © 2019 Cloudera, Inc. All rights reserved. 9 Speed Matters What does “real-time” response mean to your business? Business event TIME Data latency Analysis latency Decision latency Opportunity Data captured Information delivered Action Taken Data Freshness
  • 11. 11 Streaming From … To … Data distribution as a first class citizen IOT Devices LOG DATA SOURCES ON-PREM DATA SOURCES BIG DATA CLOUD SERVICES CLOUD BUSINESS PROCESS SERVICES * CLOUD DATA* ANALYTICS /SERVICE (Cloudera DW) App Logs Laptops /Servers Mobile Apps Security Agents CLOUD WAREHOUSE UNIVERSAL DATA DISTRIBUTION (Ingest, Transform, Deliver) Ingest Processors Ingest Gateway Router, Filter & Transform Processors Destination Processors
  • 12. © 2023 Cloudera, Inc. All rights reserved. 12 BUILDING REAL-TIME REQUIRES A TEAM
  • 13. 13 CDP: AN OPEN DATA LAKEHOUSE METADATA AND DATA CATALOG OBSERVABILITY REPLICATION SECURITY & GOVERNANCE Private Cloud
  • 15. © 2023 Cloudera, Inc. All rights reserved. 15 What is Apache Kafka? Distributed: horizontally scalable Partitioned: the data is split-up and distributed across the brokers Replicated: allows for automatic failover Unique: Kafka does not track the consumption of messages (the consumers do) Fast: designed from the ground up with a focus on performance and throughput Kafka was built at Linkedin in 2011 Open sourced as an Apache project
  • 16. © 2023 Cloudera, Inc. All rights reserved. 16 Yes, Franz, It’s Kafka Let’s do a metamorphosis on your data. Don’t fear changing data. You don’t need to be a brilliant writer to stream data. Franz Kafka was a German-speaking Bohemian novelist and short-story writer, widely regarded as one of the major figures of 20th-century literature. His work fuses elements of realism and the fantastic. Wikipedia
  • 17. © 2023 Cloudera, Inc. All rights reserved. 17 What is Can You Do With Apache Kafka? Web site activity: track page views, searches, etc. in real time Events & log aggregation: particularly in distributed systems where messages come from multiple sources Monitoring and metrics: aggregate statistics from distributed applications and build a dashboard application Stream processing: process raw data, clean it up, and forward it on to another topic or messaging system Real-time data ingestion: fast processing of a very large volume of messages
  • 18. © 2023 Cloudera, Inc. All rights reserved. 18 Kafka Terms ● Kafka is a publish/subscribe messaging system comprised of the following components: ○ Topic: a message feed ○ Producer: a process that publishes messages to a topic ○ Consumer: a process that subscribes to a topic and processes its messages ○ Broker: a server in a Kafka cluster
  • 19. © 2023 Cloudera, Inc. All rights reserved. 19 • Highly reliable distributed messaging system • Decouple applications, enables many-to-many patterns • Publish-Subscribe semantics • Horizontal scalability • Efficient implementation to operate at speed with big data volumes • Organized by topic to support several use cases Source System Source System Source System Kafka Fraud Detection Security Systems Real-Time Monitoring Many-To-Many Publish-Subscribe EVENTS
  • 21. © 2023 Cloudera, Inc. All rights reserved. 21 Flink SQL https://www.datainmotion.dev/2021/04/cloudera-sql-stream-builder-ssb-updated.html ● Streaming Analytics ● Continuous SQL ● Continuous ETL ● Complex Event Processing ● Standard SQL Powered by Apache Calcite
  • 22. 22 Flink SQL -- specify Kafka partition key on output SELECT foo AS _eventKey FROM sensors -- use event time timestamp from kafka -- exactly once compatible SELECT eventTimestamp FROM sensors -- nested structures access SELECT foo.’bar’ FROM table; -- must quote nested column -- timestamps SELECT * FROM payments WHERE eventTimestamp > CURRENT_TIMESTAMP-interval '10' second; -- unnest SELECT b.*, u.* FROM bgp_avro b, UNNEST(b.path) AS u(pathitem) -- aggregations and windows SELECT card, MAX(amount) as theamount, TUMBLE_END(eventTimestamp, interval '5' minute) as ts FROM payments WHERE lat IS NOT NULL AND lon IS NOT NULL GROUP BY card, TUMBLE(eventTimestamp, interval '5' minute) HAVING COUNT(*) > 4 -- >4==fraud -- try to do this ksql! SELECT us_west.user_score+ap_south.user_score FROM kafka_in_zone_us_west us_west FULL OUTER JOIN kafka_in_zone_ap_south ap_south ON us_west.user_id = ap_south.user_id; Key Takeaway: Rich SQL grammar with advanced time and aggregation tools
  • 24. © 2023 Cloudera, Inc. All rights reserved. 24 Apache NiFi Enable easy ingestion, routing, management and delivery of any data anywhere (Edge, cloud, data center) to any downstream system with built in end-to-end security and provenance ACQUIRE PROCESS DELIVER • Over 450 Prebuilt Processors • Easy to build your own • Parse, Enrich & Apply Schema • Filter, Split, Merger & Route • Throttle & Backpressure • Guaranteed Delivery • Full data provenance from acquisition to delivery • Diverse, Non-Traditional Sources • Eco-system integration Advanced tooling to industrialize flow development (Flow Development Life Cycle) FTP SFTP HL7 UDP XML HTTP EMAIL HTML IMAGE SYSLO G FTP SFTP HL7 UDP XML HTTP EMAIL HTML IMAGE SYSLO G HASH MERGE EXTRACT DUPLICATE SPLIT ROUTE TEXT ROUTE CONTENT ROUTE CONTEXT CONTROL RATE DISTRIBUTE LOAD GEOENRICH SCAN REPLACE TRANSLATE CONVERT ENCRYPT TALL EVALUATE EXECUTE
  • 25. © 2023 Cloudera, Inc. All rights reserved. 25 Provenance
  • 26. 26 Extensibility ● Built from the ground up with extensions in mind ● Service-loader pattern for… ○ Processors ○ Controller Services ○ Reporting Tasks ○ Prioritizers ● Extensions packaged as NiFi Archives (NARs) ○ Deploy NiFi lib directory and restart ○ Same model as standard components
  • 28. © 2023 Cloudera, Inc. All rights reserved. 28 Parquet Reader/ Writers ● Native Record Processors for Apache Parquet Files! ● CSV <-> Parquet ● XML <-> Parquet ● AVRO <-> Parquet ● JSON <-> Parquet ● More...
  • 29. © 2023 Cloudera, Inc. All rights reserved. 29 NiFi Load Balancing • Improve NiFi cluster throughput • Defined at connection level • Configurable balancing strategies • Critical for scale up paradigm in Kubernetes
  • 30. © 2023 Cloudera, Inc. All rights reserved. 30 ReadyFlow Gallery • Cloudera provided flow definitions • Cover most common data flow use cases • Optimized to work with CDP sources/destinations • Can be deployed and adjusted as needed
  • 31. © 2023 Cloudera, Inc. All rights reserved. 31 Flow Catalog • Central repository for flow definitions • Import existing NiFi flows • Manage flow definitions • Initiate flow deployments
  • 32. Apache NiFi with Python Custom Processors Python as a 1st class citizen
  • 33. © 2023 Cloudera, Inc. All rights reserved. 33 Processing millions of events with NiFi
  • 35. 35 © 2022 Cloudera, Inc. All rights reserved. APACHE ICEBERG A Flexible, Performant & Scalable Table Format • Donated by Netflix to the Apache Foundation in 2018 • Flexibility – Hidden partitioning – Full schema evolution • Data Warehouse Operations – Atomic Consistent Isolated Durable (ACID) Transactions – Time travel and rollback • Supports best in class SQL performance – High performance at Petabyte scale
  • 36.
  • 40.
  • 41. CREATE TABLE `sr1`.`default_database`.`traveladvisory` ( `title` VARCHAR(2147483647), `pubdate` VARCHAR(2147483647), `link` VARCHAR(2147483647), `guid` VARCHAR(2147483647), `advisoryId` VARCHAR(2147483647), `domain` VARCHAR(2147483647), `category` VARCHAR(2147483647), `description` VARCHAR(2147483647), `uuid` VARCHAR(2147483647), `ts` BIGINT NOT NULL ) COMMENT 'traveladvisory' WITH ( 'properties.bootstrap.servers' = 'kafka:9092', 'avro-cloudera.properties.schema.registry.url' = 'http://schema-registry:7788/api/v1', 'connector' = 'kafka', 'avro-cloudera.schema-name' = 'traveladvisory', 'format' = 'avro-cloudera', 'topic' = 'traveladvisory', 'scan.startup.mode' = 'latest-offset' )
  • 42.
  • 43.
  • 44.
  • 45.
  • 48. 48 Streaming Tech Debt Tips ● Version Control All Assets ● Operationalize with K8 ● Use DevOps and APIs ● Latest Java and Python ● Stream Sizing (NiFi, Kafka, Flink) ● Unit and Integration Test ● Backup everything ● Scale in 3s
  • 49. 49 Streaming Resources ● https://dzone.com/articles/real-time-stream-processing-with-hazelcast-and- streamnative ● https://flipstackweekly.com/ ● https://www.datainmotion.dev/ ● https://www.flankstack.dev/ ● https://github.com/tspannhw ● https://medium.com/@tspann ● https://medium.com/@tspann/predictions-for-streaming-in-2023-ad4d7395 d714 ● https://www.apachecon.com/acna2022/slides/04_Spann_Tim_Citizen_Stre aming_Engineer.pdf
  • 51. 51 © 2022 Cloudera, Inc. All rights reserved. CSP Community Edition ● Gets developers zero to Flink in less than an hour ○ Experiment with features ○ Develop apps locally ● One docker compose file of CSP which includes: ○ All dependencies required to run ○ Kafka, Kafka Connect and Flink ○ Streams Messaging Manager ○ Schema Registry ○ SQL Stream Builder Projects ● Licensed under the Cloudera Community License ● Unsupported https://www.cloudera.com/downloads/cdf/csp-community-edition.html ● Community Group Hub (Discussion Forum) for CSP ● Find it on docs.cloudera.com under Applications
  • 52. Open Source Edition • Apache NiFi in Docker • Runs in Docker • Try new features quickly • Develop applications locally ● Docker NiFi ○ docker run --name nifi -p 8443:8443 -d -e SINGLE_USER_CREDENTIALS_USERNAME=admin -e SINGLE_USER_CREDENTIALS_PASSWORD=ctsBtRBKHRAx69EqUgh vvgEvjnaLjFEB apache/nifi:latest ● Licensed under the ASF License ● Unsupported https://hub.docker.com/r/apache/nifi
  • 54. © 2023 Cloudera, Inc. All rights reserved. 54