The Never Landing Stream with HTAP and Streaming

Timothy Spann
Timothy SpannDeveloper Advocate en StreamNative
1
1
The Never Landing
Stream
with HTAP and
Streaming
Timothy Spann
Principal Developer Advocate
2
2
Introduction
The Never Landing Stream with HTAP and Streaming
4
4
FLaNK Stack
Tim Spann
@PaasDev // Blog: www.datainmotion.dev
Principal Developer Advocate.
Princeton Future of Data Meetup.
ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC
https://medium.com/@tspann
https://github.com/tspannhw
Apache NiFi x Apache Kafka x Apache Flink
5
5
Future of Data - Princeton + Virtual
@PaasDev
https://www.meetup.com/futureofdata-princeton/
From Big Data to AI to Streaming to Containers to
Cloud to Analytics to Cloud Storage to Fast Data to
Machine Learning to Microservices to ...
6
6
CDP IS THE ONLY HYBRID DATA PLATFORM
Hybrid. Open. Portable. Secure.
S3
GCS
OZONE
ADLS
OZONE S3
GCS
ADLS
CLOUDERA DATA PLATFORM
OZONE S3
GCS
ADLS
OPEN DATA
LAKEHOUSE
7
7
Apache NiFi
8
8
CLOUDERA FLOW MANAGEMENT - POWERED BY
APACHE NiFi
Ingest and manage data from edge-to-cloud using a no-code interface
● #1 data ingestion/movement engine
● Strong community
● Product maturity over 11 years
● Deploy on-premises or in the cloud
● Over 400+ pre-built processors
● Built-in data provenance
● Guaranteed delivery
● Throttling and Back pressure
9
9
Cloudera Flow Management
Ingest and manage data from edge-to-cloud using a no-code interface
ACQUIRE PROCESS DELIVER
• Over 300 pre-built processors
• Easy to build your own processors
• Parse, enrich & apply schema
• Filter, Split, Merge & Route
• Throttle & Backpressure
• Guaranteed delivery
• Full data provenance
• Eco-system integration
Advanced tooling to industrialize flow development
(Flow Development Life Cycle)
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLO
G
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLO
G
HASH
MERGE
EXTRACT
DUPLICATE
SPLIT
ROUTE TEXT
ROUTE CONTENT
ROUTE CONTEXT
CONTROL RATE
DISTRIBUTE LOAD
GEOENRICH
SCAN
REPLACE
TRANSLATE
CONVERT
ENCRYPT
TALL
EVALUATE
EXECUTE
10
10
SQL BASED ROUTING WITH NiFi’s QueryRecord Processor
● QueryRecord Processor- Executes a SQL
statement against records and writes the
results to the flow file content.
● CSVReader: Looking up schema from SR, it
will converts CSV Records into
ProcessRecords
● SQL execution via Apache Calcite:
execute configured SQL against the
ProcessRecords for routing
● CSVRecordSetWriter: Converts the result
of the query from Process records into CSV
for the for the flow file content
Do routing(routing geo and speed streams) using standard SQL as opposed to complex regular expressions.
11
11
Key Differentiators
Comprehensive streaming platform – Only vendor to offer a open and comprehensive streaming
platform for real-time data ingestion and processing to produce prescriptive and predictive analytics
Stream to Cloud – Extend the same on-premises streaming capabilities to the cloud with full support
for multi-cloud and hybrid cloud models
400+ pre-built processors – Only product to offer such comprehensive connectivity to a wide range
of data sources from edge to cloud
Enterprise-Grade Security & Governance – Deploy your streaming applications with confidence and
trust with Cloudera SDX offering unified security and governance across the entire platform
Democratize access to real-time data – Enable data analysts and other personas to quickly build
streaming applications with just SQL
12
12
Development & Runtime of DataFlow Functions
Step1. Develop functions
on local workstation or in
CDP Public Cloud using
no-code, UI designer
Step 2. Run functions on
serverless compute
services in AWS, Azure &
GCP
AWS Lambda Azure Functions Google Cloud Functions
13
13
DataFlow Functions Use Cases
Trigger Based, Batch, Scheduled and Microservice Use Cases
Serverless Trigger-Based
File Processing Pipeline
Develop & run data processing pipelines when
files are created or updated in any of the cloud
object stores
Example: When a photo is uploaded to object
storage, a data flow is triggered which runs image
resizing code and delivers resized image to
different locations.
Serverless Workflows /
Orchestration
Chain different low-code functions to build
complex workflows
Example: Automate the handling of support
tickets in a call center or orchestrate data
movement across different cloud services.
Serverless
Scheduled Tasks
Develop and run scheduled tasks without any
code on pre-defined timed intervals
Example: Offload an external database running
on-premises into the cloud once a day every
morning at 4:00 a.m.
Serverless
Microservices
Build and deploy serverless independent modules
that power your applications microservices
architecture
Example: Event-driven functions for easy
communication between thousands of decoupled
services that power a ride-sharing application.
Serverless
Web APIs
Easily build endpoints for your web applications
with HTTP APIs without any code using DFF and
any of the cloud providers' function triggers
Example: Build high performant, scalable web
applications across multiple data centers.
Serverless
Customized Triggers
With the DFF State feature, build flows to create
customized triggers allowing access to
on-premises or external services
Example: Near real time offloading of files from a
remote SFTP server.
14
14
Flow Catalog
• Central repository
for flow definitions
• Import existing
NiFi flows
• Manage flow
definitions
• Initiate flow
deployments
15
15
ReadyFlows
• Cloudera provided
flow definitions
• Cover most common
data flow use cases
• Can be deployed and
adjusted as needed
• Made available
through docs during
Tech Preview
16
16
Deployment
Wizard
• Turns flow definitions
into flow deployments
• Guides users through
providing required
configuration
• Pick from pre-defined
NiFi node sizes
• Define KPIs for the
deployment
Start Deployment Wizard Provide Parameters
Configure Sizing & Scaling Define KPIs
17
17
Key
Performance
Indicators
• Visibility into flow
deployments
• Track high level flow
performance
• Track in-depth NiFi
component metrics
• Defined in
Deployment Wizard
• Monitoring & Alerts
in Deployment
Details
KPI Definition in Deployment Wizard KPI Monitoring
18
18
Dashboard
• Central Monitoring View
• Monitors flow
deployments across
CDP environments
• Monitors flow
deployment health &
performance
• Drill into flow
deployment to monitor
system metrics and
deployment events
19
19
Data Flow
Design for
Everyone
• Cloud-native data
flow development
• Developers get their
own sandbox
• Start developing flows
without installing NiFi
• Redesigned visual
canvas
• Optimized interaction
patterns
• Integration into
CDF-PC Catalog for
versioning
20
20
https://docs.pingcap.com/tidb/dev/mysql-compatibility
Data Distribution and Sharing with TiDB
21
21
NiFi Ingesting REST API
● NiFi consumes stream
(cdc, rest, sensors)
● Distributes real-time to
● Kafka and MySQL at same time
● Flink SQL consumes from Kafka
● TiDB CDC -> Kafka
https://ossinsight.io/docs/api
22
22
Apache Kafka
23
23
Data Distribution with Apache Kafka
24
24
Apache Kudu
25
25
Why Kudu?
A simultaneous combination of sequential and random reads and writes
Can you insert time series data in
real time? How long does it take to
prepare it for analysis? Can you
get results and act fast enough to
change outcomes?
Can you handle large volumes of
machine-generated data? Do you
have the tools to identify
problems or threats? Can your
system do machine learning?
How fast can you add data to your
data store? Are you trading off the
ability to do broad analytics for the
ability to make updates? Are you
retaining only part of your data?
Time Series Data Machine Data Analytics Online Reporting
26
26
Online Transactional Processing (OLTP) and Online Analytical Processing (OLAP)
https://hpi.de/fileadmin/user_upload/hpi/navigation/10_forschung/20_future_soc_lab/Poster/2019-1/To
zun_FSOC-Poster_20191_150443.pdf
HTAP Options - Apache Kudu
27
27
HTAP Options - TiDB
28
28
Apache Flink SQL
29
29
SQL STREAM BUILDER (CLOUDERA SSB)
SQL STREAM BUILDER allows
developers, analysts, and data
scientists to write streaming
applications with industry
standard SQL.
No Java or Scala code
development required.
Simplifies access to data in Kafka
& Flink. Connectors to batch data in
HDFS, Kudu, Hive, S3, JDBC, CDC
and more
Enrich streaming data with batch
data in a single tool
Democratize access to real-time data with just SQL
30
30
SSB MATERIALIZED VIEWS
Key Takeaway; MV’s allow data scientist, analyst and developers consume data from the firehose
31
31
Infer Tables from Kafka Topics with JSON or Avro
32
32
Demos
33
HTAP
INGEST OF ALL DATA
Data Sources Cloudera Data
Flow
Cloudera
Streaming
Analytics
Cloudera
Streams
Processing
Kafka
Lake House
34
34
LLM USE CASE
Vector DB
AI Model
Unstructured file types
Data in Motion
on Cloudera Data
Platform (CDP)
Capture, process &
distribute any data,
anywhere
Other enterprise data Open Data Lakehouse
Materialized Views
Structured Sources
Applications/API’s
Streams
35
35
Live Q&A
Travel Advisories
Weather Reports
Documents
Social Media
Internal Data
Github Data
REST API
HYBRID CLOUD
INTERACT
COLLECT STORE
ENRICH, REPORT
Distribute
Collect
Report
REPORT
Visualize
Report, Automate
AI BASED ENHANCEMENTS
Predict, Automate
VECTOR DATABASE
LLM
Machine
Learning
Data
Visualization
Data Flow
Data
Warehouse
SQL
Stream Builder
Data
Visualization
Input Sentences
Generated Text
Timestamp
Input Sentence
Timestamps
Enrichments
Messaging
Broker
Real-time alerting
Real-time alerting
Aggregations
36
36
RUN AT HOME
37
37
CSP
Community
Edition
● Kafka, KConnect, SMM, SR,
Flink, and SSB in Docker
● Runs in Docker
● Try new features quickly
● Develop applications
locally
● Docker compose file of CSP to run from command line w/o any
dependencies, including Flink, SQL Stream Builder, Kafka, Kafka
Connect, Streams Messaging Manager and Schema Registry
○ $> docker compose up
● Licensed under the Cloudera Community License
● Unsupported
● Community Group Hub for CSP
● Find it on docs.cloudera.com under Applications
38
38
Open Source Edition
● Apache NiFi in Docker
● Runs in Docker
● Try new features
quickly
● Develop applications
locally
● Docker NiFi
○ docker run --name nifi -p 8443:8443 -d -e
SINGLE_USER_CREDENTIALS_USERNAME=admin -e
SINGLE_USER_CREDENTIALS_PASSWORD=ctsBtRBKHRAx69EqUgh
vvgEvjnaLjFEB apache/nifi:latest
● Licensed under the ASF License
● Unsupported
https://hub.docker.com/r/apache/nifi
39
39
Thank You
1 de 39

Recomendados

GSJUG: Mastering Data Streaming Pipelines 09May2023 por
GSJUG: Mastering Data Streaming Pipelines 09May2023GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023Timothy Spann
255 vistas80 diapositivas
Best Practices for Building Hybrid-Cloud Architectures | Hans Jespersen por
Best Practices for Building Hybrid-Cloud Architectures | Hans JespersenBest Practices for Building Hybrid-Cloud Architectures | Hans Jespersen
Best Practices for Building Hybrid-Cloud Architectures | Hans Jespersenconfluent
403 vistas31 diapositivas
Delivering the power of data using Spring Cloud DataFlow and DataStax Enterpr... por
Delivering the power of data using Spring Cloud DataFlow and DataStax Enterpr...Delivering the power of data using Spring Cloud DataFlow and DataStax Enterpr...
Delivering the power of data using Spring Cloud DataFlow and DataStax Enterpr...VMware Tanzu
631 vistas28 diapositivas
Confluent kafka meetupseattle jan2017 por
Confluent kafka meetupseattle jan2017Confluent kafka meetupseattle jan2017
Confluent kafka meetupseattle jan2017Nitin Kumar
979 vistas38 diapositivas
Streaming Time Series Data With Kenny Gorman and Elena Cuevas | Current 2022 por
Streaming Time Series Data With Kenny Gorman and Elena Cuevas | Current 2022Streaming Time Series Data With Kenny Gorman and Elena Cuevas | Current 2022
Streaming Time Series Data With Kenny Gorman and Elena Cuevas | Current 2022HostedbyConfluent
341 vistas35 diapositivas
Au delà des brokers, un tour de l’environnement Kafka | Florent Ramière por
Au delà des brokers, un tour de l’environnement Kafka | Florent RamièreAu delà des brokers, un tour de l’environnement Kafka | Florent Ramière
Au delà des brokers, un tour de l’environnement Kafka | Florent Ramièreconfluent
317 vistas58 diapositivas

Más contenido relacionado

Similar a The Never Landing Stream with HTAP and Streaming

Streaming Data and Stream Processing with Apache Kafka por
Streaming Data and Stream Processing with Apache KafkaStreaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache Kafkaconfluent
3K vistas44 diapositivas
JConWorld_ Continuous SQL with Kafka and Flink por
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkTimothy Spann
156 vistas36 diapositivas
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo por
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines DemoEvolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines DemoTimothy Spann
162 vistas8 diapositivas
Streaming Sensor Data Slides_Virender por
Streaming Sensor Data Slides_VirenderStreaming Sensor Data Slides_Virender
Streaming Sensor Data Slides_Virendervithakur
720 vistas36 diapositivas
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data por
Building Real-time Pipelines with FLaNK_ A Case Study with Transit DataBuilding Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Building Real-time Pipelines with FLaNK_ A Case Study with Transit DataTimothy Spann
193 vistas45 diapositivas
Leveraging Mainframe Data for Modern Analytics por
Leveraging Mainframe Data for Modern AnalyticsLeveraging Mainframe Data for Modern Analytics
Leveraging Mainframe Data for Modern Analyticsconfluent
2.5K vistas33 diapositivas

Similar a The Never Landing Stream with HTAP and Streaming(20)

Streaming Data and Stream Processing with Apache Kafka por confluent
Streaming Data and Stream Processing with Apache KafkaStreaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache Kafka
confluent3K vistas
JConWorld_ Continuous SQL with Kafka and Flink por Timothy Spann
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
Timothy Spann156 vistas
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo por Timothy Spann
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines DemoEvolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
Timothy Spann162 vistas
Streaming Sensor Data Slides_Virender por vithakur
Streaming Sensor Data Slides_VirenderStreaming Sensor Data Slides_Virender
Streaming Sensor Data Slides_Virender
vithakur720 vistas
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data por Timothy Spann
Building Real-time Pipelines with FLaNK_ A Case Study with Transit DataBuilding Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Timothy Spann193 vistas
Leveraging Mainframe Data for Modern Analytics por confluent
Leveraging Mainframe Data for Modern AnalyticsLeveraging Mainframe Data for Modern Analytics
Leveraging Mainframe Data for Modern Analytics
confluent2.5K vistas
Streaming Visualization por Guido Schmutz
Streaming VisualizationStreaming Visualization
Streaming Visualization
Guido Schmutz1.7K vistas
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf por Timothy Spann
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdfOSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
Timothy Spann23 vistas
Using Apache NiFi with Apache Pulsar for Fast Data On-Ramp por Timothy Spann
Using Apache NiFi with Apache Pulsar for Fast Data On-RampUsing Apache NiFi with Apache Pulsar for Fast Data On-Ramp
Using Apache NiFi with Apache Pulsar for Fast Data On-Ramp
Timothy Spann163 vistas
Streaming Data Ingest and Processing with Apache Kafka por Attunity
Streaming Data Ingest and Processing with Apache KafkaStreaming Data Ingest and Processing with Apache Kafka
Streaming Data Ingest and Processing with Apache Kafka
Attunity4.3K vistas
DIMT 2023 SG - Hands-on Workshop_ Getting started with Confluent Cloud.pdf por confluent
DIMT 2023 SG - Hands-on Workshop_ Getting started with Confluent Cloud.pdfDIMT 2023 SG - Hands-on Workshop_ Getting started with Confluent Cloud.pdf
DIMT 2023 SG - Hands-on Workshop_ Getting started with Confluent Cloud.pdf
confluent78 vistas
Databricks Platform.pptx por Alex Ivy
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
Alex Ivy3.4K vistas
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee... por HostedbyConfluent
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
HostedbyConfluent378 vistas
Big Data Q2 Customer Education Webcast: New DMX Change Data Capture for Hadoo... por Precisely
Big Data Q2 Customer Education Webcast: New DMX Change Data Capture for Hadoo...Big Data Q2 Customer Education Webcast: New DMX Change Data Capture for Hadoo...
Big Data Q2 Customer Education Webcast: New DMX Change Data Capture for Hadoo...
Precisely302 vistas
.NET Cloud-Native Bootcamp- Los Angeles por VMware Tanzu
.NET Cloud-Native Bootcamp- Los Angeles.NET Cloud-Native Bootcamp- Los Angeles
.NET Cloud-Native Bootcamp- Los Angeles
VMware Tanzu450 vistas
Red hat's updates on the cloud & infrastructure strategy por Orgad Kimchi
Red hat's updates on the cloud & infrastructure strategyRed hat's updates on the cloud & infrastructure strategy
Red hat's updates on the cloud & infrastructure strategy
Orgad Kimchi414 vistas
Beyond the Brokers: A Tour of the Kafka Ecosystem por confluent
Beyond the Brokers: A Tour of the Kafka EcosystemBeyond the Brokers: A Tour of the Kafka Ecosystem
Beyond the Brokers: A Tour of the Kafka Ecosystem
confluent780 vistas
Beyond the brokers - A tour of the Kafka ecosystem por Damien Gasparina
Beyond the brokers - A tour of the Kafka ecosystemBeyond the brokers - A tour of the Kafka ecosystem
Beyond the brokers - A tour of the Kafka ecosystem
Damien Gasparina613 vistas
Beyond the brokers - Un tour de l'écosystème Kafka por Florent Ramiere
Beyond the brokers - Un tour de l'écosystème KafkaBeyond the brokers - Un tour de l'écosystème Kafka
Beyond the brokers - Un tour de l'écosystème Kafka
Florent Ramiere783 vistas
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli... por Databricks
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
Databricks757 vistas

Más de Timothy Spann

[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines por
[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines
[EN]DSS23_tspann_Integrating LLM with Streaming Data PipelinesTimothy Spann
150 vistas25 diapositivas
CoC23_ Looking at the New Features of Apache NiFi por
CoC23_ Looking at the New Features of Apache NiFiCoC23_ Looking at the New Features of Apache NiFi
CoC23_ Looking at the New Features of Apache NiFiTimothy Spann
36 vistas24 diapositivas
CoC23_ Let’s Monitor The Conditions at the Conference por
CoC23_ Let’s Monitor The Conditions at the ConferenceCoC23_ Let’s Monitor The Conditions at the Conference
CoC23_ Let’s Monitor The Conditions at the ConferenceTimothy Spann
17 vistas17 diapositivas
CoC23_Utilizing Real-Time Transit Data for Travel Optimization por
CoC23_Utilizing Real-Time Transit Data for Travel OptimizationCoC23_Utilizing Real-Time Transit Data for Travel Optimization
CoC23_Utilizing Real-Time Transit Data for Travel OptimizationTimothy Spann
31 vistas30 diapositivas
Meetup - Brasil - Data In Motion - 2023 September 19 por
Meetup - Brasil - Data In Motion - 2023 September 19Meetup - Brasil - Data In Motion - 2023 September 19
Meetup - Brasil - Data In Motion - 2023 September 19Timothy Spann
319 vistas33 diapositivas
Implement a Universal Data Distribution Architecture to Manage All Streaming ... por
Implement a Universal Data Distribution Architecture to Manage All Streaming ...Implement a Universal Data Distribution Architecture to Manage All Streaming ...
Implement a Universal Data Distribution Architecture to Manage All Streaming ...Timothy Spann
28 vistas56 diapositivas

Más de Timothy Spann(20)

[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines por Timothy Spann
[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines
[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines
Timothy Spann150 vistas
CoC23_ Looking at the New Features of Apache NiFi por Timothy Spann
CoC23_ Looking at the New Features of Apache NiFiCoC23_ Looking at the New Features of Apache NiFi
CoC23_ Looking at the New Features of Apache NiFi
Timothy Spann36 vistas
CoC23_ Let’s Monitor The Conditions at the Conference por Timothy Spann
CoC23_ Let’s Monitor The Conditions at the ConferenceCoC23_ Let’s Monitor The Conditions at the Conference
CoC23_ Let’s Monitor The Conditions at the Conference
Timothy Spann17 vistas
CoC23_Utilizing Real-Time Transit Data for Travel Optimization por Timothy Spann
CoC23_Utilizing Real-Time Transit Data for Travel OptimizationCoC23_Utilizing Real-Time Transit Data for Travel Optimization
CoC23_Utilizing Real-Time Transit Data for Travel Optimization
Timothy Spann31 vistas
Meetup - Brasil - Data In Motion - 2023 September 19 por Timothy Spann
Meetup - Brasil - Data In Motion - 2023 September 19Meetup - Brasil - Data In Motion - 2023 September 19
Meetup - Brasil - Data In Motion - 2023 September 19
Timothy Spann319 vistas
Implement a Universal Data Distribution Architecture to Manage All Streaming ... por Timothy Spann
Implement a Universal Data Distribution Architecture to Manage All Streaming ...Implement a Universal Data Distribution Architecture to Manage All Streaming ...
Implement a Universal Data Distribution Architecture to Manage All Streaming ...
Timothy Spann28 vistas
big data fest building modern data streaming apps por Timothy Spann
big data fest building modern data streaming appsbig data fest building modern data streaming apps
big data fest building modern data streaming apps
Timothy Spann317 vistas
OSSNA Building Modern Data Streaming Apps por Timothy Spann
OSSNA Building Modern Data Streaming AppsOSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming Apps
Timothy Spann155 vistas
BestInFlowCompetitionTutorials03May2023 por Timothy Spann
BestInFlowCompetitionTutorials03May2023BestInFlowCompetitionTutorials03May2023
BestInFlowCompetitionTutorials03May2023
Timothy Spann11 vistas
CloudToolGuidance03May2023 por Timothy Spann
CloudToolGuidance03May2023CloudToolGuidance03May2023
CloudToolGuidance03May2023
Timothy Spann10 vistas
Cloudera Sandbox Event Guidelines For Workflow por Timothy Spann
Cloudera Sandbox Event Guidelines For WorkflowCloudera Sandbox Event Guidelines For Workflow
Cloudera Sandbox Event Guidelines For Workflow
Timothy Spann32 vistas
Meet the Committers Webinar_ Lab Preparation por Timothy Spann
Meet the Committers Webinar_ Lab PreparationMeet the Committers Webinar_ Lab Preparation
Meet the Committers Webinar_ Lab Preparation
Timothy Spann32 vistas
Best Practices For Workflow por Timothy Spann
Best Practices For WorkflowBest Practices For Workflow
Best Practices For Workflow
Timothy Spann89 vistas
Meetup: Streaming Data Pipeline Development por Timothy Spann
Meetup:  Streaming Data Pipeline DevelopmentMeetup:  Streaming Data Pipeline Development
Meetup: Streaming Data Pipeline Development
Timothy Spann337 vistas
DevNexus: Apache Pulsar Development 101 with Java por Timothy Spann
DevNexus:  Apache Pulsar Development 101 with JavaDevNexus:  Apache Pulsar Development 101 with Java
DevNexus: Apache Pulsar Development 101 with Java
Timothy Spann261 vistas
Conf42 Python_ ML Enhanced Event Streaming Apps with Python Microservices por Timothy Spann
Conf42 Python_ ML Enhanced Event Streaming Apps with Python MicroservicesConf42 Python_ ML Enhanced Event Streaming Apps with Python Microservices
Conf42 Python_ ML Enhanced Event Streaming Apps with Python Microservices
Timothy Spann443 vistas
ITPC Building Modern Data Streaming Apps por Timothy Spann
ITPC Building Modern Data Streaming AppsITPC Building Modern Data Streaming Apps
ITPC Building Modern Data Streaming Apps
Timothy Spann797 vistas
PythonWebConference_ Cloud Native Apache Pulsar Development 202 with Python por Timothy Spann
PythonWebConference_ Cloud Native Apache Pulsar Development 202 with PythonPythonWebConference_ Cloud Native Apache Pulsar Development 202 with Python
PythonWebConference_ Cloud Native Apache Pulsar Development 202 with Python
Timothy Spann430 vistas
PhillyJug Getting Started With Real-time Cloud Native Streaming With Java por Timothy Spann
PhillyJug  Getting Started With Real-time Cloud Native Streaming With JavaPhillyJug  Getting Started With Real-time Cloud Native Streaming With Java
PhillyJug Getting Started With Real-time Cloud Native Streaming With Java
Timothy Spann625 vistas
Why Spring Belongs In Your Data Stream (From Edge to Multi-Cloud) por Timothy Spann
Why Spring Belongs In Your Data Stream (From Edge to Multi-Cloud)Why Spring Belongs In Your Data Stream (From Edge to Multi-Cloud)
Why Spring Belongs In Your Data Stream (From Edge to Multi-Cloud)
Timothy Spann18 vistas

Último

Understanding HTML terminology por
Understanding HTML terminologyUnderstanding HTML terminology
Understanding HTML terminologyartembondar5
8 vistas8 diapositivas
predicting-m3-devopsconMunich-2023-v2.pptx por
predicting-m3-devopsconMunich-2023-v2.pptxpredicting-m3-devopsconMunich-2023-v2.pptx
predicting-m3-devopsconMunich-2023-v2.pptxTier1 app
14 vistas33 diapositivas
FOSSLight Community Day 2023-11-30 por
FOSSLight Community Day 2023-11-30FOSSLight Community Day 2023-11-30
FOSSLight Community Day 2023-11-30Shane Coughlan
8 vistas18 diapositivas
Chat GPTs por
Chat GPTsChat GPTs
Chat GPTsGene Leybzon
13 vistas36 diapositivas
Techstack Ltd at Slush 2023, Ukrainian delegation por
Techstack Ltd at Slush 2023, Ukrainian delegationTechstack Ltd at Slush 2023, Ukrainian delegation
Techstack Ltd at Slush 2023, Ukrainian delegationViktoriiaOpanasenko
7 vistas4 diapositivas
Introduction to Gradle por
Introduction to GradleIntroduction to Gradle
Introduction to GradleJohn Valentino
7 vistas7 diapositivas

Último(20)

Understanding HTML terminology por artembondar5
Understanding HTML terminologyUnderstanding HTML terminology
Understanding HTML terminology
artembondar58 vistas
predicting-m3-devopsconMunich-2023-v2.pptx por Tier1 app
predicting-m3-devopsconMunich-2023-v2.pptxpredicting-m3-devopsconMunich-2023-v2.pptx
predicting-m3-devopsconMunich-2023-v2.pptx
Tier1 app14 vistas
FOSSLight Community Day 2023-11-30 por Shane Coughlan
FOSSLight Community Day 2023-11-30FOSSLight Community Day 2023-11-30
FOSSLight Community Day 2023-11-30
Shane Coughlan8 vistas
Mobile App Development Company por Richestsoft
Mobile App Development CompanyMobile App Development Company
Mobile App Development Company
Richestsoft 5 vistas
Bootstrapping vs Venture Capital.pptx por Zeljko Svedic
Bootstrapping vs Venture Capital.pptxBootstrapping vs Venture Capital.pptx
Bootstrapping vs Venture Capital.pptx
Zeljko Svedic16 vistas
ADDO_2022_CICID_Tom_Halpin.pdf por TomHalpin9
ADDO_2022_CICID_Tom_Halpin.pdfADDO_2022_CICID_Tom_Halpin.pdf
ADDO_2022_CICID_Tom_Halpin.pdf
TomHalpin96 vistas
Top-5-production-devconMunich-2023.pptx por Tier1 app
Top-5-production-devconMunich-2023.pptxTop-5-production-devconMunich-2023.pptx
Top-5-production-devconMunich-2023.pptx
Tier1 app10 vistas
Dapr Unleashed: Accelerating Microservice Development por Miroslav Janeski
Dapr Unleashed: Accelerating Microservice DevelopmentDapr Unleashed: Accelerating Microservice Development
Dapr Unleashed: Accelerating Microservice Development
Miroslav Janeski16 vistas
aATP - New Correlation Confirmation Feature.pptx por EsatEsenek1
aATP - New Correlation Confirmation Feature.pptxaATP - New Correlation Confirmation Feature.pptx
aATP - New Correlation Confirmation Feature.pptx
EsatEsenek1222 vistas
Ports-and-Adapters Architecture for Embedded HMI por Burkhard Stubert
Ports-and-Adapters Architecture for Embedded HMIPorts-and-Adapters Architecture for Embedded HMI
Ports-and-Adapters Architecture for Embedded HMI
Burkhard Stubert35 vistas
Advanced API Mocking Techniques Using Wiremock por Dimpy Adhikary
Advanced API Mocking Techniques Using WiremockAdvanced API Mocking Techniques Using Wiremock
Advanced API Mocking Techniques Using Wiremock
Dimpy Adhikary5 vistas
Top-5-production-devconMunich-2023-v2.pptx por Tier1 app
Top-5-production-devconMunich-2023-v2.pptxTop-5-production-devconMunich-2023-v2.pptx
Top-5-production-devconMunich-2023-v2.pptx
Tier1 app9 vistas

The Never Landing Stream with HTAP and Streaming

  • 1. 1 1 The Never Landing Stream with HTAP and Streaming Timothy Spann Principal Developer Advocate
  • 4. 4 4 FLaNK Stack Tim Spann @PaasDev // Blog: www.datainmotion.dev Principal Developer Advocate. Princeton Future of Data Meetup. ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC https://medium.com/@tspann https://github.com/tspannhw Apache NiFi x Apache Kafka x Apache Flink
  • 5. 5 5 Future of Data - Princeton + Virtual @PaasDev https://www.meetup.com/futureofdata-princeton/ From Big Data to AI to Streaming to Containers to Cloud to Analytics to Cloud Storage to Fast Data to Machine Learning to Microservices to ...
  • 6. 6 6 CDP IS THE ONLY HYBRID DATA PLATFORM Hybrid. Open. Portable. Secure. S3 GCS OZONE ADLS OZONE S3 GCS ADLS CLOUDERA DATA PLATFORM OZONE S3 GCS ADLS OPEN DATA LAKEHOUSE
  • 8. 8 8 CLOUDERA FLOW MANAGEMENT - POWERED BY APACHE NiFi Ingest and manage data from edge-to-cloud using a no-code interface ● #1 data ingestion/movement engine ● Strong community ● Product maturity over 11 years ● Deploy on-premises or in the cloud ● Over 400+ pre-built processors ● Built-in data provenance ● Guaranteed delivery ● Throttling and Back pressure
  • 9. 9 9 Cloudera Flow Management Ingest and manage data from edge-to-cloud using a no-code interface ACQUIRE PROCESS DELIVER • Over 300 pre-built processors • Easy to build your own processors • Parse, enrich & apply schema • Filter, Split, Merge & Route • Throttle & Backpressure • Guaranteed delivery • Full data provenance • Eco-system integration Advanced tooling to industrialize flow development (Flow Development Life Cycle) FTP SFTP HL7 UDP XML HTTP EMAIL HTML IMAGE SYSLO G FTP SFTP HL7 UDP XML HTTP EMAIL HTML IMAGE SYSLO G HASH MERGE EXTRACT DUPLICATE SPLIT ROUTE TEXT ROUTE CONTENT ROUTE CONTEXT CONTROL RATE DISTRIBUTE LOAD GEOENRICH SCAN REPLACE TRANSLATE CONVERT ENCRYPT TALL EVALUATE EXECUTE
  • 10. 10 10 SQL BASED ROUTING WITH NiFi’s QueryRecord Processor ● QueryRecord Processor- Executes a SQL statement against records and writes the results to the flow file content. ● CSVReader: Looking up schema from SR, it will converts CSV Records into ProcessRecords ● SQL execution via Apache Calcite: execute configured SQL against the ProcessRecords for routing ● CSVRecordSetWriter: Converts the result of the query from Process records into CSV for the for the flow file content Do routing(routing geo and speed streams) using standard SQL as opposed to complex regular expressions.
  • 11. 11 11 Key Differentiators Comprehensive streaming platform – Only vendor to offer a open and comprehensive streaming platform for real-time data ingestion and processing to produce prescriptive and predictive analytics Stream to Cloud – Extend the same on-premises streaming capabilities to the cloud with full support for multi-cloud and hybrid cloud models 400+ pre-built processors – Only product to offer such comprehensive connectivity to a wide range of data sources from edge to cloud Enterprise-Grade Security & Governance – Deploy your streaming applications with confidence and trust with Cloudera SDX offering unified security and governance across the entire platform Democratize access to real-time data – Enable data analysts and other personas to quickly build streaming applications with just SQL
  • 12. 12 12 Development & Runtime of DataFlow Functions Step1. Develop functions on local workstation or in CDP Public Cloud using no-code, UI designer Step 2. Run functions on serverless compute services in AWS, Azure & GCP AWS Lambda Azure Functions Google Cloud Functions
  • 13. 13 13 DataFlow Functions Use Cases Trigger Based, Batch, Scheduled and Microservice Use Cases Serverless Trigger-Based File Processing Pipeline Develop & run data processing pipelines when files are created or updated in any of the cloud object stores Example: When a photo is uploaded to object storage, a data flow is triggered which runs image resizing code and delivers resized image to different locations. Serverless Workflows / Orchestration Chain different low-code functions to build complex workflows Example: Automate the handling of support tickets in a call center or orchestrate data movement across different cloud services. Serverless Scheduled Tasks Develop and run scheduled tasks without any code on pre-defined timed intervals Example: Offload an external database running on-premises into the cloud once a day every morning at 4:00 a.m. Serverless Microservices Build and deploy serverless independent modules that power your applications microservices architecture Example: Event-driven functions for easy communication between thousands of decoupled services that power a ride-sharing application. Serverless Web APIs Easily build endpoints for your web applications with HTTP APIs without any code using DFF and any of the cloud providers' function triggers Example: Build high performant, scalable web applications across multiple data centers. Serverless Customized Triggers With the DFF State feature, build flows to create customized triggers allowing access to on-premises or external services Example: Near real time offloading of files from a remote SFTP server.
  • 14. 14 14 Flow Catalog • Central repository for flow definitions • Import existing NiFi flows • Manage flow definitions • Initiate flow deployments
  • 15. 15 15 ReadyFlows • Cloudera provided flow definitions • Cover most common data flow use cases • Can be deployed and adjusted as needed • Made available through docs during Tech Preview
  • 16. 16 16 Deployment Wizard • Turns flow definitions into flow deployments • Guides users through providing required configuration • Pick from pre-defined NiFi node sizes • Define KPIs for the deployment Start Deployment Wizard Provide Parameters Configure Sizing & Scaling Define KPIs
  • 17. 17 17 Key Performance Indicators • Visibility into flow deployments • Track high level flow performance • Track in-depth NiFi component metrics • Defined in Deployment Wizard • Monitoring & Alerts in Deployment Details KPI Definition in Deployment Wizard KPI Monitoring
  • 18. 18 18 Dashboard • Central Monitoring View • Monitors flow deployments across CDP environments • Monitors flow deployment health & performance • Drill into flow deployment to monitor system metrics and deployment events
  • 19. 19 19 Data Flow Design for Everyone • Cloud-native data flow development • Developers get their own sandbox • Start developing flows without installing NiFi • Redesigned visual canvas • Optimized interaction patterns • Integration into CDF-PC Catalog for versioning
  • 21. 21 21 NiFi Ingesting REST API ● NiFi consumes stream (cdc, rest, sensors) ● Distributes real-time to ● Kafka and MySQL at same time ● Flink SQL consumes from Kafka ● TiDB CDC -> Kafka https://ossinsight.io/docs/api
  • 25. 25 25 Why Kudu? A simultaneous combination of sequential and random reads and writes Can you insert time series data in real time? How long does it take to prepare it for analysis? Can you get results and act fast enough to change outcomes? Can you handle large volumes of machine-generated data? Do you have the tools to identify problems or threats? Can your system do machine learning? How fast can you add data to your data store? Are you trading off the ability to do broad analytics for the ability to make updates? Are you retaining only part of your data? Time Series Data Machine Data Analytics Online Reporting
  • 26. 26 26 Online Transactional Processing (OLTP) and Online Analytical Processing (OLAP) https://hpi.de/fileadmin/user_upload/hpi/navigation/10_forschung/20_future_soc_lab/Poster/2019-1/To zun_FSOC-Poster_20191_150443.pdf HTAP Options - Apache Kudu
  • 29. 29 29 SQL STREAM BUILDER (CLOUDERA SSB) SQL STREAM BUILDER allows developers, analysts, and data scientists to write streaming applications with industry standard SQL. No Java or Scala code development required. Simplifies access to data in Kafka & Flink. Connectors to batch data in HDFS, Kudu, Hive, S3, JDBC, CDC and more Enrich streaming data with batch data in a single tool Democratize access to real-time data with just SQL
  • 30. 30 30 SSB MATERIALIZED VIEWS Key Takeaway; MV’s allow data scientist, analyst and developers consume data from the firehose
  • 31. 31 31 Infer Tables from Kafka Topics with JSON or Avro
  • 33. 33 HTAP INGEST OF ALL DATA Data Sources Cloudera Data Flow Cloudera Streaming Analytics Cloudera Streams Processing Kafka Lake House
  • 34. 34 34 LLM USE CASE Vector DB AI Model Unstructured file types Data in Motion on Cloudera Data Platform (CDP) Capture, process & distribute any data, anywhere Other enterprise data Open Data Lakehouse Materialized Views Structured Sources Applications/API’s Streams
  • 35. 35 35 Live Q&A Travel Advisories Weather Reports Documents Social Media Internal Data Github Data REST API HYBRID CLOUD INTERACT COLLECT STORE ENRICH, REPORT Distribute Collect Report REPORT Visualize Report, Automate AI BASED ENHANCEMENTS Predict, Automate VECTOR DATABASE LLM Machine Learning Data Visualization Data Flow Data Warehouse SQL Stream Builder Data Visualization Input Sentences Generated Text Timestamp Input Sentence Timestamps Enrichments Messaging Broker Real-time alerting Real-time alerting Aggregations
  • 37. 37 37 CSP Community Edition ● Kafka, KConnect, SMM, SR, Flink, and SSB in Docker ● Runs in Docker ● Try new features quickly ● Develop applications locally ● Docker compose file of CSP to run from command line w/o any dependencies, including Flink, SQL Stream Builder, Kafka, Kafka Connect, Streams Messaging Manager and Schema Registry ○ $> docker compose up ● Licensed under the Cloudera Community License ● Unsupported ● Community Group Hub for CSP ● Find it on docs.cloudera.com under Applications
  • 38. 38 38 Open Source Edition ● Apache NiFi in Docker ● Runs in Docker ● Try new features quickly ● Develop applications locally ● Docker NiFi ○ docker run --name nifi -p 8443:8443 -d -e SINGLE_USER_CREDENTIALS_USERNAME=admin -e SINGLE_USER_CREDENTIALS_PASSWORD=ctsBtRBKHRAx69EqUgh vvgEvjnaLjFEB apache/nifi:latest ● Licensed under the ASF License ● Unsupported https://hub.docker.com/r/apache/nifi