SlideShare una empresa de Scribd logo
1 de 54
Descargar para leer sin conexión
1 © Hortonworks Inc. 2011–2018. All rights reserved
Using Spark Streaming and NiFi
for the Next Generation of ETL
in the Enterprise
Andrew Psaltis
Regional CTO APAC
2 © Hortonworks Inc. 2011–2018. All rights reserved
Traditional ETL
1011010 DBseconds hoursFile
3 © Hortonworks Inc. 2011–2018. All rights reserved
Streaming ETL
1011010 DBseconds
4 © Hortonworks Inc. 2011–2018. All rights reserved
Reference Streaming Architecture
5 © Hortonworks Inc. 2011–2018. All rights reserved
Apache NiFi
6 © Hortonworks Inc. 2011–2018. All rights reserved
7 © Hortonworks Inc. 2011–2018. All rights reserved
Simplistic View of Dataflows: Easy, Definitive
Acquire
Data
Store
Data
Data
Flow
Process
Analyze
Data
8 © Hortonworks Inc. 2011–2018. All rights reserved
Standards: http://xkcd.com/927/
9 © Hortonworks Inc. 2011–2018. All rights reserved
Realistic View of Dataflows: Complex, Convoluted
Acquire
Data
Store
Data
Acquire
Data
Store
Data
Store
Data
Store
Data
Store
Data
Process
and
Analyze
Data
Data
Flow
Acquire
Data
Acquire
Data
10 © Hortonworks Inc. 2011–2018. All rights reserved
The National Security Agency Years
• Created in 2006
• Improved over eight years
• Simple initial vision – Visio for real-time dataflow management
• National Security Agency donated the codebase to the ASF in late 2014
11 © Hortonworks Inc. 2011–2018. All rights reserved
Apache NiFi
Key Features
• Guaranteed delivery
• Data buffering
- Backpressure
- Pressure release
• Prioritized queuing
• Flow specific QoS
- Latency vs. throughput
- Loss tolerance
• Data provenance
• Supports push and pull
models
• Recovery/recording
a rolling log of fine-
grained history
• Visual command and
control
• Flow templates
• Pluggable/multi-role
security
• Designed for extension
• Clustering
12 © Hortonworks Inc. 2011–2018. All rights reserved
Visual Command and Control
• Drag and drop processors to build a flow
• Start, stop, and configure components in real time
• View errors and corresponding error messages
• View statistics and health of data flow
• Create templates of common processor & connections
13 © Hortonworks Inc. 2011–2018. All rights reserved
Provenance/Lineage
14 © Hortonworks Inc. 2011–2018. All rights reserved
Prioritization
• Configure a prioritizer per
connection
• Determine what is important for
your data – time based, arrival
order, importance of a data set
• Funnel many connections down to
a single connection to prioritize
across data sets
• Develop your own prioritizer if
needed
15 © Hortonworks Inc. 2011–2018. All rights reserved
16 © Hortonworks Inc. 2011–2018. All rights reserved
Latency vs. Throughput
Choose between lower latency, or higher throughput on each processor
17 © Hortonworks Inc. 2011–2018. All rights reserved
NiFi Positioning
Apache
NiFi / MiNiFi
ETL
(Informatica, etc.)
Enterprise
Service Bus
(Fuse, Mule, etc.)
Messaging
Bus
(Kafka, MQ, etc.)
Processing
Framework
(Storm, Spark, etc.)
18 © Hortonworks Inc. 2011–2018. All rights reserved
Apache NiFi / Processing Frameworks
NiFi
Simple event processing
• Primarily feed data into processing
frameworks, can process data, with a focus on
simple event processing
• Operate on a single piece of data, or in
correlation with an enrichment dataset
(enrichment, parsing, splitting, and
transformations)
• Can scale out, but scale up better to take full
advantage of hardware resources, run
concurrent processing tasks/threads
(processing terabytes of data per day on a
single node)
Not another distributed processing framework,
but to feed data into those
Processing Frameworks (Storm, Spark, etc.)
Complex and distributed processing
• Complex processing from multiple streams (JOIN
operations)
• Analyzing data across time windows (rolling
window aggregation, standard deviation, etc.)
• Scale out to thousands of nodes if needed
Not designed to collect data or manage data flow
19 © Hortonworks Inc. 2011–2018. All rights reserved
Apache NiFi / Messaging Bus Services
NiFi
Provide dataflow solution
• Centralized management, from edge to core
• Great traceability, event level data provenance
starting when data is born
• Interactive command and control – real time
operational visibility
• Dataflow management, including prioritization,
back pressure, and edge intelligence
• Visual representation of global dataflow
Not a messaging bus, flow maintenance needed
when you have frequent consumer side updates
Messaging Bus (Kafka, JMS, etc.)
Provide messaging bus service
• Low latency
• Great data durability
• Decentralized management (producers &
consumers)
• Low broker maintenance for dynamic consumer
side updates
Not designed to solve dataflow problems
(prioritization, edge intelligence, etc.)
Traceability limited to in/out of topics, no lineage
Lack of global view of components/connectivities
20 © Hortonworks Inc. 2011–2018. All rights reserved
Apache NiFi / Integration, or Ingestion, Frameworks
NiFi
End user facing dataflow management tool
• Out of the box solution for dataflow
management
• Interactive command and control in the core,
design and deploy on the edge
• Flexible failure handling at each point of the flow
• Visual representation of global dataflow and
connectivities
• Native cross data center communication
• Data provenance for traceability
Not a library to be embedded in other
applications
Integration framework (Spring Integration,
Camel, etc), ingestion framework (Flume,
etc)
Developer facing integration tool with a
focus on data ingestion
• A set of tools to orchestrate workflow
• A fixed design and deploy pattern
• Leverage messaging bus across disconnected
networks
Developer facing, custom coding needed to
optimize
Pre-built failure handling, lack of flexibility
No holistic view of global dataflow
No built-in data traceability
21 © Hortonworks Inc. 2011–2018. All rights reserved
Apache NiFi / ETL Tools
NiFi
NOT schema dependent
• Dataflow management for both structured and
unstructured data, powered by separation of
metadata and payload
• Schema is not required, but you can have
schema
• Minimum modeling effort, just enough to
manage dataflows
• Do the plumbing job, maximize developers’
brainpower for creative work
Not designed to do heavy lifting transformation
work for DB tables (JOIN datasets, etc.). You can
create custom processors to do that, but long
way to go to catch up with existing ETL tools from
user experience perspective (GUI for data
wrangling, cleansing, etc.)
ETL (Informatica, etc.)
Schema dependent
• Tailored for Databases/WH
• ETL operations based on schema/data modeling
• Highly efficient, optimized performance
Must pre-prepare your data, time consuming to bui
data modeling, and maintain schemas
Not geared towards handling unstructured data, PD
Audio, Video, etc.
Not designed to solve dataflow problems
22 © Hortonworks Inc. 2011–2018. All rights reserved
NiFi Big Picture Pattern: Diverse Flows from One Tool
“Swiss Army Knife of
Data Movement”
23 © Hortonworks Inc. 2011–2018. All rights reserved
Apache Kafka
24 © Hortonworks Inc. 2011–2018. All rights reserved
25 © Hortonworks Inc. 2011–2018. All rights reserved
What Is Apache Kafka?
• Distributed streaming platform that allows
publishing and subscribing to streams of
records
• Streams of records are organized into
categories called topics
• Topics can be partitioned and/or replicated
• Records consist of a key, value, and
timestamp
http://kafka.apache.org/intro
Kafka
Cluster
producer
producer
producer
consumer
consumer
consumer
APACHE KAFKA
High throughput, distributed system. Designed to operate at large scale.
26 © Hortonworks Inc. 2011–2018. All rights reserved
Why Kafka
Source
System
Source
System
Source
System
Source
System
Kafka
Hadoop Security
Systems
Real-Time
Monitoring
Data
Warehouse
Producers
Brokers
Consumers
Kafka decouples data pipelines
27 © Hortonworks Inc. 2011–2018. All rights reserved
Overview of Topics
• Topics are a partitioned ordered, immutable sequence of messages
• Messages are retained for a configurable amount of time (24 hours, 7
days, etc.)
• Each consumer retains its own offset in the partition
28 © Hortonworks Inc. 2011–2018. All rights reserved
Understanding Partitions
• Partitions are an ordered, immutable sequence of messages
• Each partition is replicated for fault tolerance
• A replicated partition has one broker that acts as the leader and the rest
as followers
29 © Hortonworks Inc. 2011–2018. All rights reserved
Consuming Messages
• Messages are consumed in Kafka by a consumer group
• Each individual consumer is labeled with a group name
• Each message in a topic is sent to one consumer in the group
30 © Hortonworks Inc. 2011–2018. All rights reserved
Spark Structured
Streaming
31 © Hortonworks Inc. 2011–2018. All rights reserved
Structured Streaming
Turns stream processing into SQL
fast, scalable, fault-tolerant
Unifies high level APIs with Spark
deal with complex data and complex workloads
32 © Hortonworks Inc. 2011–2018. All rights reserved
Thinking about time
33 © Hortonworks Inc. 2011–2018. All rights reserved
34 © Hortonworks Inc. 2011–2018. All rights reserved
Time Skew
35 © Hortonworks Inc. 2011–2018. All rights reserved
Windows
36 © Hortonworks Inc. 2011–2018. All rights reserved
Tumbling Windows
37 © Hortonworks Inc. 2011–2018. All rights reserved
Tumbling Time Windowing
38 © Hortonworks Inc. 2011–2018. All rights reserved
Tumbling Count Windowing
39 © Hortonworks Inc. 2011–2018. All rights reserved
Sliding Windows
40 © Hortonworks Inc. 2011–2018. All rights reserved
Sliding Time Window
41 © Hortonworks Inc. 2011–2018. All rights reserved
Sliding Count Window
42 © Hortonworks Inc. 2011–2018. All rights reserved
Watermarking
• The threshold of how late data is expected to be and when to drop old
state
• Trails behind max event time seen by the Spark engine
• Easiest to think of the Watermark delay as the trailing gap
43 © Hortonworks Inc. 2011–2018. All rights reserved
Watermarking
• Data that is newer than the watermark is late, but allowed to aggregate
• Data that is older than the watermark is dropped as it’s "too late"
• Any windows older than watermark are automatically deleted to limit
state
44 © Hortonworks Inc. 2011–2018. All rights reserved Source: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
45 © Hortonworks Inc. 2011–2018. All rights reserved Source: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
46 © Hortonworks Inc. 2011–2018. All rights reserved Source: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
47 © Hortonworks Inc. 2011–2018. All rights reserved Source: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
48 © Hortonworks Inc. 2011–2018. All rights reserved Source: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
49 © Hortonworks Inc. 2011–2018. All rights reserved Source: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
50 © Hortonworks Inc. 2011–2018. All rights reserved
The Do’s and Don’t’s
51 © Hortonworks Inc. 2011–2018. All rights reserved
The Good, the Bad, the Ugly
• Use NiFi for all ingestion and data preparation
• Be very cautious of forcing order in Kafka
• Chose your data store to match your query pattern
52 © Hortonworks Inc. 2011–2018. All rights reserved
The Good, the Bad, the Ugly
• Monitor, monitor, monitor
• Think about using schemas
• Plan for spike in traffic – this impacts Kafka portioning and consumers
53 © Hortonworks Inc. 2011–2018. All rights reserved
Questions?
54 © Hortonworks Inc. 2011–2018. All rights reserved
Thank you

Más contenido relacionado

La actualidad más candente

End-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and AtlasEnd-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and Atlas
DataWorks Summit
 
Intelligently Collecting Data at the Edge - Intro to Apache MiNiFi
Intelligently Collecting Data at the Edge - Intro to Apache MiNiFiIntelligently Collecting Data at the Edge - Intro to Apache MiNiFi
Intelligently Collecting Data at the Edge - Intro to Apache MiNiFi
DataWorks Summit
 

La actualidad más candente (20)

Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San JoseDataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
 
Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...
Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...
Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...
 
Apache NiFi Meetup - Princeton NJ 2016
Apache NiFi Meetup - Princeton NJ 2016Apache NiFi Meetup - Princeton NJ 2016
Apache NiFi Meetup - Princeton NJ 2016
 
NiFi Best Practices for the Enterprise
NiFi Best Practices for the EnterpriseNiFi Best Practices for the Enterprise
NiFi Best Practices for the Enterprise
 
Nifi
NifiNifi
Nifi
 
Building Data Pipelines for Solr with Apache NiFi
Building Data Pipelines for Solr with Apache NiFiBuilding Data Pipelines for Solr with Apache NiFi
Building Data Pipelines for Solr with Apache NiFi
 
NiFi Developer Guide
NiFi Developer GuideNiFi Developer Guide
NiFi Developer Guide
 
Apache Nifi Crash Course
Apache Nifi Crash CourseApache Nifi Crash Course
Apache Nifi Crash Course
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
 
Introduction to data flow management using apache nifi
Introduction to data flow management using apache nifiIntroduction to data flow management using apache nifi
Introduction to data flow management using apache nifi
 
End-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and AtlasEnd-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and Atlas
 
Intelligently Collecting Data at the Edge - Intro to Apache MiNiFi
Intelligently Collecting Data at the Edge - Intro to Apache MiNiFiIntelligently Collecting Data at the Edge - Intro to Apache MiNiFi
Intelligently Collecting Data at the Edge - Intro to Apache MiNiFi
 
Apache Nifi Crash Course
Apache Nifi Crash CourseApache Nifi Crash Course
Apache Nifi Crash Course
 
T24-TAFJ-Consultant_Sivashankar.R
T24-TAFJ-Consultant_Sivashankar.RT24-TAFJ-Consultant_Sivashankar.R
T24-TAFJ-Consultant_Sivashankar.R
 
CoC23_ Looking at the New Features of Apache NiFi
CoC23_ Looking at the New Features of Apache NiFiCoC23_ Looking at the New Features of Apache NiFi
CoC23_ Looking at the New Features of Apache NiFi
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Edge architecture ieee international conference on cloud engineering
Edge architecture   ieee international conference on cloud engineeringEdge architecture   ieee international conference on cloud engineering
Edge architecture ieee international conference on cloud engineering
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 

Similar a Using Spark Streaming and NiFi for the next generation of ETL in the enterprise

Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the EnterpriseUsing Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
DataWorks Summit
 
Dataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFiDataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFi
DataWorks Summit
 

Similar a Using Spark Streaming and NiFi for the next generation of ETL in the enterprise (20)

Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the EnterpriseUsing Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
 
Apache NiFi - Flow Based Programming Meetup
Apache NiFi - Flow Based Programming MeetupApache NiFi - Flow Based Programming Meetup
Apache NiFi - Flow Based Programming Meetup
 
State of the Apache NiFi Ecosystem & Community
State of the Apache NiFi Ecosystem & CommunityState of the Apache NiFi Ecosystem & Community
State of the Apache NiFi Ecosystem & Community
 
Data Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat AlwellData Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat Alwell
 
HDF 3.1 : An Introduction to New Features
HDF 3.1 : An Introduction to New FeaturesHDF 3.1 : An Introduction to New Features
HDF 3.1 : An Introduction to New Features
 
HDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi IntroductionHDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi Introduction
 
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
 
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1
 
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFIHarnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
 
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFiTaking DataFlow Management to the Edge with Apache NiFi/MiNiFi
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi
 
Beyond Messaging Enterprise Dataflow powered by Apache NiFi
Beyond Messaging Enterprise Dataflow powered by Apache NiFiBeyond Messaging Enterprise Dataflow powered by Apache NiFi
Beyond Messaging Enterprise Dataflow powered by Apache NiFi
 
BigData Techcon - Beyond Messaging with Apache NiFi
BigData Techcon - Beyond Messaging with Apache NiFiBigData Techcon - Beyond Messaging with Apache NiFi
BigData Techcon - Beyond Messaging with Apache NiFi
 
Dataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFiDataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFi
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
NJ Hadoop Meetup - Apache NiFi Deep Dive
NJ Hadoop Meetup - Apache NiFi Deep DiveNJ Hadoop Meetup - Apache NiFi Deep Dive
NJ Hadoop Meetup - Apache NiFi Deep Dive
 
Future of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep DiveFuture of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep Dive
 
Enterprise IIoT Edge Processing with Apache NiFi
Enterprise IIoT Edge Processing with Apache NiFiEnterprise IIoT Edge Processing with Apache NiFi
Enterprise IIoT Edge Processing with Apache NiFi
 
Containers and Big Data
Containers and Big DataContainers and Big Data
Containers and Big Data
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
 

Más de DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Más de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Último (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Using Spark Streaming and NiFi for the next generation of ETL in the enterprise

  • 1. 1 © Hortonworks Inc. 2011–2018. All rights reserved Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise Andrew Psaltis Regional CTO APAC
  • 2. 2 © Hortonworks Inc. 2011–2018. All rights reserved Traditional ETL 1011010 DBseconds hoursFile
  • 3. 3 © Hortonworks Inc. 2011–2018. All rights reserved Streaming ETL 1011010 DBseconds
  • 4. 4 © Hortonworks Inc. 2011–2018. All rights reserved Reference Streaming Architecture
  • 5. 5 © Hortonworks Inc. 2011–2018. All rights reserved Apache NiFi
  • 6. 6 © Hortonworks Inc. 2011–2018. All rights reserved
  • 7. 7 © Hortonworks Inc. 2011–2018. All rights reserved Simplistic View of Dataflows: Easy, Definitive Acquire Data Store Data Data Flow Process Analyze Data
  • 8. 8 © Hortonworks Inc. 2011–2018. All rights reserved Standards: http://xkcd.com/927/
  • 9. 9 © Hortonworks Inc. 2011–2018. All rights reserved Realistic View of Dataflows: Complex, Convoluted Acquire Data Store Data Acquire Data Store Data Store Data Store Data Store Data Process and Analyze Data Data Flow Acquire Data Acquire Data
  • 10. 10 © Hortonworks Inc. 2011–2018. All rights reserved The National Security Agency Years • Created in 2006 • Improved over eight years • Simple initial vision – Visio for real-time dataflow management • National Security Agency donated the codebase to the ASF in late 2014
  • 11. 11 © Hortonworks Inc. 2011–2018. All rights reserved Apache NiFi Key Features • Guaranteed delivery • Data buffering - Backpressure - Pressure release • Prioritized queuing • Flow specific QoS - Latency vs. throughput - Loss tolerance • Data provenance • Supports push and pull models • Recovery/recording a rolling log of fine- grained history • Visual command and control • Flow templates • Pluggable/multi-role security • Designed for extension • Clustering
  • 12. 12 © Hortonworks Inc. 2011–2018. All rights reserved Visual Command and Control • Drag and drop processors to build a flow • Start, stop, and configure components in real time • View errors and corresponding error messages • View statistics and health of data flow • Create templates of common processor & connections
  • 13. 13 © Hortonworks Inc. 2011–2018. All rights reserved Provenance/Lineage
  • 14. 14 © Hortonworks Inc. 2011–2018. All rights reserved Prioritization • Configure a prioritizer per connection • Determine what is important for your data – time based, arrival order, importance of a data set • Funnel many connections down to a single connection to prioritize across data sets • Develop your own prioritizer if needed
  • 15. 15 © Hortonworks Inc. 2011–2018. All rights reserved
  • 16. 16 © Hortonworks Inc. 2011–2018. All rights reserved Latency vs. Throughput Choose between lower latency, or higher throughput on each processor
  • 17. 17 © Hortonworks Inc. 2011–2018. All rights reserved NiFi Positioning Apache NiFi / MiNiFi ETL (Informatica, etc.) Enterprise Service Bus (Fuse, Mule, etc.) Messaging Bus (Kafka, MQ, etc.) Processing Framework (Storm, Spark, etc.)
  • 18. 18 © Hortonworks Inc. 2011–2018. All rights reserved Apache NiFi / Processing Frameworks NiFi Simple event processing • Primarily feed data into processing frameworks, can process data, with a focus on simple event processing • Operate on a single piece of data, or in correlation with an enrichment dataset (enrichment, parsing, splitting, and transformations) • Can scale out, but scale up better to take full advantage of hardware resources, run concurrent processing tasks/threads (processing terabytes of data per day on a single node) Not another distributed processing framework, but to feed data into those Processing Frameworks (Storm, Spark, etc.) Complex and distributed processing • Complex processing from multiple streams (JOIN operations) • Analyzing data across time windows (rolling window aggregation, standard deviation, etc.) • Scale out to thousands of nodes if needed Not designed to collect data or manage data flow
  • 19. 19 © Hortonworks Inc. 2011–2018. All rights reserved Apache NiFi / Messaging Bus Services NiFi Provide dataflow solution • Centralized management, from edge to core • Great traceability, event level data provenance starting when data is born • Interactive command and control – real time operational visibility • Dataflow management, including prioritization, back pressure, and edge intelligence • Visual representation of global dataflow Not a messaging bus, flow maintenance needed when you have frequent consumer side updates Messaging Bus (Kafka, JMS, etc.) Provide messaging bus service • Low latency • Great data durability • Decentralized management (producers & consumers) • Low broker maintenance for dynamic consumer side updates Not designed to solve dataflow problems (prioritization, edge intelligence, etc.) Traceability limited to in/out of topics, no lineage Lack of global view of components/connectivities
  • 20. 20 © Hortonworks Inc. 2011–2018. All rights reserved Apache NiFi / Integration, or Ingestion, Frameworks NiFi End user facing dataflow management tool • Out of the box solution for dataflow management • Interactive command and control in the core, design and deploy on the edge • Flexible failure handling at each point of the flow • Visual representation of global dataflow and connectivities • Native cross data center communication • Data provenance for traceability Not a library to be embedded in other applications Integration framework (Spring Integration, Camel, etc), ingestion framework (Flume, etc) Developer facing integration tool with a focus on data ingestion • A set of tools to orchestrate workflow • A fixed design and deploy pattern • Leverage messaging bus across disconnected networks Developer facing, custom coding needed to optimize Pre-built failure handling, lack of flexibility No holistic view of global dataflow No built-in data traceability
  • 21. 21 © Hortonworks Inc. 2011–2018. All rights reserved Apache NiFi / ETL Tools NiFi NOT schema dependent • Dataflow management for both structured and unstructured data, powered by separation of metadata and payload • Schema is not required, but you can have schema • Minimum modeling effort, just enough to manage dataflows • Do the plumbing job, maximize developers’ brainpower for creative work Not designed to do heavy lifting transformation work for DB tables (JOIN datasets, etc.). You can create custom processors to do that, but long way to go to catch up with existing ETL tools from user experience perspective (GUI for data wrangling, cleansing, etc.) ETL (Informatica, etc.) Schema dependent • Tailored for Databases/WH • ETL operations based on schema/data modeling • Highly efficient, optimized performance Must pre-prepare your data, time consuming to bui data modeling, and maintain schemas Not geared towards handling unstructured data, PD Audio, Video, etc. Not designed to solve dataflow problems
  • 22. 22 © Hortonworks Inc. 2011–2018. All rights reserved NiFi Big Picture Pattern: Diverse Flows from One Tool “Swiss Army Knife of Data Movement”
  • 23. 23 © Hortonworks Inc. 2011–2018. All rights reserved Apache Kafka
  • 24. 24 © Hortonworks Inc. 2011–2018. All rights reserved
  • 25. 25 © Hortonworks Inc. 2011–2018. All rights reserved What Is Apache Kafka? • Distributed streaming platform that allows publishing and subscribing to streams of records • Streams of records are organized into categories called topics • Topics can be partitioned and/or replicated • Records consist of a key, value, and timestamp http://kafka.apache.org/intro Kafka Cluster producer producer producer consumer consumer consumer APACHE KAFKA High throughput, distributed system. Designed to operate at large scale.
  • 26. 26 © Hortonworks Inc. 2011–2018. All rights reserved Why Kafka Source System Source System Source System Source System Kafka Hadoop Security Systems Real-Time Monitoring Data Warehouse Producers Brokers Consumers Kafka decouples data pipelines
  • 27. 27 © Hortonworks Inc. 2011–2018. All rights reserved Overview of Topics • Topics are a partitioned ordered, immutable sequence of messages • Messages are retained for a configurable amount of time (24 hours, 7 days, etc.) • Each consumer retains its own offset in the partition
  • 28. 28 © Hortonworks Inc. 2011–2018. All rights reserved Understanding Partitions • Partitions are an ordered, immutable sequence of messages • Each partition is replicated for fault tolerance • A replicated partition has one broker that acts as the leader and the rest as followers
  • 29. 29 © Hortonworks Inc. 2011–2018. All rights reserved Consuming Messages • Messages are consumed in Kafka by a consumer group • Each individual consumer is labeled with a group name • Each message in a topic is sent to one consumer in the group
  • 30. 30 © Hortonworks Inc. 2011–2018. All rights reserved Spark Structured Streaming
  • 31. 31 © Hortonworks Inc. 2011–2018. All rights reserved Structured Streaming Turns stream processing into SQL fast, scalable, fault-tolerant Unifies high level APIs with Spark deal with complex data and complex workloads
  • 32. 32 © Hortonworks Inc. 2011–2018. All rights reserved Thinking about time
  • 33. 33 © Hortonworks Inc. 2011–2018. All rights reserved
  • 34. 34 © Hortonworks Inc. 2011–2018. All rights reserved Time Skew
  • 35. 35 © Hortonworks Inc. 2011–2018. All rights reserved Windows
  • 36. 36 © Hortonworks Inc. 2011–2018. All rights reserved Tumbling Windows
  • 37. 37 © Hortonworks Inc. 2011–2018. All rights reserved Tumbling Time Windowing
  • 38. 38 © Hortonworks Inc. 2011–2018. All rights reserved Tumbling Count Windowing
  • 39. 39 © Hortonworks Inc. 2011–2018. All rights reserved Sliding Windows
  • 40. 40 © Hortonworks Inc. 2011–2018. All rights reserved Sliding Time Window
  • 41. 41 © Hortonworks Inc. 2011–2018. All rights reserved Sliding Count Window
  • 42. 42 © Hortonworks Inc. 2011–2018. All rights reserved Watermarking • The threshold of how late data is expected to be and when to drop old state • Trails behind max event time seen by the Spark engine • Easiest to think of the Watermark delay as the trailing gap
  • 43. 43 © Hortonworks Inc. 2011–2018. All rights reserved Watermarking • Data that is newer than the watermark is late, but allowed to aggregate • Data that is older than the watermark is dropped as it’s "too late" • Any windows older than watermark are automatically deleted to limit state
  • 44. 44 © Hortonworks Inc. 2011–2018. All rights reserved Source: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
  • 45. 45 © Hortonworks Inc. 2011–2018. All rights reserved Source: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
  • 46. 46 © Hortonworks Inc. 2011–2018. All rights reserved Source: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
  • 47. 47 © Hortonworks Inc. 2011–2018. All rights reserved Source: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
  • 48. 48 © Hortonworks Inc. 2011–2018. All rights reserved Source: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
  • 49. 49 © Hortonworks Inc. 2011–2018. All rights reserved Source: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
  • 50. 50 © Hortonworks Inc. 2011–2018. All rights reserved The Do’s and Don’t’s
  • 51. 51 © Hortonworks Inc. 2011–2018. All rights reserved The Good, the Bad, the Ugly • Use NiFi for all ingestion and data preparation • Be very cautious of forcing order in Kafka • Chose your data store to match your query pattern
  • 52. 52 © Hortonworks Inc. 2011–2018. All rights reserved The Good, the Bad, the Ugly • Monitor, monitor, monitor • Think about using schemas • Plan for spike in traffic – this impacts Kafka portioning and consumers
  • 53. 53 © Hortonworks Inc. 2011–2018. All rights reserved Questions?
  • 54. 54 © Hortonworks Inc. 2011–2018. All rights reserved Thank you