SlideShare una empresa de Scribd logo
1 de 42
Descargar para leer sin conexión
Stream, Stream, Stream:
Different Streaming methods with Spark and Kafka
Itai Yaffe
Nielsen
Introduction
Itai Yaffe
● Tech Lead, Big Data group
● Dealing with Big Data
challenges since 2012
Introduction - part 2 (or: “your turn…”)
● Data engineers? Data architects? Something else?
● Attended our session yesterday about counting
unique users with Druid?
● Working with Spark/Kafka? Planning to?
Agenda
● Nielsen Marketing Cloud (NMC)
○ About
○ High-level architecture
● Data flow - past and present
● Spark Streaming
○ “Stateless” and “stateful” use-cases
● Spark Structured Streaming
● “Streaming” over our Data Lake
Nielsen Marketing Cloud (NMC)
● eXelate was acquired by Nielsen on March 2015
● A Data company
● Machine learning models for insights
● Targeting
● Business decisions
Nielsen Marketing Cloud - questions we try to answer
1. How many unique users of a certain profile can we reach?
E.g campaign for young women who love tech
2. How many impressions a campaign received?
Nielsen Marketing Cloud - high-level architecture
Data flow in the old days...
In-DB aggregation
OLAP
Data flow in the old days… What’s wrong with that?
● CSV-related issues, e.g:
○ Truncated lines in input files
○ Can’t enforce schema
● Scale-related issues, e.g:
○ Had to “manually” scale the processes
That's one small step for [a] man… (2014)
“Apache Spark is the Taylor Swift of big data software" (Derrick Harris, Fortune.com, 2015)
In-DB aggregation
OLAP
Why just a small step?
● Solved the scaling issues
● Still faced the CSV-related issues
Data flow - the modern way
+
Photography Copyright: NBC
Read Messages
In-DB aggregation
OLAP
The need for stateful streaming
Fast forward a few months...
●New requirements were being raised
●Specific use-case :
○ To take the load off of the operational DB (used both as OLTP and OLAP), we wanted to move most of
the aggregative operations to our Spark Streaming app
Stateful streaming via “local” aggregations
1.
Read Messages
5.
Upsert aggregated data
(every X micro-batches)
2.
Aggregate current micro-batch
3.
Write combined aggregated data 4.
Read aggregated data
From HDFS every X micro-batches
OLAP
Stateful streaming via “local” aggregations
● Required us to manage the state on our own
● Error-prone
○ E.g what if my cluster is terminated and data on HDFS is lost?
● Complicates the code
○ Mixed input sources for the same app (Kafka + files)
● Possible performance impact
○ Might cause the Kafka consumer to lag
Structured Streaming - to the rescue?
Spark 2.0 introduced Structured Streaming
●Enables running continuous, incremental processes
○ Basically manages the state for you
●Built on Spark SQL
○ DataFrame/Dataset API
○ Catalyst Optimizer
●Many other features
●Was in ALPHA mode in 2.0 and 2.1
Structured Streaming
Structured Streaming - stateful app use-case
2.
Aggregate current window
3.
Checkpoint (offsets and state) handled internally by Spark
1.
Read Messages
4.
Upsert aggregated data
(on window end)
Structured
streaming
OLAP
Structured Streaming - known issues & tips
● 3 major issues we had in 2.1.0 (solved in 2.1.1) :
○ https://issues.apache.org/jira/browse/SPARK-19517
○ https://issues.apache.org/jira/browse/SPARK-19677
○ https://issues.apache.org/jira/browse/SPARK-19407
● Checkpointing to S3 wasn’t straight-forward
○ Tried using EMRFS consistent view
■ Worked for stateless apps
■ Encountered sporadic issues for stateful apps
Structured Streaming - strengths and weaknesses (IMO)
● Strengths include :
○ Running incremental, continuous processing
○ Increased performance (e.g via Catalyst SQL optimizer)
○ Massive efforts are invested in it
● Weaknesses were mostly related to maturity
Back to the future - Spark Streaming revived for “stateful” app use-case
1.
Read Messages
3.
WriteFiles
2.
Aggregate Current micro-batch
4.
Load Data
OLAP
Cool, so… Why can’t we stop here?
● Significantly underutilized cluster resources = wasted $$$
Cool, so… Why can’t we stop here? (cont.)
● Extreme load of Kafka brokers’ disks
○ Each micro-batch needs to read ~300M messages, Kafka can’t store it all in memory
● ConcurrentModificationException when using Spark Streaming + Kafka 0.10 integration
○ Forced us to use 1 core per executor to avoid it
○ https://issues.apache.org/jira/browse/SPARK-19185 supposedly solved in 2.4.0 (possibly solving
https://issues.apache.org/jira/browse/SPARK-22562 as well)
● We wish we could run it even less frequently
○ Remember - longer micro-batches result in a better aggregation ratio
Enter “streaming” over RDR
RDR (or Raw Data Repository) is our Data Lake
●Kafka topic messages are stored on S3 in Parquet format
●RDR Loaders - stateless Spark Streaming applications
●Applications can read data from RDR for various use-cases
○ E.g analyzing data of the last 30 days
Can we leverage our Data Lake and use it as the data source (instead of Kafka)?
How do we “stream” RDR files - producer side
S3 RDRRDR Loaders
2.
Write files
1.
Read Messages
3.
Write files’ paths
Topic with the files’ paths as messages
How do we “stream” RDR files - consumer side
S3 RDR
3.
Process files
1.
Read files’ paths
2.
Read RDR files
How do we use the new RDR “streaming” infrastructure?
1.
Read files’ paths
3.
Write files
2.
Read RDR files
OLAP
4.
Load Data
Did we solve the aforementioned problems?
● EMR clusters are now transient - no more idle clusters
Application type Day 1 Day 2 Day 3
Old Spark Streaming app 1007.68$ 1007.68$ 1007.68$
“Streaming” over RDR app 150.08$ 198.73$ 174.68$
Did we solve the aforementioned problems? (cont.)
● No more extreme load of Kafka brokers’ disks
○ We still read old messages from Kafka, but now we only read
about 1K messages per hour (rather than ~300M)
● The new infra doesn’t depend on the integration of Spark Streaming with Kafka
○ No more weird exceptions...
● We can run the Spark batch applications as (in)frequent as we’d like
Summary
● Initially replaced standalone Java with Spark & Scala
○ Still faced CSV-related issues
● Introduced Spark Streaming & Kafka for “stateless” use-cases
○ Quickly needed to handle stateful use-cases as well
● Tried Spark Streaming for stateleful use-cases (via “local” aggregations)
○ Required us to manage the state on our own
● Moved to Structured Streaming (for all use-cases)
○ Cons were mostly related to maturity
Summary (cont.)
● Went back to Spark Streaming (with Druid as OLAP)
○ Performance penalty in Kafka for long micro-batches
○ Under-utilized Spark clusters
○ Etc.
● Introduced “streaming” over our Data Lake
○ Eliminated Kafka performance penalty
○ Spark clusters are much better utilized = $$$ saved
○ And more...
DRUID ES
Want to know more?
● Women in Big Data
○ A world-wide program that aims :
■ To inspire, connect, grow, and champion success of women in Big Data.
■ To grow women representation in Big Data field > 25% by 2020
○ Visit the website (https://www.womeninbigdata.org/)
● Counting Unique Users in Real-Time: Here’s a Challenge for You!
○ Presented yesterday, http://tinyurl.com/yxjc72af
● NMC Tech Blog - https://medium.com/nmc-techblog
QUESTIONS
https://www.linkedin.com/in/itaiy/
THANK YOU
Structured Streaming -
additional slides
Structured Streaming - basic concepts
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts
Data stream
Unbounded Table
New data in the data streamer
=
New rows appended to a unbounded table
Data stream as an unbonded table
Structured Streaming - basic concepts
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts
Structured Streaming - WordCount example
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts
Structured Streaming - basic terms
● Input sources :
○ File
○ Kafka
○ Socket, Rate (for testing)
● Output modes :
○ Append (default)
○ Complete
○ Update (added in Spark 2.1.1)
○ Different types of queries support different output modes
■ E.g for non-aggregation queries, Complete mode not supported as it is infeasible to keep all
unaggregated data in the Result Table
● Output sinks :
○ File
○ Kafka (added in Spark 2.2.0)
○ Foreach
○ Console, Memory (for debugging)
○ Different types of sinks support different output modes
Fault tolerance
● The goal - end-to-end exactly-once semantics
● The means :
○ Trackable sources (i.e offsets)
○ Checkpointing
○ Idempotent sinks
Monitoring
Structured Streaming in production
So we started moving to Structured Streaming
Use case Previous architecture Old flow New architecture New flow
Existing
Spark app
Periodic Spark batch job Read Parquet from S3
-> Transform ->
Write Parquet to S3
Stateless Structured
Streaming
Read from Kafka ->
Transform ->
Write Parquet to S3
Existing Java
app
Periodic standalone Java
process (“manual”
scaling)
Read CSV ->
Transform and
aggregate -> Write to
RDBMS
Stateful Structured
Streaming
Read from Kafka ->
Transform and aggregate ->
Write to RDBMS
New app N/A N/A Stateful Structured
Streaming
Read from Kafka ->
Transform and aggregate ->
Write to RDBMS

Más contenido relacionado

La actualidad más candente

Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberXiang Fu
 
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and DruidOpen Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and DruidDataWorks Summit
 
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & KafkaMohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & KafkaFlink Forward
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamDataWorks Summit/Hadoop Summit
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Gyula Fóra
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
 
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
Hadoop made fast - Why Virtual Reality Needed Stream Processing to SurviveHadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Surviveconfluent
 
Flurry Analytic Backend - Processing Terabytes of Data in Real-time
Flurry Analytic Backend - Processing Terabytes of Data in Real-timeFlurry Analytic Backend - Processing Terabytes of Data in Real-time
Flurry Analytic Backend - Processing Terabytes of Data in Real-timeTrieu Nguyen
 
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...Databricks
 
Storage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesStorage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesDataWorks Summit
 
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraStream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraDatabricks
 
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovRUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovBig Data Spain
 
Large-scaled telematics analytics
Large-scaled telematics analyticsLarge-scaled telematics analytics
Large-scaled telematics analyticsDataWorks Summit
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
 
Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Pr...
Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Pr...Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Pr...
Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Pr...confluent
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson
 

La actualidad más candente (20)

Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ Uber
 
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and DruidOpen Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
 
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & KafkaMohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
 
Data Pipeline at Tapad
Data Pipeline at TapadData Pipeline at Tapad
Data Pipeline at Tapad
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
Hadoop made fast - Why Virtual Reality Needed Stream Processing to SurviveHadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
 
Flurry Analytic Backend - Processing Terabytes of Data in Real-time
Flurry Analytic Backend - Processing Terabytes of Data in Real-timeFlurry Analytic Backend - Processing Terabytes of Data in Real-time
Flurry Analytic Backend - Processing Terabytes of Data in Real-time
 
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
 
Storage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesStorage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on Kubernetes
 
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraStream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
 
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovRUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
 
Large-scaled telematics analytics
Large-scaled telematics analyticsLarge-scaled telematics analytics
Large-scaled telematics analytics
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 
Scaling HDFS at Xiaomi
Scaling HDFS at XiaomiScaling HDFS at Xiaomi
Scaling HDFS at Xiaomi
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
 
Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Pr...
Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Pr...Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Pr...
Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Pr...
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
The Evolution of Big Data Pipelines at Intuit
The Evolution of Big Data Pipelines at Intuit The Evolution of Big Data Pipelines at Intuit
The Evolution of Big Data Pipelines at Intuit
 

Similar a Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka

Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaDatabricks
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaItai Yaffe
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using KafkaKnoldus Inc.
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2aspyker
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsMonal Daxini
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
 
Scala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusScala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusBoldRadius Solutions
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Monal Daxini
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...Codemotion
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streamingdatamantra
 
Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkDemi Ben-Ari
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineMonal Daxini
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache FlinkAKASH SIHAG
 

Similar a Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka (20)

Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and Kafka
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data Problems
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Scala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusScala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadius
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache spark
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipeline
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache Flink
 

Más de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Más de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 

Último (20)

Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 

Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka

  • 1. Stream, Stream, Stream: Different Streaming methods with Spark and Kafka Itai Yaffe Nielsen
  • 2. Introduction Itai Yaffe ● Tech Lead, Big Data group ● Dealing with Big Data challenges since 2012
  • 3. Introduction - part 2 (or: “your turn…”) ● Data engineers? Data architects? Something else? ● Attended our session yesterday about counting unique users with Druid? ● Working with Spark/Kafka? Planning to?
  • 4. Agenda ● Nielsen Marketing Cloud (NMC) ○ About ○ High-level architecture ● Data flow - past and present ● Spark Streaming ○ “Stateless” and “stateful” use-cases ● Spark Structured Streaming ● “Streaming” over our Data Lake
  • 5. Nielsen Marketing Cloud (NMC) ● eXelate was acquired by Nielsen on March 2015 ● A Data company ● Machine learning models for insights ● Targeting ● Business decisions
  • 6. Nielsen Marketing Cloud - questions we try to answer 1. How many unique users of a certain profile can we reach? E.g campaign for young women who love tech 2. How many impressions a campaign received?
  • 7. Nielsen Marketing Cloud - high-level architecture
  • 8. Data flow in the old days... In-DB aggregation OLAP
  • 9. Data flow in the old days… What’s wrong with that? ● CSV-related issues, e.g: ○ Truncated lines in input files ○ Can’t enforce schema ● Scale-related issues, e.g: ○ Had to “manually” scale the processes
  • 10. That's one small step for [a] man… (2014) “Apache Spark is the Taylor Swift of big data software" (Derrick Harris, Fortune.com, 2015) In-DB aggregation OLAP
  • 11. Why just a small step? ● Solved the scaling issues ● Still faced the CSV-related issues
  • 12. Data flow - the modern way + Photography Copyright: NBC
  • 14. The need for stateful streaming Fast forward a few months... ●New requirements were being raised ●Specific use-case : ○ To take the load off of the operational DB (used both as OLTP and OLAP), we wanted to move most of the aggregative operations to our Spark Streaming app
  • 15. Stateful streaming via “local” aggregations 1. Read Messages 5. Upsert aggregated data (every X micro-batches) 2. Aggregate current micro-batch 3. Write combined aggregated data 4. Read aggregated data From HDFS every X micro-batches OLAP
  • 16. Stateful streaming via “local” aggregations ● Required us to manage the state on our own ● Error-prone ○ E.g what if my cluster is terminated and data on HDFS is lost? ● Complicates the code ○ Mixed input sources for the same app (Kafka + files) ● Possible performance impact ○ Might cause the Kafka consumer to lag
  • 17. Structured Streaming - to the rescue? Spark 2.0 introduced Structured Streaming ●Enables running continuous, incremental processes ○ Basically manages the state for you ●Built on Spark SQL ○ DataFrame/Dataset API ○ Catalyst Optimizer ●Many other features ●Was in ALPHA mode in 2.0 and 2.1 Structured Streaming
  • 18. Structured Streaming - stateful app use-case 2. Aggregate current window 3. Checkpoint (offsets and state) handled internally by Spark 1. Read Messages 4. Upsert aggregated data (on window end) Structured streaming OLAP
  • 19. Structured Streaming - known issues & tips ● 3 major issues we had in 2.1.0 (solved in 2.1.1) : ○ https://issues.apache.org/jira/browse/SPARK-19517 ○ https://issues.apache.org/jira/browse/SPARK-19677 ○ https://issues.apache.org/jira/browse/SPARK-19407 ● Checkpointing to S3 wasn’t straight-forward ○ Tried using EMRFS consistent view ■ Worked for stateless apps ■ Encountered sporadic issues for stateful apps
  • 20. Structured Streaming - strengths and weaknesses (IMO) ● Strengths include : ○ Running incremental, continuous processing ○ Increased performance (e.g via Catalyst SQL optimizer) ○ Massive efforts are invested in it ● Weaknesses were mostly related to maturity
  • 21. Back to the future - Spark Streaming revived for “stateful” app use-case 1. Read Messages 3. WriteFiles 2. Aggregate Current micro-batch 4. Load Data OLAP
  • 22. Cool, so… Why can’t we stop here? ● Significantly underutilized cluster resources = wasted $$$
  • 23. Cool, so… Why can’t we stop here? (cont.) ● Extreme load of Kafka brokers’ disks ○ Each micro-batch needs to read ~300M messages, Kafka can’t store it all in memory ● ConcurrentModificationException when using Spark Streaming + Kafka 0.10 integration ○ Forced us to use 1 core per executor to avoid it ○ https://issues.apache.org/jira/browse/SPARK-19185 supposedly solved in 2.4.0 (possibly solving https://issues.apache.org/jira/browse/SPARK-22562 as well) ● We wish we could run it even less frequently ○ Remember - longer micro-batches result in a better aggregation ratio
  • 24. Enter “streaming” over RDR RDR (or Raw Data Repository) is our Data Lake ●Kafka topic messages are stored on S3 in Parquet format ●RDR Loaders - stateless Spark Streaming applications ●Applications can read data from RDR for various use-cases ○ E.g analyzing data of the last 30 days Can we leverage our Data Lake and use it as the data source (instead of Kafka)?
  • 25. How do we “stream” RDR files - producer side S3 RDRRDR Loaders 2. Write files 1. Read Messages 3. Write files’ paths Topic with the files’ paths as messages
  • 26. How do we “stream” RDR files - consumer side S3 RDR 3. Process files 1. Read files’ paths 2. Read RDR files
  • 27. How do we use the new RDR “streaming” infrastructure? 1. Read files’ paths 3. Write files 2. Read RDR files OLAP 4. Load Data
  • 28. Did we solve the aforementioned problems? ● EMR clusters are now transient - no more idle clusters Application type Day 1 Day 2 Day 3 Old Spark Streaming app 1007.68$ 1007.68$ 1007.68$ “Streaming” over RDR app 150.08$ 198.73$ 174.68$
  • 29. Did we solve the aforementioned problems? (cont.) ● No more extreme load of Kafka brokers’ disks ○ We still read old messages from Kafka, but now we only read about 1K messages per hour (rather than ~300M) ● The new infra doesn’t depend on the integration of Spark Streaming with Kafka ○ No more weird exceptions... ● We can run the Spark batch applications as (in)frequent as we’d like
  • 30. Summary ● Initially replaced standalone Java with Spark & Scala ○ Still faced CSV-related issues ● Introduced Spark Streaming & Kafka for “stateless” use-cases ○ Quickly needed to handle stateful use-cases as well ● Tried Spark Streaming for stateleful use-cases (via “local” aggregations) ○ Required us to manage the state on our own ● Moved to Structured Streaming (for all use-cases) ○ Cons were mostly related to maturity
  • 31. Summary (cont.) ● Went back to Spark Streaming (with Druid as OLAP) ○ Performance penalty in Kafka for long micro-batches ○ Under-utilized Spark clusters ○ Etc. ● Introduced “streaming” over our Data Lake ○ Eliminated Kafka performance penalty ○ Spark clusters are much better utilized = $$$ saved ○ And more...
  • 32. DRUID ES Want to know more? ● Women in Big Data ○ A world-wide program that aims : ■ To inspire, connect, grow, and champion success of women in Big Data. ■ To grow women representation in Big Data field > 25% by 2020 ○ Visit the website (https://www.womeninbigdata.org/) ● Counting Unique Users in Real-Time: Here’s a Challenge for You! ○ Presented yesterday, http://tinyurl.com/yxjc72af ● NMC Tech Blog - https://medium.com/nmc-techblog
  • 36. Structured Streaming - basic concepts https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts Data stream Unbounded Table New data in the data streamer = New rows appended to a unbounded table Data stream as an unbonded table
  • 37. Structured Streaming - basic concepts https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts
  • 38. Structured Streaming - WordCount example https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts
  • 39. Structured Streaming - basic terms ● Input sources : ○ File ○ Kafka ○ Socket, Rate (for testing) ● Output modes : ○ Append (default) ○ Complete ○ Update (added in Spark 2.1.1) ○ Different types of queries support different output modes ■ E.g for non-aggregation queries, Complete mode not supported as it is infeasible to keep all unaggregated data in the Result Table ● Output sinks : ○ File ○ Kafka (added in Spark 2.2.0) ○ Foreach ○ Console, Memory (for debugging) ○ Different types of sinks support different output modes
  • 40. Fault tolerance ● The goal - end-to-end exactly-once semantics ● The means : ○ Trackable sources (i.e offsets) ○ Checkpointing ○ Idempotent sinks
  • 42. Structured Streaming in production So we started moving to Structured Streaming Use case Previous architecture Old flow New architecture New flow Existing Spark app Periodic Spark batch job Read Parquet from S3 -> Transform -> Write Parquet to S3 Stateless Structured Streaming Read from Kafka -> Transform -> Write Parquet to S3 Existing Java app Periodic standalone Java process (“manual” scaling) Read CSV -> Transform and aggregate -> Write to RDBMS Stateful Structured Streaming Read from Kafka -> Transform and aggregate -> Write to RDBMS New app N/A N/A Stateful Structured Streaming Read from Kafka -> Transform and aggregate -> Write to RDBMS

Notas del editor

  1. Thank you for coming to hear about our different use-cases of Streaming with Spark and Kafka I will try to make it interesting and valuable for you
  2. Questions - at the end of the session
  3. Nielsen marketing cloud or NMC in short A group inside Nielsen, Born from exelate company that was acquired by Nielsen on March 2015 Nielsen is a data company and so are we and we had strong business relationship until at some point they decided to go for it and acquired exelate Data company meaning Buying and onboarding data into NMC from data providers, customers and Nielsen data We have huge high quality dataset enrich the data using machine learning models in order to create more relevant quality insights categorize and sell according to a need Helping brands to take intelligence business decisions E.g. Targeting in the digital marketing world Meaning help fit ads to viewers For example street sign can fit to a very small % of people who see it vs Online ads that can fit the profile of the individual that sees it More interesting to the user More chances he will click the ad Better ROI for the marketer
  4. What are the questions we try to answer in NMC that help our customers to take business decisions ? A lot of questions but to lead to what druid is coming to solve Translating from human problem to technical problem: UU (distinct) count Simple count
  5. Few words on NMC data pipeline architecture: Frontend layer: Receives all the online and offline data traffic Bare metal on different data centers (3 in US, 2 in EU ,3 in APAC) near real time - high throughput/low latency challenges Backend layer Aws Cloud based process all the frontend layer outputs ETL’s - load data to data sources aggregated and raw Applications layer Also in the cloud Variety of apps above all our data sources Web - NMC data configurations (segments, audiences etc) campaign analysis , campaign management tools etc. visualized profile graphs reports
  6. We’ve used Clustrix (our operation DB) as both OLTP and OLAP Events are flowing from our Serving system, need to ETL the data into our data stores (DB, DWH, etc.) Events were written to CSV files Some fields had double quotes, e.g: 2014-07-17,12:55:38,2,2,0,"1619691,9995",1 Processing was done via standalone Java process Had many problems with this architecture Truncated lines in input files Can’t enforce schema Had to “manually” scale the processes
  7. Around 2014 the standalone Java processes were transformed into Spark batch jobs written in Scala (but in this presentation we’re going to focus on streaming). This is a simplified version of what we built (simplified it to make it clearer across the presentation) Spark A distributed, scalable engine for large-scale data processing Unified framework for batch, streaming, machine learning, etc Was gaining a lot of popularity in the Big Data community Built on RDDs (Resilient distributed dataset) A fault-tolerant collection of elements that can be operated on in parallel Scala Combines object-oriented and functional programming First-class citizen is Spark
  8. Kafka Open-source stream-processing platform Highly scalable Publish/Subscribe (A.K.A pub/sub) Schema enforcement - using Schema Registry and relying on Avro format Much more Originally developed by LinkedIn Graduated from Apache Incubator on late 2012 Quickly became the de facto standard in the industry Today commercial development is led by Confluent Spark Streaming A natural evolvement of our Spark batch job (unified framework – remember?) Introduced the DStream concept Continuous stream of data Represented by a continuous series of RDDs Works in micro-batches Each RDD in a DStream contains data from a certain interval (e.g 5 minutes)
  9. We started with Spark Streaming over Kafka (in 2015) Our Streaming apps were “stateless” (see below) and running 24/7 : Reading a batch of messages from Kafka Performing simple transformations on each message (no aggregations) Writing the output of each batch to a persistent storage (DB, S3, etc.) Stateful operations (aggregations) were performed periodically in batch either by Spark jobs ETLs in our DB/DWH
  10. Looking back, Spark Streaming might have been able to perform stateful operations for us, but (as far as I recall) mapWithState wasn’t available yet, and updateStateByKey had some pending issues. The way to achieve it was : Read messages from Kafka Aggregate the messages of the current micro-batch Increased micro-batch length to achieve a better aggregation ratio Combine the results of the results of the previous micro-batches (stored on the cluster’s HDFS) Write the results back to HDFS Every X batches : Update the DB with the aggregated data (some sort of UPSERT) Delete the aggregated files from HDFS UPSERT = INSERT ... ON DUPLICATE KEY UPDATE … (in MySQL) For example, given t1 with columns a (the key) and b (starting from an empty table) INSERT INTO t1 (a,b) VALUES (1,2) ON DUPLICATE KEY UPDATE b=b+VALUES(b); -> a=1, b=2 INSERT INTO t1 (a,b) VALUES (1,5) ON DUPLICATE KEY UPDATE b=b+VALUES(b); -> a=1, b=7
  11. In this specific use-case, the app was reading from a topic which had only small amounts of data Required us to manage the state on our own Error-prone E.g what if my cluster is terminated and data on HDFS is lost? Complicates the code Mixed input sources for the same app (Kafka + files) Possible performance impact Might cause the Kafka consumer to lag Obviously not the perfect way (but that’s what we had…)
  12. DataFrame/Dataset - rather than DStream’s RDD Catalyst Optimizer - extensible query optimizer which is “at the core of Spark SQL… designed with these key two purposes: Easily add new optimization techniques and features to Spark SQL Enable external developers to extend the optimizer (e.g. adding data source specific rules, support for new data types, etc.)” (see https://databricks.com/glossary/catalyst-optimizer) Other features included : Handling event-time and late data End-to-end exactly-once fault-tolerance
  13. Checkpoint folder is the location Spark stores : The offsets we already read from Kafka The state of the stateful operations (e.g aggregations) We’ve used S3 (via EMRFS) for checkpointing. We’ve deployed to production various use-cases using Structured Streaming : Periodic Spark batch job was converted to a stateless Structured Streaming app Periodic standalone Java app was converted to a stateful Structured Streaming app A brand new app was written as a stateful Structured Streaming app
  14. EMRFS consistent view - an optional feature on AWS EMR, allows clusters to check for list and read-after-write consistency for S3 objects written by or synced with EMRFS Checkpointing to S3 wasn’t straight-forward Try using EMRFS consistent view Recommended for stateless apps For stateful apps, we encountered sporadic issues possibly related to the metadata store (i.e DynamoDB)
  15. Strengths : Running incremental, continuous processing End-to-end exactly-once fault-tolerance (if you implement it correctly) Increased performance (uses the Catalyst SQL optimizer and other DataFrame optimizations like code generation) Massive efforts are invested in it Weaknesses : Maturity Inability to perform multiple actions on the exact same Dataset E.g http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-Avoiding-multiple-streaming-queries-tt30944.html Seems to be resolved by https://issues.apache.org/jira/browse/SPARK-24565 (in Spark 2.4, but then you get at-least once)
  16. Moved many apps (mostly the ones performing UU counts and “hit” counts) to rely on Druid, which is meant for OLAP, so now : Spark Streaming app Runs on a long-lived EMR cluster (cluster is on 24/7) Performs the “in-batch” aggregation per micro-batch (before writing to S3) Writes relevant metadata to RDS (e.g S3 path) This kind of “split” (i.e persisting the Dataset/DataFrame and iterating it a few times) is impossible with Structured Streaming (where every “branch” of processing is a separate query, at least until https://issues.apache.org/jira/browse/SPARK-24565) M/R ingestion job (loads data into Druid) : Reads relevant metadata from RDS performs the final aggregation (before data is loaded into Druid) Update state in RDS (e.g which files were handled)
  17. Screenshot from Ganglia installed on our AWS EMR cluster running the Spark Streaming app Remember - longer micro-batches result in a better aggregation ratio Each such app runs on its own long-lived EMR cluster
  18. Extreme load of Kafka brokers’ disks Each micro-batch needs to read ~300M messages , Kafka can’t store it all in memory ConcurrentModificationException when using Spark Streaming + Kafka 0.10 integration Forced us to use 1 core per executor to avoid it https://issues.apache.org/jira/browse/SPARK-19185 supposedly solved in 2.4.0 (possibly solving https://issues.apache.org/jira/browse/SPARK-22562 as well) We wish we could run it even less frequently Remember - longer micro-batches result in a better aggregation ratio
  19. Each Kafka topic has its own RDR loader, which stores the data in a separate bucket on S3 (partitioned by date) This means each topic has only 1 consumer (the appropriate RDR loader) RDR loaders use micro-batches of 4-6 minutes, writing about 100 files per micro-batch, each file is ~0.5GB (to allow efficient read) Only simple transformations on each message (no aggregations) Hence no need for long micro-batche
  20. Once the RDR loader writes the files to S3, it also writes the files’ paths to a designated topic in Kafka
  21. How does that work? Spark batch applications are executed every X hours On each execution : A transient EMR cluster is launched An app consumes the next Y messages from the designated Kafka topic (containing Y paths).Since each such app is consuming from Kafka is a standard way, the offsets (i.e which messages the consumer already read) are committed (and maintained) the same way as we do for any other Kafka consumer Then the app reads those Y paths from RDR and processes them Once done, the EMR cluster is terminated
  22. Applications are now batch rather than streaming We no longer use Spark Streaming-Kafka integration, but rather Kafka API (from the driver) to read the files’ paths from the designated Kafka topic Once we got the paths from Kafka, we use the “regular” batch method of reading files, i.e spark.read.parquet After processing has ended, offsets of the messages we read are committed (as we’d do for any Kafka consumer) We now use Airflow (the de facto standard in the industry) to schedule and monitor our batch jobs All this obviously is not meant to be used by apps that require actual real time (say milliseconds)
  23. EMR clusters are now transient, so the cluster is terminated as soon as the batch job has finished - no more idle clusters Cost: Spark Streaming cluster is on 24/7, so the cost is fixed With the new infra, the daily cost varies based on the amount of data we processed that day
  24. Initially replaced standalone Java with Spark & Scala Solved the scale-related issues but not the CSV-related issues Introduced Spark Streaming & Kafka for “stateless” use-cases Replaced CSV files with Kafka (de facto standard in the industry) Already had Spark batch in production (Spark as a unified framework) Tried Spark Streaming for stateleful use-cases (via “local” aggregations) Not the optimal solution Moved to Structured Streaming (for all use-cases) Pros include : Enables running continuous, incremental processes Built on Spark SQL Cons include : Maturity Inability to perform multiple actions on the exact same Dataset
  25. Went back to Spark Streaming Aggregations are done per micro-batch (in Spark) and daily (in Druid) Still not perfect Performance penalty in Kafka for long micro-batches Concurrency issue with Kafka 0.10 consumer in Spark Under-utilized Spark clusters Introduced “streaming” over our Data Lake Spark Streaming apps (A.K.A “RDR loaders”) write files to S3 and paths to Kafka Spark batch apps read S3 paths from Kafka (and the actual files from S3) Transient EMR clusters Airflow for scheduling and monitoring Pros : Eliminated the performance penalty we had in Kafka Spark clusters are much better utilized = $$$ saved
  26. “The key idea in Structured Streaming is to treat a live data stream as a table that is being continuously appended… You will express your streaming computation as standard batch-like query as on a static table, and Spark runs it as an incremental query on the unbounded input table”
  27. “A query on the input will generate the “Result Table”. Every trigger interval (say, every 1 second), new rows get appended to the Input Table, which eventually updates the Result Table. Whenever the result table gets updated, we would want to write the changed result rows to an external sink.”