SlideShare una empresa de Scribd logo
1 de 50
Stream, Stream, Stream:
Different Streaming methods with Spark and Kafka
Itai Yaffe + Ron Tevel
Nielsen
Introduction
Ron Tevel Itai Yaffe
● Big Data developer
● Developing Big
Data infrastructure
solutions
● Big Data Tech Lead
● Dealing with Big
Data challenges
since 2012
Introduction - part 2 (or: “your turn…”)
● Data engineers? Data architects? Something else?
● Working with Spark? Planning to?
● Working with Kafka? Planning to?
● Cloud deployments? On-prem?
Agenda
● Nielsen Marketing Cloud (NMC)
○ About
○ High-level architecture
● Data flow - past and present
● Spark Streaming
○ “Stateless” and “stateful” use-cases
● Spark Structured Streaming
● “Streaming” over our Data Lake
Nielsen Marketing Cloud (NMC)
● eXelate was acquired by Nielsen on March 2015
● A Data company
● Machine learning models for insights
● Targeting
● Business decisions
Nielsen Marketing Cloud - questions we try to answer
● How many users of a certain profile can we reach
Campaign for fancy women sneakers -
● How many hits for a specific web page in a date range
NMC high-level architecture
NMC high-level architecture
NMC high-level architecture
Data flow in the old days...
Events are flowing from our Serving system, need to ETL the data
into our data stores (DB, DWH, etc.)
● In the past, events were written to CSV files
○ Some fields had double quotes, e.g :
2014-07-17,12:55:38,204,400,US|FL|daytona
beach|32114,cdde7b60a3117cc4c539b10faad665a9,"https%3A%2F%2Floadm.exelator.com%2Fload%
2F%3Fp%3D204%26g%3D400%26buid%3D6989098507373987292%26j%3D0","http%3A%2F%2Fwww
.vacationrentals.com%2Fvacation-
rentals%2Fflorida%2Forlando.html",2,2,0,"1619691,9995","","","1",,"Windows 7","Chrome"
● Processing with standalone Java process
● Had many problems with this architecture
○ Truncated lines in input files
○ Can’t enforce schema
○ Had to “manually” scale the processes
That's one small step for [a] man...
Moved to Spark (and Scala) in 2014
● Spark
○ An engine for large-scale data processing
○ Distributed, scalable
○ Unified framework for batch, streaming, machine learning, etc
○ Was gaining a lot of popularity in the Big Data community
○ Built on RDDs (Resilient distributed dataset)
■ A fault-tolerant collection of elements that can be operated on in parallel
● Scala
○ Combines object-oriented and functional programming
○ First-class citizen is Spark
● Converted the standalone Java processes to Spark batch jobs
○ Solved the scaling issues
○ Still faced the CSV-related issues
Data flow - the modern way
Introducing Kafka
● Open-source stream-processing platform
○ Highly scalable
○ Publish/Subscribe (A.K.A pub/sub)
○ Schema enforcement
○ Much more
● Originally developed by LinkedIn
● Graduated from Apache Incubator on late 2012
● Quickly became the de facto standard in the industry
● Today commercial development is led by Confluent
Data flow - the modern way (cont.)
… Along with Spark Streaming
● A natural evolvement of our Spark batch job (unified framework - remember?)
● Introduced the DStream concept
○ Continuous stream of data
○ Represented by a continuous series of RDDs
● Works in micro-batches
○ Each RDD in a DStream contains data from a certain interval (e.g 5 minutes)
Spark Streaming - “stateless” app use-case
We started with Spark Streaming over Kafka (in 2015)
● Our Streaming apps were “stateless”, i.e :
○ Reading a batch of messages from Kafka
○ Performing simple transformations on each message (no aggregations)
○ Writing each batch to a persistent storage (S3)
● Stateful operations (aggregations) were performed in batch on files, either by
○ Spark jobs
○ ETLs in our DB/DWH
Spark Streaming - “stateless” app use-case
The need for stateful streaming
Fast forward a few months...
● New requirements were being raised
● Specific use-case :
○ To take the load off of the operational DB (used both as OLTP and OLAP), we wanted to move most of
the aggregative operations to our Spark streaming app
Stateful streaming via “local” aggregations
● The way to achieve it was :
○ Read messages from Kafka
○ Aggregate the messages of the current micro-batch
○ Combine the results of the results of the previous micro-batches (stored on the cluster’s HDFS)
○ Write the results back to HDFS
○ Every X batches :
■ Update the DB with the aggregated data (some sort of UPSERT)
■ Delete the aggregated files from HDFS
● UPSERT = INSERT ... ON DUPLICATE KEY UPDATE … (in MySQL)
○ For example, given t1 with columns a (the key) and b (starting from an empty table)
■ INSERT INTO t1 (a,b) VALUES (1,2) ON DUPLICATE KEY UPDATE b=b+VALUES(b); -> a=1, b=2
■ INSERT INTO t1 (a,b) VALUES (1,5) ON DUPLICATE KEY UPDATE b=b+VALUES(b); -> a=1, b=7
Stateful streaming via “local” aggregations
Stateful streaming via “local”aggregations - cons
● Required us to manage the state on our own
● Error-prone
○ E.g what if my cluster is terminated and data on HDFS is lost?
● Complicates the code
○ Mixed input sources for the same app (Kafka + files)
● Possible performance impact
○ Might cause the Kafka consumer to lag
● Obviously not the perfect way (but that’s what we had…)
Structured Streaming - to the rescue?
Spark 2.0 introduced Structured Streaming
● Enables running continuous, incremental processes
○ Basically manages the state for you
● Built on Spark SQL
○ DataFrame/Dataset API
○ Catalyst Optimizer
● Allows handling event-time and late data
● Ensures end-to-end exactly-once fault-tolerance
● Was in ALPHA mode in 2.0 and 2.1
Structured Streaming - basic concepts
Structured Streaming - WordCount example
Structured Streaming - stateful app use-case
Structured Streaming in production
So we started moving to Structured Streaming
Use case Previous
architecture
Old flow New
architecture
New flow
Existing Spark
app
Periodic Spark
batch job
Read Parquet from S3 ->
Transform ->
Write Parquet to S3
Stateless
Structured
Streaming
Read from Kafka -> Transform ->
Write Parquet to S3
Existing Java
app
Periodic
standalone Java
process (“manual”
scaling)
Read CSV ->
Transform and aggregate ->
Write to RDBMS
Stateful
Structured
Streaming
Read from Kafka -> Transform
and aggregate ->
Write to RDBMS
New app N/A N/A Stateful
Structured
Streaming
Read from Kafka -> Transform
and aggregate ->
Write to RDBMS
Structured Streaming - known issues & tips
● 3 major issues we had in 2.1.0 (solved in 2.1.1) :
○ https://issues.apache.org/jira/browse/SPARK-19517
○ https://issues.apache.org/jira/browse/SPARK-19677
○ https://issues.apache.org/jira/browse/SPARK-19407
● Using EMRFS consistent view when checkpointing to S3
○ Recommended for stateless apps
○ For stateful apps, we encountered sporadic issues possibly related to the metadata store (i.e
DynamoDB)
Structured Streaming - strengths and weaknesses (IMO)
● Strengths :
○ Running incremental, continuous processing
○ End-to-end exactly-once fault-tolerance (if you implement it correctly)
○ Increased performance (uses the Catalyst SQL optimizer and other DataFrame optimizations like code
generation)
○ Massive efforts are invested in it
● Weaknesses :
○ Maturity
○ Inability to perform multiple actions on the exact same Dataset (by-design?)
■ Seems to be resolved by https://issues.apache.org/jira/browse/SPARK-24565 (in the upcoming
Spark 2.4, but then you get at-least once)
Back to the future - DStream revived for “stateful” app use-case
So what’s wrong with our DStream to Druid application?
So what’s wrong with our DStream to Druid application?
● Kafka needs to read 300M messages from disk.
● ConcurrentModificationException when working with Spark Streaming on Kafka 0.10
○ Forced us to use 1 core per executor to avoid it
○ https://issues.apache.org/jira/browse/SPARK-19185 supposedly solved in 2.4.0 (possibly solving
https://issues.apache.org/jira/browse/SPARK-22562 as well)
● We wish we could run it less frequently.
Enter “streaming” over RDR
What is RDR?
RDR is Raw Data Repository and it’s our Data Lake
● Kafka topic messages saved to S3 in Parquet.
● RDR Loaders - Spark streaming applications.
● Applications can read from RDR and do analytics on the data.
Can we leverage our Data Lake as the data source instead of Kafka?
The Idea of How to Stream RDR files
How do we “stream” RDR Files
How do we use the new RDR “streaming” infrastructure?
“A platform to programmatically author, schedule and monitor workflows”
Developed by Airbnb and is a part of the Apache Incubator
The Scheduler
Did we solve the problems?
No longer streaming application - no longer idle cluster.
Name Day 1 Day 2 Day 3
Old App to Druid 1007.68$ 1007.68$ 1007.68$
New App to Druid 150.08$ 198.73$ 174.68$
Did we solve the problems?
● Still reads old messages from Kafka disk but unlike reading 300M messages we just
read 1K messages per hour.
● Doesn’t depend on the integration of spark streaming with Kafka - no more weird
Kafka exceptions.
● We can run the Spark batch application as (in)frequent as we’d like.
Summary
● Started with Spark Streaming for “stateless” use-cases
○ Replaced CSV files with Kafka (de facto standard in the industry)
○ Already had Spark batch in production (Spark as a unified framework)
● Tried Spark Streaming for stateleful use-cases (via “local” aggregations)
○ Not the optimal solution
● Moved to Structured Streaming (for all use-cases)
○ Pros include :
■ Enables running continuous, incremental processes
■ Built on Spark SQL
○ Cons include :
■ Maturity
■ Inability to perform multiple actions on the exact same Dataset (by-design?)
Summary - cont.
● Moved (back) to Spark Streaming
○ Aggregations are done per micro-batch (in Spark) and daily (in Druid)
○ Still not perfect
■ Performance penalty in Kafka for long micro-batches
■ Concurrency issue with Kafka 0.10 consumer in Spark
■ Under-utilized Spark clusters
● Introduced “streaming” over our Data Lake
○ Spark Streaming apps (A.K.A “RDR loaders”) write files to S3 and paths to Kafka
○ Spark batch apps read S3 paths from Kafka (and the actual files from S3)
■ Airflow for scheduling and monitoring
■ Meant for apps that don’t require real time
○ Pros :
■ Eliminated the performance penalty we had in Kafka
■ Spark clusters are much better utilized
QUESTIONS?
Join us - https://www.comeet.co/jobs/nielsen/33.000
Big Data Group Leader
Big Data Team Leader
And more...
THANK YOU!
https://www.linkedin.com/in/rontevel/
https://www.linkedin.com/in/itaiy/
Structured Streaming -
additional slides
Structured Streaming - basic concepts
Structured Streaming - basic terms
● Input sources :
○ File
○ Kafka
○ Socket, Rate (for testing)
● Output modes :
○ Append (default)
○ Complete
○ Update (added in Spark 2.1.1)
○ Different types of queries support different output modes
■ E.g for non-aggregation queries, Complete mode not supported as it is infeasible to keep all
unaggregated data in the Result Table
● Output sinks :
○ File
○ Kafka (added in Spark 2.2.0)
○ Foreach
○ Console, Memory (for debugging)
○ Different types of sinks support different output modes
Handling event-time and late data
Handling event-time and late data
Handling event-time and late data
Handling event-time and late data
// Group the data by window and word and compute the count of each group
val windowedCounts = words.groupBy(
window($"timestamp", "10 minutes", "5 minutes"),
$"word"
).count()
// Group the data by window and word and compute the count of each group
val windowedCounts = words
.withWatermark("timestamp", "10 minutes")
.groupBy(
window($"timestamp", "10 minutes", "5 minutes"),
$"word")
.count()
Fault tolerance
● The goal - end-to-end exactly-once semantics
● The means :
○ Trackable sources (i.e offsets)
○ Checkpointing
○ Idempotent sinks
aggDF
.writeStream
.outputMode("complete")
.option("checkpointLocation", "path/to/HDFS/dir")
.format("memory")
.start()
Monitoring
● Interactive APIs :
○ streamingQuery.lastProgress()/status()
○ Output example
● Asynchronous API :
○ val spark: SparkSession = ...
spark.streams.addListener(new StreamingQueryListener() {
override def onQueryStarted(queryStarted: QueryStartedEvent): Unit = {
println("Query started: " + queryStarted.id)
}
override def onQueryTerminated(queryTerminated: QueryTerminatedEvent): Unit = {
println("Query terminated: " + queryTerminated.id)
}
override def onQueryProgress(queryProgress: QueryProgressEvent): Unit = {
println("Query made progress: " + queryProgress.progress)
}
})

Más contenido relacionado

La actualidad más candente

Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Data Con LA
 
Evolution of apache spark
Evolution of apache sparkEvolution of apache spark
Evolution of apache sparkdatamantra
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streamingdatamantra
 
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsApache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsTimothy Spann
 
Spark streaming
Spark streamingSpark streaming
Spark streamingWhiteklay
 
Kappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioKappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioPiotr Czarnas
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streamingdatamantra
 
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion DubaiSMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion DubaiCodemotion Dubai
 
Introduction to dataset
Introduction to datasetIntroduction to dataset
Introduction to datasetdatamantra
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafkadatamantra
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2datamantra
 
Container Orchestrator Smackdown @ContinousLifecycle
Container Orchestrator Smackdown @ContinousLifecycleContainer Orchestrator Smackdown @ContinousLifecycle
Container Orchestrator Smackdown @ContinousLifecycleMichael Mueller
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streamingdatamantra
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemGyula Fóra
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with SparkVincent GALOPIN
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Sparkdatamantra
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streamingdatamantra
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesAlexis Seigneurin
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelinprajods
 

La actualidad más candente (20)

Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
 
Evolution of apache spark
Evolution of apache sparkEvolution of apache spark
Evolution of apache spark
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsApache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Kappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioKappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.io
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
 
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion DubaiSMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
 
Introduction to dataset
Introduction to datasetIntroduction to dataset
Introduction to dataset
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
 
Container Orchestrator Smackdown @ContinousLifecycle
Container Orchestrator Smackdown @ContinousLifecycleContainer Orchestrator Smackdown @ContinousLifecycle
Container Orchestrator Smackdown @ContinousLifecycle
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with Spark
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Spark
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
 
Spark on yarn
Spark on yarnSpark on yarn
Spark on yarn
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and Microservices
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 

Similar a Stream, stream, stream: Different streaming methods with Spark and Kafka

Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaDataWorks Summit
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaDatabricks
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
Apache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectosApache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectosOpenSistemas
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using KafkaKnoldus Inc.
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 
Apache Spark and Python: unified Big Data analytics
Apache Spark and Python: unified Big Data analyticsApache Spark and Python: unified Big Data analytics
Apache Spark and Python: unified Big Data analyticsJulien Anguenot
 
A Step to programming with Apache Spark
A Step to programming with Apache SparkA Step to programming with Apache Spark
A Step to programming with Apache SparkKnoldus Inc.
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty
 
Scala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusScala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusBoldRadius Solutions
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015Robbie Strickland
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache FlinkAKASH SIHAG
 
Getting Started with Spark Scala
Getting Started with Spark ScalaGetting Started with Spark Scala
Getting Started with Spark ScalaKnoldus Inc.
 
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development ModelLaskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development ModelGarindra Prahandono
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with SparkRoger Rafanell Mas
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at UberDatabricks
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...Codemotion
 

Similar a Stream, stream, stream: Different streaming methods with Spark and Kafka (20)

Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Apache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectosApache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectos
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Apache Spark and Python: unified Big Data analytics
Apache Spark and Python: unified Big Data analyticsApache Spark and Python: unified Big Data analytics
Apache Spark and Python: unified Big Data analytics
 
A Step to programming with Apache Spark
A Step to programming with Apache SparkA Step to programming with Apache Spark
A Step to programming with Apache Spark
 
ASPgems - kappa architecture
ASPgems - kappa architectureASPgems - kappa architecture
ASPgems - kappa architecture
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
 
Scala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusScala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadius
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache Flink
 
Getting Started with Spark Scala
Getting Started with Spark ScalaGetting Started with Spark Scala
Getting Started with Spark Scala
 
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development ModelLaskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
 

Más de Itai Yaffe

Mastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data ProcessingMastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data ProcessingItai Yaffe
 
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse AutomationSolving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse AutomationItai Yaffe
 
Lessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark ApplicationsLessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark ApplicationsItai Yaffe
 
Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Itai Yaffe
 
Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"Itai Yaffe
 
Evaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening NotesEvaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening NotesItai Yaffe
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeItai Yaffe
 
Data Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management MonolithsData Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management MonolithsItai Yaffe
 
Unleashing the Power of your Data
Unleashing the Power of your DataUnleashing the Power of your Data
Unleashing the Power of your DataItai Yaffe
 
Data Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening NotesData Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening NotesItai Yaffe
 
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...Itai Yaffe
 
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and DruidDevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and DruidItai Yaffe
 
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)Itai Yaffe
 
Introducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom ConnectorsIntroducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom ConnectorsItai Yaffe
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapItai Yaffe
 
Scalable Incremental Index for Druid
Scalable Incremental Index for DruidScalable Incremental Index for Druid
Scalable Incremental Index for DruidItai Yaffe
 
Funnel Analysis with Spark and Druid
Funnel Analysis with Spark and DruidFunnel Analysis with Spark and Druid
Funnel Analysis with Spark and DruidItai Yaffe
 
The benefits of running Spark on your own Docker
The benefits of running Spark on your own DockerThe benefits of running Spark on your own Docker
The benefits of running Spark on your own DockerItai Yaffe
 
Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Itai Yaffe
 
Scheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructureScheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructureItai Yaffe
 

Más de Itai Yaffe (20)

Mastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data ProcessingMastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data Processing
 
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse AutomationSolving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
 
Lessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark ApplicationsLessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark Applications
 
Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?
 
Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"
 
Evaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening NotesEvaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening Notes
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 
Data Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management MonolithsData Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management Monoliths
 
Unleashing the Power of your Data
Unleashing the Power of your DataUnleashing the Power of your Data
Unleashing the Power of your Data
 
Data Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening NotesData Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening Notes
 
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
 
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and DruidDevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
 
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
 
Introducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom ConnectorsIntroducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom Connectors
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
Scalable Incremental Index for Druid
Scalable Incremental Index for DruidScalable Incremental Index for Druid
Scalable Incremental Index for Druid
 
Funnel Analysis with Spark and Druid
Funnel Analysis with Spark and DruidFunnel Analysis with Spark and Druid
Funnel Analysis with Spark and Druid
 
The benefits of running Spark on your own Docker
The benefits of running Spark on your own DockerThe benefits of running Spark on your own Docker
The benefits of running Spark on your own Docker
 
Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?
 
Scheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructureScheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructure
 

Último

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 

Último (20)

Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 

Stream, stream, stream: Different streaming methods with Spark and Kafka

  • 1. Stream, Stream, Stream: Different Streaming methods with Spark and Kafka Itai Yaffe + Ron Tevel Nielsen
  • 2. Introduction Ron Tevel Itai Yaffe ● Big Data developer ● Developing Big Data infrastructure solutions ● Big Data Tech Lead ● Dealing with Big Data challenges since 2012
  • 3. Introduction - part 2 (or: “your turn…”) ● Data engineers? Data architects? Something else? ● Working with Spark? Planning to? ● Working with Kafka? Planning to? ● Cloud deployments? On-prem?
  • 4. Agenda ● Nielsen Marketing Cloud (NMC) ○ About ○ High-level architecture ● Data flow - past and present ● Spark Streaming ○ “Stateless” and “stateful” use-cases ● Spark Structured Streaming ● “Streaming” over our Data Lake
  • 5. Nielsen Marketing Cloud (NMC) ● eXelate was acquired by Nielsen on March 2015 ● A Data company ● Machine learning models for insights ● Targeting ● Business decisions
  • 6. Nielsen Marketing Cloud - questions we try to answer ● How many users of a certain profile can we reach Campaign for fancy women sneakers - ● How many hits for a specific web page in a date range
  • 10. Data flow in the old days... Events are flowing from our Serving system, need to ETL the data into our data stores (DB, DWH, etc.) ● In the past, events were written to CSV files ○ Some fields had double quotes, e.g : 2014-07-17,12:55:38,204,400,US|FL|daytona beach|32114,cdde7b60a3117cc4c539b10faad665a9,"https%3A%2F%2Floadm.exelator.com%2Fload% 2F%3Fp%3D204%26g%3D400%26buid%3D6989098507373987292%26j%3D0","http%3A%2F%2Fwww .vacationrentals.com%2Fvacation- rentals%2Fflorida%2Forlando.html",2,2,0,"1619691,9995","","","1",,"Windows 7","Chrome" ● Processing with standalone Java process ● Had many problems with this architecture ○ Truncated lines in input files ○ Can’t enforce schema ○ Had to “manually” scale the processes
  • 11. That's one small step for [a] man... Moved to Spark (and Scala) in 2014 ● Spark ○ An engine for large-scale data processing ○ Distributed, scalable ○ Unified framework for batch, streaming, machine learning, etc ○ Was gaining a lot of popularity in the Big Data community ○ Built on RDDs (Resilient distributed dataset) ■ A fault-tolerant collection of elements that can be operated on in parallel ● Scala ○ Combines object-oriented and functional programming ○ First-class citizen is Spark ● Converted the standalone Java processes to Spark batch jobs ○ Solved the scaling issues ○ Still faced the CSV-related issues
  • 12. Data flow - the modern way Introducing Kafka ● Open-source stream-processing platform ○ Highly scalable ○ Publish/Subscribe (A.K.A pub/sub) ○ Schema enforcement ○ Much more ● Originally developed by LinkedIn ● Graduated from Apache Incubator on late 2012 ● Quickly became the de facto standard in the industry ● Today commercial development is led by Confluent
  • 13. Data flow - the modern way (cont.) … Along with Spark Streaming ● A natural evolvement of our Spark batch job (unified framework - remember?) ● Introduced the DStream concept ○ Continuous stream of data ○ Represented by a continuous series of RDDs ● Works in micro-batches ○ Each RDD in a DStream contains data from a certain interval (e.g 5 minutes)
  • 14. Spark Streaming - “stateless” app use-case We started with Spark Streaming over Kafka (in 2015) ● Our Streaming apps were “stateless”, i.e : ○ Reading a batch of messages from Kafka ○ Performing simple transformations on each message (no aggregations) ○ Writing each batch to a persistent storage (S3) ● Stateful operations (aggregations) were performed in batch on files, either by ○ Spark jobs ○ ETLs in our DB/DWH
  • 15. Spark Streaming - “stateless” app use-case
  • 16. The need for stateful streaming Fast forward a few months... ● New requirements were being raised ● Specific use-case : ○ To take the load off of the operational DB (used both as OLTP and OLAP), we wanted to move most of the aggregative operations to our Spark streaming app
  • 17. Stateful streaming via “local” aggregations ● The way to achieve it was : ○ Read messages from Kafka ○ Aggregate the messages of the current micro-batch ○ Combine the results of the results of the previous micro-batches (stored on the cluster’s HDFS) ○ Write the results back to HDFS ○ Every X batches : ■ Update the DB with the aggregated data (some sort of UPSERT) ■ Delete the aggregated files from HDFS ● UPSERT = INSERT ... ON DUPLICATE KEY UPDATE … (in MySQL) ○ For example, given t1 with columns a (the key) and b (starting from an empty table) ■ INSERT INTO t1 (a,b) VALUES (1,2) ON DUPLICATE KEY UPDATE b=b+VALUES(b); -> a=1, b=2 ■ INSERT INTO t1 (a,b) VALUES (1,5) ON DUPLICATE KEY UPDATE b=b+VALUES(b); -> a=1, b=7
  • 18. Stateful streaming via “local” aggregations
  • 19. Stateful streaming via “local”aggregations - cons ● Required us to manage the state on our own ● Error-prone ○ E.g what if my cluster is terminated and data on HDFS is lost? ● Complicates the code ○ Mixed input sources for the same app (Kafka + files) ● Possible performance impact ○ Might cause the Kafka consumer to lag ● Obviously not the perfect way (but that’s what we had…)
  • 20. Structured Streaming - to the rescue? Spark 2.0 introduced Structured Streaming ● Enables running continuous, incremental processes ○ Basically manages the state for you ● Built on Spark SQL ○ DataFrame/Dataset API ○ Catalyst Optimizer ● Allows handling event-time and late data ● Ensures end-to-end exactly-once fault-tolerance ● Was in ALPHA mode in 2.0 and 2.1
  • 21. Structured Streaming - basic concepts
  • 22. Structured Streaming - WordCount example
  • 23. Structured Streaming - stateful app use-case
  • 24. Structured Streaming in production So we started moving to Structured Streaming Use case Previous architecture Old flow New architecture New flow Existing Spark app Periodic Spark batch job Read Parquet from S3 -> Transform -> Write Parquet to S3 Stateless Structured Streaming Read from Kafka -> Transform -> Write Parquet to S3 Existing Java app Periodic standalone Java process (“manual” scaling) Read CSV -> Transform and aggregate -> Write to RDBMS Stateful Structured Streaming Read from Kafka -> Transform and aggregate -> Write to RDBMS New app N/A N/A Stateful Structured Streaming Read from Kafka -> Transform and aggregate -> Write to RDBMS
  • 25. Structured Streaming - known issues & tips ● 3 major issues we had in 2.1.0 (solved in 2.1.1) : ○ https://issues.apache.org/jira/browse/SPARK-19517 ○ https://issues.apache.org/jira/browse/SPARK-19677 ○ https://issues.apache.org/jira/browse/SPARK-19407 ● Using EMRFS consistent view when checkpointing to S3 ○ Recommended for stateless apps ○ For stateful apps, we encountered sporadic issues possibly related to the metadata store (i.e DynamoDB)
  • 26. Structured Streaming - strengths and weaknesses (IMO) ● Strengths : ○ Running incremental, continuous processing ○ End-to-end exactly-once fault-tolerance (if you implement it correctly) ○ Increased performance (uses the Catalyst SQL optimizer and other DataFrame optimizations like code generation) ○ Massive efforts are invested in it ● Weaknesses : ○ Maturity ○ Inability to perform multiple actions on the exact same Dataset (by-design?) ■ Seems to be resolved by https://issues.apache.org/jira/browse/SPARK-24565 (in the upcoming Spark 2.4, but then you get at-least once)
  • 27. Back to the future - DStream revived for “stateful” app use-case
  • 28. So what’s wrong with our DStream to Druid application?
  • 29. So what’s wrong with our DStream to Druid application? ● Kafka needs to read 300M messages from disk. ● ConcurrentModificationException when working with Spark Streaming on Kafka 0.10 ○ Forced us to use 1 core per executor to avoid it ○ https://issues.apache.org/jira/browse/SPARK-19185 supposedly solved in 2.4.0 (possibly solving https://issues.apache.org/jira/browse/SPARK-22562 as well) ● We wish we could run it less frequently.
  • 31. What is RDR? RDR is Raw Data Repository and it’s our Data Lake ● Kafka topic messages saved to S3 in Parquet. ● RDR Loaders - Spark streaming applications. ● Applications can read from RDR and do analytics on the data. Can we leverage our Data Lake as the data source instead of Kafka?
  • 32. The Idea of How to Stream RDR files
  • 33. How do we “stream” RDR Files
  • 34. How do we use the new RDR “streaming” infrastructure?
  • 35. “A platform to programmatically author, schedule and monitor workflows” Developed by Airbnb and is a part of the Apache Incubator The Scheduler
  • 36. Did we solve the problems? No longer streaming application - no longer idle cluster. Name Day 1 Day 2 Day 3 Old App to Druid 1007.68$ 1007.68$ 1007.68$ New App to Druid 150.08$ 198.73$ 174.68$
  • 37. Did we solve the problems? ● Still reads old messages from Kafka disk but unlike reading 300M messages we just read 1K messages per hour. ● Doesn’t depend on the integration of spark streaming with Kafka - no more weird Kafka exceptions. ● We can run the Spark batch application as (in)frequent as we’d like.
  • 38. Summary ● Started with Spark Streaming for “stateless” use-cases ○ Replaced CSV files with Kafka (de facto standard in the industry) ○ Already had Spark batch in production (Spark as a unified framework) ● Tried Spark Streaming for stateleful use-cases (via “local” aggregations) ○ Not the optimal solution ● Moved to Structured Streaming (for all use-cases) ○ Pros include : ■ Enables running continuous, incremental processes ■ Built on Spark SQL ○ Cons include : ■ Maturity ■ Inability to perform multiple actions on the exact same Dataset (by-design?)
  • 39. Summary - cont. ● Moved (back) to Spark Streaming ○ Aggregations are done per micro-batch (in Spark) and daily (in Druid) ○ Still not perfect ■ Performance penalty in Kafka for long micro-batches ■ Concurrency issue with Kafka 0.10 consumer in Spark ■ Under-utilized Spark clusters ● Introduced “streaming” over our Data Lake ○ Spark Streaming apps (A.K.A “RDR loaders”) write files to S3 and paths to Kafka ○ Spark batch apps read S3 paths from Kafka (and the actual files from S3) ■ Airflow for scheduling and monitoring ■ Meant for apps that don’t require real time ○ Pros : ■ Eliminated the performance penalty we had in Kafka ■ Spark clusters are much better utilized
  • 40. QUESTIONS? Join us - https://www.comeet.co/jobs/nielsen/33.000 Big Data Group Leader Big Data Team Leader And more...
  • 43. Structured Streaming - basic concepts
  • 44. Structured Streaming - basic terms ● Input sources : ○ File ○ Kafka ○ Socket, Rate (for testing) ● Output modes : ○ Append (default) ○ Complete ○ Update (added in Spark 2.1.1) ○ Different types of queries support different output modes ■ E.g for non-aggregation queries, Complete mode not supported as it is infeasible to keep all unaggregated data in the Result Table ● Output sinks : ○ File ○ Kafka (added in Spark 2.2.0) ○ Foreach ○ Console, Memory (for debugging) ○ Different types of sinks support different output modes
  • 48. Handling event-time and late data // Group the data by window and word and compute the count of each group val windowedCounts = words.groupBy( window($"timestamp", "10 minutes", "5 minutes"), $"word" ).count() // Group the data by window and word and compute the count of each group val windowedCounts = words .withWatermark("timestamp", "10 minutes") .groupBy( window($"timestamp", "10 minutes", "5 minutes"), $"word") .count()
  • 49. Fault tolerance ● The goal - end-to-end exactly-once semantics ● The means : ○ Trackable sources (i.e offsets) ○ Checkpointing ○ Idempotent sinks aggDF .writeStream .outputMode("complete") .option("checkpointLocation", "path/to/HDFS/dir") .format("memory") .start()
  • 50. Monitoring ● Interactive APIs : ○ streamingQuery.lastProgress()/status() ○ Output example ● Asynchronous API : ○ val spark: SparkSession = ... spark.streams.addListener(new StreamingQueryListener() { override def onQueryStarted(queryStarted: QueryStartedEvent): Unit = { println("Query started: " + queryStarted.id) } override def onQueryTerminated(queryTerminated: QueryTerminatedEvent): Unit = { println("Query terminated: " + queryTerminated.id) } override def onQueryProgress(queryProgress: QueryProgressEvent): Unit = { println("Query made progress: " + queryProgress.progress) } })

Notas del editor

  1. Welcome to nielsen marketing cloud offices Thank you for coming to hear about how our different use-cases of Streaming with Spark and Kafka We will try to make it interesting and valuable for you
  2. Present ourselves Question - pop in - know -> answer, don’t know - will say it’s a good question with complicated solution, let’s talk after the session
  3. Nielsen marketing cloud or NMC in short A group inside Nielsen, Born from exelate company that was acquired by Nielsen on March 2015 Nielsen is a data company and so are we and we had strong business relationship until at some point they decided to go for it and acquired exelate Data company meaning Buying and onboarding data into NMC from data providers, customers and Nielsen data We have huge high quality dataset enrich the data using machine learning models in order to create more relevant quality insights categorize and sell according to a need Helping brands to take intelligence business decisions E.g. Targeting in the digital marketing world Meaning help fit ads to viewers For example street sign can fit to a very small % of people who see it vs Online ads that can fit the profile of the individual that sees it More interesting to the user More chances he will click the ad Better ROI for the marketer
  4. What are the questions we try to answer in NMC that helps our customers to take business decisions ? A lot of questions but to lead to what druid is coming to solveֿ Translating from human problem to technical problem: Uu (distinct) count Simple count
  5. Few words on NMC data pipeline architecture: Frontend layer: Receives all the online and offline data traffic Bare metal on different data centers 3 us,2 eu ,2 pacific near real time - high throughput/low latency challenges Backend layer Aws Cloud based process all the frontend layer outputs ETL’s - load data to data sources aggregated and raw Applications layer Also in the cloud Variety of apps above all our data sources Web - NMC data configurations (segments, audiences etc) campaign analysis , campaign management tools etc. visualized profile graphs reports
  6. In each data center the frontend layer is build out of few systems: Data serving - in house gets the external traffic online and offline analyse the event read/write user repository Run algorithms to identify relevant buyers output to kafka Scale of 5B events/day with 200ms SLA while most events are done in few tens of ms’s Modeling and scoring system - in house Scoring and learning Online machines learning algorithms look alike models Cross device models 1.7T models a day Avarage 1100 model execution in a single event Less than 20ms User repository Aerospike 8B active users in us space 3B active users in eu space Everything goes to kafka in each DC and replicated using uReplicator to the centralized kafka in the cloud uReplicator - uber open source that we modified a little to fit our needs, like mirror maker
  7. Centralizes Kafka cluster in the cloud for all the data from all kafka clusters in all DC’s 50 big brokers 10 topics Deals with 11B events per day Kafka data processing is done mainly with spark streaming Loading sources such as DRUID - real time analytics from the web applications 10B events per day Clustrix - relational DB, aggregated data RDR - data lake 15 TB per day - compressed parquet Snowflake - managed db for DS and analytics purposes Now after presenting the data pipeline itay will go deeper into druid use case and implementation details
  8. Around 2014 were transformed into Spark batch jobs written in Scala (but in this presentation we’re going to focus on streaming)
  9. Schema enforcement - using Schema Registry and relying on Avro format
  10. “Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem”
  11. Looking back, Spark Streaming might have been able to perform stateful operations for us, but (as far as I recall) mapWithState wasn’t available yet, and updateStateByKey had some pending issues.
  12. In this specific use-case, the app was reading from a topic which had only small amounts of data
  13. DataFrame/Dataset - rather than DStream’s RDD Catalyst Optimizer - extensible query optimizer which is “at the core of Spark SQL… designed with these key two purposes: Easily add new optimization techniques and features to Spark SQL Enable external developers to extend the optimizer (e.g. adding data source specific rules, support for new data types, etc.)” (see https://databricks.com/glossary/catalyst-optimizer)
  14. “The key idea in Structured Streaming is to treat a live data stream as a table that is being continuously appended… You will express your streaming computation as standard batch-like query as on a static table, and Spark runs it as an incremental query on the unbounded input table”
  15. “A query on the input will generate the “Result Table”. Every trigger interval (say, every 1 second), new rows get appended to the Input Table, which eventually updates the Result Table. Whenever the result table gets updated, we would want to write the changed result rows to an external sink.”
  16. EMRFS consistent view - an optional feature on AWS EMR, allows clusters to check for list and read-after-write consistency for S3 objects written by or synced with EMRFS
  17. E.g http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-Avoiding-multiple-streaming-queries-tt30944.html
  18. Moved many apps (mostly the ones performing UU counts and “hit” counts) to rely on Druid, so now : Spark Streaming app Performs the “in-batch” aggregation per micro-batch (before writing to S3) Writes relevant metadata to RDS (e.g S3 path) This kind of “split” (i.e persisting the Dataset/DataFrame and iterating it a few times) is impossible with Structured Streaming (where every “branch” of processing is a separate query, at least until https://issues.apache.org/jira/browse/SPARK-24565) M/R ingestion job : Reads relevant metadata from RDS performs the final aggregation (before data is loaded into Druid) Update state in RDS (e.g which files were handled) Remember the RDS part, we’ll get back to it in Simona’s presentation about how we manage Kafka consumers’ offsets
  19. אז מה בעצם הבעיה עם האפליקציות סטרימינג שעוזרות לנו לטעון מידע לדרויד? האפליקציות האלה עובדות במיקרו באטצ של שעה כי זה עוזר לנו לבצע אגרגציה על יותר מידע ועוזר לנו לטעון לדרויד יותר מהר. למה זה בעיה? עבור כל אפליקציית סטרימינג כזאת אנחנו מרימים קלאסטר משלה באמאזון נקרא אי אמ אר ונגיד האפליקציה סיימה את המיקרו באטצ אחרי חצי שעה, אחרי 45 דקות תלוי בכל מיני משתנים וכתוצאה מכך הקלאסטר שלנו לא עושה עבודה בין המיקרו באטצים ואנחנו סתם משלמים עליו.
  20. בדיזיין של קפקה הודעות נכתבות לדיסק והודעות חדשות נשמרות גם בזיכרון בגלל שזה באטצ של שעה אז רוב ההודעות שאנחנו קוראים יושבות רק בדיסק במקום בזיכרון ולקרוא את ההודעות האלה מכביד על קפקה בעיקר אם יש כמה אפליקציות כאלו שקוראות כל שעה. אחרי ששידרגנו את ספארק סטרימינג לעבוד עם קפקה 0.10 התחלנו לקבל שגיאות מספארק שהוא מנסה להשתמש באותו קונסיומר במקביל(לא טרד סייפ) לכן הפתרון היה לעבוד עם קור אחד בכל אקסקיוטור כדי למנוע מצב כזה. כביכול נפתר ב2.4.0 לגבי האפליקציות סטרימינג לדרויד היינו רוצים לאפשר באטצים של יותר משעה - יהיה לנו יותר קל לטעון את המידע לדרויד. אז לקרוא מקפקה את המידע שאנחנו צריכים במקרה הזה אולי זה לא רעיון טוב כל כך אז מאיפה כן נקרא?
  21. סטרמינג מעל האר די אר
  22. אז מה זה אר די אר? ראשי תיבות של Raw Data Repository זה בעצם הדאה לייק שלנו. אז כמה פרטים על הדאטה לייק שלנו. כמו כל דאטה לייק יש לנו סטורג׳ במקרה שלנו הוא אס 3 וכותבים לשם הודעות מטופיקים שונים של קפקה כקבצי פרקט, כאשר כל טופיק בבאקט(אפשר להסתכל על זה כתיקייה) משלו וההודעות מחולקות לפרטישנים של יום. איך זה נשמר? אז מאחורי הקלעים יש לנו מה שאנחנו קוראים לו אר די אר לודר. זה בעצם אפליקציית ספארק סטרימינג שבשונה מהאפליקציית סטרמיניג שעד עכשיו דיברנו עליה היא רצה במיקרו באטצים קטנים של 4 ל 6 דקות. הרעיון העיקרי שלה הוא לקרוא הודעות של טופיק מסוים ולכתוב אותם לאר די אר. בממוצע כל 6 דקות אפליקציה כותבת בערך 100 קבצים, כאשר שואפים להגיע לחצי גיגה של קובץ כדי לאפשר קריאה יעילה של הקבצים באפליקציות ספארק אחרות. יש לנו כל מיני אפליקציות אחרות שקוראות מאר די אר ומעבדות ומנתחות את המידע. לדוגמא אפליקציות שקוראות יום שלם(הדאטה מחולק לפרטישנים של יום) רצות על כל המידע שהגיע נניח אתמול או רצות על 30 יום אחורה. אז יש לנו כבר את הדאטה לייק שלנו מוכן ואנחנו כותבים אליו את המידע מקפקה אז אולי במקום לקרוא את המידע מקפקה נוכל לנצל את הדאטה לייק שלנו? אוקיי זה מגניב אבל אנחנו צריכים לקרוא כל שעה וכל כמה דקות מתווספים קבצים חדשים אז איך נעקוב אחרי איזה קבצים קראנו/צריכים לקרוא?
  23. אז לפני שנעבור על איך אנחנו עושים את זה אני אסביר את הרעיון. יש לנו את הדאטה לייק עם הקבצים נגיד והכנסנו את הנתיבים של הקבצים האלה לקפקה כהודעות לטופיק מסוים ברגע שהם נוצרים בדאטה לייק. כלומר נגיד והאר די אר לודר קורא מטופיק איקס וכותב קבצים לדאטה לייק אז את אותם נתיבים של הקבצים שהוא כתב יכנסו לקפקה לטופיק שנקרא אר די אר איקס. ברגע שיש לנו את הנתיבים של הקבצים בטופיק בקפקה כל אפליקציה שתרצה תוכל להתמנות לטופיק הזה ולצרוך את הנתיבים של הקבצים ובעצם לקרוא את הקבצים לפי הנתיבים שהיא קיבלה וככה האפליקציה תוכל לקרוא מהדאטה לייק את הקבצים בצורה של סטרימינג כל שעה.
  24. אז איך בעצם נוכל להסטרים את הקבצים האלה לאפליקציות? אז לפי הארכיטקטורה המורכבת זה מאוד פשוט האר די אר לודר יקרא טופיק איקס מקפקה יכתוב לדאטה לייק ואחרי שהוא סיים יכתוב את הנתיבים של הקבצים לטופיק אר די אר איקס בקפקה. לצערנו ספארק לא יכול להגיד לך מה הם הנתיבים של הקבצים שהוא כתב אבל אנחנו יכולים להגיד לו לכתוב את הקבצים לתוך תיקייה בדאטה לייק כאשר לתיקייה יהיה שם ייחודי שניצור עבור כל מיקרו באטצ של אר די אר לודר ואת הנתיב של התיקייה הייחודית הזאת נשלח לקפקה. מה שיקרה הוא שהאפליקציה לא תקבל נתיבים לקבצים מהקפקה אלא נתיב לתיקייה שתכיל את הקבצים.
  25. מגניב אז עכשיו אנחנו צריכים להתאים את האפליקציית סטרימינג לדרויד שדיברנו עליה להשתמש בתשתית הזאת. דבר ראשון שינינו אותה מאפליקציית סטרימינג לאפליקציית באטץ׳. אנחנו יותר לא משתמשים באינטגרציה של ספארק קפקה סטרימינג אלא משתמשים בקפקה API בדרייבר כדי לקרוא את הנתיבים של הקבצים מקפקה. בעצם העבודה האמיתית של ספארק, היצירה של הדאטה סט/דאטה פריים, מתחילה אחרי שקיבלנו את הנתיבים ואנחנו מעבירים אותם לספארק spark.read.parquet למי שמכיר. בסוף אחרי שספארק סיים הדרייבר עושה קומיט לנתיבים של הקבצים, להודעות, שקיבלנו מקפקה וככה האפליקציה שלנו תתקדם לקבצים הבאים בתור. כל זה טוב ויפה אבל למי ששם לב נשארה לנו בעיה, אמרתי שיש לנו קלאסטר לכל אפליקציית ספארק סטרימינג אבל עכשיו זה כבר לא ספארק סטרמינג כלומר ספארק לא מתזמן את עצמו כל שעה אלא זה ספארק באטץ׳ ובסוף האפליקצייה גם הקלאסטר יורד אז אנחנו צריכים משהו שיתזמן לנו ירים קלאסטר שיפעיל את האפליקצייה החדשה שלנו כל שעה.
  26. תכירו את המתזמן שבחרנו איירפלו. פותח באייר בי אן בי ואחר כך יצא החוצה ונכנס לאינקובטור של אפאצ׳י. עד עכשיו עבדנו עם אמאזון דאטה פייפליין שאיתו תיזמנו את התהליכים שלנו אבל החלטנו לעבור לאיירפלו כי הוא מתחיל להיות סטנדרט בתעשייה וגם כי הUI של דאטה פייפליין פשוט נורא ובהמשך נראה עוד הבדל משמעותי עבורנו ביניהם.
  27. אז האם פתרנו את הבעיות שהצגתי בהתחלה? בעיה ראשונה אמרתי יש לנו קלאסטר באוויר 24/7 והוא לא מנוצל בזמן שבין מיקרו באטצים אז זה נפתר כי זה יותר לא ספארק סטרימינג אז ברגע שהאפליקצייה סיימה לעבוד על הקבצים הקלטסאר באמאזון יורד אוטומטית עד הריצה הבאה שאיירפלו ירים קלאסטר חדש. זה איפשר לנו לחסוך יותר מ800$ ביום (אלה מספרים אמיתיים). אפשר לראות שהספארק סטרימינג אותו מחיר כי כמו שאמרתי זה למעלה 24/7 עם אותם כמות שרתים ואילו האפליקצייה החדשה המחיר שלה משתנה בהתאם למידע שהגיע באותו יום.
  28. בעיה שנייה דיברנו לגבי הקפקה שאנחנו קוראים מלא הודעות מהדיסק וזה יוצר לנו לוד גבוה מאוד אז בגלל שאנחנו עדיין קוראים כל שעה אז קפקה עדיין שולף מהדיסק את רוב ההודעות אבל זה פחות משמעותי כי במקום לקרוא 300 מיליון הודעות אנחנו קוראים בערך רק 1000. לא תלויים באינטגרציה של ספארק סטרימינג עם קפקה לכן נפטרנו מכל השגיאות המוזרות והבעיות שהגיעו איתו. ConcurrentModificationException זוכרים אותו? ולסיום כמו שרצינו אנחנו יכולים להריץ את האפליקציה שלנו שמכינה את המידע לדרויד כל יותר משעה מבלי לדאוג לפגיעה בביצועים של קפקה או ניצול גרוע של הקלאסטרים שלנו כדי להפוך את הטעינה לדרויד בהמשך ליותר מהירה.
  29. Handling late data - relevant when performing timed (=windowed) aggregations