SlideShare a Scribd company logo
1 of 37
Download to read offline
Spark Streaming Tips
for
Devs & Ops
WHO ARE WE?
Fede Fernández
Scala Software Engineer at 47 Degrees
Spark Certified Developer
@fede_fdz
Fran Pérez
Scala Software Engineer at 47 Degrees
Spark Certified Developer
@FPerezP
Overview
Spark Streaming
Spark + Kafka
groupByKey vs reduceByKey
Table Joins
Serializer
Tunning
Spark Streaming
Real-time data processing
Continuous Data Flow
RDD
RDD
RDD
DStream
Output Data
Spark + Kafka
● Receiver-based Approach
○ At least once (with Write Ahead Logs)
● Direct API
○ Exactly once
Spark + Kafka
● Receiver-based Approach
Spark + Kafka
● Direct API
groupByKey VS reduceByKey
● groupByKey
○ Groups pairs of data with the same key.
● reduceByKey
○ Groups and combines pairs of data based on a reduce
operation.
groupByKey VS reduceByKey
sc.textFile(“hdfs://….”)
.flatMap(_.split(“ “))
.map((_, 1)).groupByKey.map(t => (t._1, t._2.sum))
sc.textFile(“hdfs://….”)
.flatMap(_.split(“ “))
.map((_, 1)).reduceByKey(_ + _)
groupByKey
(c, 1)
(s, 1)
(j, 1)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(s, 1)
(s, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
groupByKey
(c, 1)
(s, 1)
(j, 1)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(s, 1)
(s, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
groupByKey
(c, 1)
(s, 1)
(j, 1)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(s, 1)
(s, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
shuffle shuffle
groupByKey
(c, 1)
(s, 1)
(j, 1)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(s, 1)
(s, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 1)
(j, 1)
(j, 1)
(j, 1)
(j, 1)
(s, 1)
(s, 1)
(s, 1)
(s, 1)
(s, 1)
(s, 1)
(c, 1)
(c, 1)
(c, 1)
(c, 1)
shuffle shuffle
groupByKey
(c, 1)
(s, 1)
(j, 1)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(s, 1)
(s, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 1)
(j, 1)
(j, 1)
(j, 1)
(j, 1)
(j, 5)
(s, 1)
(s, 1)
(s, 1)
(s, 1)
(s, 1)
(s, 1)
(s, 6)
(c, 1)
(c, 1)
(c, 1)
(c, 1)
(c, 4)
shuffle shuffle
reduceByKey
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 1)
(s, 1)
(s, 1)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 1)
reduceByKey
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 2)
(c, 1)
(s, 2)
(j, 1)
(s, 1)
(s, 1)
(j, 1)
(s, 2)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(c, 2)
(s, 1)
(c, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
reduceByKey
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 2)
(c, 1)
(s, 2)
(j, 1)
(s, 1)
(s, 1)
(j, 1)
(s, 2)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(c, 2)
(s, 1)
(c, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
reduceByKey
(j, 2)
(j, 1)
(j, 1)
(j, 1)
(s, 2)
(s, 2)
(s, 1)
(s, 1)
(c, 1)
(c, 2)
(c, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 2)
(c, 1)
(s, 2)
(j, 1)
(s, 1)
(s, 1)
(j, 1)
(s, 2)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(c, 2)
(s, 1)
(c, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
shuffle shuffle
reduceByKey
(j, 2)
(j, 1)
(j, 1)
(j, 1)
(j, 5)
(s, 2)
(s, 2)
(s, 1)
(s, 1)
(s, 6)
(c, 1)
(c, 2)
(c, 1)
(c, 4)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 2)
(c, 1)
(s, 2)
(j, 1)
(s, 1)
(s, 1)
(j, 1)
(s, 2)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(c, 2)
(s, 1)
(c, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
shuffle shuffle
reduce VS group
● Improve performance
● Can’t always be used
● Out of Memory Exceptions
● aggregateByKey, foldByKey, combineByKey
Table Joins
● Typical operations that can be improved
● Need a previous analysis
● There are no silver bullets
Table Joins: Medium - Large
Table Joins: Medium - Large
FILTER
No Shuffle
Table Joins: Small - Large
...
Shuffled Hash Join
sqlContext.sql("explain <select>").collect.mkString(“n”)
[== Physical Plan ==]
[Project]
[+- SortMergeJoin]
[ :- Sort]
[ : +- TungstenExchange hashpartitioning]
[ : +- TungstenExchange RoundRobinPartitioning]
[ : +- ConvertToUnsafe]
[ : +- Scan ExistingRDD]
[ +- Sort]
[ +- TungstenExchange hashpartitioning]
[ +- ConvertToUnsafe]
[ +- Scan ExistingRDD]
Table Joins: Small - Large
Broadcast Hash Join
sqlContext.sql("explain <select>").collect.mkString(“n”)
[== Physical Plan ==]
[Project]
[+- BroadcastHashJoin]
[ :- TungstenExchange RoundRobinPartitioning]
[ : +- ConvertToUnsafe]
[ : +- Scan ExistingRDD]
[ +- Scan ParquetRelation]
No shuffle!
By default from Spark 1.4 when using DataFrame API
Prior Spark 1.4
ANALYZE TABLE small_table COMPUTE STATISTICS noscan
Broadcast
Table Joins: Small - Large
Serializers
● Java’s ObjectOutputStream framework. (Default)
● Custom serializers: extends Serializable & Externalizable.
● KryoSerializer: register your custom classes.
● Where is our code being run?
● Special care to JodaTime.
Tuning
Garbage Collector
blockInterval
Partitioning
Storage
Tuning: Garbage Collector
• Applications which rely heavily on memory consumption.
• GC Strategies
• Concurrent Mark Sweep (CMS) GC
• ParallelOld GC
• Garbage-First GC
• Tuning steps:
• Review your logic and object management
• Try Garbage-First
• Activate and inspect the logs
Reference: https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
Tuning: blockInterval
blockInterval = (bi * consumers) / (pf * sc)
● CAT: Total cores per partition.
● bi: Batch Interval time in milliseconds.
● consumers: number of streaming consumers.
● pf (partitionFactor): number of partitions per core.
● sc (sparkCores): CAT - consumers.
blockInterval: example
● batchIntervalMillis = 600,000
● consumers = 20
● CAT = 120
● sparkCores = 120 - 20 = 100
● partitionFactor = 3
blockInterval = (bi * consumers) / (pf * sc) =
(600,000 * 20) / (3 * 100) =
40,000
Tuning: Partitioning
partitions = consumers * bi / blockInterval
● consumers: number of streaming consumers.
● bi: Batch Interval time in milliseconds.
● blockInterval: time size to split data before storing into
Spark.
Partitioning: example
● batchIntervalMillis = 600,000
● consumers = 20
● blockInterval = 40,000
partitions = consumers * bi / blockInterval =
20 * 600,000/ 40,000=
30
Tuning: Storage
• Default (MEMORY_ONLY)
• MEMORY_ONLY_SER with Serialization Library
• MEMORY_AND_DISK & DISK_ONLY
• Replicated _2
• OFF_HEAP (Tachyon/Alluxio)
Where to find more information?
Spark Official Documentation
Databricks Blog
Databricks Spark Knowledge Base
Spark Notebook - By Andy Petrella
Databricks YouTube Channel
QUESTIONS
Fede Fernández
@fede_fdz
fede.f@47deg.com
Fran Pérez
@FPerezP
fran.p@47deg.com
Thanks!

More Related Content

What's hot

Storing metrics at scale with Gnocchi
Storing metrics at scale with GnocchiStoring metrics at scale with Gnocchi
Storing metrics at scale with GnocchiGordon Chung
 
Scaling the #2ndhalf
Scaling the #2ndhalfScaling the #2ndhalf
Scaling the #2ndhalfSalo Shp
 
Gnocchi v4 (preview)
Gnocchi v4 (preview)Gnocchi v4 (preview)
Gnocchi v4 (preview)Gordon Chung
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionKaran Singh
 
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...DataStax
 
Scaling Flink in Cloud
Scaling Flink in CloudScaling Flink in Cloud
Scaling Flink in CloudSteven Wu
 
The State of Lightweight Threads for the JVM
The State of Lightweight Threads for the JVMThe State of Lightweight Threads for the JVM
The State of Lightweight Threads for the JVMVolkan Yazıcı
 
SequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational DatabaseSequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational Databasewangzhonnew
 
C*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with CassandraC*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with CassandraDataStax
 
Garbage collection in .net (basic level)
Garbage collection in .net (basic level)Garbage collection in .net (basic level)
Garbage collection in .net (basic level)Larry Nung
 
Unified Data Platform, by Pauline Yeung of Cisco Systems
Unified Data Platform, by Pauline Yeung of Cisco SystemsUnified Data Platform, by Pauline Yeung of Cisco Systems
Unified Data Platform, by Pauline Yeung of Cisco SystemsAltinity Ltd
 
Time Series Processing with Solr and Spark
Time Series Processing with Solr and SparkTime Series Processing with Solr and Spark
Time Series Processing with Solr and SparkJosef Adersberger
 
A Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAINA Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAINEDB
 

What's hot (18)

R and cpp
R and cppR and cpp
R and cpp
 
Scala+data
Scala+dataScala+data
Scala+data
 
Storing metrics at scale with Gnocchi
Storing metrics at scale with GnocchiStoring metrics at scale with Gnocchi
Storing metrics at scale with Gnocchi
 
Scaling the #2ndhalf
Scaling the #2ndhalfScaling the #2ndhalf
Scaling the #2ndhalf
 
Gnocchi v4 (preview)
Gnocchi v4 (preview)Gnocchi v4 (preview)
Gnocchi v4 (preview)
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
 
JEE on DC/OS
JEE on DC/OSJEE on DC/OS
JEE on DC/OS
 
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
 
Ordered Record Collection
Ordered Record CollectionOrdered Record Collection
Ordered Record Collection
 
Scaling Flink in Cloud
Scaling Flink in CloudScaling Flink in Cloud
Scaling Flink in Cloud
 
The State of Lightweight Threads for the JVM
The State of Lightweight Threads for the JVMThe State of Lightweight Threads for the JVM
The State of Lightweight Threads for the JVM
 
SequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational DatabaseSequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational Database
 
C*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with CassandraC*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with Cassandra
 
PostgreSQL
PostgreSQLPostgreSQL
PostgreSQL
 
Garbage collection in .net (basic level)
Garbage collection in .net (basic level)Garbage collection in .net (basic level)
Garbage collection in .net (basic level)
 
Unified Data Platform, by Pauline Yeung of Cisco Systems
Unified Data Platform, by Pauline Yeung of Cisco SystemsUnified Data Platform, by Pauline Yeung of Cisco Systems
Unified Data Platform, by Pauline Yeung of Cisco Systems
 
Time Series Processing with Solr and Spark
Time Series Processing with Solr and SparkTime Series Processing with Solr and Spark
Time Series Processing with Solr and Spark
 
A Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAINA Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAIN
 

Viewers also liked

Agile Methodologies and Cost Estimation
Agile Methodologies and Cost EstimationAgile Methodologies and Cost Estimation
Agile Methodologies and Cost Estimationshansa2014
 
Life Sciences: Career Development in Europe and Asia
Life Sciences: Career Development in Europe and AsiaLife Sciences: Career Development in Europe and Asia
Life Sciences: Career Development in Europe and AsiaKelly Services
 
Introduction to Go language
Introduction to Go languageIntroduction to Go language
Introduction to Go languageTzar Umang
 
Learning Analytics - What Do Stakeholders Really Think?
Learning Analytics - What Do Stakeholders Really Think?Learning Analytics - What Do Stakeholders Really Think?
Learning Analytics - What Do Stakeholders Really Think?Neil Witt
 
Software engineer
Software engineerSoftware engineer
Software engineerxccoffey10
 
bti asia salary guide
bti asia salary guidebti asia salary guide
bti asia salary guideFebrian ‎
 
Sharing up to 80% code for iOS, Android, and Windows platforms, a Retail App ...
Sharing up to 80% code for iOS, Android, and Windows platforms, a Retail App ...Sharing up to 80% code for iOS, Android, and Windows platforms, a Retail App ...
Sharing up to 80% code for iOS, Android, and Windows platforms, a Retail App ...Xamarin
 
Southeast Indonesia: A guide for investors and developers in Lombok, Sumbawa,...
Southeast Indonesia: A guide for investors and developers in Lombok, Sumbawa,...Southeast Indonesia: A guide for investors and developers in Lombok, Sumbawa,...
Southeast Indonesia: A guide for investors and developers in Lombok, Sumbawa,...Travis Albee
 
GO Mobile presentation for English Language Centre at the University of Liver...
GO Mobile presentation for English Language Centre at the University of Liver...GO Mobile presentation for English Language Centre at the University of Liver...
GO Mobile presentation for English Language Centre at the University of Liver...Alex Spiers
 
Developing a technology enhanced learning strategy
Developing a technology enhanced learning strategyDeveloping a technology enhanced learning strategy
Developing a technology enhanced learning strategySarah Knight
 
Oracle rac cachefusion - High Availability Day 2015
Oracle rac cachefusion - High Availability Day 2015Oracle rac cachefusion - High Availability Day 2015
Oracle rac cachefusion - High Availability Day 2015aioughydchapter
 
Natural Resources: Career Development in Europe and Asia
Natural Resources: Career Development in Europe and AsiaNatural Resources: Career Development in Europe and Asia
Natural Resources: Career Development in Europe and AsiaKelly Services
 
Cloud conference - mongodb
Cloud conference - mongodbCloud conference - mongodb
Cloud conference - mongodbMitch Pirtle
 
Aman sharma hyd_12crac High Availability Day 2015
Aman sharma hyd_12crac High Availability Day 2015Aman sharma hyd_12crac High Availability Day 2015
Aman sharma hyd_12crac High Availability Day 2015aioughydchapter
 
Estimation or, "How to Dig your Grave"
Estimation or, "How to Dig your Grave"Estimation or, "How to Dig your Grave"
Estimation or, "How to Dig your Grave"Rowan Merewood
 
RACATTACK Lab Handbook - Enable Flex Cluster and Flex ASM
RACATTACK Lab Handbook - Enable Flex Cluster and Flex ASMRACATTACK Lab Handbook - Enable Flex Cluster and Flex ASM
RACATTACK Lab Handbook - Enable Flex Cluster and Flex ASMMaaz Anjum
 
Project-Based Instruction and the Importance of Self-Directed Learning
Project-Based Instruction and the Importance of Self-Directed LearningProject-Based Instruction and the Importance of Self-Directed Learning
Project-Based Instruction and the Importance of Self-Directed LearningLinkedIn Learning Solutions
 
Open Minded? Software Engineer to a UX Engineer. Ask me how. by Micael Diaz d...
Open Minded? Software Engineer to a UX Engineer. Ask me how. by Micael Diaz d...Open Minded? Software Engineer to a UX Engineer. Ask me how. by Micael Diaz d...
Open Minded? Software Engineer to a UX Engineer. Ask me how. by Micael Diaz d...DEVCON
 

Viewers also liked (20)

Agile Methodologies and Cost Estimation
Agile Methodologies and Cost EstimationAgile Methodologies and Cost Estimation
Agile Methodologies and Cost Estimation
 
Life Sciences: Career Development in Europe and Asia
Life Sciences: Career Development in Europe and AsiaLife Sciences: Career Development in Europe and Asia
Life Sciences: Career Development in Europe and Asia
 
Introduction to Go language
Introduction to Go languageIntroduction to Go language
Introduction to Go language
 
Learning Analytics - What Do Stakeholders Really Think?
Learning Analytics - What Do Stakeholders Really Think?Learning Analytics - What Do Stakeholders Really Think?
Learning Analytics - What Do Stakeholders Really Think?
 
Software engineer
Software engineerSoftware engineer
Software engineer
 
Agile cost estimation
Agile cost estimationAgile cost estimation
Agile cost estimation
 
bti asia salary guide
bti asia salary guidebti asia salary guide
bti asia salary guide
 
Sharing up to 80% code for iOS, Android, and Windows platforms, a Retail App ...
Sharing up to 80% code for iOS, Android, and Windows platforms, a Retail App ...Sharing up to 80% code for iOS, Android, and Windows platforms, a Retail App ...
Sharing up to 80% code for iOS, Android, and Windows platforms, a Retail App ...
 
Southeast Indonesia: A guide for investors and developers in Lombok, Sumbawa,...
Southeast Indonesia: A guide for investors and developers in Lombok, Sumbawa,...Southeast Indonesia: A guide for investors and developers in Lombok, Sumbawa,...
Southeast Indonesia: A guide for investors and developers in Lombok, Sumbawa,...
 
GO Mobile presentation for English Language Centre at the University of Liver...
GO Mobile presentation for English Language Centre at the University of Liver...GO Mobile presentation for English Language Centre at the University of Liver...
GO Mobile presentation for English Language Centre at the University of Liver...
 
Developing a technology enhanced learning strategy
Developing a technology enhanced learning strategyDeveloping a technology enhanced learning strategy
Developing a technology enhanced learning strategy
 
MongoDB
MongoDBMongoDB
MongoDB
 
Oracle rac cachefusion - High Availability Day 2015
Oracle rac cachefusion - High Availability Day 2015Oracle rac cachefusion - High Availability Day 2015
Oracle rac cachefusion - High Availability Day 2015
 
Natural Resources: Career Development in Europe and Asia
Natural Resources: Career Development in Europe and AsiaNatural Resources: Career Development in Europe and Asia
Natural Resources: Career Development in Europe and Asia
 
Cloud conference - mongodb
Cloud conference - mongodbCloud conference - mongodb
Cloud conference - mongodb
 
Aman sharma hyd_12crac High Availability Day 2015
Aman sharma hyd_12crac High Availability Day 2015Aman sharma hyd_12crac High Availability Day 2015
Aman sharma hyd_12crac High Availability Day 2015
 
Estimation or, "How to Dig your Grave"
Estimation or, "How to Dig your Grave"Estimation or, "How to Dig your Grave"
Estimation or, "How to Dig your Grave"
 
RACATTACK Lab Handbook - Enable Flex Cluster and Flex ASM
RACATTACK Lab Handbook - Enable Flex Cluster and Flex ASMRACATTACK Lab Handbook - Enable Flex Cluster and Flex ASM
RACATTACK Lab Handbook - Enable Flex Cluster and Flex ASM
 
Project-Based Instruction and the Importance of Self-Directed Learning
Project-Based Instruction and the Importance of Self-Directed LearningProject-Based Instruction and the Importance of Self-Directed Learning
Project-Based Instruction and the Importance of Self-Directed Learning
 
Open Minded? Software Engineer to a UX Engineer. Ask me how. by Micael Diaz d...
Open Minded? Software Engineer to a UX Engineer. Ask me how. by Micael Diaz d...Open Minded? Software Engineer to a UX Engineer. Ask me how. by Micael Diaz d...
Open Minded? Software Engineer to a UX Engineer. Ask me how. by Micael Diaz d...
 

Similar to Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012Steven Francia
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介Masayuki Matsushita
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBBuilding a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBCody Ray
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteChris Baynes
 
Sparkcamp stratasingapore
Sparkcamp stratasingaporeSparkcamp stratasingapore
Sparkcamp stratasingaporeCheng Feng
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamNeville Li
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5SAP Concur
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareDatabricks
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaShiao-An Yuan
 
A New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDKA New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDKShu-Jeng Hsieh
 

Similar to Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández (20)

MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBBuilding a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
Final_show
Final_showFinal_show
Final_show
 
Sparkcamp stratasingapore
Sparkcamp stratasingaporeSparkcamp stratasingapore
Sparkcamp stratasingapore
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
 
A New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDKA New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDK
 

More from J On The Beach

Massively scalable ETL in real world applications: the hard way
Massively scalable ETL in real world applications: the hard wayMassively scalable ETL in real world applications: the hard way
Massively scalable ETL in real world applications: the hard wayJ On The Beach
 
Big Data On Data You Don’t Have
Big Data On Data You Don’t HaveBig Data On Data You Don’t Have
Big Data On Data You Don’t HaveJ On The Beach
 
Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...
Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...
Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...J On The Beach
 
Pushing it to the edge in IoT
Pushing it to the edge in IoTPushing it to the edge in IoT
Pushing it to the edge in IoTJ On The Beach
 
Drinking from the firehose, with virtual streams and virtual actors
Drinking from the firehose, with virtual streams and virtual actorsDrinking from the firehose, with virtual streams and virtual actors
Drinking from the firehose, with virtual streams and virtual actorsJ On The Beach
 
How do we deploy? From Punched cards to Immutable server pattern
How do we deploy? From Punched cards to Immutable server patternHow do we deploy? From Punched cards to Immutable server pattern
How do we deploy? From Punched cards to Immutable server patternJ On The Beach
 
When Cloud Native meets the Financial Sector
When Cloud Native meets the Financial SectorWhen Cloud Native meets the Financial Sector
When Cloud Native meets the Financial SectorJ On The Beach
 
The big data Universe. Literally.
The big data Universe. Literally.The big data Universe. Literally.
The big data Universe. Literally.J On The Beach
 
Streaming to a New Jakarta EE
Streaming to a New Jakarta EEStreaming to a New Jakarta EE
Streaming to a New Jakarta EEJ On The Beach
 
The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...
The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...
The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...J On The Beach
 
Pushing AI to the Client with WebAssembly and Blazor
Pushing AI to the Client with WebAssembly and BlazorPushing AI to the Client with WebAssembly and Blazor
Pushing AI to the Client with WebAssembly and BlazorJ On The Beach
 
Axon Server went RAFTing
Axon Server went RAFTingAxon Server went RAFTing
Axon Server went RAFTingJ On The Beach
 
The Six Pitfalls of building a Microservices Architecture (and how to avoid t...
The Six Pitfalls of building a Microservices Architecture (and how to avoid t...The Six Pitfalls of building a Microservices Architecture (and how to avoid t...
The Six Pitfalls of building a Microservices Architecture (and how to avoid t...J On The Beach
 
Madaari : Ordering For The Monkeys
Madaari : Ordering For The MonkeysMadaari : Ordering For The Monkeys
Madaari : Ordering For The MonkeysJ On The Beach
 
Servers are doomed to fail
Servers are doomed to failServers are doomed to fail
Servers are doomed to failJ On The Beach
 
Interaction Protocols: It's all about good manners
Interaction Protocols: It's all about good mannersInteraction Protocols: It's all about good manners
Interaction Protocols: It's all about good mannersJ On The Beach
 
A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...
A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...
A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...J On The Beach
 
Leadership at every level
Leadership at every levelLeadership at every level
Leadership at every levelJ On The Beach
 
Machine Learning: The Bare Math Behind Libraries
Machine Learning: The Bare Math Behind LibrariesMachine Learning: The Bare Math Behind Libraries
Machine Learning: The Bare Math Behind LibrariesJ On The Beach
 

More from J On The Beach (20)

Massively scalable ETL in real world applications: the hard way
Massively scalable ETL in real world applications: the hard wayMassively scalable ETL in real world applications: the hard way
Massively scalable ETL in real world applications: the hard way
 
Big Data On Data You Don’t Have
Big Data On Data You Don’t HaveBig Data On Data You Don’t Have
Big Data On Data You Don’t Have
 
Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...
Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...
Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...
 
Pushing it to the edge in IoT
Pushing it to the edge in IoTPushing it to the edge in IoT
Pushing it to the edge in IoT
 
Drinking from the firehose, with virtual streams and virtual actors
Drinking from the firehose, with virtual streams and virtual actorsDrinking from the firehose, with virtual streams and virtual actors
Drinking from the firehose, with virtual streams and virtual actors
 
How do we deploy? From Punched cards to Immutable server pattern
How do we deploy? From Punched cards to Immutable server patternHow do we deploy? From Punched cards to Immutable server pattern
How do we deploy? From Punched cards to Immutable server pattern
 
Java, Turbocharged
Java, TurbochargedJava, Turbocharged
Java, Turbocharged
 
When Cloud Native meets the Financial Sector
When Cloud Native meets the Financial SectorWhen Cloud Native meets the Financial Sector
When Cloud Native meets the Financial Sector
 
The big data Universe. Literally.
The big data Universe. Literally.The big data Universe. Literally.
The big data Universe. Literally.
 
Streaming to a New Jakarta EE
Streaming to a New Jakarta EEStreaming to a New Jakarta EE
Streaming to a New Jakarta EE
 
The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...
The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...
The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...
 
Pushing AI to the Client with WebAssembly and Blazor
Pushing AI to the Client with WebAssembly and BlazorPushing AI to the Client with WebAssembly and Blazor
Pushing AI to the Client with WebAssembly and Blazor
 
Axon Server went RAFTing
Axon Server went RAFTingAxon Server went RAFTing
Axon Server went RAFTing
 
The Six Pitfalls of building a Microservices Architecture (and how to avoid t...
The Six Pitfalls of building a Microservices Architecture (and how to avoid t...The Six Pitfalls of building a Microservices Architecture (and how to avoid t...
The Six Pitfalls of building a Microservices Architecture (and how to avoid t...
 
Madaari : Ordering For The Monkeys
Madaari : Ordering For The MonkeysMadaari : Ordering For The Monkeys
Madaari : Ordering For The Monkeys
 
Servers are doomed to fail
Servers are doomed to failServers are doomed to fail
Servers are doomed to fail
 
Interaction Protocols: It's all about good manners
Interaction Protocols: It's all about good mannersInteraction Protocols: It's all about good manners
Interaction Protocols: It's all about good manners
 
A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...
A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...
A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...
 
Leadership at every level
Leadership at every levelLeadership at every level
Leadership at every level
 
Machine Learning: The Bare Math Behind Libraries
Machine Learning: The Bare Math Behind LibrariesMachine Learning: The Bare Math Behind Libraries
Machine Learning: The Bare Math Behind Libraries
 

Recently uploaded

MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...Jittipong Loespradit
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...masabamasaba
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...masabamasaba
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationJuha-Pekka Tolvanen
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxalwaysnagaraju26
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba
 

Recently uploaded (20)

MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 

Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

  • 2.
  • 3. WHO ARE WE? Fede Fernández Scala Software Engineer at 47 Degrees Spark Certified Developer @fede_fdz Fran Pérez Scala Software Engineer at 47 Degrees Spark Certified Developer @FPerezP
  • 4. Overview Spark Streaming Spark + Kafka groupByKey vs reduceByKey Table Joins Serializer Tunning
  • 5. Spark Streaming Real-time data processing Continuous Data Flow RDD RDD RDD DStream Output Data
  • 6. Spark + Kafka ● Receiver-based Approach ○ At least once (with Write Ahead Logs) ● Direct API ○ Exactly once
  • 7. Spark + Kafka ● Receiver-based Approach
  • 8. Spark + Kafka ● Direct API
  • 9. groupByKey VS reduceByKey ● groupByKey ○ Groups pairs of data with the same key. ● reduceByKey ○ Groups and combines pairs of data based on a reduce operation.
  • 10. groupByKey VS reduceByKey sc.textFile(“hdfs://….”) .flatMap(_.split(“ “)) .map((_, 1)).groupByKey.map(t => (t._1, t._2.sum)) sc.textFile(“hdfs://….”) .flatMap(_.split(“ “)) .map((_, 1)).reduceByKey(_ + _)
  • 11. groupByKey (c, 1) (s, 1) (j, 1) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1)
  • 12. groupByKey (c, 1) (s, 1) (j, 1) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1)
  • 13. groupByKey (c, 1) (s, 1) (j, 1) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) shuffle shuffle
  • 14. groupByKey (c, 1) (s, 1) (j, 1) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 1) (j, 1) (j, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (s, 1) (s, 1) (s, 1) (c, 1) (c, 1) (c, 1) (c, 1) shuffle shuffle
  • 15. groupByKey (c, 1) (s, 1) (j, 1) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 1) (j, 1) (j, 1) (j, 1) (j, 1) (j, 5) (s, 1) (s, 1) (s, 1) (s, 1) (s, 1) (s, 1) (s, 6) (c, 1) (c, 1) (c, 1) (c, 1) (c, 4) shuffle shuffle
  • 16. reduceByKey (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 1) (s, 1) (s, 1) (s, 1) (c, 1) (c, 1) (j, 1) (c, 1) (s, 1) (j, 1)
  • 17. reduceByKey (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 2) (c, 1) (s, 2) (j, 1) (s, 1) (s, 1) (j, 1) (s, 2) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (c, 2) (s, 1) (c, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1)
  • 18. reduceByKey (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 2) (c, 1) (s, 2) (j, 1) (s, 1) (s, 1) (j, 1) (s, 2) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (c, 2) (s, 1) (c, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1)
  • 19. reduceByKey (j, 2) (j, 1) (j, 1) (j, 1) (s, 2) (s, 2) (s, 1) (s, 1) (c, 1) (c, 2) (c, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 2) (c, 1) (s, 2) (j, 1) (s, 1) (s, 1) (j, 1) (s, 2) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (c, 2) (s, 1) (c, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) shuffle shuffle
  • 20. reduceByKey (j, 2) (j, 1) (j, 1) (j, 1) (j, 5) (s, 2) (s, 2) (s, 1) (s, 1) (s, 6) (c, 1) (c, 2) (c, 1) (c, 4) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 2) (c, 1) (s, 2) (j, 1) (s, 1) (s, 1) (j, 1) (s, 2) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (c, 2) (s, 1) (c, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) shuffle shuffle
  • 21. reduce VS group ● Improve performance ● Can’t always be used ● Out of Memory Exceptions ● aggregateByKey, foldByKey, combineByKey
  • 22. Table Joins ● Typical operations that can be improved ● Need a previous analysis ● There are no silver bullets
  • 24. Table Joins: Medium - Large FILTER No Shuffle
  • 25. Table Joins: Small - Large ... Shuffled Hash Join sqlContext.sql("explain <select>").collect.mkString(“n”) [== Physical Plan ==] [Project] [+- SortMergeJoin] [ :- Sort] [ : +- TungstenExchange hashpartitioning] [ : +- TungstenExchange RoundRobinPartitioning] [ : +- ConvertToUnsafe] [ : +- Scan ExistingRDD] [ +- Sort] [ +- TungstenExchange hashpartitioning] [ +- ConvertToUnsafe] [ +- Scan ExistingRDD]
  • 26. Table Joins: Small - Large Broadcast Hash Join sqlContext.sql("explain <select>").collect.mkString(“n”) [== Physical Plan ==] [Project] [+- BroadcastHashJoin] [ :- TungstenExchange RoundRobinPartitioning] [ : +- ConvertToUnsafe] [ : +- Scan ExistingRDD] [ +- Scan ParquetRelation] No shuffle! By default from Spark 1.4 when using DataFrame API Prior Spark 1.4 ANALYZE TABLE small_table COMPUTE STATISTICS noscan Broadcast
  • 28. Serializers ● Java’s ObjectOutputStream framework. (Default) ● Custom serializers: extends Serializable & Externalizable. ● KryoSerializer: register your custom classes. ● Where is our code being run? ● Special care to JodaTime.
  • 30. Tuning: Garbage Collector • Applications which rely heavily on memory consumption. • GC Strategies • Concurrent Mark Sweep (CMS) GC • ParallelOld GC • Garbage-First GC • Tuning steps: • Review your logic and object management • Try Garbage-First • Activate and inspect the logs Reference: https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
  • 31. Tuning: blockInterval blockInterval = (bi * consumers) / (pf * sc) ● CAT: Total cores per partition. ● bi: Batch Interval time in milliseconds. ● consumers: number of streaming consumers. ● pf (partitionFactor): number of partitions per core. ● sc (sparkCores): CAT - consumers.
  • 32. blockInterval: example ● batchIntervalMillis = 600,000 ● consumers = 20 ● CAT = 120 ● sparkCores = 120 - 20 = 100 ● partitionFactor = 3 blockInterval = (bi * consumers) / (pf * sc) = (600,000 * 20) / (3 * 100) = 40,000
  • 33. Tuning: Partitioning partitions = consumers * bi / blockInterval ● consumers: number of streaming consumers. ● bi: Batch Interval time in milliseconds. ● blockInterval: time size to split data before storing into Spark.
  • 34. Partitioning: example ● batchIntervalMillis = 600,000 ● consumers = 20 ● blockInterval = 40,000 partitions = consumers * bi / blockInterval = 20 * 600,000/ 40,000= 30
  • 35. Tuning: Storage • Default (MEMORY_ONLY) • MEMORY_ONLY_SER with Serialization Library • MEMORY_AND_DISK & DISK_ONLY • Replicated _2 • OFF_HEAP (Tachyon/Alluxio)
  • 36. Where to find more information? Spark Official Documentation Databricks Blog Databricks Spark Knowledge Base Spark Notebook - By Andy Petrella Databricks YouTube Channel