SlideShare una empresa de Scribd logo
1 de 42
Descargar para leer sin conexión
“ 10 things I wish I'd known
before using
in production ! ”
Himanshu Arora
Lead Data Engineer, NeoLynk France
h.arora@neolynk.fr
@him_aro
Nitya Nand Yadav
Data Engineer, NeoLynk France
n.yadav@neolynk.fr
@nityany
Partenaire Suivez l’actualité de nos Tribus JVM, PHP, JS et Agile
sur nos réseaux sociaux :
JVM
What we are going to cover...
1. RDD vs DataFrame vs DataSet
2. Data Serialisation Formats
3. Storage formats
4. Broadcast join
5. Hardware tuning
6. Level of parallelism
7. GC tuning
8. Common errors
9. Data skew
10. Data locality
5
1/10 - RDD vs DataFrames vs DataSets
6
● RDD - Resilient Distributed Dataset
➔ Main abstraction of Spark.
➔ Low-level transformation, actions and control on partition level.
➔ Unstructured dataset like media streams, text streams.
➔ Manipulate data with functional programming constructs.
➔ No optimization
7
● DataFrame
➔ High level abstractions, rich semantics.
➔ Like a big distributed SQL table.
➔ High level expressions (aggregation, average, sum, sql queries).
➔ Performance and optimizations(Predicate pushdown, QBO, CBO...).
➔ No compile time type check, runtime errors.
8
● DataSet
➔ A collection of strongly-typed JVM objects, dictated by a case class you define
in Scala or a class in Java.
➔ DataFrame = DataSet[Row].
➔ Performance and optimisations.
➔ Type-safety at compile time.
9
2/10 - Data Serialisation Format
➔ Data shuffled in serialized format between executors.
➔ RDDs cached & persisted in disk are serialized too.
➔ Default serialization format of spark: Java Serialization (slow & large).
➔ Better use: Kryo serialisation.
➔ Kryo: Faster and more compact (up to 10x).
➔ DataFrame/DataSets use tungsten serialization (even better than kryo).
10
val sparkConf: SparkConf = new SparkConf()
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val sparkSession: SparkSession = SparkSession
.builder()
.config(sparkConf)
.getOrCreate()
// register your own custom classes with kryo
sparkConf.registerKryoClasses(Array(classOf[MyCustomeClass]))
2/10 - Data Serialisation Format
11
3/10 - Storage Formats
12
➔ Avoid using text, json and csv etc. if possible.
➔ Use compressed binary formats instead.
➔ Popular choices: Apache Parquet, Apache Avro & ORC etc.
➔ Use case dictates the choice.
3/10 - Storage Formats
13
➔ Binary formats.
➔ Splittable.
➔ Parquet: Columnar & Avro: Row based
➔ Parquet: Higher compression rates than row based format.
➔ Parquet: read-heavy workload & Avro: write heavy workload
➔ Schema preserved in files itself.
➔ Avro: Better support for schema evolution
3/10 - Storage Formats: Apache Parquet & Avro
14
val sparkConf: SparkConf = new SparkConf()
.set("spark.sql.parquet.compression.codec", "snappy")
val dataframe = sparkSession.read.parquet("s3a://....")
dataframe.write.parquet("s3a://....")
val sparkConf: SparkConf = new SparkConf()
.set("spark.sql.avro.compression.codec", "snappy")
val dataframe = sparkSession.read.avro("s3a://....")
dataframe.write.avro("s3a://....")
3/10 - Storage Formats
15
3/10 - Benchmark
Using AVRO
instead of JSON
16
4/10 - Broadcast Join
17
//spark automatically broadcasts small dataframes (max. 10MB by default)
val sparkConf: SparkConf = new SparkConf()
.set("spark.sql.autoBroadcastJoinThreshold", "2147483648")
.set("spark.sql.broadcastTimeout", "900") //default 300 secs
/*
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
*/
//force broadcast
val result = bigDataFrame.join(broadcast(smallDataFrame))
4/10 - Broadcast Join
18
Know Your Cluster
● Number of nodes
● Cores per node
● RAM per node
● Cluster Manager (Yarn,
Mesos …)
Let’s assume:
● 5 nodes
● 16 cores each
● 64GB RAM
● Yarn as RM
● Spark in client mode
5/10 - Hardware Tuning
19
--num-executors = 80 //( 16 cores x 5 nodes)
--executor-cores = 1
--executor-memory = 4GB //(64 GB / 16 executors per node)
➔ Not running multiple tasks on same JVM (not sharing
broadcast vars, accumulators…).
➔ Risk of running out of memory to compute a partition.
5/10 - Hardware Tuning / Scenario #1 (Small executors)
20
--num-executors = 5
--executor-cores = 16
--executor-memory = 64GB
5/10 - Hardware Tuning / Scenario #2 (Large executors)
➔ Very long garbage collection pauses.
➔ Poor performance with HDFS (handling many
concurrent threads).
21
--executor-cores = 5
--num-executors = 14 //(15 core per node / 5 core per executor = 3 x 5 node -1)
--executor-memory = 18GB //(64 / 3 executors per node - 10% overhead)
5/10 - Hardware Tuning / Scenario #3 (Right Balance)
➔ Recommended concurrent threads for HDFS is 5.
➔ Always leave one core for Yarn daemons.
➔ Always leave one executor for Yarn ApplicationMaster.
➔ Off heap memory for yarn = 10% for executor memory.
22
➔ Hardware tuning.
➔ Moving from
Java serializer to
Kryo.
5/10 - Benchmark
23
rdd = sc.textFile('demo.zip')
rdd = rdd.repartition(100)
6/10 - Level of parallelism/partitions
➔ The maximum size of a partition(s) is limited by the available memory of an
executor.
➔ Increasing partitions count will make each partition to have less data.
➔ Spark can not split compressed files (e.g. zip) and creates only 1 partition so
repartition yourself.
24
➔ Quick wins when using a large JVM heap to avoid long GC pauses.
spark.executor.extraJavaOptions: -XX:+UseG1GC -XX:+AlwaysPreTouch -XX:+UseLargePages
-XX:+UseTLAB -XX:+ResizeTLAB
// if creating too many objects in driver (ex. collect())
// which is not a very good idea though
spark.driver.extraJavaOptions: -XX:+UseG1GC -XX:+AlwaysPreTouch -XX:+UseLargePages
-XX:+UseTLAB -XX:+ResizeTLAB
7/10 - GC Tuning
25
Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical
memory used.
8/10 - Knock knock… Who’s there?… An error :(
26
➔ Not enough executor memory.
➔ Too many executor cores (implies too much parallelism).
➔ Not enough spark partitions.
➔ Data skew (let’s talk about that later…).
➔ Increase executor memory.
➔ Reduce number of executor cores.
➔ Increase number of spark partitions.
➔ Persist in memory and disk (or just disk) with serialization.
➔ Off heap memory for caching.
8/10 - Knock knock… Who’s there?… An error :(
27
8/10 - Knock knock… Who’s there?… An error :(
19/01/31 21:03:13 INFO DAGScheduler: Host lost:
ip-172-29-149-243.eu-west-1.compute.internal (epoch 16)
19/01/31 21:03:13 INFO BlockManagerMasterEndpoint: Trying to
remove executors on host
ip-172-29-149-243.eu-west-1.compute.internal from BlockManagerMaster.
19/01/31 21:03:13 INFO BlockManagerMaster: Removed executors on
host ip-172-29-149-243.eu-west-1.compute.internal successfully.
28
{
"Name": "ScaleInContainerPendingRatio",
"Description": "Scale in on ContainerPendingRatio",
"Action": {
"SimpleScalingPolicyConfiguration": {
"AdjustmentType": "CHANGE_IN_CAPACITY",
"ScalingAdjustment": -1,
"CoolDown": 300
}
},
"Trigger": {
"CloudWatchAlarmDefinition": {
"ComparisonOperator": "LESS_THAN_OR_EQUAL",
"EvaluationPeriods": 3,
"MetricName": "ContainerPendingRatio",
"Namespace": "AWS/ElasticMapReduce",
"Dimensions": [
{
"Value": "$${emr.clusterId}",
"Key": "JobFlowId"
}
],
"Period": 300,
"Statistic": "AVERAGE",
"Threshold": 0,
"Unit": "COUNT"
}
}
}
Containers Pending / Containers allocated
Is this really
my cluster
…….?????
{
"Name": "ScaleInMemoryPercentage",
"Description": "Scale in on YARNMemoryAvailablePercentage",
"Action": {
"SimpleScalingPolicyConfiguration": {
"AdjustmentType": "CHANGE_IN_CAPACITY",
"ScalingAdjustment": -2,
"CoolDown": 300
}
},
"Trigger": {
"CloudWatchAlarmDefinition": {
"ComparisonOperator": "GREATER_THAN",
"EvaluationPeriods": 3,
"MetricName": "YARNMemoryAvailablePercentage",
"Namespace": "AWS/ElasticMapReduce",
"Dimensions": [
{
"Key": "JobFlowId",
"Value": "$${emr.clusterId}"
}
],
"Period": 300,
"Statistic": "AVERAGE",
"Threshold": 95.0,
"Unit": "PERCENT"
}
}
}
➔ A condition when data is not uniformly distributed across partitions.
➔ During joins, aggregations etc.
➔ E.g. joining with a column containing lots of null.
➔ Might cause java.lang.OutOfMemoryError: Java heap space.
9/10 - Data Skew
32
df1.join(df2,
Seq(
"make",
"model"
)
)
33
➔ Repartition your data based on key(Rdd) and column(dataframe) ,which will
evenly distribute the data.
➔ Use non-skewed column(s) for join.
➔ Replace null values of join col with NULL_X (X is a random number).
➔ Salting.
9/10 - Data Skew: possible solutions
34
df1.join(df2,
Seq(
"make",
"model",
"engine_size"
)
)
35
Let’s sprinkle
some SALT on
data skew …!!
9/10 - Impossible to find repartitioning key for even data distribution ???
Salting key = Actual partition key + Random fake key
(where fake key takes value between 1 to N, with N being the level of
distribution/partitions)
37
➔ Join DFs : Create salt col on bigger DF and broadcast the smaller one (with
addition col containing 1 to N).
➔ If both are too big to broadcast: Salt one and iterative broadcast other.
38
➔ Why it’s important?
10/10 - Data Locality
39
val sparkSession = SparkSession
.builder()
.appName("spark-app")
.config("spark.locality.wait", "60s") //default 3secs
.config("spark.locality.wait.node", "0") //set to 0 to skip node local
.config("spark.locality.wait.process", "10s")
.config("spark.locality.wait.rack", "30s")
.getOrCreate()
10/10 - Data Locality
40
REFERENCES
41
Thank you...very much !
Slides: http://tiny.cc/tiq56y
@him_aro @nityany

Más contenido relacionado

La actualidad más candente

Debugging & Tuning in Spark
Debugging & Tuning in SparkDebugging & Tuning in Spark
Debugging & Tuning in SparkShiao-An Yuan
 
Out of the box replication in postgres 9.4(pg confus)
Out of the box replication in postgres 9.4(pg confus)Out of the box replication in postgres 9.4(pg confus)
Out of the box replication in postgres 9.4(pg confus)Denish Patel
 
Streaming replication in practice
Streaming replication in practiceStreaming replication in practice
Streaming replication in practiceAlexey Lesovsky
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing GuideJose De La Rosa
 
Out of the box replication in postgres 9.4
Out of the box replication in postgres 9.4Out of the box replication in postgres 9.4
Out of the box replication in postgres 9.4Denish Patel
 
HBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at BoxHBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at BoxCloudera, Inc.
 
GlusterFS As an Object Storage
GlusterFS As an Object StorageGlusterFS As an Object Storage
GlusterFS As an Object StorageKeisuke Takahashi
 
Advanced Postgres Monitoring
Advanced Postgres MonitoringAdvanced Postgres Monitoring
Advanced Postgres MonitoringDenish Patel
 
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...ScaleGrid.io
 
PostgreSQL Streaming Replication Cheatsheet
PostgreSQL Streaming Replication CheatsheetPostgreSQL Streaming Replication Cheatsheet
PostgreSQL Streaming Replication CheatsheetAlexey Lesovsky
 
Your 1st Ceph cluster
Your 1st Ceph clusterYour 1st Ceph cluster
Your 1st Ceph clusterMirantis
 
Red Hat Enterprise Linux OpenStack Platform on Inktank Ceph Enterprise
Red Hat Enterprise Linux OpenStack Platform on Inktank Ceph EnterpriseRed Hat Enterprise Linux OpenStack Platform on Inktank Ceph Enterprise
Red Hat Enterprise Linux OpenStack Platform on Inktank Ceph EnterpriseRed_Hat_Storage
 
Advanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMXAdvanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMXzznate
 
PGConf.ASIA 2019 - High Availability, 10 Seconds Failover - Lucky Haryadi
PGConf.ASIA 2019 - High Availability, 10 Seconds Failover - Lucky HaryadiPGConf.ASIA 2019 - High Availability, 10 Seconds Failover - Lucky Haryadi
PGConf.ASIA 2019 - High Availability, 10 Seconds Failover - Lucky HaryadiEqunix Business Solutions
 
Cassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Community Webinar | In Case of Emergency Break GlassCassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Community Webinar | In Case of Emergency Break GlassDataStax
 

La actualidad más candente (19)

Debugging & Tuning in Spark
Debugging & Tuning in SparkDebugging & Tuning in Spark
Debugging & Tuning in Spark
 
Out of the box replication in postgres 9.4(pg confus)
Out of the box replication in postgres 9.4(pg confus)Out of the box replication in postgres 9.4(pg confus)
Out of the box replication in postgres 9.4(pg confus)
 
Streaming replication in practice
Streaming replication in practiceStreaming replication in practice
Streaming replication in practice
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing Guide
 
Out of the box replication in postgres 9.4
Out of the box replication in postgres 9.4Out of the box replication in postgres 9.4
Out of the box replication in postgres 9.4
 
HBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at BoxHBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at Box
 
GlusterFS As an Object Storage
GlusterFS As an Object StorageGlusterFS As an Object Storage
GlusterFS As an Object Storage
 
Advanced Postgres Monitoring
Advanced Postgres MonitoringAdvanced Postgres Monitoring
Advanced Postgres Monitoring
 
Shootout at the PAAS Corral
Shootout at the PAAS CorralShootout at the PAAS Corral
Shootout at the PAAS Corral
 
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
 
PostgreSQL Streaming Replication Cheatsheet
PostgreSQL Streaming Replication CheatsheetPostgreSQL Streaming Replication Cheatsheet
PostgreSQL Streaming Replication Cheatsheet
 
Your 1st Ceph cluster
Your 1st Ceph clusterYour 1st Ceph cluster
Your 1st Ceph cluster
 
Red Hat Enterprise Linux OpenStack Platform on Inktank Ceph Enterprise
Red Hat Enterprise Linux OpenStack Platform on Inktank Ceph EnterpriseRed Hat Enterprise Linux OpenStack Platform on Inktank Ceph Enterprise
Red Hat Enterprise Linux OpenStack Platform on Inktank Ceph Enterprise
 
Advanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMXAdvanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMX
 
PGConf.ASIA 2019 - High Availability, 10 Seconds Failover - Lucky Haryadi
PGConf.ASIA 2019 - High Availability, 10 Seconds Failover - Lucky HaryadiPGConf.ASIA 2019 - High Availability, 10 Seconds Failover - Lucky Haryadi
PGConf.ASIA 2019 - High Availability, 10 Seconds Failover - Lucky Haryadi
 
7 Ways To Crash Postgres
7 Ways To Crash Postgres7 Ways To Crash Postgres
7 Ways To Crash Postgres
 
PostgreSQL and RAM usage
PostgreSQL and RAM usagePostgreSQL and RAM usage
PostgreSQL and RAM usage
 
Cassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Community Webinar | In Case of Emergency Break GlassCassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Community Webinar | In Case of Emergency Break Glass
 
Ceph issue 해결 사례
Ceph issue 해결 사례Ceph issue 해결 사례
Ceph issue 해결 사례
 

Similar a 10 things i wish i'd known before using spark in production

Building Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scaleBuilding Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scaleAlex Thompson
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Databricks
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudRevolution Analytics
 
16 artifacts to capture when there is a production problem
16 artifacts to capture when there is a production problem16 artifacts to capture when there is a production problem
16 artifacts to capture when there is a production problemTier1 app
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdfAmit Raj
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...Gianmario Spacagna
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks
 
DevoxxUK: Optimizating Application Performance on Kubernetes
DevoxxUK: Optimizating Application Performance on KubernetesDevoxxUK: Optimizating Application Performance on Kubernetes
DevoxxUK: Optimizating Application Performance on KubernetesDinakar Guniguntala
 
Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005dflexer
 
import rdma: zero-copy networking with RDMA and Python
import rdma: zero-copy networking with RDMA and Pythonimport rdma: zero-copy networking with RDMA and Python
import rdma: zero-copy networking with RDMA and Pythongroveronline
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
 
‘16 artifacts’ to capture when there is a production problem
‘16 artifacts’ to capture when there is a production problem‘16 artifacts’ to capture when there is a production problem
‘16 artifacts’ to capture when there is a production problemTier1 app
 
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation Center
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation CenterDUG'20: 12 - DAOS in Lenovo’s HPC Innovation Center
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation CenterAndrey Kudryavtsev
 
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016Zabbix
 
Puppet at Opera Sofware - PuppetCamp Oslo 2013
Puppet at Opera Sofware - PuppetCamp Oslo 2013Puppet at Opera Sofware - PuppetCamp Oslo 2013
Puppet at Opera Sofware - PuppetCamp Oslo 2013Cosimo Streppone
 

Similar a 10 things i wish i'd known before using spark in production (20)

Building Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scaleBuilding Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scale
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
20160908 hivemall meetup
20160908 hivemall meetup20160908 hivemall meetup
20160908 hivemall meetup
 
GR740 User day
GR740 User dayGR740 User day
GR740 User day
 
16 artifacts to capture when there is a production problem
16 artifacts to capture when there is a production problem16 artifacts to capture when there is a production problem
16 artifacts to capture when there is a production problem
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdf
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
 
DevoxxUK: Optimizating Application Performance on Kubernetes
DevoxxUK: Optimizating Application Performance on KubernetesDevoxxUK: Optimizating Application Performance on Kubernetes
DevoxxUK: Optimizating Application Performance on Kubernetes
 
Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005
 
import rdma: zero-copy networking with RDMA and Python
import rdma: zero-copy networking with RDMA and Pythonimport rdma: zero-copy networking with RDMA and Python
import rdma: zero-copy networking with RDMA and Python
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
‘16 artifacts’ to capture when there is a production problem
‘16 artifacts’ to capture when there is a production problem‘16 artifacts’ to capture when there is a production problem
‘16 artifacts’ to capture when there is a production problem
 
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation Center
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation CenterDUG'20: 12 - DAOS in Lenovo’s HPC Innovation Center
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation Center
 
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
 
Evolution of Spark APIs
Evolution of Spark APIsEvolution of Spark APIs
Evolution of Spark APIs
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
The Accidental DBA
The Accidental DBAThe Accidental DBA
The Accidental DBA
 
Puppet at Opera Sofware - PuppetCamp Oslo 2013
Puppet at Opera Sofware - PuppetCamp Oslo 2013Puppet at Opera Sofware - PuppetCamp Oslo 2013
Puppet at Opera Sofware - PuppetCamp Oslo 2013
 

Más de Paris Data Engineers !

Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
 
REX : pourquoi et comment développer son propre scheduler
REX : pourquoi et comment développer son propre schedulerREX : pourquoi et comment développer son propre scheduler
REX : pourquoi et comment développer son propre schedulerParis Data Engineers !
 
Utilisation de MLflow pour le cycle de vie des projet Machine learning
Utilisation de MLflow pour le cycle de vie des projet Machine learningUtilisation de MLflow pour le cycle de vie des projet Machine learning
Utilisation de MLflow pour le cycle de vie des projet Machine learningParis Data Engineers !
 
Change Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVHChange Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVHParis Data Engineers !
 
Building highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin FrançoisBuilding highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin FrançoisParis Data Engineers !
 
Scala pour le Data Engineering par Jonathan Winandy
Scala pour le Data Engineering par Jonathan WinandyScala pour le Data Engineering par Jonathan Winandy
Scala pour le Data Engineering par Jonathan WinandyParis Data Engineers !
 

Más de Paris Data Engineers ! (11)

Spark tools by Jonathan Winandy
Spark tools by Jonathan WinandySpark tools by Jonathan Winandy
Spark tools by Jonathan Winandy
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
SCIO : Apache Beam API
SCIO : Apache Beam APISCIO : Apache Beam API
SCIO : Apache Beam API
 
Apache Beam de A à Z
 Apache Beam de A à Z Apache Beam de A à Z
Apache Beam de A à Z
 
REX : pourquoi et comment développer son propre scheduler
REX : pourquoi et comment développer son propre schedulerREX : pourquoi et comment développer son propre scheduler
REX : pourquoi et comment développer son propre scheduler
 
Deeplearning in production
Deeplearning in productionDeeplearning in production
Deeplearning in production
 
Utilisation de MLflow pour le cycle de vie des projet Machine learning
Utilisation de MLflow pour le cycle de vie des projet Machine learningUtilisation de MLflow pour le cycle de vie des projet Machine learning
Utilisation de MLflow pour le cycle de vie des projet Machine learning
 
Introduction à Apache Pulsar
 Introduction à Apache Pulsar Introduction à Apache Pulsar
Introduction à Apache Pulsar
 
Change Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVHChange Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVH
 
Building highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin FrançoisBuilding highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin François
 
Scala pour le Data Engineering par Jonathan Winandy
Scala pour le Data Engineering par Jonathan WinandyScala pour le Data Engineering par Jonathan Winandy
Scala pour le Data Engineering par Jonathan Winandy
 

Último

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 

Último (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

10 things i wish i'd known before using spark in production

  • 1. “ 10 things I wish I'd known before using in production ! ”
  • 2. Himanshu Arora Lead Data Engineer, NeoLynk France h.arora@neolynk.fr @him_aro Nitya Nand Yadav Data Engineer, NeoLynk France n.yadav@neolynk.fr @nityany
  • 3. Partenaire Suivez l’actualité de nos Tribus JVM, PHP, JS et Agile sur nos réseaux sociaux : JVM
  • 4.
  • 5. What we are going to cover... 1. RDD vs DataFrame vs DataSet 2. Data Serialisation Formats 3. Storage formats 4. Broadcast join 5. Hardware tuning 6. Level of parallelism 7. GC tuning 8. Common errors 9. Data skew 10. Data locality 5
  • 6. 1/10 - RDD vs DataFrames vs DataSets 6
  • 7. ● RDD - Resilient Distributed Dataset ➔ Main abstraction of Spark. ➔ Low-level transformation, actions and control on partition level. ➔ Unstructured dataset like media streams, text streams. ➔ Manipulate data with functional programming constructs. ➔ No optimization 7
  • 8. ● DataFrame ➔ High level abstractions, rich semantics. ➔ Like a big distributed SQL table. ➔ High level expressions (aggregation, average, sum, sql queries). ➔ Performance and optimizations(Predicate pushdown, QBO, CBO...). ➔ No compile time type check, runtime errors. 8
  • 9. ● DataSet ➔ A collection of strongly-typed JVM objects, dictated by a case class you define in Scala or a class in Java. ➔ DataFrame = DataSet[Row]. ➔ Performance and optimisations. ➔ Type-safety at compile time. 9
  • 10. 2/10 - Data Serialisation Format ➔ Data shuffled in serialized format between executors. ➔ RDDs cached & persisted in disk are serialized too. ➔ Default serialization format of spark: Java Serialization (slow & large). ➔ Better use: Kryo serialisation. ➔ Kryo: Faster and more compact (up to 10x). ➔ DataFrame/DataSets use tungsten serialization (even better than kryo). 10
  • 11. val sparkConf: SparkConf = new SparkConf() .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") val sparkSession: SparkSession = SparkSession .builder() .config(sparkConf) .getOrCreate() // register your own custom classes with kryo sparkConf.registerKryoClasses(Array(classOf[MyCustomeClass])) 2/10 - Data Serialisation Format 11
  • 12. 3/10 - Storage Formats 12
  • 13. ➔ Avoid using text, json and csv etc. if possible. ➔ Use compressed binary formats instead. ➔ Popular choices: Apache Parquet, Apache Avro & ORC etc. ➔ Use case dictates the choice. 3/10 - Storage Formats 13
  • 14. ➔ Binary formats. ➔ Splittable. ➔ Parquet: Columnar & Avro: Row based ➔ Parquet: Higher compression rates than row based format. ➔ Parquet: read-heavy workload & Avro: write heavy workload ➔ Schema preserved in files itself. ➔ Avro: Better support for schema evolution 3/10 - Storage Formats: Apache Parquet & Avro 14
  • 15. val sparkConf: SparkConf = new SparkConf() .set("spark.sql.parquet.compression.codec", "snappy") val dataframe = sparkSession.read.parquet("s3a://....") dataframe.write.parquet("s3a://....") val sparkConf: SparkConf = new SparkConf() .set("spark.sql.avro.compression.codec", "snappy") val dataframe = sparkSession.read.avro("s3a://....") dataframe.write.avro("s3a://....") 3/10 - Storage Formats 15
  • 16. 3/10 - Benchmark Using AVRO instead of JSON 16
  • 17. 4/10 - Broadcast Join 17
  • 18. //spark automatically broadcasts small dataframes (max. 10MB by default) val sparkConf: SparkConf = new SparkConf() .set("spark.sql.autoBroadcastJoinThreshold", "2147483648") .set("spark.sql.broadcastTimeout", "900") //default 300 secs /* Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult: java.util.concurrent.TimeoutException: Futures timed out after [300 seconds] */ //force broadcast val result = bigDataFrame.join(broadcast(smallDataFrame)) 4/10 - Broadcast Join 18
  • 19. Know Your Cluster ● Number of nodes ● Cores per node ● RAM per node ● Cluster Manager (Yarn, Mesos …) Let’s assume: ● 5 nodes ● 16 cores each ● 64GB RAM ● Yarn as RM ● Spark in client mode 5/10 - Hardware Tuning 19
  • 20. --num-executors = 80 //( 16 cores x 5 nodes) --executor-cores = 1 --executor-memory = 4GB //(64 GB / 16 executors per node) ➔ Not running multiple tasks on same JVM (not sharing broadcast vars, accumulators…). ➔ Risk of running out of memory to compute a partition. 5/10 - Hardware Tuning / Scenario #1 (Small executors) 20
  • 21. --num-executors = 5 --executor-cores = 16 --executor-memory = 64GB 5/10 - Hardware Tuning / Scenario #2 (Large executors) ➔ Very long garbage collection pauses. ➔ Poor performance with HDFS (handling many concurrent threads). 21
  • 22. --executor-cores = 5 --num-executors = 14 //(15 core per node / 5 core per executor = 3 x 5 node -1) --executor-memory = 18GB //(64 / 3 executors per node - 10% overhead) 5/10 - Hardware Tuning / Scenario #3 (Right Balance) ➔ Recommended concurrent threads for HDFS is 5. ➔ Always leave one core for Yarn daemons. ➔ Always leave one executor for Yarn ApplicationMaster. ➔ Off heap memory for yarn = 10% for executor memory. 22
  • 23. ➔ Hardware tuning. ➔ Moving from Java serializer to Kryo. 5/10 - Benchmark 23
  • 24. rdd = sc.textFile('demo.zip') rdd = rdd.repartition(100) 6/10 - Level of parallelism/partitions ➔ The maximum size of a partition(s) is limited by the available memory of an executor. ➔ Increasing partitions count will make each partition to have less data. ➔ Spark can not split compressed files (e.g. zip) and creates only 1 partition so repartition yourself. 24
  • 25. ➔ Quick wins when using a large JVM heap to avoid long GC pauses. spark.executor.extraJavaOptions: -XX:+UseG1GC -XX:+AlwaysPreTouch -XX:+UseLargePages -XX:+UseTLAB -XX:+ResizeTLAB // if creating too many objects in driver (ex. collect()) // which is not a very good idea though spark.driver.extraJavaOptions: -XX:+UseG1GC -XX:+AlwaysPreTouch -XX:+UseLargePages -XX:+UseTLAB -XX:+ResizeTLAB 7/10 - GC Tuning 25
  • 26. Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used. 8/10 - Knock knock… Who’s there?… An error :( 26
  • 27. ➔ Not enough executor memory. ➔ Too many executor cores (implies too much parallelism). ➔ Not enough spark partitions. ➔ Data skew (let’s talk about that later…). ➔ Increase executor memory. ➔ Reduce number of executor cores. ➔ Increase number of spark partitions. ➔ Persist in memory and disk (or just disk) with serialization. ➔ Off heap memory for caching. 8/10 - Knock knock… Who’s there?… An error :( 27
  • 28. 8/10 - Knock knock… Who’s there?… An error :( 19/01/31 21:03:13 INFO DAGScheduler: Host lost: ip-172-29-149-243.eu-west-1.compute.internal (epoch 16) 19/01/31 21:03:13 INFO BlockManagerMasterEndpoint: Trying to remove executors on host ip-172-29-149-243.eu-west-1.compute.internal from BlockManagerMaster. 19/01/31 21:03:13 INFO BlockManagerMaster: Removed executors on host ip-172-29-149-243.eu-west-1.compute.internal successfully. 28
  • 29. { "Name": "ScaleInContainerPendingRatio", "Description": "Scale in on ContainerPendingRatio", "Action": { "SimpleScalingPolicyConfiguration": { "AdjustmentType": "CHANGE_IN_CAPACITY", "ScalingAdjustment": -1, "CoolDown": 300 } }, "Trigger": { "CloudWatchAlarmDefinition": { "ComparisonOperator": "LESS_THAN_OR_EQUAL", "EvaluationPeriods": 3, "MetricName": "ContainerPendingRatio", "Namespace": "AWS/ElasticMapReduce", "Dimensions": [ { "Value": "$${emr.clusterId}", "Key": "JobFlowId" } ], "Period": 300, "Statistic": "AVERAGE", "Threshold": 0, "Unit": "COUNT" } } } Containers Pending / Containers allocated
  • 30. Is this really my cluster …….?????
  • 31. { "Name": "ScaleInMemoryPercentage", "Description": "Scale in on YARNMemoryAvailablePercentage", "Action": { "SimpleScalingPolicyConfiguration": { "AdjustmentType": "CHANGE_IN_CAPACITY", "ScalingAdjustment": -2, "CoolDown": 300 } }, "Trigger": { "CloudWatchAlarmDefinition": { "ComparisonOperator": "GREATER_THAN", "EvaluationPeriods": 3, "MetricName": "YARNMemoryAvailablePercentage", "Namespace": "AWS/ElasticMapReduce", "Dimensions": [ { "Key": "JobFlowId", "Value": "$${emr.clusterId}" } ], "Period": 300, "Statistic": "AVERAGE", "Threshold": 95.0, "Unit": "PERCENT" } } }
  • 32. ➔ A condition when data is not uniformly distributed across partitions. ➔ During joins, aggregations etc. ➔ E.g. joining with a column containing lots of null. ➔ Might cause java.lang.OutOfMemoryError: Java heap space. 9/10 - Data Skew 32
  • 34. ➔ Repartition your data based on key(Rdd) and column(dataframe) ,which will evenly distribute the data. ➔ Use non-skewed column(s) for join. ➔ Replace null values of join col with NULL_X (X is a random number). ➔ Salting. 9/10 - Data Skew: possible solutions 34
  • 36. Let’s sprinkle some SALT on data skew …!!
  • 37. 9/10 - Impossible to find repartitioning key for even data distribution ??? Salting key = Actual partition key + Random fake key (where fake key takes value between 1 to N, with N being the level of distribution/partitions) 37
  • 38. ➔ Join DFs : Create salt col on bigger DF and broadcast the smaller one (with addition col containing 1 to N). ➔ If both are too big to broadcast: Salt one and iterative broadcast other. 38
  • 39. ➔ Why it’s important? 10/10 - Data Locality 39
  • 40. val sparkSession = SparkSession .builder() .appName("spark-app") .config("spark.locality.wait", "60s") //default 3secs .config("spark.locality.wait.node", "0") //set to 0 to skip node local .config("spark.locality.wait.process", "10s") .config("spark.locality.wait.rack", "30s") .getOrCreate() 10/10 - Data Locality 40
  • 42. Thank you...very much ! Slides: http://tiny.cc/tiq56y @him_aro @nityany