PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production

Chetan Khatri, Data Science Practice Leader.
Accionlabs India.
PyconLT’19, May 26 - Vilnius Lithuania
Twitter: @khatri_chetan,
Email: chetan.khatri@live.com
chetan.khatri@accionlabs.com
LinkedIn: https://www.linkedin.com/in/chetkhatri
Github: chetkhatri

Lead - Data Science, Technology Evangelist @ Accion labs India Pvt. Ltd.
Contributor @ Apache Spark, Apache HBase, Elixir Lang.
Co-Authored University Curriculum @ University of Kachchh, India.
Data Engineering @: Nazara Games, Eccella Corporation.
M.Sc. - Computer Science from University of Kachchh, India.

● Apache Spark
● Primary data structures (RDD, DataSet, Dataframe)
● Koalas: pandas API on Apache Spark
● Pragmatic explanation - executors, cores, containers, stage, job, a task in Spark.
● Parallel read from JDBC: Challenges and best practices.
● Bulk Load API vs JDBC write
● An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin
● Avoid unnecessary shuffle
● Alternative to spark default sort
● Why dropDuplicates() doesn’t result consistency, What is alternative
● Optimize Spark stage generation plan
● Predicate pushdown with partitioning and bucketing

● Apache Spark is a fast and general-purpose cluster computing system / Unified Engine for massive data
processing.
● It provides high level API for Scala, Java, Python and R and optimized engine that supports general
execution graphs.
Structured Data / SQL - Spark SQL Graph Processing - GraphX
Machine Learning - MLlib Streaming - Spark Streaming,
Structured Streaming

RDD RDD RDD RDD
Logical Model Across Distributed Storage on Cluster
HDFS, S3

RDD RDD RDD
T T
RDD -> T -> RDD -> T -> RDD
T = Transformation

Integer RDD
String or Text RDD
Double or Binary RDD

RDD RDD RDD
T T
RDD RDD RDD
T A
RDD - T - RDD - T - RDD - T - RDD - A - RDD
T = Transformation
A = Action

Operations
Transformation
Action

TRANSFORMATIONSACTIONS
General Math / Statistical Set Theory / Relational Data Structure / I/O
map
gilter
flatMap
mapPartitions
mapPartitionsWithIndex
groupBy
sortBy
sample
randomSplit
union
intersection
subtract
distinct
cartesian
zip
keyBy
zipWithIndex
zipWithUniqueID
zipPartitions
coalesce
repartition
repartitionAndSortWithinPartitions
pipe
reduce
collect
aggregate
fold
first
take
forEach
top
treeAggregate
treeReduce
forEachPartition
collectAsMap
count
takeSample
max
min
sum
histogram
mean
variance
stdev
sampleVariance
countApprox
countApproxDistinct
takeOrdered
saveAsTextFile
saveAsSequenceFile
saveAsObjectFile
saveAsHadoopDataset
saveAsHadoopFile
saveAsNewAPIHadoopDataset
saveAsNewAPIHadoopFile

You care about control of dataset and knows how data looks like, you care
about low level API.
Don’t care about lot’s of lambda functions than DSL.
Don’t care about Schema or Structure of Data.
Don’t care about optimization, performance & inefﬁciencies!
Very slow for non-JVM languages like Python, R.
Don’t care about Inadvertent inefﬁciencies.

SQL DataFrames Datasets
Syntax Errors Runtime Compile Time Compile Time
Analysis Errors Runtime Runtime Compile Time
Analysis errors are caught before a job runs on cluster

// convert RDD -> DF with column names
parsedDF = parsedRDD.toDF("project", "sprint", "numStories")
// filter, groupBy, sum, and then agg()
parsedDF.filter(lambda x: x[1] === "finance")
.groupBy("sprint")
.agg(sum("numStories").as("count"))
.limit(100)
.show(100)
project sprint numStories
ﬁnance 3 20
ﬁnance 4 22

parsedDF.createOrReplaceTempView("audits")
results = spark.sql(
"""SELECT sprint, sum(numStories)
AS count FROM audits WHERE project = 'finance' GROUP BY sprint
LIMIT 100""")
results.show(100)
project sprint numStories
ﬁnance 3 20
ﬁnance 4 22

SQL AST
DataFrame
Datasets
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
Physical
Plans
CostModel
Selected
Physical
Plan
RDD

employees.join(events, employees("id") === events("eid"))
.filter(events("date") > "2015-01-01")
events file
employees
table
join
filter
Logical Plan
scan
(employees)
filter
Scan
(events)
join
Physical Plan
Optimized
scan
(events)
Optimized
scan
(employees)
join
Physical Plan
With Predicate Pushdown
and Column Pruning

● Pandas - Analyze small datasets.
● Spark - Analyze large size of datasets.
Pandas DataFrame Spark DataFrame
Column df['col'] df['col']
Mutability Mutable Immutable
Add a Column df['Z'] = df['X'] +
df['Y']
df.withColumn('Z',
df['X'] + df['Y'])
Rename columns df.columns = ['X', 'Y'] df.select(df['Q1'].as('X'
), df['P1'].as('Y'))
Ref Example, https://github.com/chetkhatri/PyConLT2019

Executors
Cores
Containers
Stage
Job
Task

Job - Each transformation and action mapping in Spark would create a separate jobs.
Stage - A Set of task in each job which can run parallel using ThreadPoolExecutor.
Task - Lowest level of Concurrent and Parallel execution Unit.
Each stage is split into #number-of-partitions tasks,
i.e Number of Tasks = stage * number of partitions in the stage

yarn.scheduler.minimum-allocation-vcores = 1
Yarn.scheduler.maximum-allocation-vcores = 6
Yarn.scheduler.minimum-allocation-mb = 4096
Yarn.scheduler.maximum-allocation-mb = 28832
Yarn.nodemanager.resource.memory-mb = 54000
Number of max containers you can run = (Yarn.nodemanager.resource.memory-mb = 54000 /
Yarn.scheduler.minimum-allocation-mb = 4096) = 13

What happens when you run this code?
What would be the impact at Database engine side?

JoinSelection execution planning strategy uses
spark.sql.autoBroadcastJoinThreshold property (default: 10M) to control the size
of a dataset before broadcasting it to all worker nodes when performing a join.
# check broadcast join threshold
>>> int(spark.conf.get("spark.sql.autoBroadcastJoinThreshold")) / 1024 / 1024
10
# logical plan with tree numbered
sampleDF.queryExecution.logical.numberedTreeString
# Query plan
sampleDF.explain

Repartition: Boost the Parallelism, by increasing the number of Partitions. Partition on Joins, to get
same key joins faster.
// Reduce number of partitions without shuffling, where repartition does equal data shuffling across the cluster.
employeeDF.coalesce(10).bulkCopyToSqlDB(bulkWriteConfig("EMPLOYEE_CLIENT"))
For example, In case of bulk JDBC write. Parameter "bulkCopyBatchSize" -> "2500", means Dataframe has 10 partitions
and each partition will write 2500 records Parallely.
Reduce: Impact on Network Communication, File I/O, Network I/O, Bandwidth I/O etc.

1. // disable autoBroadcastJoin
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
2. // Order doesn't matter
table1.leftjoin(table2) or table2.leftjoin(table1)
3. force broadcast, if one DataFrame is not small!
4. Minimize shuffling & Boost Parallism, Partitioning, Bucketing, coalesce, repartition,
HashPartitioner

./bin/spark-submit
--conf spark.yarn.maxAppAttempts=1
--name PyConLT19
--master yarn
--deploy-mode cluster
--driver-memory 18g
--executor-memory 24g
--num-executors 4
--executor-cores 6
--conf spark.speculation=false
--conf spark.broadcast.compress=true
--conf spark.sql.broadcastTimeout=36000
--conf spark.network.timeout=2500s
--conf spark.dynamicAllocation.executorAllocationRatio=1
--conf spark.executor.heartbeatInterval=30s
--conf spark.dynamicAllocation.executorIdleTimeout=60s
--conf spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=15s
--conf spark.network.timeout=1200s
--conf spark.dynamicAllocation.schedulerBacklogTimeout=15s
--conf spark.shuffle.service.enabled=true
--conf spark.dynamicAllocation.enabled=True
--conf spark.dynamicAllocation.minExecutors=2
--conf spark.dynamicAllocation.initialExecutors=2
--conf spark.dynamicAllocation.maxExecutors=6
examples/src/main/python/pi.py

[1] Koalas: pandas API on Apache Spark
[URL] https://github.com/databricks/koalas
[2] An open-source storage layer that brings scalable, ACID transactions
to Apache Spark™ and big data workloads. https://delta.io
[URL] https://github.com/delta-io/delta
[3] Leveraging Spark Speculation To Identify And Re-Schedule Slow Running
Tasks.
https://blog.yuvalitzchakov.com/leveraging-spark-speculation-to-identify-
and-re-schedule-slow-running-tasks/

PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production

PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production

Similar to PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production (20)

More from Chetan Khatri

More from Chetan Khatri (20)

Recently uploaded

Recently uploaded (20)

PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production