2. www.tantusdata.com
About me
• Data Engineer @TantusData
• Have worked for: Spotify, Apple, telcos, startups
• Cluster installations, application architecture and
development, training, data team support
marcin@tantusdata.com
marcin.szymaniuk@gmail.com
@mszymani
5. www.tantusdata.com
Bring analysis to data
DATA
R / PYTHON
/ SAS
Sample
• Sample only (region, latest month…)
• Coarse aggregate eg. month vs hour (1:720)
63. www.tantusdata.com
val df = spark.read.parquet(“…”)
df
.withColumn("year", year(col(“timestamp”)))
.withColumn("month", month(col("timestamp")))
.withColumn("day", dayofmonth(col("timestamp")))
.partitionBy("year", "month", "day")
.write.save(output)
HDFS
TASK
READ ADDCOL ADDCOL
HDFS
Day 1
Day 2
Day 90
…HDFS
TASK
READ ADDCOL ADDCOL
HDFS
Day 1
Day 2
Day 90
…
Case #1
64. HDFS
TASK
READ ADDCOL ADDCOL
HDFS
Day 1
Day 2
Day 90
…HDFS
TASK
READ ADDCOL ADDCOL
HDFS
Day 1
Day 2
Day 90
…
HDFS
TASK
READ ADDCOL ADDCOL
HDFS
Day 1
Day 2
Day 90
…HDFS
TASK
READ ADDCOL ADDCOL
HDFS
Day 1
Day 2
Day 90
…
HDFS
TASK
READ ADDCOL ADDCOL
HDFS
Day 1
Day 2
Day 90
…HDFS
TASK
READ ADDCOL ADDCOL
HDFS
Day 1
Day 2
Day 90
…
…
76. HDFS
TASK
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK 2
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK X
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK 1
Day 1
Day 2
Day 3
LOCAL
TASK 2
Day 1
Day 2
Day 3
LOCAL
TASK X
Day 1
Day 2
Day 3
LOCAL
STAGE 1
HDFS
HDFS
HDFS
TASK
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK 2
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK X
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK 1
Day 1
Day 2
Day 3
LOCAL
TASK 2
Day 1
Day 2
Day 3
LOCAL
TASK X
Day 1
Day 2
Day 3
LOCAL
STAGE 1
HDFS
HDFS
77. HDFS
TASK
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK 2
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK X
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK 1
Day 1
Day 2
Day 3
LOCAL
TASK 1
TASK 2
Day 1
Day 2
Day 3
LOCAL
TASK X
Day 1
Day 2
Day 3
LOCAL
STAGE 1 STAGE 2
HDFS
HDFS
HDFS
TASK
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK 2
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK X
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK 1
Day 1
Day 2
Day 3
LOCAL
TASK 1
TASK 2
Day 1
Day 2
Day 3
LOCAL
TASK X
Day 1
Day 2
Day 3
LOCAL
STAGE 1 STAGE 2
HDFS
HDFS
78. HDFS
TASK
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK 2
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK X
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK 1
Day 1
Day 2
Day 3
LOCAL
TASK 1
TASK 2
Day 1
Day 2
Day 3
LOCAL
TASK X
Day 1
Day 2
Day 3
LOCAL
TASK 2
STAGE 1 STAGE 2
HDFS
HDFS
79. www.tantusdata.com
Case #1 - summary
• each partitionBy task may create many files in HDFS
• repartition to the rescue
88. HDFS
TASK
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK 2
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK X
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK 1
Day 1
Day 2
Day 3
LOCAL
TASK 1
TASK 2
Day 1
Day 2
Day 3
LOCAL
TASK X
Day 1
Day 2
Day 3
LOCAL
TASK 2
STAGE 1 STAGE 2
HDFS
HDFS
89. HDFS
TASK
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK 2
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK X
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK 1
Day 1
Day 2
Day 3
LOCAL
TASK 1
TASK 2
Day 1
Day 2
Day 3
LOCAL
TASK X
Day 1
Day 2
Day 3
LOCAL
TASK 2
STAGE 1 STAGE 2
HDFS
HDFS
100G100G
100G
100. www.tantusdata.com
Case #2 - recap
• Know your data distribution when repartitioning
• Keep all your tasks busy - repartition by more columns if
necessary
105. HDFS
TASK
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK 2
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK 1
users1
users2
users3
LOCAL
TASK 2
users1
users2
users3
LOCAL
STAGE 1
HDFS
106. HDFS
TASK
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK 2
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK X
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK 1
users1
users2
users3
LOCAL
TASK 2
users1
users2
users3
LOCAL
TASK 1
users1
users2
users3
LOCAL
STAGE 1
HDFS
HDFS
HDFS
TASK X
Day 1
Day 2
Day 3
LOCAL
TASK 1
users1
users2
users3
LOCAL
HDFS
STAGE 2
107. HDFS
TASK
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK 2
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK X
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK 1
users1
users2
users3
LOCAL
TASK 1
TASK 2
users1
users2
users3
LOCAL
TASK 1
users1
users2
users3
LOCAL
TASK 2
STAGE 1
HDFS
HDFS
HDFS
TASK X
Day 1
Day 2
Day 3
LOCAL
TASK 1
users1
users2
users3
LOCAL
HDFS
STAGE 2
STAGE 3
108. HDFS
TASK
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK 2
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK X
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK 1
users1
users2
users3
LOCAL
TASK 1
TASK 2
users1
users2
users3
LOCAL
TASK 1
users1
users2
users3
LOCAL
TASK 2
STAGE 1
HDFS
HDFS
HDFS
TASK X
Day 1
Day 2
Day 3
LOCAL
TASK 1
users1
users2
users3
LOCAL
HDFS
STAGE 2
STAGE 3
109. HDFS
TASK
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK 2
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK X
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK 1
users1
users2
users3
LOCAL
TASK 1
TASK 2
users1
users2
users3
LOCAL
TASK 1
users1
users2
users3
LOCAL
TASK 2
STAGE 1
HDFS
HDFS
HDFS
TASK X
Day 1
Day 2
Day 3
LOCAL
TASK 1
users1
users2
users3
LOCAL
HDFS
STAGE 2
STAGE 3
114. HDFS
TASK
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK 2
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK X
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK 1
users1
users2
users3
LOCAL
TASK 1
TASK 2
users1
users2
users3
LOCAL
TASK 1
users1
users2
users3
LOCAL
TASK 2
STAGE 1
HDFS
HDFS
HDFS
TASK X
Day 1
Day 2
Day 3
LOCAL
TASK 1
users1
users2
users3
LOCAL
HDFS
STAGE 2
STAGE 3
115. .config(
“spark.sql.shuffle.partitions",
“200”
)
HDFS
TASK
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK 2
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK X
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK 1
users1
users2
users3
LOCAL
TASK 1
TASK 2
users1
users2
users3
LOCAL
TASK 1
users1
users2
users3
LOCAL
TASK 2
STAGE 1
HDFS
HDFS
HDFS
TASK X
Day 1
Day 2
Day 3
LOCAL
TASK 1
users1
users2
users3
LOCAL
HDFS
STAGE 2
STAGE 3
116. .config(
“spark.sql.shuffle.partitions",
“200”
)
HDFS
TASK
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK 2
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK X
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK 1
users1
users2
users3
LOCAL
TASK 1
TASK 2
users1
users2
users3
LOCAL
TASK 1
users1
users2
users3
LOCAL
TASK 2
STAGE 1
HDFS
HDFS
HDFS
TASK X
Day 1
Day 2
Day 3
LOCAL
TASK 1
users1
users2
users3
LOCAL
HDFS
STAGE 2
STAGE 3
10TB/200 = 50GB/TASK
117. 10TB/200 = 50GB/TASK
HDFS
TASK
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK 2
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK X
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK 1
users1
users2
users3
LOCAL
TASK 1
TASK 2
users1
users2
users3
LOCAL
TASK 1
users1
users2
users3
LOCAL
TASK 2
STAGE 1
HDFS
HDFS
HDFS
TASK X
Day 1
Day 2
Day 3
LOCAL
TASK 1
users1
users2
users3
LOCAL
HDFS
STAGE 2
STAGE 3
50G
50G
.config(
“spark.sql.shuffle.partitions",
“200”
)
119. www.tantusdata.com
What to do?
• Understand your data!
• Control the level of parallelism
• Choose number of partitions according to your data size
.config("spark.sql.shuffle.partitions", “2000")
.repartition(2000)
128. HDFS
TASK
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK 2
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK X
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK 1
users1
users2
users3
LOCAL
TASK 1
TASK 2
users1
users2
users3
LOCAL
TASK 1
users1
users2
users3
LOCAL
TASK 2
STAGE 1
HDFS
HDFS
HDFS
TASK X
Day 1
Day 2
Day 3
LOCAL
TASK 1
users1
users2
users3
LOCAL
HDFS
STAGE 2
STAGE 3
129. HDFS
TASK
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK 2
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK X
Day 1
Day 2
Day 3
LOCAL
HDFS
TASK 1
users1
users2
users3
LOCAL
TASK 1
TASK 2
users1
users2
users3
LOCAL
TASK 1
users1
users2
users3
LOCAL
TASK 2
STAGE 1
HDFS
HDFS
HDFS
TASK X
Day 1
Day 2
Day 3
LOCAL
TASK 1
users1
users2
users3
LOCAL
HDFS
STAGE 2
STAGE 3
1TB
149. www.tantusdata.com
• Know your data!
• Understand the transformations you are doing
• Keep your tasks busy
Use cases - recap
Photo credit: shamakern.com