給初學者的Spark教學

2 0 1 7 . 0 6
給初學者的 S p a r k 教學
P o p c o r n y ( 陸振恩 )

Who am I
• 陸振恩 (popcorny)
• Director of Engineering @TenMax
• 之前經歷
– 交大資科所
– 第四屆趨勢百萬程式競賽冠軍
– 聯發科技 (2005- 2010)
– SmartQ (2011 – 2014)
– cacaFly/TenMax (2014-present)
• FB: https://fb.me/popcornylu
2

Target Audience
• 有基本的Java寫作能力
• 最好有Java8 Stream或是其他語言function language相關的基本概念 (map,
flatMap, filter, reduce, …)
• 還不會寫Spark，或是看過Spark的書還沒動手做的
3

Outline
• 了解Spark的基本常識
• 介紹Spark DataFrame/SQL
• 寫一個Spark Application
4

• 介紹Spark DataFrame/SQL
5

Introduction to Spark
• Spark是一個分散式運算引擎
• MapReduce框架
• 以RDD為基礎 (Resilient Distributed Datasets)
6

Spark適合什麼
• 適合
– 大資料量的批次資料處理
– 流運算
– 各種資料量的ETL及資料分析
• 不適合
– RDBMS就可以解決你的需求的時候
7

Big Data ArchitectureDistributedFileSystem
DistributedFileSystem
Resource Manager
Computation Framework
Application Framework
Application
8

Hadoop Application
Pig / Hive
Hadoop ArchitectureHDFS
HDFS
YARN
Hadoop MapReduce V2
9

Spark ArchitectureDFS
DFS
YARN
Spark Application
Spark DataFrame,SQL / Stream / Mllib / GraphX
10
Spark

Spark Application
Driver
Executor Executor
Executor Executor
Spark
context
Executor
Application
spark-submit
application.jar
Cluster 11
Node
JVM Process

Spark RDD
• Resilient Distributed Dataset
• 可以把它想成是Java的Stream，只是分散式的版本
• 特色
– Lazy Evaluation: 只有在action被觸發時，才會真正運算，否則只是
建立關聯而已。
– Partitioned: 資料可以分成很多可以平行處理的partition
– Cachable: 運算過的資料可以暫存在executor中。
– Reusable: RDD可以被重複使用，相較的Java Stream就只能用一
次。
12

Spark RDD
• 處理資料不外乎Input, Transformation, Output
• 或是稱為ETL (Extract, Transformation, Load)
• 而Spark中
– Input是由spark context產生的RDD
– 由RDD可以產生一系列的transformation
– 最後執行一個action，會啟動整個pipeline，並且產生output到
action所對應的地方
13

Input
• 都是從spark context取得input的RDD
• sc.parallelize(list): 把一個list送到spark cluster
• sc.textFile(path): 從path取得一個文字檔
14

Simple Operations
• map(func): 一對一的轉換
T  U
• flatMap(func): 一對多的轉換
T  0..* U
• mapPartitions(func) : 多對多的轉換
T0..*  0..* U
• filter(func) : 過濾器
T  0..1T
15

Shuffle Operations (Single Source)
• groupByKey([numTasks]): 把同樣的key的資料串成一個list
(K, V)  (K, Iterable<V>)
• reduceByKey(func, [numTasks]): 把同樣的資料reduce起來
(K, V)  (K, V),
reducer (V,V)  V
• aggregateByKey(zeroValue, seqOp, combOp, [numTasks]): 把同樣的資料
reduce起來，但是透過accumulator
(K, V)  (K, U),
seqOp (U,V) -> U,
combOp (U,U)  U
• sortByKey([ancending], [numTasks]): 根據key排序
(K, V)  (K, V)
16

Shuffle Operations (Two Sources)
• cartesian(otherDataset, [numTasks]): 把兩邊的資料n x m種的完全配對。例
如撲克牌的4個花色 x 13個數字可以配對成整副牌。
T, U  (T, U)
• join(otherDataset, [numTasks]): 把同key的資料join起來，支援inner join,
left/right/full outer join
(K, V), (K, W)  (K, (V, W))
• cogroup(otherDataset, [numTasks]): 類似gropuByKey，只是是兩個
sources的版本
(K, V), (K, W)  (K, (Iterable<V>, Iterable<W>))
17

Repartition Operations
• repartition(numParitions): 單純shuffle
• coalesce(numParitions): 不會shuffle，只是減少partition數量
18

Actions
• 寫檔案
– saveAsTextFile(path)
• 傳回driver
– first(): 取得第一筆
– take(n): 取得前n筆
– collect(): 取得所有的結果
– count(): 算結果有幾筆
– reduce(func): 用一個reducer去收資料
• 直接在exectuor內部執行
– foreach(func): 直接在executor中一個一個item callback
19

Shuffle
• 資料交換的動作
• 資料必須要先有key, value
• 用key來分群
• 同一個key的一定被分到同一個partition
• 這東西其實就是MapReduce在做的事情
23

Shuffle
24Source: MapReduce Shuffle原理与 Spark Shuffle原理

Job, Stage, Task
• Application由spark-submit產生
• Job由action operation產生
• Stage由shuffle operation產生，不同stage可以有不同的task數量。
• Task由shuffle operation的tasks或由input partition來決定數量，為平行
處理中最小不可切割的任務。
Cluster Application Job Stage Task
1 * 1 * 1 * 1 *
25

Operations
groupByKey
reduceByKey
aggregateByKey
repartition
map
flatMap
mapPartitions
filter
cartesian
join
cogroup
foreach
foreachPartitions
sc.textFile
sc.xxxFile
Driver Program
saveAsTextFile
saveAsXxxFile
sc.parallelized collect first
take count
reduce
26

• 了解Spark DataFrame/SQL常用操作
27

Spark DataFrame & Dataset
• DataFrame
– 就像是RDBMS的table
– 有Schema，並且可以是巢狀的
– Dataset<Row>
– 一筆資料由很多columns所組成
• Dataset
– Dataset<T>
– Typed dataset
28

Reader and Writer
• Input/output的來源
– RDD
– File
• 支援的格式
– CSV
– Json
– Parquet (推薦)
29

DataFrame Operations
• select(column…)
• distinct()
• join(right, column)
• where(column)
• groupBy(columns…)
• agg(column...)
• orderBy(column…)
30

DataFrame Functions
• Import org.apache.spark.sql.functions
• Normal Functions
– col(name)
• Aggregation Functions
– min(column)
– max(column)
– count(column)
– sum(column)
– avg(column)
31

32
• Data

33
• SQL
Select year, region, sum(people_total) as people_total
from population group by year, region order by people_total desc
• Spark Dataframe

DataFrame Schema
• 定義Schema
– JavaBean, Encoder
– 程式化指定
– Metastore (Hive)
– 從檔案內容去推導schema
• 檢查Schema
– df.printSchema()
34

Spark SQL
• 用SQL語法來query dataframe
• SQL本身是一個declarative語言，所以內建優化引擎，把它變成phisicial的
dataframe operations
• Output則是另外一個dataframe
35

1. 了解Spark的基本常識
2. 了解Spark DataFrame/SQL常用操作
3. 寫一個Spark Application
37

Spark Application
• 包裝在一個application jar
• 透過spark-submit來執行程式
• Submit需要指定master
• Master代表的是一個resource manager或說是cluster manager。
Submit之後會在整個resource manager取得所需要的資源
• Spark application透過spark context跟這些資源互動
38

Uber jar
• 因為spark application jar需要傳到各個executer執行，所以要怎麼把用到的
library也傳過去?
• 把所用到的jar檔解開，必且直接包在application jar，這種方法就叫做uber jar
• 或稱fat jar或shadow jar
39

Spark Template Project
• https://github.com/popcornylu/spark-wordcount
• Commands
– Application Jar:
./gradlew jar
spark-submit –master local[*] build/libs/spark-wordcount.jar
– Application uber jar
./gradlew shadowJar
spark-submit –master local[*] build/libs/spark-wordcount-
all.jar
40

Resource Manager
• Local
• Standalone cluster
• YARN cluster
• Mesos cluster
41

Spark Web UI
• 預設在跑spark application的時候可以啟動WebUI (port: 4040, 4041,….)
• 可以用來看Job, Stage, Task的進度
• Debug好工具
42

History Server
• WebUI只能看到正在執行的spark application
• 但是可以透過history server已經結束的application的紀錄
43

Configurations
• conf/log4.properties: Log Configuration。可以把預設log level從INFO改
成WARN
• conf/core-site.xml: File System Configuration。如果有用到DFS要在這
邊設定。
• conf/spark-default.xml: Default Application Configuration。例如預設的
master，或是預設要記錄history都要在這邊設定
• conf/spark-env.sh: Default Environment Variable。主要是各個daemon執
行的環境變數。
44

Recap
• Spark是一個分散式的運算引擎
• 由RDD所構成，有Input, Transformations, Action
• 執行一個Action換產生Job，一個Job可能有很多Stages，每個Stages有不一樣
的task數量
• Shuffle的原理
• Spark DataFrame跟Spark SQL
• 如何寫一個Spark Application
45

給初學者的Spark教學

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 給初學者的Spark教學

Similar to 給初學者的Spark教學 (20)

More from Chen-en Lu

More from Chen-en Lu (6)

給初學者的Spark教學