R使いがSparkを使ったら

R使いがSparkを使ったら
早川　敦士

自己紹介
早川　敦士
株式会社リクルートコミュニケーションズ 2015年4月新卒入社
大学時代の専門:品質管理・信頼性工学
Japan.R 2015 主催
·
·
主な担当業務: アドホック分析や可視化ツール開発-
·
·
普段はR言語のコミュニティに生息-
2/28

R言語ユーザから見えるSpark
Scala？
関数型言語？
よく分からないけど、難しそう
·
·
·
4/28

Strata Hadoop
カンファレンスのメインテーマはSpark
5/28

Strata Hadoop参加後の僕
時代はSparkだ！
R言語では捌けない大規模データをSparkで処理したい！
馴染み深いDataFrameも使える！
·
·
·
6/28

R言語ユーザにとってお馴染みのdata.frame
R言語ユーザはdata.frameで生活している
dplyrサイコー
データの集計・加工が簡単になる！
·
·
·
7/28

google trendにおけるDataFrame
8/28

dplyrによるデータ加工例
iris %>% head(2)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
iris %>%
filter(Species %in% c("setosa", "versicolor")) %>%
group_by(Species) %>%
summarise(sepal_length_sum = sum(Sepal.Length)) %>%
head()
## Source: local data frame [2 x 2]
##
## Species sepal_length_sum
9/28

SparkにおけるDataFrame
http://spark.apache.org/docs/latest/sql-programming-guide.html
大体の事は、ここに書いてある。 10/28

困ったら、更なる深みへ
http://spark.apache.org/docs/latest/api/scala
/index.html#org.apache.spark.sql.DataFrame
検討をつけて、検索して漁る
ドキュメントに情報が無いことが多いので、こっちを見ることが多い。
11/28

Sparkにおけるデータの流れ
spark-whats-new-whats-coming
Machine Learning Library (MLlib) Guide
·
·
12/28
S3
Spark
Core
RDD現在
DataFrame
API
org.apache.
spark.mllib
現在
現在
org.apache.
spark.ml
将来

Spark DataFrameとR言
語の比較
13/28

データの読み込み
R
Spark
csvを読み込む場合は自前でパースするか、別のライブラリを用いる。
df <- read.csv("iris.csv")
val hadoopConf=sc.hadoopConfiguration;
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId","ほげ")
hadoopConf.set("fs.s3.awsSecretAccessKey","ほげ")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.json(s"s3://ほげ/iris.json")
14/28

最初の？行を見る
R
Spark
df %>% head(1)
df.limit(1)
15/28

列の選択
R
Spark
df %>% select("Sepal_Length")
df.select("Sepal_Length")
16/28

条件に一致する行を取り出す
R
Spark
or
df %>% filter(Sepal_Length > 5)
df.filter($"Sepal_Length".gt(5))
df.filter($"Sepal_Length" > 5)
17/28

groubyで行数を数える
R
Spark
or
df %>%
group_by("Species") %>%
summarize(count = n())
df.
groupBy("Species").count()
df.
groupBy("Species").agg($"Species", count("Sepal_Length"))
18/28

groupbyで集計する
R
Spark
df %>%
summarize(max(Sepal_Length), min(Sepal_Length))
df.
groupBy("Species").
agg($"Species", max("Sepal_Length"), min("Sepal_Length"))
19/28

ソート
R
Spark
iris %>%
arrange(Sepal.Length)
df.
sort(asc("Sepal_Length"))
20/28

ランクを求める
R
Spark
HiveContextを使う必要がある。
df %>% mutate(Sepal_Length, rank=rank(Sepal.Length))
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.expressions.Window
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val df = sqlContext.read.json(s"s3://ほげ/iris.json")
val wSpec = Window.orderBy($"Sepal_Length")
val wSpec = Window.orderBy($"Sepal_Length")
df.select($"Sepal_Length", rank().over(wSpec).alias("rank"))
21/28

partitionByでランクを求める
R
Spark
df %>%
group_by("Species") %>%
mutate(rank=rank(Sepal_Length))
val wSpec = Window.partitionBy(df("Species")).orderBy(df("Sepal_Length"))
df.select($"Sepal_Length",
$"Species",
rank().over(wSpec).alias("rank"))
22/28

partitionByでrowNumberを使ってランクを求
める
R
Spark
df %>%
arrange(Sepal_Length) %>%
mutate(row_number = 1:n())
val wSpec = Window.orderBy(df("Sepal_Length"))
$"Species", rowNumber.over(wSpec).alias("row_number"))
23/28

Species毎にSepal_Lengthで並び替えて差分を
求める
R
Spark
df %>%
mutate(Sepal_Length_lag = Sepal_Length - lag(Sepal_Length))
val wSpec = Window.partitionBy($"Species").orderBy($"Sepal_Length")
df.select($"Sepal_Length", $"Species",
($"Sepal_Length" - lag($"Sepal_Length", 1).
over(wSpec)).alias("Sepal_Length_lag"))
24/28

Species毎にSepal_Lengthで並び替えて累積
和を求める
R
Spark
df %>%
mutate(Sepal_Length_cumsum = cumsum(Sepal_Length))
val wSpec = Window.partitionBy($"Species").orderBy("Sepal_Length")
$"Species",
sum($"Sepal_Length").over(wSpec).alias("Sepal_Length_cumsum"))
25/28

inner joinする
R
Spark
sample1 %>%
inner_join(., sample2, by=c("prefecture", "city"))
val sample1_url = s"s3://ほげ/sample1.json"
val sample2_url = s"s3://ほげ/sample2.json"
val sample1 = sqlContext.read.json(sample1_url).as("sample1")
val sample2 = sqlContext.read.json(sample2_url).as("sample2")
sample1.join(sample2).
where($"sample1.prefecture" === $"sample2.prefecture" &&
$"sample1.city" === $"sample2.city")
26/28

最後に
データ加工をする上で頻出する(data.frame|DataFrame)の操作をRのdplyrと
SparkのDataFrameで比較した。
今後のSparkはRDDではなく、DataFrameがメインになっていく可能性が高い
ので、明日から使おう！！
27/28

R使いがSparkを使ったら

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a R使いがSparkを使ったら

Similar a R使いがSparkを使ったら (20)

R使いがSparkを使ったら