SlideShare una empresa de Scribd logo
1 de 31
Visualization of
Supervised Learning with
{arules} + {arulesViz}
Takashi J. OZAKI, Ph. D.
Recruit Communications Co., Ltd.
2014/4/17 1
About me
 Twitter: @TJO_datasci
 Data Scientist (Quant Analyst) in Recruit group
 A group of companies in advertisement media and
human resources
 Known as a major player with big data
 Current mission: ad-hoc analysis on various
marketing data
 Actually, still I’m new to the field of data science
2014/4/17 2
About me
 Original background: neuroscience in the human
brain (6 years experience as postdoc researcher)
2014/4/17 3
(Ozaki, PLoS One, 2011)
About me
 English version of my blog
http://tjo-en.hatenablog.com/
2014/4/17 4
2014/4/17 5
Tonight’s topic is:
2014/4/17 6
Graphical Visualization of
Supervised Learning
Advantage of this technique
More intuitive
Easy to grasp even for high-
dimensional data
Even lay guys can easily understand
Useful for presentation
2014/4/17 7
Supervised learning: lower dimension, more intuitive
 In case of 2D data… (e.g. nonlinear SVM)
2014/4/17 8
x y label
0.924335 -1.0665Yes
2.109901 2.615284No
0.988192 -0.90812Yes
1.299749 0.944518No
-0.60885 0.457816Yes
-2.25484 1.615489Yes
Supervised learning: higher dimension, less intuitive
 In case of 7D… no way!!!
2014/4/17 9
game1 game2 game3 social1 social2 app1 app2 cv
0 0 0 1 0 0 0No
1 0 0 1 1 0 0No
0 1 1 1 1 1 0Yes
0 0 1 1 0 1 1Yes
1 0 1 0 1 1 1Yes
0 0 0 1 1 1 0No
… … … … … … ……
???
2014/4/17 10
Is there any technique
that can easily visualize
supervised learning with
higher dimension?
(…for lay people?)
2014/4/17 11
 {arules} + {arulesViz}
Why association rules and its visualization?
 Much roughly, association rules can be interpreted
as a kind of (likeness of) generative modeling
 A large set of conditional probability
 If it can be regarded as a set of conditional
probability, it also can be described as (likeness of)
Bayesian network
“XY”
 If it’s like a Bayesian network, it can be visualized
as graph representation, e.g. by {igraph}
2014/4/17 12
𝑠𝑢𝑝𝑝 𝑋 → 𝑌 =
𝜎(𝑋 ∪ 𝑌)
𝑀
𝑐𝑜𝑛𝑓 𝑋 → 𝑌 =
𝑠𝑢𝑝𝑝(𝑋 → 𝑌)
𝑠𝑢𝑝𝑝(𝑋)
𝑙𝑖𝑓𝑡 𝑋 → 𝑌 =
𝑐𝑜𝑛𝑓(𝑋 → 𝑌)
𝑠𝑢𝑝𝑝(𝑌)
X Y
Further points…
 Only when all of independent variables are bivariate,
they can be handled as “basket transaction”
2014/4/17 13
game1 game2 game3 social1 social2 app1 app2 cv
0 0 0 1 0 0 0No
1 0 0 1 1 0 0No
0 1 1 1 1 1 0Yes
0 0 1 1 0 1 1Yes
1 0 1 0 1 1 1Yes
0 0 0 1 1 1 0No
… … … … … … ……
{social1, No}
{game1, social1, social2, No}
{game2, game3, social1, social2, app1, Yes}
{game3, social1, app1, app2, Yes}
{game1, game3, social2, app1, app2, Yes}
{socia1, social2, app1, No}
…
2014/4/17 14
Let’s try in R!
Sample data “d1”
2014/4/17 15
game1 game2 game3 social1 social2 app1 app2 cv
0 0 0 1 0 0 0No
1 0 0 1 1 0 0No
0 1 1 1 1 1 0Yes
0 0 1 1 0 1 1Yes
1 0 1 0 1 1 1Yes
0 0 0 1 1 1 0No
… … … … … … ……
Imagine you’re working on a certain platform for web entertainment.
It has 3 SP games, 2 SP social networking, 2 apps.
The data records user’s history of any activity on each content in a
month after registration, and “cv” label describes they are still active
after a month passed.
In the case with svm {e1071}…
2014/4/17 16
> d1.svm<-svm(cv~.,d1) # install and require {e1071}
# svm {e1071}
> table(d1$cv,predict(d1.svm,d1[,-8]))
No Yes
No 1402 98
Yes 80 1420
# Good accuracy (only for training data)
In the case with randomForest {randomForest}…
2014/4/17 17
> tuneRF(d1[,-8],d1[,8],doBest=T) # install and require {randomForest}
# (omitted)
> d1.rf<-randomForest(cv~.,d1,mtry=2)
# randomForest {randomForest}
> table(d1$cv,predict(d1.rf,d1[,-8]))
No Yes
No 1413 87
Yes 92 1408
# Good accuracy
> importance(d1.rf)
MeanDecreaseGini
game1 20.640253
game2 12.115196
game3 2.355584
social1 189.053648
social2 76.476470
app1 796.937087
app2 2.804019
# Variable importance (without any directionality)
In the case with glm {stats}…
2014/4/17 18
> d1.glm<-glm(cv~.,d1,family=binomial)
> summary(d1.glm)
Call:
glm(formula = cv ~ ., family = binomial, data = d1)
# (omitted)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.37793 0.25979 -5.304 1.13e-07 ***
game1 1.05846 0.17344 6.103 1.04e-09 ***
game2 -0.54914 0.16752 -3.278 0.00105 **
game3 0.12035 0.16803 0.716 0.47386
social1 -3.00110 0.21653 -13.860 < 2e-16 ***
social2 1.53098 0.17349 8.824 < 2e-16 ***
app1 5.33547 0.19191 27.802 < 2e-16 ***
app2 0.07811 0.16725 0.467 0.64048
---
# (omitted)
Sample data converted for transactions “d2”
2014/4/17 19
game1 game2 game3 social1 social2 app1 app2 yes no
0 0 0 1 0 0 0 0 1
1 0 0 1 1 0 0 0 1
0 1 1 1 1 1 0 1 0
0 0 1 1 0 1 1 1 0
1 0 1 0 1 1 1 1 0
0 0 0 1 1 1 0 0 1
… … … … … … … … …
Just “cv” column was divided into 2 columns: “yes” and “no” with
bivariate (0 or 1)
Run apriori {arules} to get association rules
2014/4/17 20
> d2.ap.small<-apriori(as.matrix(d2)) # install and require {arules}
parameter specification:
confidence minval smax arem aval originalSupport support minlen
maxlen target ext
0.8 0.1 1 none FALSE TRUE 0.1 1 10 rules FALSE
algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
apriori - find association rules with the apriori algorithm
version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[9 item(s), 3000 transaction(s)] done [0.00s].
sorting and recoding items ... [9 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 5 done [0.00s].
writing ... [50 rule(s)] done [0.00s]. # only 50 rules…
creating S4 object ... done [0.00s].
Run apriori {arules} to get association rules
2014/4/17 21
> d2.ap.large<-apriori(as.matrix(d2),parameter=list(support=0.001))
parameter specification:
confidence minval smax arem aval originalSupport support minlen
maxlen target ext
0.8 0.1 1 none FALSE TRUE 0.001 1 10 rules FALSE
algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
apriori - find association rules with the apriori algorithm
version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[9 item(s), 3000 transaction(s)] done [0.00s].
sorting and recoding items ... [9 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 5 6 7 8 done [0.00s].
writing ... [182 rule(s)] done [0.00s]. # as much as 182 rules
creating S4 object ... done [0.00s].
OK, just visualize it
2014/4/17 22
> require(“arulesViz”)
# (omitted)
> plot(d2.ap.small, method=“graph”, control=list(type=“items”,
layout=layout.fruchterman.reingold,))
> plot(d2.ap.large, method=“graph”, control=list(type=“items”,
layout=layout.fruchterman.reingold,))
# Fruchterman – Reingold force-directed graph drawing algorithm can
locate nodes with distances that is proportional to “shortest path
length” between them
# Then nodes (items) should be located based on their “closeness”
between each other
Small set of rules visualized with {arulesViz}
2014/4/17 23
Compare with a result of glm
2014/4/17 24
> d1.glm<-glm(cv~.,d1,family=binomial)
> summary(d1.glm)
Call:
glm(formula = cv ~ ., family = binomial, data = d1)
# (omitted)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.37793 0.25979 -5.304 1.13e-07 ***
game1 1.05846 0.17344 6.103 1.04e-09 ***
game2 -0.54914 0.16752 -3.278 0.00105 **
game3 0.12035 0.16803 0.716 0.47386
social1 -3.00110 0.21653 -13.860 < 2e-16 ***
social2 1.53098 0.17349 8.824 < 2e-16 ***
app1 5.33547 0.19191 27.802 < 2e-16 ***
app2 0.07811 0.16725 0.467 0.64048
---
# (omitted)
Large set of rules visualized with {arulesViz}
2014/4/17 25
Compare with a result of randomForest
2014/4/17 26
> tuneRF(d1[,-8],d1[,8],doBest=T) # install and require {randomForest}
# (omitted)
> d1.rf<-randomForest(cv~.,d1,mtry=2)
# randomForest {randomForest}
> table(d1$cv,predict(d1.rf,d1[,-8]))
No Yes
No 1413 87
Yes 92 1408
# Good accuracy
> importance(d1.rf)
MeanDecreaseGini
game1 20.640253
game2 12.115196
game3 2.355584
social1 189.053648
social2 76.476470
app1 796.937087
app2 2.804019
# Variable importance (without any directionality)
See how far nodes are from yes / no
2014/4/17 27
Large set of rules visualized with {arulesViz}
2014/4/17 28
Advantage of this technique
More intuitive
Easy to grasp even for high-
dimensional data
Even lay guys can easily understand
Useful for presentation
2014/4/17 29
Disadvantage of this technique
Less strict
Never quantitative
2014/4/17 30
Any questions or comments?
2014/4/17 31
Don’t hesitate to ask me!
@TJO_datasci

Más contenido relacionado

La actualidad más candente

Palestra sobre Collections com Python
Palestra sobre Collections com PythonPalestra sobre Collections com Python
Palestra sobre Collections com Pythonpugpe
 
밑바닥부터 시작하는 의료 AI
밑바닥부터 시작하는 의료 AI밑바닥부터 시작하는 의료 AI
밑바닥부터 시작하는 의료 AINAVER Engineering
 
Seistech SQL code
Seistech SQL codeSeistech SQL code
Seistech SQL codeSimon Hoyle
 
[Pgday.Seoul 2021] 2. Porting Oracle UDF and Optimization
[Pgday.Seoul 2021] 2. Porting Oracle UDF and Optimization[Pgday.Seoul 2021] 2. Porting Oracle UDF and Optimization
[Pgday.Seoul 2021] 2. Porting Oracle UDF and OptimizationPgDay.Seoul
 
M12 random forest-part01
M12 random forest-part01M12 random forest-part01
M12 random forest-part01Raman Kannan
 
Clustering com numpy e cython
Clustering com numpy e cythonClustering com numpy e cython
Clustering com numpy e cythonAnderson Dantas
 
Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...
Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...
Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...Raman Kannan
 
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPlotly
 
Database API, your new friend
Database API, your new friendDatabase API, your new friend
Database API, your new friendkikoalonsob
 
M11 bagging loo cv
M11 bagging loo cvM11 bagging loo cv
M11 bagging loo cvRaman Kannan
 
M09-Cross validating-naive-bayes
M09-Cross validating-naive-bayesM09-Cross validating-naive-bayes
M09-Cross validating-naive-bayesRaman Kannan
 
第7回 大規模データを用いたデータフレーム操作実習(1)
第7回 大規模データを用いたデータフレーム操作実習(1)第7回 大規模データを用いたデータフレーム操作実習(1)
第7回 大規模データを用いたデータフレーム操作実習(1)Wataru Shito
 
手把手教你 R 語言分析實務
手把手教你 R 語言分析實務手把手教你 R 語言分析實務
手把手教你 R 語言分析實務Helen Chang
 
30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature Selection30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature SelectionJames Huang
 

La actualidad más candente (20)

Five
FiveFive
Five
 
Palestra sobre Collections com Python
Palestra sobre Collections com PythonPalestra sobre Collections com Python
Palestra sobre Collections com Python
 
밑바닥부터 시작하는 의료 AI
밑바닥부터 시작하는 의료 AI밑바닥부터 시작하는 의료 AI
밑바닥부터 시작하는 의료 AI
 
Seistech SQL code
Seistech SQL codeSeistech SQL code
Seistech SQL code
 
03
0303
03
 
PHP 5.4
PHP 5.4PHP 5.4
PHP 5.4
 
[Pgday.Seoul 2021] 2. Porting Oracle UDF and Optimization
[Pgday.Seoul 2021] 2. Porting Oracle UDF and Optimization[Pgday.Seoul 2021] 2. Porting Oracle UDF and Optimization
[Pgday.Seoul 2021] 2. Porting Oracle UDF and Optimization
 
M12 random forest-part01
M12 random forest-part01M12 random forest-part01
M12 random forest-part01
 
Clustering com numpy e cython
Clustering com numpy e cythonClustering com numpy e cython
Clustering com numpy e cython
 
Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...
Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...
Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...
 
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
 
Database API, your new friend
Database API, your new friendDatabase API, your new friend
Database API, your new friend
 
M11 bagging loo cv
M11 bagging loo cvM11 bagging loo cv
M11 bagging loo cv
 
Session 02
Session 02Session 02
Session 02
 
Python 1
Python 1Python 1
Python 1
 
M09-Cross validating-naive-bayes
M09-Cross validating-naive-bayesM09-Cross validating-naive-bayes
M09-Cross validating-naive-bayes
 
第7回 大規模データを用いたデータフレーム操作実習(1)
第7回 大規模データを用いたデータフレーム操作実習(1)第7回 大規模データを用いたデータフレーム操作実習(1)
第7回 大規模データを用いたデータフレーム操作実習(1)
 
手把手教你 R 語言分析實務
手把手教你 R 語言分析實務手把手教你 R 語言分析實務
手把手教你 R 語言分析實務
 
30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature Selection30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature Selection
 
Drupal 8 database api
Drupal 8 database apiDrupal 8 database api
Drupal 8 database api
 

Destacado

Taste of Wine vs. Data Science
Taste of Wine vs. Data ScienceTaste of Wine vs. Data Science
Taste of Wine vs. Data ScienceTakashi J OZAKI
 
「データサイエンティスト・ブーム」後の企業におけるデータ分析者像を探る
「データサイエンティスト・ブーム」後の企業におけるデータ分析者像を探る「データサイエンティスト・ブーム」後の企業におけるデータ分析者像を探る
「データサイエンティスト・ブーム」後の企業におけるデータ分析者像を探るTakashi J OZAKI
 
Deep Learningと他の分類器をRで比べてみよう in Japan.R 2014
Deep Learningと他の分類器をRで比べてみよう in Japan.R 2014Deep Learningと他の分類器をRで比べてみよう in Japan.R 2014
Deep Learningと他の分類器をRで比べてみよう in Japan.R 2014Takashi J OZAKI
 
直感的な単変量モデルでは予測できない「ワインの味」を多変量モデルで予測する
直感的な単変量モデルでは予測できない「ワインの味」を多変量モデルで予測する直感的な単変量モデルでは予測できない「ワインの味」を多変量モデルで予測する
直感的な単変量モデルでは予測できない「ワインの味」を多変量モデルで予測するTakashi J OZAKI
 
Rによるやさしい統計学第20章「検定力分析によるサンプルサイズの決定」
Rによるやさしい統計学第20章「検定力分析によるサンプルサイズの決定」Rによるやさしい統計学第20章「検定力分析によるサンプルサイズの決定」
Rによるやさしい統計学第20章「検定力分析によるサンプルサイズの決定」Takashi J OZAKI
 
データ分析というお仕事のこれまでとこれから(HCMPL2014)
データ分析というお仕事のこれまでとこれから(HCMPL2014)データ分析というお仕事のこれまでとこれから(HCMPL2014)
データ分析というお仕事のこれまでとこれから(HCMPL2014)Takashi J OZAKI
 
Trading volume mapping R in recent environment
Trading volume mapping R in recent environment Trading volume mapping R in recent environment
Trading volume mapping R in recent environment Nagi Teramo
 
最新業界事情から見るデータサイエンティストの「実像」
最新業界事情から見るデータサイエンティストの「実像」最新業界事情から見るデータサイエンティストの「実像」
最新業界事情から見るデータサイエンティストの「実像」Takashi J OZAKI
 
『手を動かしながら学ぶ ビジネスに活かすデータマイニング』で目指したもの・学んでもらいたいもの
『手を動かしながら学ぶ ビジネスに活かすデータマイニング』で目指したもの・学んでもらいたいもの『手を動かしながら学ぶ ビジネスに活かすデータマイニング』で目指したもの・学んでもらいたいもの
『手を動かしながら学ぶ ビジネスに活かすデータマイニング』で目指したもの・学んでもらいたいものTakashi J OZAKI
 
Granger因果による 時系列データの因果推定(因果フェス2015)
Granger因果による時系列データの因果推定(因果フェス2015)Granger因果による時系列データの因果推定(因果フェス2015)
Granger因果による 時系列データの因果推定(因果フェス2015)Takashi J OZAKI
 
計量時系列分析の立場からビジネスの現場のデータを見てみよう - 30th Tokyo Webmining
計量時系列分析の立場からビジネスの現場のデータを見てみよう - 30th Tokyo Webmining計量時系列分析の立場からビジネスの現場のデータを見てみよう - 30th Tokyo Webmining
計量時系列分析の立場からビジネスの現場のデータを見てみよう - 30th Tokyo WebminingTakashi J OZAKI
 
21世紀で最もセクシーな職業!?「データサイエンティスト」の実像に迫る
21世紀で最もセクシーな職業!?「データサイエンティスト」の実像に迫る21世紀で最もセクシーな職業!?「データサイエンティスト」の実像に迫る
21世紀で最もセクシーな職業!?「データサイエンティスト」の実像に迫るTakashi J OZAKI
 
Rで計量時系列分析~CRANパッケージ総ざらい~
Rで計量時系列分析~CRANパッケージ総ざらい~ Rで計量時系列分析~CRANパッケージ総ざらい~
Rで計量時系列分析~CRANパッケージ総ざらい~ Takashi J OZAKI
 
ビジネスの現場のデータ分析における理想と現実
ビジネスの現場のデータ分析における理想と現実ビジネスの現場のデータ分析における理想と現実
ビジネスの現場のデータ分析における理想と現実Takashi J OZAKI
 
Tech Lab Paak講演会 20150601
Tech Lab Paak講演会 20150601Tech Lab Paak講演会 20150601
Tech Lab Paak講演会 20150601Takashi J OZAKI
 
なぜ統計学がビジネスの 意思決定において大事なのか?
なぜ統計学がビジネスの 意思決定において大事なのか?なぜ統計学がビジネスの 意思決定において大事なのか?
なぜ統計学がビジネスの 意思決定において大事なのか?Takashi J OZAKI
 
Simple perceptron by TJO
Simple perceptron by TJOSimple perceptron by TJO
Simple perceptron by TJOTakashi J OZAKI
 
傾向スコアを使ったキャンペーン効果検証V1
傾向スコアを使ったキャンペーン効果検証V1傾向スコアを使ったキャンペーン効果検証V1
傾向スコアを使ったキャンペーン効果検証V1Kazuya Obanayama
 

Destacado (20)

Taste of Wine vs. Data Science
Taste of Wine vs. Data ScienceTaste of Wine vs. Data Science
Taste of Wine vs. Data Science
 
「データサイエンティスト・ブーム」後の企業におけるデータ分析者像を探る
「データサイエンティスト・ブーム」後の企業におけるデータ分析者像を探る「データサイエンティスト・ブーム」後の企業におけるデータ分析者像を探る
「データサイエンティスト・ブーム」後の企業におけるデータ分析者像を探る
 
Deep Learningと他の分類器をRで比べてみよう in Japan.R 2014
Deep Learningと他の分類器をRで比べてみよう in Japan.R 2014Deep Learningと他の分類器をRで比べてみよう in Japan.R 2014
Deep Learningと他の分類器をRで比べてみよう in Japan.R 2014
 
直感的な単変量モデルでは予測できない「ワインの味」を多変量モデルで予測する
直感的な単変量モデルでは予測できない「ワインの味」を多変量モデルで予測する直感的な単変量モデルでは予測できない「ワインの味」を多変量モデルで予測する
直感的な単変量モデルでは予測できない「ワインの味」を多変量モデルで予測する
 
Rによるやさしい統計学第20章「検定力分析によるサンプルサイズの決定」
Rによるやさしい統計学第20章「検定力分析によるサンプルサイズの決定」Rによるやさしい統計学第20章「検定力分析によるサンプルサイズの決定」
Rによるやさしい統計学第20章「検定力分析によるサンプルサイズの決定」
 
データ分析というお仕事のこれまでとこれから(HCMPL2014)
データ分析というお仕事のこれまでとこれから(HCMPL2014)データ分析というお仕事のこれまでとこれから(HCMPL2014)
データ分析というお仕事のこれまでとこれから(HCMPL2014)
 
Trading volume mapping R in recent environment
Trading volume mapping R in recent environment Trading volume mapping R in recent environment
Trading volume mapping R in recent environment
 
最新業界事情から見るデータサイエンティストの「実像」
最新業界事情から見るデータサイエンティストの「実像」最新業界事情から見るデータサイエンティストの「実像」
最新業界事情から見るデータサイエンティストの「実像」
 
Salmon cycle
Salmon cycleSalmon cycle
Salmon cycle
 
Jc 20141003 tjo
Jc 20141003 tjoJc 20141003 tjo
Jc 20141003 tjo
 
『手を動かしながら学ぶ ビジネスに活かすデータマイニング』で目指したもの・学んでもらいたいもの
『手を動かしながら学ぶ ビジネスに活かすデータマイニング』で目指したもの・学んでもらいたいもの『手を動かしながら学ぶ ビジネスに活かすデータマイニング』で目指したもの・学んでもらいたいもの
『手を動かしながら学ぶ ビジネスに活かすデータマイニング』で目指したもの・学んでもらいたいもの
 
Granger因果による 時系列データの因果推定(因果フェス2015)
Granger因果による時系列データの因果推定(因果フェス2015)Granger因果による時系列データの因果推定(因果フェス2015)
Granger因果による 時系列データの因果推定(因果フェス2015)
 
計量時系列分析の立場からビジネスの現場のデータを見てみよう - 30th Tokyo Webmining
計量時系列分析の立場からビジネスの現場のデータを見てみよう - 30th Tokyo Webmining計量時系列分析の立場からビジネスの現場のデータを見てみよう - 30th Tokyo Webmining
計量時系列分析の立場からビジネスの現場のデータを見てみよう - 30th Tokyo Webmining
 
21世紀で最もセクシーな職業!?「データサイエンティスト」の実像に迫る
21世紀で最もセクシーな職業!?「データサイエンティスト」の実像に迫る21世紀で最もセクシーな職業!?「データサイエンティスト」の実像に迫る
21世紀で最もセクシーな職業!?「データサイエンティスト」の実像に迫る
 
Rで計量時系列分析~CRANパッケージ総ざらい~
Rで計量時系列分析~CRANパッケージ総ざらい~ Rで計量時系列分析~CRANパッケージ総ざらい~
Rで計量時系列分析~CRANパッケージ総ざらい~
 
ビジネスの現場のデータ分析における理想と現実
ビジネスの現場のデータ分析における理想と現実ビジネスの現場のデータ分析における理想と現実
ビジネスの現場のデータ分析における理想と現実
 
Tech Lab Paak講演会 20150601
Tech Lab Paak講演会 20150601Tech Lab Paak講演会 20150601
Tech Lab Paak講演会 20150601
 
なぜ統計学がビジネスの 意思決定において大事なのか?
なぜ統計学がビジネスの 意思決定において大事なのか?なぜ統計学がビジネスの 意思決定において大事なのか?
なぜ統計学がビジネスの 意思決定において大事なのか?
 
Simple perceptron by TJO
Simple perceptron by TJOSimple perceptron by TJO
Simple perceptron by TJO
 
傾向スコアを使ったキャンペーン効果検証V1
傾向スコアを使ったキャンペーン効果検証V1傾向スコアを使ったキャンペーン効果検証V1
傾向スコアを使ったキャンペーン効果検証V1
 

Similar a Visualization of Supervised Learning with {arules} + {arulesViz}

TAO Fayan_Report on Top 10 data mining algorithms applications with R
TAO Fayan_Report on Top 10 data mining algorithms applications with RTAO Fayan_Report on Top 10 data mining algorithms applications with R
TAO Fayan_Report on Top 10 data mining algorithms applications with RFayan TAO
 
Comparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxComparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxPremaGanesh1
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Yao Yao
 
Is your excel production code?
Is your excel production code?Is your excel production code?
Is your excel production code?ProCogia
 
Rattle Graphical Interface for R Language
Rattle Graphical Interface for R LanguageRattle Graphical Interface for R Language
Rattle Graphical Interface for R LanguageMajid Abdollahi
 
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docxINFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docxcarliotwaycave
 
mobl presentation @ IHomer
mobl presentation @ IHomermobl presentation @ IHomer
mobl presentation @ IHomerzefhemel
 
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Vivian S. Zhang
 
Bsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structureBsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structureRai University
 
4 Descriptive Statistics with R
4 Descriptive Statistics with R4 Descriptive Statistics with R
4 Descriptive Statistics with RDr Nisha Arora
 
Bca ii dfs u-1 introduction to data structure
Bca ii dfs u-1 introduction to data structureBca ii dfs u-1 introduction to data structure
Bca ii dfs u-1 introduction to data structureRai University
 
Data Manipulation with Numpy and Pandas in PythonStarting with N
Data Manipulation with Numpy and Pandas in PythonStarting with NData Manipulation with Numpy and Pandas in PythonStarting with N
Data Manipulation with Numpy and Pandas in PythonStarting with NOllieShoresna
 
Presentation on use of r statistics
Presentation on use of r statisticsPresentation on use of r statistics
Presentation on use of r statisticsKrishna Dhakal
 
Mca ii dfs u-1 introduction to data structure
Mca ii dfs u-1 introduction to data structureMca ii dfs u-1 introduction to data structure
Mca ii dfs u-1 introduction to data structureRai University
 
Get started with R lang
Get started with R langGet started with R lang
Get started with R langsenthil0809
 

Similar a Visualization of Supervised Learning with {arules} + {arulesViz} (20)

TAO Fayan_Report on Top 10 data mining algorithms applications with R
TAO Fayan_Report on Top 10 data mining algorithms applications with RTAO Fayan_Report on Top 10 data mining algorithms applications with R
TAO Fayan_Report on Top 10 data mining algorithms applications with R
 
Comparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxComparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptx
 
R studio
R studio R studio
R studio
 
ML .pptx
ML .pptxML .pptx
ML .pptx
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
 
Is your excel production code?
Is your excel production code?Is your excel production code?
Is your excel production code?
 
Rattle Graphical Interface for R Language
Rattle Graphical Interface for R LanguageRattle Graphical Interface for R Language
Rattle Graphical Interface for R Language
 
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docxINFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
 
UNIT V.docx
UNIT V.docxUNIT V.docx
UNIT V.docx
 
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
 
mobl presentation @ IHomer
mobl presentation @ IHomermobl presentation @ IHomer
mobl presentation @ IHomer
 
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
 
Bsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structureBsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structure
 
4 Descriptive Statistics with R
4 Descriptive Statistics with R4 Descriptive Statistics with R
4 Descriptive Statistics with R
 
Bca ii dfs u-1 introduction to data structure
Bca ii dfs u-1 introduction to data structureBca ii dfs u-1 introduction to data structure
Bca ii dfs u-1 introduction to data structure
 
UNIT IV -Data Structures.pdf
UNIT IV -Data Structures.pdfUNIT IV -Data Structures.pdf
UNIT IV -Data Structures.pdf
 
Data Manipulation with Numpy and Pandas in PythonStarting with N
Data Manipulation with Numpy and Pandas in PythonStarting with NData Manipulation with Numpy and Pandas in PythonStarting with N
Data Manipulation with Numpy and Pandas in PythonStarting with N
 
Presentation on use of r statistics
Presentation on use of r statisticsPresentation on use of r statistics
Presentation on use of r statistics
 
Mca ii dfs u-1 introduction to data structure
Mca ii dfs u-1 introduction to data structureMca ii dfs u-1 introduction to data structure
Mca ii dfs u-1 introduction to data structure
 
Get started with R lang
Get started with R langGet started with R lang
Get started with R lang
 

Último

Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 

Último (20)

Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

Visualization of Supervised Learning with {arules} + {arulesViz}

  • 1. Visualization of Supervised Learning with {arules} + {arulesViz} Takashi J. OZAKI, Ph. D. Recruit Communications Co., Ltd. 2014/4/17 1
  • 2. About me  Twitter: @TJO_datasci  Data Scientist (Quant Analyst) in Recruit group  A group of companies in advertisement media and human resources  Known as a major player with big data  Current mission: ad-hoc analysis on various marketing data  Actually, still I’m new to the field of data science 2014/4/17 2
  • 3. About me  Original background: neuroscience in the human brain (6 years experience as postdoc researcher) 2014/4/17 3 (Ozaki, PLoS One, 2011)
  • 4. About me  English version of my blog http://tjo-en.hatenablog.com/ 2014/4/17 4
  • 6. 2014/4/17 6 Graphical Visualization of Supervised Learning
  • 7. Advantage of this technique More intuitive Easy to grasp even for high- dimensional data Even lay guys can easily understand Useful for presentation 2014/4/17 7
  • 8. Supervised learning: lower dimension, more intuitive  In case of 2D data… (e.g. nonlinear SVM) 2014/4/17 8 x y label 0.924335 -1.0665Yes 2.109901 2.615284No 0.988192 -0.90812Yes 1.299749 0.944518No -0.60885 0.457816Yes -2.25484 1.615489Yes
  • 9. Supervised learning: higher dimension, less intuitive  In case of 7D… no way!!! 2014/4/17 9 game1 game2 game3 social1 social2 app1 app2 cv 0 0 0 1 0 0 0No 1 0 0 1 1 0 0No 0 1 1 1 1 1 0Yes 0 0 1 1 0 1 1Yes 1 0 1 0 1 1 1Yes 0 0 0 1 1 1 0No … … … … … … …… ???
  • 10. 2014/4/17 10 Is there any technique that can easily visualize supervised learning with higher dimension? (…for lay people?)
  • 11. 2014/4/17 11  {arules} + {arulesViz}
  • 12. Why association rules and its visualization?  Much roughly, association rules can be interpreted as a kind of (likeness of) generative modeling  A large set of conditional probability  If it can be regarded as a set of conditional probability, it also can be described as (likeness of) Bayesian network “XY”  If it’s like a Bayesian network, it can be visualized as graph representation, e.g. by {igraph} 2014/4/17 12 𝑠𝑢𝑝𝑝 𝑋 → 𝑌 = 𝜎(𝑋 ∪ 𝑌) 𝑀 𝑐𝑜𝑛𝑓 𝑋 → 𝑌 = 𝑠𝑢𝑝𝑝(𝑋 → 𝑌) 𝑠𝑢𝑝𝑝(𝑋) 𝑙𝑖𝑓𝑡 𝑋 → 𝑌 = 𝑐𝑜𝑛𝑓(𝑋 → 𝑌) 𝑠𝑢𝑝𝑝(𝑌) X Y
  • 13. Further points…  Only when all of independent variables are bivariate, they can be handled as “basket transaction” 2014/4/17 13 game1 game2 game3 social1 social2 app1 app2 cv 0 0 0 1 0 0 0No 1 0 0 1 1 0 0No 0 1 1 1 1 1 0Yes 0 0 1 1 0 1 1Yes 1 0 1 0 1 1 1Yes 0 0 0 1 1 1 0No … … … … … … …… {social1, No} {game1, social1, social2, No} {game2, game3, social1, social2, app1, Yes} {game3, social1, app1, app2, Yes} {game1, game3, social2, app1, app2, Yes} {socia1, social2, app1, No} …
  • 15. Sample data “d1” 2014/4/17 15 game1 game2 game3 social1 social2 app1 app2 cv 0 0 0 1 0 0 0No 1 0 0 1 1 0 0No 0 1 1 1 1 1 0Yes 0 0 1 1 0 1 1Yes 1 0 1 0 1 1 1Yes 0 0 0 1 1 1 0No … … … … … … …… Imagine you’re working on a certain platform for web entertainment. It has 3 SP games, 2 SP social networking, 2 apps. The data records user’s history of any activity on each content in a month after registration, and “cv” label describes they are still active after a month passed.
  • 16. In the case with svm {e1071}… 2014/4/17 16 > d1.svm<-svm(cv~.,d1) # install and require {e1071} # svm {e1071} > table(d1$cv,predict(d1.svm,d1[,-8])) No Yes No 1402 98 Yes 80 1420 # Good accuracy (only for training data)
  • 17. In the case with randomForest {randomForest}… 2014/4/17 17 > tuneRF(d1[,-8],d1[,8],doBest=T) # install and require {randomForest} # (omitted) > d1.rf<-randomForest(cv~.,d1,mtry=2) # randomForest {randomForest} > table(d1$cv,predict(d1.rf,d1[,-8])) No Yes No 1413 87 Yes 92 1408 # Good accuracy > importance(d1.rf) MeanDecreaseGini game1 20.640253 game2 12.115196 game3 2.355584 social1 189.053648 social2 76.476470 app1 796.937087 app2 2.804019 # Variable importance (without any directionality)
  • 18. In the case with glm {stats}… 2014/4/17 18 > d1.glm<-glm(cv~.,d1,family=binomial) > summary(d1.glm) Call: glm(formula = cv ~ ., family = binomial, data = d1) # (omitted) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.37793 0.25979 -5.304 1.13e-07 *** game1 1.05846 0.17344 6.103 1.04e-09 *** game2 -0.54914 0.16752 -3.278 0.00105 ** game3 0.12035 0.16803 0.716 0.47386 social1 -3.00110 0.21653 -13.860 < 2e-16 *** social2 1.53098 0.17349 8.824 < 2e-16 *** app1 5.33547 0.19191 27.802 < 2e-16 *** app2 0.07811 0.16725 0.467 0.64048 --- # (omitted)
  • 19. Sample data converted for transactions “d2” 2014/4/17 19 game1 game2 game3 social1 social2 app1 app2 yes no 0 0 0 1 0 0 0 0 1 1 0 0 1 1 0 0 0 1 0 1 1 1 1 1 0 1 0 0 0 1 1 0 1 1 1 0 1 0 1 0 1 1 1 1 0 0 0 0 1 1 1 0 0 1 … … … … … … … … … Just “cv” column was divided into 2 columns: “yes” and “no” with bivariate (0 or 1)
  • 20. Run apriori {arules} to get association rules 2014/4/17 20 > d2.ap.small<-apriori(as.matrix(d2)) # install and require {arules} parameter specification: confidence minval smax arem aval originalSupport support minlen maxlen target ext 0.8 0.1 1 none FALSE TRUE 0.1 1 10 rules FALSE algorithmic control: filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE apriori - find association rules with the apriori algorithm version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt set item appearances ...[0 item(s)] done [0.00s]. set transactions ...[9 item(s), 3000 transaction(s)] done [0.00s]. sorting and recoding items ... [9 item(s)] done [0.00s]. creating transaction tree ... done [0.00s]. checking subsets of size 1 2 3 4 5 done [0.00s]. writing ... [50 rule(s)] done [0.00s]. # only 50 rules… creating S4 object ... done [0.00s].
  • 21. Run apriori {arules} to get association rules 2014/4/17 21 > d2.ap.large<-apriori(as.matrix(d2),parameter=list(support=0.001)) parameter specification: confidence minval smax arem aval originalSupport support minlen maxlen target ext 0.8 0.1 1 none FALSE TRUE 0.001 1 10 rules FALSE algorithmic control: filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE apriori - find association rules with the apriori algorithm version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt set item appearances ...[0 item(s)] done [0.00s]. set transactions ...[9 item(s), 3000 transaction(s)] done [0.00s]. sorting and recoding items ... [9 item(s)] done [0.00s]. creating transaction tree ... done [0.00s]. checking subsets of size 1 2 3 4 5 6 7 8 done [0.00s]. writing ... [182 rule(s)] done [0.00s]. # as much as 182 rules creating S4 object ... done [0.00s].
  • 22. OK, just visualize it 2014/4/17 22 > require(“arulesViz”) # (omitted) > plot(d2.ap.small, method=“graph”, control=list(type=“items”, layout=layout.fruchterman.reingold,)) > plot(d2.ap.large, method=“graph”, control=list(type=“items”, layout=layout.fruchterman.reingold,)) # Fruchterman – Reingold force-directed graph drawing algorithm can locate nodes with distances that is proportional to “shortest path length” between them # Then nodes (items) should be located based on their “closeness” between each other
  • 23. Small set of rules visualized with {arulesViz} 2014/4/17 23
  • 24. Compare with a result of glm 2014/4/17 24 > d1.glm<-glm(cv~.,d1,family=binomial) > summary(d1.glm) Call: glm(formula = cv ~ ., family = binomial, data = d1) # (omitted) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.37793 0.25979 -5.304 1.13e-07 *** game1 1.05846 0.17344 6.103 1.04e-09 *** game2 -0.54914 0.16752 -3.278 0.00105 ** game3 0.12035 0.16803 0.716 0.47386 social1 -3.00110 0.21653 -13.860 < 2e-16 *** social2 1.53098 0.17349 8.824 < 2e-16 *** app1 5.33547 0.19191 27.802 < 2e-16 *** app2 0.07811 0.16725 0.467 0.64048 --- # (omitted)
  • 25. Large set of rules visualized with {arulesViz} 2014/4/17 25
  • 26. Compare with a result of randomForest 2014/4/17 26 > tuneRF(d1[,-8],d1[,8],doBest=T) # install and require {randomForest} # (omitted) > d1.rf<-randomForest(cv~.,d1,mtry=2) # randomForest {randomForest} > table(d1$cv,predict(d1.rf,d1[,-8])) No Yes No 1413 87 Yes 92 1408 # Good accuracy > importance(d1.rf) MeanDecreaseGini game1 20.640253 game2 12.115196 game3 2.355584 social1 189.053648 social2 76.476470 app1 796.937087 app2 2.804019 # Variable importance (without any directionality)
  • 27. See how far nodes are from yes / no 2014/4/17 27
  • 28. Large set of rules visualized with {arulesViz} 2014/4/17 28
  • 29. Advantage of this technique More intuitive Easy to grasp even for high- dimensional data Even lay guys can easily understand Useful for presentation 2014/4/17 29
  • 30. Disadvantage of this technique Less strict Never quantitative 2014/4/17 30
  • 31. Any questions or comments? 2014/4/17 31 Don’t hesitate to ask me! @TJO_datasci