Exploratory data analysis using xgboost package in R

Exploratory DataAnalysis
Using XGBoost
XGBoost を使った探索的データ分析
第1回 R勉強会＠仙台（#Sendai.R）

？誰
臨床検査事業のなかのひと
？専門
遊牧＠モンゴル（生態学／環境科学）
▼
臨床検査事業の研究所（データを縦にしたり横にしたりするしごと）
1cm
@kato_kohaku

Exploratory Data Analysis (EDA)
https://www.itl.nist.gov/div898/handbook/eda/section1/eda11.htm
is an approach/philosophy for data analysis that employs a variety of
techniques (mostly graphical) to
1. maximize insight into a data set;
2. uncover underlying structure;
3. extract important variables;
4. detect outliers and anomalies;
5. test underlying assumptions;
6. develop parsimonious models; and
7. determine optimal factor settings.

EDA (or explanation) after modelling
Taxonomy of Interpretation / Explanation
https://christophm.github.io/interpretable-ml-book/

EDA using Random Forest (EDARF)
randomForest を使った探索的データ分析 (off-topic)
Random Forest
model
Imputation for missing
 rfimpute()
 {missForest}
Rule Extraction
 {intrees}
 defragTrees@python
 EDARF::plot_prox()
 getTree()
Feature importance
 Gini / Accuracy
 Permutation based
Sensitivity analysis
 Partial Dependence Plot (PDP)
 feature contribution based {forestFloor}
Suggestion
 Feature Tweaking

Today’s topic
Intrinsic Post hoc
Model-Specific
Methods
• Linear Regression
• Logistic Regression
• GLM, GAM and more
• Decision Tree
• Decision Rules
• RuleFit
• Naive Bayes Classifier
• K-Nearest Neighbors
• Feature Importance (OOB error@RF;
gain/cover/weight @XGB)
• Feature Contribution (forestFloor@RF,
XGBoostexplainer, lightgbmExplainer)
• Alternate / Enumerate lasso
(@LASSO)
• inTrees / defragTrees (@RF/XGB)
• Actionable feature tweaking
(@RF/XGB)
Model-
Agnostic
Methods
Intrinsic interpretable
Model にも適用可能
• Partial Dependence Plot
• Individual Conditional Expectation
• Accumulated Local Effects Plot
• Feature Interaction
• Permutation Feature Importance
• Global Surrogate
• Local Explanation (LIME, Shapley
Values, breakDown)
Example-
based
Explanations
??
• Counterfactual Explanations
• Adversarial Examples
• Prototypes and Criticisms
• Influential Instances
EDA × XGBoost

Why EDA × XGBoost (or LightGBM)?
Motivation
https://twitter.com/fchollet/status/1113476428249464833?s=19

Decision tree, Random Forest & Gradient Boosting
Overview
https://www.kdnuggets.com/2017/10/understanding-machine-learning-algorithms.html
http://www.cse.chalmers.se/~richajo/dit866/lectures/l8/gb_explainer.pdf
Gradient Boosting

Gradient Boosting & XGBoost
Overview
http://www.yisongyue.com/courses/cs155/2019_winter/lectures/Lecture_06.pdf
https://www.kdd.org/kdd2016/papers/files/rfp0697-chenAemb.pdf
XGBoost’s Improvements:
 Overfitting suppression
 Split finding efficiency
 Computation time

EDA using XGBoost
XGBoost を使った探索的データ分析
XGBoost
model
Rule Extraction
 Xgb.model.dt.tree()
 {intrees}
 defragTrees@python
Feature importance
 Gain & Cover
 Permutation based
Summarize explanation
 Clustering of observations
 Variable response (2)
 Feature interaction
Suggestion
Individual explanation
 Shapley value (predcontrib)
 Structure based (predapprox)
Variable response (1)
 PDP / ICE / ALE

EDA (or explanation) using XGBoost
1. Build XGBoost model
2. Feature importance
• Gain & Cover
• Permutation based
3. Variable response (1)
• Partial Dependence Plot (PDP/ICE/ALE)
4. Rule Extraction
• Xgb.model.dt.tree()
• intrees
• defragTrees@python
5. Individual explanation
• Shapley value (predcontrib)
• Structure based (predapprox)
7. Feature interaction
• 2-way SHAP (predinteraction)
URL
Today’s Topic
Suggestion(off topic)

To Get ALL the Sample Codes
Please see github:
• https://github.com/katokohaku/EDAxgboost

１．XGBOOST MODELの構築
1. データセット
1. 変数の基本プロファイルの確認（型、定義、情報、構造、etc）
2. 前処理（変数変換、教師/テストへの分割・サンプリング、データ変換）
2. タスクと評価指標の設定
1. 分類問題？回帰問題（回帰の種類）？クラスタリング？その他？
2. 正確度、誤差、AUC、その他？
3. ハイパーパラメタの設定
1. パラメターサーチする・しない
2. どのパラメータ？、探索の方法？
4. 学習済みモデルの評価
1. 予測精度、予測特性（バイアス傾向）、その他
https://github.com/katokohaku/EDAxgboost/blob/master/100_building_xgboost_model.Rmd

EDA (or explanation) after modelling
1. Build XGBoost model
2. Feature importance
• Structure based (Gain & Cover)
• Permutation based
• Partial Dependence Plot (PDP / ICE / ALE)
4. Rule Extraction
• Xgb.model.dt.tree()
• intrees
5. Individual explanation
7. Feature interaction
• 2-way SHAP (predinteraction)
URL
EDA tools for XGBoost
Suggestion(off topic)

Human Resources Analytics Data Set
Preparation
• left (target to predict)
• Whether the employee left the workplace or not (1 or 0) Factor
• satisfaction_level
• Level of satisfaction (0-1)
• last_evaluation
• Time since last performance evaluation (in Years)
• number_project
• Number of projects completed while at work
• average_montly_hours
• Average monthly hours at workplace
• time_spend_company
• Number of years spent in the company
• Work_accident
• Whether the employee had a workplace accident
• promotion_last_5years
• Whether the employee was promoted in the last five years
• Sales
• Department in which they work for
• Salary
• Relative level of salary (high)
Source
https://github.com/ryankarlos/Human-Resource-Analytics-Kaggle-Dataset/tree/master/Original_Kaggle_Dataset

Take a glance
Preparation
• GGally::ggpairs()

+ Random Noise
Make continuous features noisy with the same way as:
• https://medium.com/applied-data-science/new-r-package-the-xgboost-explainer-51dd7d1aa211
Preparation

Baseline profile: table1::table1()

Convert Train / Test set to xgb.DMatrix
Preparation
1. Factor variable → Integer (or dummy)
2. Separate trainset / testset (+under sampling)
3. (data.frame →) matrix → xgb.DMatrix

Convert Train / Test set to xgb.DMatrix
To minimize the intercept
of xgb model
Factor → Integer
Separate train set
(+under sampling)
Convert xgb.DMatrix
Separate test set
Convert xgb.DMatrix

Hyper-parameter settings
Preparation
• According to:
https://xgboost.readthedocs.io/en/latest/parameter.html
• Tune with Grid/Random/BayesOpt. etc., if you like.
(Recommendation: using mlR)

Search optimal number of booster
Build XGBoost model
• Using cross-validation : xgb.cv()

Predictive performances
• For test set

Distribution of Prediction
Predictive performances
URL

２．学習したXGBOOST MODELのプロファイル
1. 予測における特徴量の重要度 (feature importance)
1. Structure based importance（Gain & Cover）: xgb.importance()
2. Permutation based importance: DALEX::variable_importance()
URL
https://github.com/katokohaku/EDAxgboost/blob/master/100_building_xgboost_model.Rmd

xgb.importance()
Feature importance
For a tree model:
Gain
• represents fractional contribution of each feature to the model based on the
total gain of this feature's splits. Higher percentage means a more important
predictive feature.
Cover
• metric of the number of observation related to this feature;
Frequency
• percentage representing the relative number of times a feature have been
used in trees.
For a linear model's importance:
Weight
• the linear coefficient of the feature;
https://www.rdocumentation.org/packages/xgboost/versions/0.6.4.1/topics/xgb.importance

Feature importance (structure based)
Calculates weight when not split further for each node
1. Distribute weight differences to each node
2. Accumulate the weight of the path passed by each observation, for each
booster for each feature (node)

Feature importance (structure based)
Feature importance
Gain
• represents fractional contribution of each feature to the model based on the
total gain of this feature's splits. Higher percentage means a more important
predictive feature.
https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf
Gain of ith feature at kth node in jth booster is calculated as

Feature importance (permutation based)
Feature importance
• Calculating the increase in the model’s prediction error after
permuting the feature.
• A feature is “important” if shuffling its values increases the model error,
because in this case the model relied on the feature for the prediction.
https://christophm.github.io/interpretable-ml-book/feature-importance.html
FROM: https://www.kaggle.com/dansbecker/permutation-importance

Structure based vs Permutation based
Feature Importance
Structure based Permutation based
For consistency check, rather than for "which is better?“.

３．感度分析（１）
1. 変数値の変化に対するモデル出力の応答
1. Individual Conditional Expectation & Partial Dependence Plot (ICE & PD plot)
2. PDPの問題点
3. Accumulated Local Effect (ALE) Plot
URL
https://github.com/katokohaku/EDAxgboost/blob/master/200_Sensitivity_analysis.Rmd

Marginal Response for a Single Variable
Sensitivity Analysis: ICE+PDP vs ALE Plot
Variable response comparison:
ICE+PD Plot
ALE Plot

What-If & other observation (ICE) + average line (PD)
Ceteris Paribus Plots (blue line)
• show possible scenarios for model predictions allowing for changes in a single
dimension keeping all other features constant (the ceteris paribus principle).
Individual Conditional Expectation (ICE) plot (gray lines)
• visualizes one line per instance.
Partial Dependence plot (red line)
• are shown as the average line of all observation.
https://christophm.github.io/interpretable-ml-book/ice.html
Feature value
Modeloutput

The assumption of independence
• is the biggest issue with Partial Dependence plots. When the features are correlated,
PD create new data points in areas of the feature distribution where the actual
probability is very low.
Disadvantage of Ceteris Paribus Plots and PDP
https://christophm.github.io/interpretable-ml-book/pdp.html#disadvantages-5
Forexample,it is unlikelythat:
Someone is 2 meters tall
but weighs less than 50 kg.

A Solution
Local Effect
• averages its derivative of observations on conditional distribution, instead of averaging
overall distribution of target feature.
Accumulated Local Effects (ALE)
• averages Local Effects across the window after being calculated for each window.
https://arxiv.org/abs/1612.08468
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
LocalEffect(4)
ALE = mean(Local Effects)

Sensitivity Analysis: ICE+PDP & ALE Plot

Sensitivity Analysis: ICE+PDP vs ALE Plot

４－１．ツリーの可視化とルールの要約
1. ツリーの可視化
1. boosterのダンプ： xgb.model.dt.tree()
2. Single boosterの可視化： xgb.plot.tree()
3. 要約したツリーの可視化： xgb.plot.multi.trees()
2. 予測ルールの抽出（inTrees）
1. ルールの列挙
2. ルールの要約
URL
https://github.com/katokohaku/EDAxgboost/blob/master/300_rule_extraction_xgbPlots.Rmd

Text dump Tree model structure
Rule Extraction:: xgb.model.dt.tree()
• Parse a boosted tree model into a data.table structure.

Plot a boosted tree model (1st tree)
Rule Extraction
URL

Plot a boosted tree model (2nd tree)
Rule Extraction
URL

Plot multiple tree model
Rule Extraction
URL

Multiple-in-one plot
Rule Extraction
URL

４－２．ツリーの可視化とルールの要約
1. ツリーの可視化
1. boosterのダンプ： xgb.model.dt.tree()
2. Single boosterの可視化： xgb.plot.tree()
3. 要約したツリーの可視化： xgb.plot.multi.trees()
2. 予測ルールの抽出（inTrees）
1. ルールの列挙
2. ルールの要約
URL
https://github.com/katokohaku/EDAxgboost/blob/master/300_rule_extraction_xgbPlots.Rmd

Extract rules from of trees
Rule Extraction: {inTrees}
https://arxiv.org/abs/1408.5456
• Using inTrees

Enumerate rules from of trees

Build a simplified tree ensemble learner (STEL)
ALL of sample code are:
https://github.com/katokohaku/EDAxgboost/blob/master/310_rule_extraction_inTrees.md

５－１．FEATURE CONTRIBUTIONにもとづくプロファイル
1. 個別の観察の説明 (prediction breakdown)
1. Shapley value： predict(..., predcontrib = TRUE, predapprox = FALSE)
2. Structure based： predict(..., predcontrib = TRUE, predapprox = TRUE)
3. 予測に基づく観察対象の次元削減
4. クラスタリングによるグループ化
5. グループ内の観察の可視化
URL
https://github.com/katokohaku/EDAxgboost/blob/master/400_breakdown_individual-explanation_and_clustering.Rmd

Shapley value
A method for assigning payouts to players depending on their contribution to
the total payout. Players cooperate in a coalition and receive a certain profit
from this cooperation.
The “game”
• is the prediction task for a single instance of the dataset.
The “gain”
• is the actual prediction for this instance minus the average prediction for all instances.
The “players”
• are the feature values of the instance that collaborate to receive the gain (= predict a
certain value).
• https://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf
• https://christophm.github.io/interpretable-ml-book/shapley.html
Feature contribution based on cooperative game theory

Shapley value
Shapley value is the average of all the marginal contributions
to all possible coalitions.
• One solution to keep the computation time manageable is to compute
contributions for only a few samples of the possible coalitions.
• https://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf
• https://christophm.github.io/interpretable-ml-book/shapley.html
Feature contribution based on cooperative game theory

Breakdown individual explanation path
Feature contribution based on tree structure
Based on xgboost model structure,
1. Calculate weight when not split further for each node
2. Distribute weight differences to each node
3. Accumulate the weight of the path passed by each observation, for each
booster for each feature (node)

To get prediction path

Individual explanation path
Enumerate Feature contribution based on Shapley / tree structure
Each row explains each observation (prediction breakdown)

Explain single observation
Individual explanation:
Each row explains each observation (prediction breakdown)

５－２．FEATURE CONTRIBUTIONにもとづくプロファイル
1. 個別の観察の説明 (prediction breakdown)
3. 予測に基づく観察対象の次元削減
4. クラスタリングによるグループ化
5. グループ内の観察の可視化
URL
https://github.com/katokohaku/EDAxgboost/blob/master/400_breakdown_individual-explanation_and_clustering.Rmd

Identify clusters based on xgboost
Clustering of featurecontribution of each observation using t-SNE
• Dimension reduction using t-SNE

Dimension reduction: Rtsne::Rtsne()

Identify clusters based on xgboost
Rtsne::Rtsne() → hclust() → cutree() → ggrepel::geom_label_repel()
• Class labeling using hierarchical clustering (hclust)

Scatter plot with group label

Similar observations in a cluster (1)
URL

Similar observations in a cluster (2)
URL

https://github.com/katokohaku/EDAxgboost/blob/master/R/waterfallBreakdown.R

６．FEATURE CONTRIBUTIONにもとづく感度分析
1. 変数値の変化に対するモデル出力の応答（感度分析）②
URL
https://github.com/katokohaku/EDAxgboost/blob/master/410_breakdown_feature_response-interaction.Rmd

Individual explanation path
Each column explains each feature impact (variable response)

Individual Feature Impact (1)
Sensitivity Analysis

Individual Feature Impact (2-1)

Individual Feature Impact (2-2)

Contribution dependency plots
URL
xgb.plot.shap()
• display the estimated contributions (Shapley value) of a feature to model
prediction for each individual case.

Feature Impact Summary
http://www.f1-predictor.com/model-interpretability-with-shap/
Similar to SHAPR,
• contribution breakdown from prediction path (model structure).

６．CONTRIBUTIONにもとづく相互作用分析
1. 変数同士の相互作用
1. 2変数の相互作用の強さ: predict(..., predinteraction = TRUE)
URL
https://github.com/katokohaku/EDAxgboost/blob/master/410_breakdown_feature_response-interaction.Rmd

Feature interaction of single observation
• Feature contribution can be decomposed as 2-way feature interaction.
Feature interaction

2-way featue interaction:
Feature contribution for feature contribution
Each row shows breakdown of contribution

Feature interaction of single observation
• xgboost:::predict.xgb.Booster(..., predinteraction = TRUE)
xgboost:::predict.xgb.Booster(..., predinteraction = TRUE)

Feature contribution for feature contribution of single instance

Absolute mean of all interaction
• SHAP can be decomposed as 2-way feature interaction.
xgboost:::predict.xgb.Booster(..., predinteraction = TRUE)

xgboost
Original Paper
• https://www.kdd.org/kdd2016/subtopic/view/xgboost-a-scalable-tree-
boosting-system
Tasks, Metrics & other Parameters
• https://xgboost.readthedocs.io/en/latest/
For R
• http://dmlc.ml/rstats/2016/03/10/xgboost.html
• https://xgboost.readthedocs.io/en/latest/R-
package/xgboostPresentation.html
• https://xgboost.readthedocs.io/en/latest/R-package/discoverYourData.html
解説ブログ記事・スライド（日本語）
• http://kefism.hatenablog.com/entry/2017/06/11/182959
• https://speakerdeck.com/hoxomaxwell/dive-into-xgboost
References

Data & Model explanation
Generic interpretability/explainability
• Iml book
• https://christophm.github.io/interpretable-ml-book/
Exploratory Data Analysis (EDA)
• What is EDA?
• https://www.itl.nist.gov/div898/handbook/eda/section1/eda11.htm
• DALEX
• Descriptive mAchine Learning EXplanations
• https://pbiecek.github.io/DALEX/
• DrWhy
• the collection of tools for Explainable AI (XAI)
• https://pbiecek.github.io/DALEX/
References

Exploratory data analysis using xgboost package in R

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Exploratory data analysis using xgboost package in R

Similar a Exploratory data analysis using xgboost package in R (20)

Más de Satoshi Kato

Más de Satoshi Kato (14)

Último

Último (20)

Exploratory data analysis using xgboost package in R

Notas del editor