Kaggle talk series top 0.2% kaggler on amazon employee access challenge

AAmmaazzoonnEEmmppllooyyeeeeAAcccceessssCChhaalllleennggee
Predictanemployee'saccessneeds,givenhis/herjobrole
Yibo Chen
Data Scientist @ Supstat Inc
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
1 of 65 6/13/14, 2:01 PM

AAggeennddaa
Introduction to the Challenge1.
Look into the Data2.
Model Building3.
Summary4.
2/65
2 of 65 6/13/14, 2:01 PM

IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the story
http://www.kaggle.com/c/amazon-employee-access-challenge
it is all about the access we need to fulﬁll our daily work.
3/65
3 of 65 6/13/14, 2:01 PM

the mission
build an auto-access model based on the historical data
to determine the access privilege according to the employee's job role and the resource he applied
for
4/65
4 of 65 6/13/14, 2:01 PM

the data
The data consists of real historical data collected from 2010 & 2011.
Employees are manually allowed or denied access to resources over time.
the files
train.csv - The training set. Each row has the ACTION (ground truth), RESOURCE, and
information about the employee's role at the time of approval
test.csv - The test set for which predictions should be made. Each row asks whether an
employee having the listed characteristics should have access to the listed resource.
·
·
5/65
5 of 65 6/13/14, 2:01 PM

the variables
COLUMN NAME DESCRIPTION
ACTION ACTION is 1 if the resource was approved, 0 if the resource was not
RESOURCE An ID for each resource
MGR_ID The EMPLOYEE ID of the manager of the current EMPLOYEE ID record
ROLE_ROLLUP_1 Company role grouping category id 1 (e.g. US Engineering)
ROLE_ROLLUP_2 Company role grouping category id 2 (e.g. US Retail)
ROLE_DEPTNAME Company role department description (e.g. Retail)
ROLE_TITLE Company role business title description (e.g. Senior Engineering Retail Manager)
ROLE_FAMILY_DESC Company role family extended description (e.g. Retail Manager, Software Engineering)
ROLE_FAMILY Company role family description (e.g. Retail Manager)
ROLE_CODE Company role code; this code is unique to each role (e.g. Manager)
6/65
6 of 65 6/13/14, 2:01 PM

the metric
AUC(area under the ROC curve)
is a metric used to judge predictions in binary response (0/1) problem
is only sensitive to the order determined by the predictions and not their magnitudes
package verification or ROCR in R
·
·
·
7/65
7 of 65 6/13/14, 2:01 PM

the metric
(t <- data.frame(true_label=c(0,0,0,0,1,1,1,1),
predict_1=c(1,2,3,4,5,6,7,8),
predict_2=c(1,2,3,6,5,4,7,8),
predict_3=c(1,7,6,4,5,3,2,8)))
## true_label predict_1 predict_2 predict_3
## 1 0 1 1 1
## 2 0 2 2 7
## 3 0 3 3 6
## 4 0 4 6 4
## 5 1 5 5 5
## 6 1 6 4 3
## 7 1 7 7 2
## 8 1 8 8 8
8/65
8 of 65 6/13/14, 2:01 PM

the metric
P:4
N:4
TP:2
FP:1
TPR=TP/P=0.5
FPR=FP/N=0.25
table(t$predict_2 >= 6, t$true_label)
##
## 0 1
## FALSE 3 2
## TRUE 1 2
9/65
9 of 65 6/13/14, 2:01 PM

the metric
P:4
N:4
TP:3
FP:1
TPR=TP/P=0.75
FPR=FP/N=0.25
table(t$predict_2 >= 5, t$true_label)
##
## 0 1
## FALSE 3 1
## TRUE 1 3
10/65
10 of 65 6/13/14, 2:01 PM

the metric
11/65
11 of 65 6/13/14, 2:01 PM

the metric
require(ROCR, quietly = T)
pred <- prediction(t$predict_1, t$true_label)
performance(pred, "auc")@y.values[[1]]
## [1] 1
require(verification, quietly = T)
roc.area(t$true_label, t$predict_1)$A
## [1] 1
perf <- performance(pred, "tpr", "fpr")
plot(perf, col = 2, lwd = 3)
12/65
12 of 65 6/13/14, 2:01 PM

the metric
## [1] 0.875
## [1] 0.875
13/65
13 of 65 6/13/14, 2:01 PM

the metric
## [1] 0.5
## [1] 0.5
14/65
14 of 65 6/13/14, 2:01 PM

LLooookkiinnttootthheeDDaattaa
load data from files
15/65
15 of 65 6/13/14, 2:01 PM

the target
table(y, useNA = "ifany")
## y
## 0 1 <NA>
## 1897 30872 58921
16/65
16 of 65 6/13/14, 2:01 PM

the predictor
17/65
17 of 65 6/13/14, 2:01 PM

treat the features as Categorical or Numerical?
sapply(x, function(z) {
length(unique(z))
})
## resource mgr_id role_rollup_1 role_rollup_2
## 7518 4913 130 183
## role_deptname role_title role_family_desc role_family
## 476 361 2951 68
## role_code
## 361
18/65
18 of 65 6/13/14, 2:01 PM

par(mar = c(5, 4, 0, 2))
plot(x$role_title, x$role_code)
19/65
19 of 65 6/13/14, 2:01 PM

length(unique(x$role_title))
## [1] 361
length(unique(x$role_code))
## [1] 361
length(unique(paste(x$role_code, x$role_title)))
## [1] 361
20/65
20 of 65 6/13/14, 2:01 PM

x <- x[, names(x) != "role_code"]
sapply(x, function(z) {
length(unique(z))
})
## resource mgr_id role_rollup_1 role_rollup_2
## 7518 4913 130 183
## role_deptname role_title role_family_desc role_family
## 476 361 2951 68
21/65
21 of 65 6/13/14, 2:01 PM

check the distribution - role_family_desc
hist(train$role_family_desc, breaks = 100) hist(test$role_family_desc, breaks = 100)
22/65
22 of 65 6/13/14, 2:01 PM

check the distribution - resource
hist(train$resource, breaks = 100) hist(test$resource, breaks = 100)
23/65
23 of 65 6/13/14, 2:01 PM

check the distribution - mgr_id
hist(train$mgr_id, breaks = 100) hist(test$mgr_id, breaks = 100)
24/65
24 of 65 6/13/14, 2:01 PM

treat the features as Categorical or Numerical?
YetiMan shared his ﬁndings in the forum:
1) My analyses so far leads me to believe that there is "information" in some of the categorical
labels themselves. My hunch is that they imply some sort of chronology, but I can't be certain.
2) Just for fun I increased the max classes for R's gbm package to 8192 and built a model (using
plain vanilla training data). The leader board result was 0.87 - slightly worse than the all-numeric
gbm. Food for thought.
·
·
25/65
25 of 65 6/13/14, 2:01 PM

our approach
treat all features as Categorical1.
treat all features as Numerical2.
treat mgr_id as Numerical, the others as Categorical3.
26/65
26 of 65 6/13/14, 2:01 PM

MMooddeellBBuuiillddiinngg
workflow
Feature Extraction
Base Learners
Ensemble
·
·
·
27/65
27 of 65 6/13/14, 2:01 PM

workflow
28/65
28 of 65 6/13/14, 2:01 PM

Feature Extraction
the raw features(as numerical)1.
the raw features(as categorical) with level reduction2.
the dummies(in sparse Matrix)3.
the dummies including the interaction4.
some derived variables(count & ratio)5.
29/65
29 of 65 6/13/14, 2:01 PM

1. the raw features(as numerical)
30/65
30 of 65 6/13/14, 2:01 PM

2. the raw features(as categorical) with level reduction
2.1 choose the top frequency categories
VAR_RAW FREQUENCY VAR_WITH_LEVEL_REDUCTION
a 3 a
a 3 a
a 3 a
b 2 b
b 2 b
c 1 other
d 1 other
for (i in 1:ncol(x)) {
the_labels <- names(sort(table(x[, i]), decreasing = T)[1:2])
x[!x[, i] %in% the_labels, i] <- "other"
}
31/65
31 of 65 6/13/14, 2:01 PM

2. the raw features(as categorical) with level reduction
2.2 use Pearson's Chi-squared Test
table(y$y, ifelse(x$mgr_id == 770, "mgr_770", "mgr_not_770"))
##
## mgr_770 mgr_not_770
## 0 5 1892
## 1 147 30725
chisq.test(y$y, ifelse(x$mgr_id == 770, "mgr_770", "mgr_not_770"))$p.value
## [1] 0.2507
32/65
32 of 65 6/13/14, 2:01 PM

3. the dummies(in sparse Matrix)
ID VAR VAR_A VAR_B VAR_C
1 a 1 0 0
2 a 1 0 0
3 a 1 0 0
4 b 0 1 0
5 c 0 0 1
33/65
33 of 65 6/13/14, 2:01 PM

3. the dummies(in sparse Matrix)
use package Matrix to create the dummies
require(Matrix)
set.seed(114)
Matrix(sample(c(0, 1), 40, re = T, prob = c(0.6, 0.1)), nrow = 5)
## 5 x 8 sparse Matrix of class "dgCMatrix"
##
## [1,] . . . 1 . . . 1
## [2,] . 1 . . . . 1 .
## [3,] 1 . . . . . . .
## [4,] . . . . . 1 . .
## [5,] . . . . . . . .
34/65
34 of 65 6/13/14, 2:01 PM

4. the dummies including the interaction
ID M N MN_AP MN_AQ MN_BP MN_BQ
1 a p 1 0 0 0
2 a p 1 0 0 0
3 a q 0 1 0 0
4 b p 0 0 1 0
5 b q 0 0 0 1
35/65
35 of 65 6/13/14, 2:01 PM

5. some derived variables(count & ratio)
the frequency of every category
the frequency of the interactions
the proportion
·
·
·
36/65
36 of 65 6/13/14, 2:01 PM

5. some derived variables(count & ratio)
tmp1 <- cnt_1[114:117, c('c1_resource', 'c1_role_deptname')]
tmp2 <- cnt_2[114:117, c('c2_resource_role_deptname_cnt_ij',
'c2_resource_role_deptname_ratio_i',
'c2_resource_role_deptname_ratio_j')]
cbind(tmp1, tmp2)
## c1_resource c1_role_deptname c2_resource_role_deptname_cnt_ij
## 114 1 1645 1
## 115 36 1312 4
## 116 45 465 24
## 117 374 2377 169
## c2_resource_role_deptname_ratio_i c2_resource_role_deptname_ratio_j
## 114 1.0000 0.0006079
## 115 0.1111 0.0030488
## 116 0.5333 0.0516129
## 117 0.4519 0.0710980
37/65
37 of 65 6/13/14, 2:01 PM

Base Learners
Regularized Generalized Linear Model1.
Support Vector Machine2.
Random Forest3.
Gradient Boosting Machine4.
38/65
38 of 65 6/13/14, 2:01 PM

Ensemble
mean prediction of all models1.
two-stage stacking2.
based on 5-fold cv holdout predictions·
39/65
39 of 65 6/13/14, 2:01 PM

Ensemble
mean prediction of all models1.
two-stage stacking2.
based on 5-fold cv holdout predictions
algorithms in level-1(Regularized Generalized Linear Model & Gradient Boosting Machine)
algorithms in level-2(Regularized Generalized Linear Model)
·
·
·
40/65
40 of 65 6/13/14, 2:01 PM

1. Regularized Generalized Linear Model
generalized linear model(glm)
convex penalties
·
·
41/65
41 of 65 6/13/14, 2:01 PM

logistic regression·
x <- sort(rnorm(100))
set.seed(114)
y <- c(sample(x=c(0,1),size=30,prob=c(0.9,0.1),re=T),
sample(x=c(0,1),size=20,prob=c(0.7,0.3),re=T),
sample(x=c(0,1),size=20,prob=c(0.3,0.7),re=T),
sample(x=c(0,1),size=30,prob=c(0.1,0.9),re=T))
m1 <- lm(y~x)
m2 <- glm(y~x,family=binomial(link=logit))
y2 <- predict(m2,data=x,type='response')
par(mar=c(5,4,0,0))
plot(y~x);abline(m1,lwd=3,col=2)
points(x,y2,type='l',lwd=3,col=3)
42/65
42 of 65 6/13/14, 2:01 PM

logistic regression·
convex penalties·
43/65
43 of 65 6/13/14, 2:01 PM

convex penalties·
L1 (lasso)
L2 (ridge regression)
mixture of L1&L2 (elastic net)
-
-
-
44/65
44 of 65 6/13/14, 2:01 PM

the dummies(in sparse Matrix)
the dummies including the interaction
R package:glmnet
·
·
·
45/65
45 of 65 6/13/14, 2:01 PM

2. Support Vector Machine(just for Diversity)
46/65
46 of 65 6/13/14, 2:01 PM

47/65
47 of 65 6/13/14, 2:01 PM

48/65
48 of 65 6/13/14, 2:01 PM

the dummies including the interaction
some derived variables(count & ratio)
R package:kernlab,e1071
·
·
·
49/65
49 of 65 6/13/14, 2:01 PM

decision tree
50/65
50 of 65 6/13/14, 2:01 PM

3. Random Forest
decision trees + bagging
51/65
51 of 65 6/13/14, 2:01 PM

3. Random Forest
the raw features(as numerical)
the raw features(as categorical) with level reduction
R package:randomForest
·
·
·
·
52/65
52 of 65 6/13/14, 2:01 PM

4. Gradient Boosting Machine
decision trees + boosting
53/65
53 of 65 6/13/14, 2:01 PM

4. Gradient Boosting Machine
the raw features(as numerical)
the raw features(as categorical) with level reduction
R package:gbm
·
·
·
·
54/65
54 of 65 6/13/14, 2:01 PM

SSuummmmaarryy
some insights
VARIABLE NAME REL.INF
cnt2_resource_role_deptname_cnt_ij 2.542974017
cnt2_resource_role_rollup_2_ratio_i 2.107624216
cnt2_resource_role_deptname_ratio_j 2.017153645
cnt2_resource_role_rollup_2_ratio_j 1.910465811
cnt2_resource_role_family_ratio_i 1.770737494
... ...
cnt4_resource_mgr_id_role_rollup_2_role_family_desc 0.008938286
cnt4_resource_role_rollup_1_role_rollup_2_role_title 0.008930661
cnt4_resource_mgr_id_role_rollup_1_role_family_desc 0.002106958
55/65
55 of 65 6/13/14, 2:01 PM

SSuummmmaarryy
some insights
summary(x[, c('cnt2_resource_role_deptname_cnt_ij',
'cnt2_resource_role_deptname_ratio_j')])
## cnt2_resource_role_deptname_cnt_ij cnt2_resource_role_deptname_ratio_j
## Min. : 1.0 Min. :0.0003
## 1st Qu.: 2.0 1st Qu.:0.0061
## Median : 7.0 Median :0.0172
## Mean : 15.6 Mean :0.0315
## 3rd Qu.: 17.0 3rd Qu.:0.0368
## Max. :201.0 Max. :1.0000
56/65
56 of 65 6/13/14, 2:01 PM

SSuummmmaarryy
some insights
xx <- x[, 'cnt2_resource_role_deptname_cnt_ij']
tt <- t.test(xx ~ y)
list(estimate=tt$estimate,
conf.int=tt$conf.int, p.value=tt$p.value)
## $estimate
## mean in group 0 mean in group 1
## 10.04 13.82
##
## $conf.int
## [1] -4.851 -2.710
## attr(,"conf.level")
## [1] 0.95
##
## $p.value
## [1] 5.838e-12
par(mar=c(5,4,2,2))
boxplot(xx ~ y)
57/65
57 of 65 6/13/14, 2:01 PM

SSuummmmaarryy
some insights
xxx <- cut(xx, include.lowest=T,
breaks=c(0,1,3,7,14,30,300))
par(mar=c(5,2,0,0))
barplot(table(xxx))
tb <- table(y, xxx)
r_0 <- tb[1, ] / colSums(tb)
par(mar=c(5,2,0,0))
plot(r_0, type='l', lwd=3)
58/65
58 of 65 6/13/14, 2:01 PM

SSuummmmaarryy
some insights
xx <- x[, 'cnt2_resource_role_deptname_ratio_j']
tt <- t.test(xx ~ y)
list(estimate=tt$estimate,
conf.int=tt$conf.int, p.value=tt$p.value)
## $estimate
## mean in group 0 mean in group 1
## 0.01955 0.02902
##
## $conf.int
## [1] -0.011732 -0.007205
## attr(,"conf.level")
## [1] 0.95
##
## $p.value
## [1] 3.93e-16
par(mar=c(5,4,2,2))
boxplot(xx ~ y)
59/65
59 of 65 6/13/14, 2:01 PM

SSuummmmaarryy
some insights
xxx <- cut(xx, include.lowest=T,
breaks=quantile(xx, seq(0,1,0.2)))
par(mar=c(5,2,0,0))
barplot(table(xxx))
tb <- table(y, xxx)
r_0 <- tb[1, ] / colSums(tb)
par(mar=c(5,2,0,0))
plot(r_0, type='l', lwd=3)
60/65
60 of 65 6/13/14, 2:01 PM

SSuummmmaarryy
overfitting
MODEL AUC_CV AUC_PUBLIC AUC_PRIVATE
num_glmnet_0 0.8985069 0.87737 0.87385
stacking_gbm_with_the_glmnet 0.9277316 0.90695 0.90478
61/65
61 of 65 6/13/14, 2:01 PM

SSuummmmaarryy
overfitting
MODEL AUC_CV AUC_PUBLIC AUC_PRIVATE
num_glmnet_0 0.8985069 0.87737 0.87385
stacking_gbm_with_the_glmnet 0.9277316 0.90695 0.90478
stacking_gbm_without_the_glmnet 0.9182303 0.91529 0.91130
62/65
62 of 65 6/13/14, 2:01 PM

SSuummmmaarryy
overfitting
Winning solution code and methodology
http://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/5283/winning-solution-
code-and-methodology
63/65
63 of 65 6/13/14, 2:01 PM

SSuummmmaarryy
useful discussions
Python code to achieve 0.90 AUC with Logistic Regression
http://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/4838/python-code-to-
achieve-0-90-auc-with-logistic-regression
Starter code in python with scikit-learn (AUC .885)
http://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/4797/starter-code-in-python-
with-scikit-learn-auc-885
Patterns in Training data set
http://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/4886/patterns-in-training-
data-set
64/65
64 of 65 6/13/14, 2:01 PM

tthhaannkkyyoouu
65/65
65 of 65 6/13/14, 2:01 PM

Kaggle talk series top 0.2% kaggler on amazon employee access challenge

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Kaggle talk series top 0.2% kaggler on amazon employee access challenge

Similar a Kaggle talk series top 0.2% kaggler on amazon employee access challenge (20)

Más de Vivian S. Zhang

Más de Vivian S. Zhang (20)

Último

Último (20)

Kaggle talk series top 0.2% kaggler on amazon employee access challenge