SlideShare una empresa de Scribd logo
1 de 65
Descargar para leer sin conexión
AAmmaazzoonnEEmmppllooyyeeeeAAcccceessssCChhaalllleennggee
Predictanemployee'saccessneeds,givenhis/herjobrole
Yibo Chen
Data Scientist @ Supstat Inc
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
1 of 65 6/13/14, 2:01 PM
AAggeennddaa
Introduction to the Challenge1.
Look into the Data2.
Model Building3.
Summary4.
2/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
2 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the story
http://www.kaggle.com/c/amazon-employee-access-challenge
it is all about the access we need to fulfill our daily work.
3/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
3 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the mission
build an auto-access model based on the historical data
to determine the access privilege according to the employee's job role and the resource he applied
for
4/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
4 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the data
The data consists of real historical data collected from 2010 & 2011.
Employees are manually allowed or denied access to resources over time.
the files
train.csv - The training set. Each row has the ACTION (ground truth), RESOURCE, and
information about the employee's role at the time of approval
test.csv - The test set for which predictions should be made. Each row asks whether an
employee having the listed characteristics should have access to the listed resource.
·
·
5/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
5 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the variables
COLUMN NAME DESCRIPTION
ACTION ACTION is 1 if the resource was approved, 0 if the resource was not
RESOURCE An ID for each resource
MGR_ID The EMPLOYEE ID of the manager of the current EMPLOYEE ID record
ROLE_ROLLUP_1 Company role grouping category id 1 (e.g. US Engineering)
ROLE_ROLLUP_2 Company role grouping category id 2 (e.g. US Retail)
ROLE_DEPTNAME Company role department description (e.g. Retail)
ROLE_TITLE Company role business title description (e.g. Senior Engineering Retail Manager)
ROLE_FAMILY_DESC Company role family extended description (e.g. Retail Manager, Software Engineering)
ROLE_FAMILY Company role family description (e.g. Retail Manager)
ROLE_CODE Company role code; this code is unique to each role (e.g. Manager)
6/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
6 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the metric
AUC(area under the ROC curve)
is a metric used to judge predictions in binary response (0/1) problem
is only sensitive to the order determined by the predictions and not their magnitudes
package verification or ROCR in R
·
·
·
7/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
7 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the metric
(t <- data.frame(true_label=c(0,0,0,0,1,1,1,1),
predict_1=c(1,2,3,4,5,6,7,8),
predict_2=c(1,2,3,6,5,4,7,8),
predict_3=c(1,7,6,4,5,3,2,8)))
## true_label predict_1 predict_2 predict_3
## 1 0 1 1 1
## 2 0 2 2 7
## 3 0 3 3 6
## 4 0 4 6 4
## 5 1 5 5 5
## 6 1 6 4 3
## 7 1 7 7 2
## 8 1 8 8 8
8/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
8 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the metric
P:4
N:4
TP:2
FP:1
TPR=TP/P=0.5
FPR=FP/N=0.25
table(t$predict_2 >= 6, t$true_label)
##
## 0 1
## FALSE 3 2
## TRUE 1 2
9/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
9 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the metric
P:4
N:4
TP:3
FP:1
TPR=TP/P=0.75
FPR=FP/N=0.25
table(t$predict_2 >= 5, t$true_label)
##
## 0 1
## FALSE 3 1
## TRUE 1 3
10/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
10 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the metric
11/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
11 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the metric
require(ROCR, quietly = T)
pred <- prediction(t$predict_1, t$true_label)
performance(pred, "auc")@y.values[[1]]
## [1] 1
require(verification, quietly = T)
roc.area(t$true_label, t$predict_1)$A
## [1] 1
pred <- prediction(t$predict_1, t$true_label)
perf <- performance(pred, "tpr", "fpr")
plot(perf, col = 2, lwd = 3)
12/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
12 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the metric
pred <- prediction(t$predict_2, t$true_label)
performance(pred, "auc")@y.values[[1]]
## [1] 0.875
roc.area(t$true_label, t$predict_2)$A
## [1] 0.875
pred <- prediction(t$predict_2, t$true_label)
perf <- performance(pred, "tpr", "fpr")
plot(perf, col = 2, lwd = 3)
13/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
13 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the metric
pred <- prediction(t$predict_3, t$true_label)
performance(pred, "auc")@y.values[[1]]
## [1] 0.5
roc.area(t$true_label, t$predict_3)$A
## [1] 0.5
pred <- prediction(t$predict_3, t$true_label)
perf <- performance(pred, "tpr", "fpr")
plot(perf, col = 2, lwd = 3)
14/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
14 of 65 6/13/14, 2:01 PM
LLooookkiinnttootthheeDDaattaa
load data from files
15/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
15 of 65 6/13/14, 2:01 PM
LLooookkiinnttootthheeDDaattaa
the target
table(y, useNA = "ifany")
## y
## 0 1 <NA>
## 1897 30872 58921
16/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
16 of 65 6/13/14, 2:01 PM
LLooookkiinnttootthheeDDaattaa
the predictor
17/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
17 of 65 6/13/14, 2:01 PM
LLooookkiinnttootthheeDDaattaa
treat the features as Categorical or Numerical?
sapply(x, function(z) {
length(unique(z))
})
## resource mgr_id role_rollup_1 role_rollup_2
## 7518 4913 130 183
## role_deptname role_title role_family_desc role_family
## 476 361 2951 68
## role_code
## 361
18/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
18 of 65 6/13/14, 2:01 PM
LLooookkiinnttootthheeDDaattaa
par(mar = c(5, 4, 0, 2))
plot(x$role_title, x$role_code)
19/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
19 of 65 6/13/14, 2:01 PM
LLooookkiinnttootthheeDDaattaa
length(unique(x$role_title))
## [1] 361
length(unique(x$role_code))
## [1] 361
length(unique(paste(x$role_code, x$role_title)))
## [1] 361
20/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
20 of 65 6/13/14, 2:01 PM
LLooookkiinnttootthheeDDaattaa
x <- x[, names(x) != "role_code"]
sapply(x, function(z) {
length(unique(z))
})
## resource mgr_id role_rollup_1 role_rollup_2
## 7518 4913 130 183
## role_deptname role_title role_family_desc role_family
## 476 361 2951 68
21/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
21 of 65 6/13/14, 2:01 PM
LLooookkiinnttootthheeDDaattaa
check the distribution - role_family_desc
hist(train$role_family_desc, breaks = 100) hist(test$role_family_desc, breaks = 100)
22/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
22 of 65 6/13/14, 2:01 PM
LLooookkiinnttootthheeDDaattaa
check the distribution - resource
hist(train$resource, breaks = 100) hist(test$resource, breaks = 100)
23/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
23 of 65 6/13/14, 2:01 PM
LLooookkiinnttootthheeDDaattaa
check the distribution - mgr_id
hist(train$mgr_id, breaks = 100) hist(test$mgr_id, breaks = 100)
24/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
24 of 65 6/13/14, 2:01 PM
LLooookkiinnttootthheeDDaattaa
treat the features as Categorical or Numerical?
YetiMan shared his findings in the forum:
1) My analyses so far leads me to believe that there is "information" in some of the categorical
labels themselves. My hunch is that they imply some sort of chronology, but I can't be certain.
2) Just for fun I increased the max classes for R's gbm package to 8192 and built a model (using
plain vanilla training data). The leader board result was 0.87 - slightly worse than the all-numeric
gbm. Food for thought.
·
·
25/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
25 of 65 6/13/14, 2:01 PM
LLooookkiinnttootthheeDDaattaa
our approach
treat all features as Categorical1.
treat all features as Numerical2.
treat mgr_id as Numerical, the others as Categorical3.
26/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
26 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
workflow
Feature Extraction
Base Learners
Ensemble
·
·
·
27/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
27 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
workflow
28/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
28 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
Feature Extraction
the raw features(as numerical)1.
the raw features(as categorical) with level reduction2.
the dummies(in sparse Matrix)3.
the dummies including the interaction4.
some derived variables(count & ratio)5.
29/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
29 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
1. the raw features(as numerical)
30/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
30 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
2. the raw features(as categorical) with level reduction
2.1 choose the top frequency categories
VAR_RAW FREQUENCY VAR_WITH_LEVEL_REDUCTION
a 3 a
a 3 a
a 3 a
b 2 b
b 2 b
c 1 other
d 1 other
for (i in 1:ncol(x)) {
the_labels <- names(sort(table(x[, i]), decreasing = T)[1:2])
x[!x[, i] %in% the_labels, i] <- "other"
}
31/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
31 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
2. the raw features(as categorical) with level reduction
2.2 use Pearson's Chi-squared Test
table(y$y, ifelse(x$mgr_id == 770, "mgr_770", "mgr_not_770"))
##
## mgr_770 mgr_not_770
## 0 5 1892
## 1 147 30725
chisq.test(y$y, ifelse(x$mgr_id == 770, "mgr_770", "mgr_not_770"))$p.value
## [1] 0.2507
32/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
32 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
3. the dummies(in sparse Matrix)
ID VAR VAR_A VAR_B VAR_C
1 a 1 0 0
2 a 1 0 0
3 a 1 0 0
4 b 0 1 0
5 c 0 0 1
33/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
33 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
3. the dummies(in sparse Matrix)
use package Matrix to create the dummies
require(Matrix)
set.seed(114)
Matrix(sample(c(0, 1), 40, re = T, prob = c(0.6, 0.1)), nrow = 5)
## 5 x 8 sparse Matrix of class "dgCMatrix"
##
## [1,] . . . 1 . . . 1
## [2,] . 1 . . . . 1 .
## [3,] 1 . . . . . . .
## [4,] . . . . . 1 . .
## [5,] . . . . . . . .
34/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
34 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
4. the dummies including the interaction
ID M N MN_AP MN_AQ MN_BP MN_BQ
1 a p 1 0 0 0
2 a p 1 0 0 0
3 a q 0 1 0 0
4 b p 0 0 1 0
5 b q 0 0 0 1
35/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
35 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
5. some derived variables(count & ratio)
the frequency of every category
the frequency of the interactions
the proportion
·
·
·
36/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
36 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
5. some derived variables(count & ratio)
tmp1 <- cnt_1[114:117, c('c1_resource', 'c1_role_deptname')]
tmp2 <- cnt_2[114:117, c('c2_resource_role_deptname_cnt_ij',
'c2_resource_role_deptname_ratio_i',
'c2_resource_role_deptname_ratio_j')]
cbind(tmp1, tmp2)
## c1_resource c1_role_deptname c2_resource_role_deptname_cnt_ij
## 114 1 1645 1
## 115 36 1312 4
## 116 45 465 24
## 117 374 2377 169
## c2_resource_role_deptname_ratio_i c2_resource_role_deptname_ratio_j
## 114 1.0000 0.0006079
## 115 0.1111 0.0030488
## 116 0.5333 0.0516129
## 117 0.4519 0.0710980
37/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
37 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
Base Learners
Regularized Generalized Linear Model1.
Support Vector Machine2.
Random Forest3.
Gradient Boosting Machine4.
38/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
38 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
Ensemble
mean prediction of all models1.
two-stage stacking2.
based on 5-fold cv holdout predictions·
39/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
39 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
Ensemble
mean prediction of all models1.
two-stage stacking2.
based on 5-fold cv holdout predictions
algorithms in level-1(Regularized Generalized Linear Model & Gradient Boosting Machine)
algorithms in level-2(Regularized Generalized Linear Model)
·
·
·
40/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
40 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
1. Regularized Generalized Linear Model
generalized linear model(glm)
convex penalties
·
·
41/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
41 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
1. Regularized Generalized Linear Model
logistic regression·
x <- sort(rnorm(100))
set.seed(114)
y <- c(sample(x=c(0,1),size=30,prob=c(0.9,0.1),re=T),
sample(x=c(0,1),size=20,prob=c(0.7,0.3),re=T),
sample(x=c(0,1),size=20,prob=c(0.3,0.7),re=T),
sample(x=c(0,1),size=30,prob=c(0.1,0.9),re=T))
m1 <- lm(y~x)
m2 <- glm(y~x,family=binomial(link=logit))
y2 <- predict(m2,data=x,type='response')
par(mar=c(5,4,0,0))
plot(y~x);abline(m1,lwd=3,col=2)
points(x,y2,type='l',lwd=3,col=3)
42/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
42 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
1. Regularized Generalized Linear Model
logistic regression·
convex penalties·
43/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
43 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
1. Regularized Generalized Linear Model
convex penalties·
L1 (lasso)
L2 (ridge regression)
mixture of L1&L2 (elastic net)
-
-
-
44/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
44 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
1. Regularized Generalized Linear Model
the dummies(in sparse Matrix)
the dummies including the interaction
R package:glmnet
·
·
·
45/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
45 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
2. Support Vector Machine(just for Diversity)
46/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
46 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
2. Support Vector Machine(just for Diversity)
47/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
47 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
2. Support Vector Machine(just for Diversity)
48/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
48 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
2. Support Vector Machine(just for Diversity)
the dummies including the interaction
some derived variables(count & ratio)
R package:kernlab,e1071
·
·
·
49/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
49 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
decision tree
50/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
50 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
3. Random Forest
decision trees + bagging
51/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
51 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
3. Random Forest
the raw features(as numerical)
the raw features(as categorical) with level reduction
some derived variables(count & ratio)
R package:randomForest
·
·
·
·
52/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
52 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
4. Gradient Boosting Machine
decision trees + boosting
53/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
53 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
4. Gradient Boosting Machine
the raw features(as numerical)
the raw features(as categorical) with level reduction
some derived variables(count & ratio)
R package:gbm
·
·
·
·
54/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
54 of 65 6/13/14, 2:01 PM
SSuummmmaarryy
some insights
VARIABLE NAME REL.INF
cnt2_resource_role_deptname_cnt_ij 2.542974017
cnt2_resource_role_rollup_2_ratio_i 2.107624216
cnt2_resource_role_deptname_ratio_j 2.017153645
cnt2_resource_role_rollup_2_ratio_j 1.910465811
cnt2_resource_role_family_ratio_i 1.770737494
... ...
cnt4_resource_mgr_id_role_rollup_2_role_family_desc 0.008938286
cnt4_resource_role_rollup_1_role_rollup_2_role_title 0.008930661
cnt4_resource_mgr_id_role_rollup_1_role_family_desc 0.002106958
55/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
55 of 65 6/13/14, 2:01 PM
SSuummmmaarryy
some insights
summary(x[, c('cnt2_resource_role_deptname_cnt_ij',
'cnt2_resource_role_deptname_ratio_j')])
## cnt2_resource_role_deptname_cnt_ij cnt2_resource_role_deptname_ratio_j
## Min. : 1.0 Min. :0.0003
## 1st Qu.: 2.0 1st Qu.:0.0061
## Median : 7.0 Median :0.0172
## Mean : 15.6 Mean :0.0315
## 3rd Qu.: 17.0 3rd Qu.:0.0368
## Max. :201.0 Max. :1.0000
56/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
56 of 65 6/13/14, 2:01 PM
SSuummmmaarryy
some insights
xx <- x[, 'cnt2_resource_role_deptname_cnt_ij']
tt <- t.test(xx ~ y)
list(estimate=tt$estimate,
conf.int=tt$conf.int, p.value=tt$p.value)
## $estimate
## mean in group 0 mean in group 1
## 10.04 13.82
##
## $conf.int
## [1] -4.851 -2.710
## attr(,"conf.level")
## [1] 0.95
##
## $p.value
## [1] 5.838e-12
par(mar=c(5,4,2,2))
boxplot(xx ~ y)
57/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
57 of 65 6/13/14, 2:01 PM
SSuummmmaarryy
some insights
xxx <- cut(xx, include.lowest=T,
breaks=c(0,1,3,7,14,30,300))
par(mar=c(5,2,0,0))
barplot(table(xxx))
tb <- table(y, xxx)
r_0 <- tb[1, ] / colSums(tb)
par(mar=c(5,2,0,0))
plot(r_0, type='l', lwd=3)
58/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
58 of 65 6/13/14, 2:01 PM
SSuummmmaarryy
some insights
xx <- x[, 'cnt2_resource_role_deptname_ratio_j']
tt <- t.test(xx ~ y)
list(estimate=tt$estimate,
conf.int=tt$conf.int, p.value=tt$p.value)
## $estimate
## mean in group 0 mean in group 1
## 0.01955 0.02902
##
## $conf.int
## [1] -0.011732 -0.007205
## attr(,"conf.level")
## [1] 0.95
##
## $p.value
## [1] 3.93e-16
par(mar=c(5,4,2,2))
boxplot(xx ~ y)
59/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
59 of 65 6/13/14, 2:01 PM
SSuummmmaarryy
some insights
xxx <- cut(xx, include.lowest=T,
breaks=quantile(xx, seq(0,1,0.2)))
par(mar=c(5,2,0,0))
barplot(table(xxx))
tb <- table(y, xxx)
r_0 <- tb[1, ] / colSums(tb)
par(mar=c(5,2,0,0))
plot(r_0, type='l', lwd=3)
60/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
60 of 65 6/13/14, 2:01 PM
SSuummmmaarryy
overfitting
MODEL AUC_CV AUC_PUBLIC AUC_PRIVATE
num_glmnet_0 0.8985069 0.87737 0.87385
stacking_gbm_with_the_glmnet 0.9277316 0.90695 0.90478
61/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
61 of 65 6/13/14, 2:01 PM
SSuummmmaarryy
overfitting
MODEL AUC_CV AUC_PUBLIC AUC_PRIVATE
num_glmnet_0 0.8985069 0.87737 0.87385
stacking_gbm_with_the_glmnet 0.9277316 0.90695 0.90478
stacking_gbm_without_the_glmnet 0.9182303 0.91529 0.91130
62/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
62 of 65 6/13/14, 2:01 PM
SSuummmmaarryy
overfitting
Winning solution code and methodology
http://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/5283/winning-solution-
code-and-methodology
63/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
63 of 65 6/13/14, 2:01 PM
SSuummmmaarryy
useful discussions
Python code to achieve 0.90 AUC with Logistic Regression
http://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/4838/python-code-to-
achieve-0-90-auc-with-logistic-regression
Starter code in python with scikit-learn (AUC .885)
http://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/4797/starter-code-in-python-
with-scikit-learn-auc-885
Patterns in Training data set
http://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/4886/patterns-in-training-
data-set
64/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
64 of 65 6/13/14, 2:01 PM
tthhaannkkyyoouu
65/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
65 of 65 6/13/14, 2:01 PM

Más contenido relacionado

La actualidad más candente

The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive ModelsThe caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive ModelsNYC Predictive Analytics
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnDataRobot
 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Yao Yao
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on HadoopVivian S. Zhang
 
Reinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAIReinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAIRaouf KESKES
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ FyberDaniel Hen
 
Ensembling & Boosting 概念介紹
Ensembling & Boosting  概念介紹Ensembling & Boosting  概念介紹
Ensembling & Boosting 概念介紹Wayne Chen
 
Comparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for RegressionComparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for RegressionSeonho Park
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionJaroslaw Szymczak
 
On the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) modelsOn the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) modelsVillu Ruusmann
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoostJoonyoung Yi
 
Converting Scikit-Learn to PMML
Converting Scikit-Learn to PMMLConverting Scikit-Learn to PMML
Converting Scikit-Learn to PMMLVillu Ruusmann
 
Higgs Boson Challenge
Higgs Boson ChallengeHiggs Boson Challenge
Higgs Boson ChallengeRaouf KESKES
 
R, Scikit-Learn and Apache Spark ML - What difference does it make?
R, Scikit-Learn and Apache Spark ML - What difference does it make?R, Scikit-Learn and Apache Spark ML - What difference does it make?
R, Scikit-Learn and Apache Spark ML - What difference does it make?Villu Ruusmann
 
Probabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate SolutionsProbabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate SolutionsOleksandr Pryymak
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Yao Yao
 
Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning AlgorithmsHichem Felouat
 

La actualidad más candente (20)

The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive ModelsThe caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
 
Xgboost
XgboostXgboost
Xgboost
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
 
Reinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAIReinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAI
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ Fyber
 
Ensembling & Boosting 概念介紹
Ensembling & Boosting  概念介紹Ensembling & Boosting  概念介紹
Ensembling & Boosting 概念介紹
 
Comparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for RegressionComparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for Regression
 
Demystifying Xgboost
Demystifying XgboostDemystifying Xgboost
Demystifying Xgboost
 
Converting R to PMML
Converting R to PMMLConverting R to PMML
Converting R to PMML
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competition
 
On the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) modelsOn the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) models
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
 
Converting Scikit-Learn to PMML
Converting Scikit-Learn to PMMLConverting Scikit-Learn to PMML
Converting Scikit-Learn to PMML
 
Higgs Boson Challenge
Higgs Boson ChallengeHiggs Boson Challenge
Higgs Boson Challenge
 
R, Scikit-Learn and Apache Spark ML - What difference does it make?
R, Scikit-Learn and Apache Spark ML - What difference does it make?R, Scikit-Learn and Apache Spark ML - What difference does it make?
R, Scikit-Learn and Apache Spark ML - What difference does it make?
 
Probabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate SolutionsProbabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate Solutions
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
 
Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning Algorithms
 

Similar a Kaggle talk series top 0.2% kaggler on amazon employee access challenge

Open06
Open06Open06
Open06butest
 
Target Leakage in Machine Learning
Target Leakage in Machine LearningTarget Leakage in Machine Learning
Target Leakage in Machine LearningYuriy Guts
 
Machine learning key to your formulation challenges
Machine learning key to your formulation challengesMachine learning key to your formulation challenges
Machine learning key to your formulation challengesMarc Borowczak
 
Scaling AutoML-Driven Anomaly Detection With Luminaire
Scaling AutoML-Driven Anomaly Detection With LuminaireScaling AutoML-Driven Anomaly Detection With Luminaire
Scaling AutoML-Driven Anomaly Detection With LuminaireDatabricks
 
Agile Data Science 2.0 - Big Data Science Meetup
Agile Data Science 2.0 - Big Data Science MeetupAgile Data Science 2.0 - Big Data Science Meetup
Agile Data Science 2.0 - Big Data Science MeetupRussell Jurney
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0Russell Jurney
 
Machine Learning Model Bakeoff
Machine Learning Model BakeoffMachine Learning Model Bakeoff
Machine Learning Model Bakeoffmrphilroth
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0Russell Jurney
 
Data Science Salon Miami Example - Churn Rate Predictor
Data Science Salon Miami Example - Churn Rate PredictorData Science Salon Miami Example - Churn Rate Predictor
Data Science Salon Miami Example - Churn Rate PredictorGreg Werner
 
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가JaeCheolKim10
 
[Tutorial] building machine learning models for predictive maintenance applic...
[Tutorial] building machine learning models for predictive maintenance applic...[Tutorial] building machine learning models for predictive maintenance applic...
[Tutorial] building machine learning models for predictive maintenance applic...PAPIs.io
 
Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Miguel González-Fierro
 
ITB Term Paper - 10BM60066
ITB Term Paper - 10BM60066ITB Term Paper - 10BM60066
ITB Term Paper - 10BM60066rahulsm27
 
MACHINE LEARNING CLASSIFIERS TO ANALYZE CREDIT RISK
MACHINE LEARNING CLASSIFIERS TO ANALYZE CREDIT RISKMACHINE LEARNING CLASSIFIERS TO ANALYZE CREDIT RISK
MACHINE LEARNING CLASSIFIERS TO ANALYZE CREDIT RISKIRJET Journal
 
Target Leakage in Machine Learning (ODSC East 2020)
Target Leakage in Machine Learning (ODSC East 2020)Target Leakage in Machine Learning (ODSC East 2020)
Target Leakage in Machine Learning (ODSC East 2020)Yuriy Guts
 

Similar a Kaggle talk series top 0.2% kaggler on amazon employee access challenge (20)

Deep learning
Deep learningDeep learning
Deep learning
 
Open06
Open06Open06
Open06
 
Target Leakage in Machine Learning
Target Leakage in Machine LearningTarget Leakage in Machine Learning
Target Leakage in Machine Learning
 
Telecom Churn Analysis
Telecom Churn AnalysisTelecom Churn Analysis
Telecom Churn Analysis
 
Machine learning key to your formulation challenges
Machine learning key to your formulation challengesMachine learning key to your formulation challenges
Machine learning key to your formulation challenges
 
Scaling AutoML-Driven Anomaly Detection With Luminaire
Scaling AutoML-Driven Anomaly Detection With LuminaireScaling AutoML-Driven Anomaly Detection With Luminaire
Scaling AutoML-Driven Anomaly Detection With Luminaire
 
Agile Data Science 2.0 - Big Data Science Meetup
Agile Data Science 2.0 - Big Data Science MeetupAgile Data Science 2.0 - Big Data Science Meetup
Agile Data Science 2.0 - Big Data Science Meetup
 
Speed bumps ahead
Speed bumps aheadSpeed bumps ahead
Speed bumps ahead
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Machine Learning Model Bakeoff
Machine Learning Model BakeoffMachine Learning Model Bakeoff
Machine Learning Model Bakeoff
 
Java
JavaJava
Java
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Data Science Salon Miami Example - Churn Rate Predictor
Data Science Salon Miami Example - Churn Rate PredictorData Science Salon Miami Example - Churn Rate Predictor
Data Science Salon Miami Example - Churn Rate Predictor
 
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가
 
Performance
PerformancePerformance
Performance
 
[Tutorial] building machine learning models for predictive maintenance applic...
[Tutorial] building machine learning models for predictive maintenance applic...[Tutorial] building machine learning models for predictive maintenance applic...
[Tutorial] building machine learning models for predictive maintenance applic...
 
Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...
 
ITB Term Paper - 10BM60066
ITB Term Paper - 10BM60066ITB Term Paper - 10BM60066
ITB Term Paper - 10BM60066
 
MACHINE LEARNING CLASSIFIERS TO ANALYZE CREDIT RISK
MACHINE LEARNING CLASSIFIERS TO ANALYZE CREDIT RISKMACHINE LEARNING CLASSIFIERS TO ANALYZE CREDIT RISK
MACHINE LEARNING CLASSIFIERS TO ANALYZE CREDIT RISK
 
Target Leakage in Machine Learning (ODSC East 2020)
Target Leakage in Machine Learning (ODSC East 2020)Target Leakage in Machine Learning (ODSC East 2020)
Target Leakage in Machine Learning (ODSC East 2020)
 

Más de Vivian S. Zhang

Career services workshop- Roger Ren
Career services workshop- Roger RenCareer services workshop- Roger Ren
Career services workshop- Roger RenVivian S. Zhang
 
Nycdsa wordpress guide book
Nycdsa wordpress guide bookNycdsa wordpress guide book
Nycdsa wordpress guide bookVivian S. Zhang
 
We're so skewed_presentation
We're so skewed_presentationWe're so skewed_presentation
We're so skewed_presentationVivian S. Zhang
 
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataVivian S. Zhang
 
A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data Vivian S. Zhang
 
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Vivian S. Zhang
 
Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Vivian S. Zhang
 
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataTHE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataVivian S. Zhang
 
Max Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningMax Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningVivian S. Zhang
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
 
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesVivian S. Zhang
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rVivian S. Zhang
 
Natural Language Processing(SupStat Inc)
Natural Language Processing(SupStat Inc)Natural Language Processing(SupStat Inc)
Natural Language Processing(SupStat Inc)Vivian S. Zhang
 
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
 Hack session for NYTimes Dialect Map Visualization( developed by R Shiny) Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)Vivian S. Zhang
 
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...Vivian S. Zhang
 
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nycData Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nycVivian S. Zhang
 
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nycData Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nycVivian S. Zhang
 
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...Vivian S. Zhang
 

Más de Vivian S. Zhang (20)

Why NYC DSA.pdf
Why NYC DSA.pdfWhy NYC DSA.pdf
Why NYC DSA.pdf
 
Career services workshop- Roger Ren
Career services workshop- Roger RenCareer services workshop- Roger Ren
Career services workshop- Roger Ren
 
Nycdsa wordpress guide book
Nycdsa wordpress guide bookNycdsa wordpress guide book
Nycdsa wordpress guide book
 
We're so skewed_presentation
We're so skewed_presentationWe're so skewed_presentation
We're so skewed_presentation
 
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big Data
 
A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data
 
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
 
Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015
 
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataTHE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
 
Max Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningMax Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learning
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen Zhang
 
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York Times
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with r
 
Bayesian models in r
Bayesian models in rBayesian models in r
Bayesian models in r
 
Natural Language Processing(SupStat Inc)
Natural Language Processing(SupStat Inc)Natural Language Processing(SupStat Inc)
Natural Language Processing(SupStat Inc)
 
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
 Hack session for NYTimes Dialect Map Visualization( developed by R Shiny) Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
 
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
 
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nycData Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
 
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nycData Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
 
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
 

Último

Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptNarmatha D
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadaditya806802
 
Risk Management in Engineering Construction Project
Risk Management in Engineering Construction ProjectRisk Management in Engineering Construction Project
Risk Management in Engineering Construction ProjectErbil Polytechnic University
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingBootNeck1
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Coursebim.edu.pl
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solidnamansinghjarodiya
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Erbil Polytechnic University
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
Autonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptAutonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptbibisarnayak0
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra
 
Crystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptxCrystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptxachiever3003
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfRajuKanojiya4
 

Último (20)

Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.ppt
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasad
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Risk Management in Engineering Construction Project
Risk Management in Engineering Construction ProjectRisk Management in Engineering Construction Project
Risk Management in Engineering Construction Project
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event Scheduling
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Course
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solid
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
Autonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptAutonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.ppt
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptx
 
Crystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptxCrystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptx
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdf
 

Kaggle talk series top 0.2% kaggler on amazon employee access challenge

  • 1. AAmmaazzoonnEEmmppllooyyeeeeAAcccceessssCChhaalllleennggee Predictanemployee'saccessneeds,givenhis/herjobrole Yibo Chen Data Scientist @ Supstat Inc Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 1 of 65 6/13/14, 2:01 PM
  • 2. AAggeennddaa Introduction to the Challenge1. Look into the Data2. Model Building3. Summary4. 2/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 2 of 65 6/13/14, 2:01 PM
  • 3. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee the story http://www.kaggle.com/c/amazon-employee-access-challenge it is all about the access we need to fulfill our daily work. 3/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 3 of 65 6/13/14, 2:01 PM
  • 4. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee the mission build an auto-access model based on the historical data to determine the access privilege according to the employee's job role and the resource he applied for 4/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 4 of 65 6/13/14, 2:01 PM
  • 5. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee the data The data consists of real historical data collected from 2010 & 2011. Employees are manually allowed or denied access to resources over time. the files train.csv - The training set. Each row has the ACTION (ground truth), RESOURCE, and information about the employee's role at the time of approval test.csv - The test set for which predictions should be made. Each row asks whether an employee having the listed characteristics should have access to the listed resource. · · 5/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 5 of 65 6/13/14, 2:01 PM
  • 6. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee the variables COLUMN NAME DESCRIPTION ACTION ACTION is 1 if the resource was approved, 0 if the resource was not RESOURCE An ID for each resource MGR_ID The EMPLOYEE ID of the manager of the current EMPLOYEE ID record ROLE_ROLLUP_1 Company role grouping category id 1 (e.g. US Engineering) ROLE_ROLLUP_2 Company role grouping category id 2 (e.g. US Retail) ROLE_DEPTNAME Company role department description (e.g. Retail) ROLE_TITLE Company role business title description (e.g. Senior Engineering Retail Manager) ROLE_FAMILY_DESC Company role family extended description (e.g. Retail Manager, Software Engineering) ROLE_FAMILY Company role family description (e.g. Retail Manager) ROLE_CODE Company role code; this code is unique to each role (e.g. Manager) 6/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 6 of 65 6/13/14, 2:01 PM
  • 7. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee the metric AUC(area under the ROC curve) is a metric used to judge predictions in binary response (0/1) problem is only sensitive to the order determined by the predictions and not their magnitudes package verification or ROCR in R · · · 7/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 7 of 65 6/13/14, 2:01 PM
  • 8. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee the metric (t <- data.frame(true_label=c(0,0,0,0,1,1,1,1), predict_1=c(1,2,3,4,5,6,7,8), predict_2=c(1,2,3,6,5,4,7,8), predict_3=c(1,7,6,4,5,3,2,8))) ## true_label predict_1 predict_2 predict_3 ## 1 0 1 1 1 ## 2 0 2 2 7 ## 3 0 3 3 6 ## 4 0 4 6 4 ## 5 1 5 5 5 ## 6 1 6 4 3 ## 7 1 7 7 2 ## 8 1 8 8 8 8/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 8 of 65 6/13/14, 2:01 PM
  • 9. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee the metric P:4 N:4 TP:2 FP:1 TPR=TP/P=0.5 FPR=FP/N=0.25 table(t$predict_2 >= 6, t$true_label) ## ## 0 1 ## FALSE 3 2 ## TRUE 1 2 9/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 9 of 65 6/13/14, 2:01 PM
  • 10. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee the metric P:4 N:4 TP:3 FP:1 TPR=TP/P=0.75 FPR=FP/N=0.25 table(t$predict_2 >= 5, t$true_label) ## ## 0 1 ## FALSE 3 1 ## TRUE 1 3 10/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 10 of 65 6/13/14, 2:01 PM
  • 11. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee the metric 11/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 11 of 65 6/13/14, 2:01 PM
  • 12. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee the metric require(ROCR, quietly = T) pred <- prediction(t$predict_1, t$true_label) performance(pred, "auc")@y.values[[1]] ## [1] 1 require(verification, quietly = T) roc.area(t$true_label, t$predict_1)$A ## [1] 1 pred <- prediction(t$predict_1, t$true_label) perf <- performance(pred, "tpr", "fpr") plot(perf, col = 2, lwd = 3) 12/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 12 of 65 6/13/14, 2:01 PM
  • 13. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee the metric pred <- prediction(t$predict_2, t$true_label) performance(pred, "auc")@y.values[[1]] ## [1] 0.875 roc.area(t$true_label, t$predict_2)$A ## [1] 0.875 pred <- prediction(t$predict_2, t$true_label) perf <- performance(pred, "tpr", "fpr") plot(perf, col = 2, lwd = 3) 13/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 13 of 65 6/13/14, 2:01 PM
  • 14. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee the metric pred <- prediction(t$predict_3, t$true_label) performance(pred, "auc")@y.values[[1]] ## [1] 0.5 roc.area(t$true_label, t$predict_3)$A ## [1] 0.5 pred <- prediction(t$predict_3, t$true_label) perf <- performance(pred, "tpr", "fpr") plot(perf, col = 2, lwd = 3) 14/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 14 of 65 6/13/14, 2:01 PM
  • 15. LLooookkiinnttootthheeDDaattaa load data from files 15/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 15 of 65 6/13/14, 2:01 PM
  • 16. LLooookkiinnttootthheeDDaattaa the target table(y, useNA = "ifany") ## y ## 0 1 <NA> ## 1897 30872 58921 16/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 16 of 65 6/13/14, 2:01 PM
  • 17. LLooookkiinnttootthheeDDaattaa the predictor 17/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 17 of 65 6/13/14, 2:01 PM
  • 18. LLooookkiinnttootthheeDDaattaa treat the features as Categorical or Numerical? sapply(x, function(z) { length(unique(z)) }) ## resource mgr_id role_rollup_1 role_rollup_2 ## 7518 4913 130 183 ## role_deptname role_title role_family_desc role_family ## 476 361 2951 68 ## role_code ## 361 18/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 18 of 65 6/13/14, 2:01 PM
  • 19. LLooookkiinnttootthheeDDaattaa par(mar = c(5, 4, 0, 2)) plot(x$role_title, x$role_code) 19/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 19 of 65 6/13/14, 2:01 PM
  • 20. LLooookkiinnttootthheeDDaattaa length(unique(x$role_title)) ## [1] 361 length(unique(x$role_code)) ## [1] 361 length(unique(paste(x$role_code, x$role_title))) ## [1] 361 20/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 20 of 65 6/13/14, 2:01 PM
  • 21. LLooookkiinnttootthheeDDaattaa x <- x[, names(x) != "role_code"] sapply(x, function(z) { length(unique(z)) }) ## resource mgr_id role_rollup_1 role_rollup_2 ## 7518 4913 130 183 ## role_deptname role_title role_family_desc role_family ## 476 361 2951 68 21/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 21 of 65 6/13/14, 2:01 PM
  • 22. LLooookkiinnttootthheeDDaattaa check the distribution - role_family_desc hist(train$role_family_desc, breaks = 100) hist(test$role_family_desc, breaks = 100) 22/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 22 of 65 6/13/14, 2:01 PM
  • 23. LLooookkiinnttootthheeDDaattaa check the distribution - resource hist(train$resource, breaks = 100) hist(test$resource, breaks = 100) 23/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 23 of 65 6/13/14, 2:01 PM
  • 24. LLooookkiinnttootthheeDDaattaa check the distribution - mgr_id hist(train$mgr_id, breaks = 100) hist(test$mgr_id, breaks = 100) 24/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 24 of 65 6/13/14, 2:01 PM
  • 25. LLooookkiinnttootthheeDDaattaa treat the features as Categorical or Numerical? YetiMan shared his findings in the forum: 1) My analyses so far leads me to believe that there is "information" in some of the categorical labels themselves. My hunch is that they imply some sort of chronology, but I can't be certain. 2) Just for fun I increased the max classes for R's gbm package to 8192 and built a model (using plain vanilla training data). The leader board result was 0.87 - slightly worse than the all-numeric gbm. Food for thought. · · 25/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 25 of 65 6/13/14, 2:01 PM
  • 26. LLooookkiinnttootthheeDDaattaa our approach treat all features as Categorical1. treat all features as Numerical2. treat mgr_id as Numerical, the others as Categorical3. 26/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 26 of 65 6/13/14, 2:01 PM
  • 27. MMooddeellBBuuiillddiinngg workflow Feature Extraction Base Learners Ensemble · · · 27/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 27 of 65 6/13/14, 2:01 PM
  • 28. MMooddeellBBuuiillddiinngg workflow 28/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 28 of 65 6/13/14, 2:01 PM
  • 29. MMooddeellBBuuiillddiinngg Feature Extraction the raw features(as numerical)1. the raw features(as categorical) with level reduction2. the dummies(in sparse Matrix)3. the dummies including the interaction4. some derived variables(count & ratio)5. 29/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 29 of 65 6/13/14, 2:01 PM
  • 30. MMooddeellBBuuiillddiinngg 1. the raw features(as numerical) 30/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 30 of 65 6/13/14, 2:01 PM
  • 31. MMooddeellBBuuiillddiinngg 2. the raw features(as categorical) with level reduction 2.1 choose the top frequency categories VAR_RAW FREQUENCY VAR_WITH_LEVEL_REDUCTION a 3 a a 3 a a 3 a b 2 b b 2 b c 1 other d 1 other for (i in 1:ncol(x)) { the_labels <- names(sort(table(x[, i]), decreasing = T)[1:2]) x[!x[, i] %in% the_labels, i] <- "other" } 31/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 31 of 65 6/13/14, 2:01 PM
  • 32. MMooddeellBBuuiillddiinngg 2. the raw features(as categorical) with level reduction 2.2 use Pearson's Chi-squared Test table(y$y, ifelse(x$mgr_id == 770, "mgr_770", "mgr_not_770")) ## ## mgr_770 mgr_not_770 ## 0 5 1892 ## 1 147 30725 chisq.test(y$y, ifelse(x$mgr_id == 770, "mgr_770", "mgr_not_770"))$p.value ## [1] 0.2507 32/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 32 of 65 6/13/14, 2:01 PM
  • 33. MMooddeellBBuuiillddiinngg 3. the dummies(in sparse Matrix) ID VAR VAR_A VAR_B VAR_C 1 a 1 0 0 2 a 1 0 0 3 a 1 0 0 4 b 0 1 0 5 c 0 0 1 33/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 33 of 65 6/13/14, 2:01 PM
  • 34. MMooddeellBBuuiillddiinngg 3. the dummies(in sparse Matrix) use package Matrix to create the dummies require(Matrix) set.seed(114) Matrix(sample(c(0, 1), 40, re = T, prob = c(0.6, 0.1)), nrow = 5) ## 5 x 8 sparse Matrix of class "dgCMatrix" ## ## [1,] . . . 1 . . . 1 ## [2,] . 1 . . . . 1 . ## [3,] 1 . . . . . . . ## [4,] . . . . . 1 . . ## [5,] . . . . . . . . 34/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 34 of 65 6/13/14, 2:01 PM
  • 35. MMooddeellBBuuiillddiinngg 4. the dummies including the interaction ID M N MN_AP MN_AQ MN_BP MN_BQ 1 a p 1 0 0 0 2 a p 1 0 0 0 3 a q 0 1 0 0 4 b p 0 0 1 0 5 b q 0 0 0 1 35/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 35 of 65 6/13/14, 2:01 PM
  • 36. MMooddeellBBuuiillddiinngg 5. some derived variables(count & ratio) the frequency of every category the frequency of the interactions the proportion · · · 36/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 36 of 65 6/13/14, 2:01 PM
  • 37. MMooddeellBBuuiillddiinngg 5. some derived variables(count & ratio) tmp1 <- cnt_1[114:117, c('c1_resource', 'c1_role_deptname')] tmp2 <- cnt_2[114:117, c('c2_resource_role_deptname_cnt_ij', 'c2_resource_role_deptname_ratio_i', 'c2_resource_role_deptname_ratio_j')] cbind(tmp1, tmp2) ## c1_resource c1_role_deptname c2_resource_role_deptname_cnt_ij ## 114 1 1645 1 ## 115 36 1312 4 ## 116 45 465 24 ## 117 374 2377 169 ## c2_resource_role_deptname_ratio_i c2_resource_role_deptname_ratio_j ## 114 1.0000 0.0006079 ## 115 0.1111 0.0030488 ## 116 0.5333 0.0516129 ## 117 0.4519 0.0710980 37/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 37 of 65 6/13/14, 2:01 PM
  • 38. MMooddeellBBuuiillddiinngg Base Learners Regularized Generalized Linear Model1. Support Vector Machine2. Random Forest3. Gradient Boosting Machine4. 38/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 38 of 65 6/13/14, 2:01 PM
  • 39. MMooddeellBBuuiillddiinngg Ensemble mean prediction of all models1. two-stage stacking2. based on 5-fold cv holdout predictions· 39/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 39 of 65 6/13/14, 2:01 PM
  • 40. MMooddeellBBuuiillddiinngg Ensemble mean prediction of all models1. two-stage stacking2. based on 5-fold cv holdout predictions algorithms in level-1(Regularized Generalized Linear Model & Gradient Boosting Machine) algorithms in level-2(Regularized Generalized Linear Model) · · · 40/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 40 of 65 6/13/14, 2:01 PM
  • 41. MMooddeellBBuuiillddiinngg 1. Regularized Generalized Linear Model generalized linear model(glm) convex penalties · · 41/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 41 of 65 6/13/14, 2:01 PM
  • 42. MMooddeellBBuuiillddiinngg 1. Regularized Generalized Linear Model logistic regression· x <- sort(rnorm(100)) set.seed(114) y <- c(sample(x=c(0,1),size=30,prob=c(0.9,0.1),re=T), sample(x=c(0,1),size=20,prob=c(0.7,0.3),re=T), sample(x=c(0,1),size=20,prob=c(0.3,0.7),re=T), sample(x=c(0,1),size=30,prob=c(0.1,0.9),re=T)) m1 <- lm(y~x) m2 <- glm(y~x,family=binomial(link=logit)) y2 <- predict(m2,data=x,type='response') par(mar=c(5,4,0,0)) plot(y~x);abline(m1,lwd=3,col=2) points(x,y2,type='l',lwd=3,col=3) 42/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 42 of 65 6/13/14, 2:01 PM
  • 43. MMooddeellBBuuiillddiinngg 1. Regularized Generalized Linear Model logistic regression· convex penalties· 43/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 43 of 65 6/13/14, 2:01 PM
  • 44. MMooddeellBBuuiillddiinngg 1. Regularized Generalized Linear Model convex penalties· L1 (lasso) L2 (ridge regression) mixture of L1&L2 (elastic net) - - - 44/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 44 of 65 6/13/14, 2:01 PM
  • 45. MMooddeellBBuuiillddiinngg 1. Regularized Generalized Linear Model the dummies(in sparse Matrix) the dummies including the interaction R package:glmnet · · · 45/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 45 of 65 6/13/14, 2:01 PM
  • 46. MMooddeellBBuuiillddiinngg 2. Support Vector Machine(just for Diversity) 46/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 46 of 65 6/13/14, 2:01 PM
  • 47. MMooddeellBBuuiillddiinngg 2. Support Vector Machine(just for Diversity) 47/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 47 of 65 6/13/14, 2:01 PM
  • 48. MMooddeellBBuuiillddiinngg 2. Support Vector Machine(just for Diversity) 48/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 48 of 65 6/13/14, 2:01 PM
  • 49. MMooddeellBBuuiillddiinngg 2. Support Vector Machine(just for Diversity) the dummies including the interaction some derived variables(count & ratio) R package:kernlab,e1071 · · · 49/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 49 of 65 6/13/14, 2:01 PM
  • 50. MMooddeellBBuuiillddiinngg decision tree 50/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 50 of 65 6/13/14, 2:01 PM
  • 51. MMooddeellBBuuiillddiinngg 3. Random Forest decision trees + bagging 51/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 51 of 65 6/13/14, 2:01 PM
  • 52. MMooddeellBBuuiillddiinngg 3. Random Forest the raw features(as numerical) the raw features(as categorical) with level reduction some derived variables(count & ratio) R package:randomForest · · · · 52/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 52 of 65 6/13/14, 2:01 PM
  • 53. MMooddeellBBuuiillddiinngg 4. Gradient Boosting Machine decision trees + boosting 53/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 53 of 65 6/13/14, 2:01 PM
  • 54. MMooddeellBBuuiillddiinngg 4. Gradient Boosting Machine the raw features(as numerical) the raw features(as categorical) with level reduction some derived variables(count & ratio) R package:gbm · · · · 54/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 54 of 65 6/13/14, 2:01 PM
  • 55. SSuummmmaarryy some insights VARIABLE NAME REL.INF cnt2_resource_role_deptname_cnt_ij 2.542974017 cnt2_resource_role_rollup_2_ratio_i 2.107624216 cnt2_resource_role_deptname_ratio_j 2.017153645 cnt2_resource_role_rollup_2_ratio_j 1.910465811 cnt2_resource_role_family_ratio_i 1.770737494 ... ... cnt4_resource_mgr_id_role_rollup_2_role_family_desc 0.008938286 cnt4_resource_role_rollup_1_role_rollup_2_role_title 0.008930661 cnt4_resource_mgr_id_role_rollup_1_role_family_desc 0.002106958 55/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 55 of 65 6/13/14, 2:01 PM
  • 56. SSuummmmaarryy some insights summary(x[, c('cnt2_resource_role_deptname_cnt_ij', 'cnt2_resource_role_deptname_ratio_j')]) ## cnt2_resource_role_deptname_cnt_ij cnt2_resource_role_deptname_ratio_j ## Min. : 1.0 Min. :0.0003 ## 1st Qu.: 2.0 1st Qu.:0.0061 ## Median : 7.0 Median :0.0172 ## Mean : 15.6 Mean :0.0315 ## 3rd Qu.: 17.0 3rd Qu.:0.0368 ## Max. :201.0 Max. :1.0000 56/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 56 of 65 6/13/14, 2:01 PM
  • 57. SSuummmmaarryy some insights xx <- x[, 'cnt2_resource_role_deptname_cnt_ij'] tt <- t.test(xx ~ y) list(estimate=tt$estimate, conf.int=tt$conf.int, p.value=tt$p.value) ## $estimate ## mean in group 0 mean in group 1 ## 10.04 13.82 ## ## $conf.int ## [1] -4.851 -2.710 ## attr(,"conf.level") ## [1] 0.95 ## ## $p.value ## [1] 5.838e-12 par(mar=c(5,4,2,2)) boxplot(xx ~ y) 57/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 57 of 65 6/13/14, 2:01 PM
  • 58. SSuummmmaarryy some insights xxx <- cut(xx, include.lowest=T, breaks=c(0,1,3,7,14,30,300)) par(mar=c(5,2,0,0)) barplot(table(xxx)) tb <- table(y, xxx) r_0 <- tb[1, ] / colSums(tb) par(mar=c(5,2,0,0)) plot(r_0, type='l', lwd=3) 58/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 58 of 65 6/13/14, 2:01 PM
  • 59. SSuummmmaarryy some insights xx <- x[, 'cnt2_resource_role_deptname_ratio_j'] tt <- t.test(xx ~ y) list(estimate=tt$estimate, conf.int=tt$conf.int, p.value=tt$p.value) ## $estimate ## mean in group 0 mean in group 1 ## 0.01955 0.02902 ## ## $conf.int ## [1] -0.011732 -0.007205 ## attr(,"conf.level") ## [1] 0.95 ## ## $p.value ## [1] 3.93e-16 par(mar=c(5,4,2,2)) boxplot(xx ~ y) 59/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 59 of 65 6/13/14, 2:01 PM
  • 60. SSuummmmaarryy some insights xxx <- cut(xx, include.lowest=T, breaks=quantile(xx, seq(0,1,0.2))) par(mar=c(5,2,0,0)) barplot(table(xxx)) tb <- table(y, xxx) r_0 <- tb[1, ] / colSums(tb) par(mar=c(5,2,0,0)) plot(r_0, type='l', lwd=3) 60/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 60 of 65 6/13/14, 2:01 PM
  • 61. SSuummmmaarryy overfitting MODEL AUC_CV AUC_PUBLIC AUC_PRIVATE num_glmnet_0 0.8985069 0.87737 0.87385 stacking_gbm_with_the_glmnet 0.9277316 0.90695 0.90478 61/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 61 of 65 6/13/14, 2:01 PM
  • 62. SSuummmmaarryy overfitting MODEL AUC_CV AUC_PUBLIC AUC_PRIVATE num_glmnet_0 0.8985069 0.87737 0.87385 stacking_gbm_with_the_glmnet 0.9277316 0.90695 0.90478 stacking_gbm_without_the_glmnet 0.9182303 0.91529 0.91130 62/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 62 of 65 6/13/14, 2:01 PM
  • 63. SSuummmmaarryy overfitting Winning solution code and methodology http://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/5283/winning-solution- code-and-methodology 63/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 63 of 65 6/13/14, 2:01 PM
  • 64. SSuummmmaarryy useful discussions Python code to achieve 0.90 AUC with Logistic Regression http://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/4838/python-code-to- achieve-0-90-auc-with-logistic-regression Starter code in python with scikit-learn (AUC .885) http://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/4797/starter-code-in-python- with-scikit-learn-auc-885 Patterns in Training data set http://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/4886/patterns-in-training- data-set 64/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 64 of 65 6/13/14, 2:01 PM
  • 65. tthhaannkkyyoouu 65/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 65 of 65 6/13/14, 2:01 PM