NYC Data Science Academy, NYC Open Data Meetup, Big Data, Data Science, NYC, Vivian Zhang, SupStat Inc,NYC, Machine learning, Kaggle, amazon employee access challenge
2. AAggeennddaa
Introduction to the Challenge1.
Look into the Data2.
Model Building3.
Summary4.
2/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
2 of 65 6/13/14, 2:01 PM
4. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the mission
build an auto-access model based on the historical data
to determine the access privilege according to the employee's job role and the resource he applied
for
4/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
4 of 65 6/13/14, 2:01 PM
5. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the data
The data consists of real historical data collected from 2010 & 2011.
Employees are manually allowed or denied access to resources over time.
the files
train.csv - The training set. Each row has the ACTION (ground truth), RESOURCE, and
information about the employee's role at the time of approval
test.csv - The test set for which predictions should be made. Each row asks whether an
employee having the listed characteristics should have access to the listed resource.
·
·
5/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
5 of 65 6/13/14, 2:01 PM
6. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the variables
COLUMN NAME DESCRIPTION
ACTION ACTION is 1 if the resource was approved, 0 if the resource was not
RESOURCE An ID for each resource
MGR_ID The EMPLOYEE ID of the manager of the current EMPLOYEE ID record
ROLE_ROLLUP_1 Company role grouping category id 1 (e.g. US Engineering)
ROLE_ROLLUP_2 Company role grouping category id 2 (e.g. US Retail)
ROLE_DEPTNAME Company role department description (e.g. Retail)
ROLE_TITLE Company role business title description (e.g. Senior Engineering Retail Manager)
ROLE_FAMILY_DESC Company role family extended description (e.g. Retail Manager, Software Engineering)
ROLE_FAMILY Company role family description (e.g. Retail Manager)
ROLE_CODE Company role code; this code is unique to each role (e.g. Manager)
6/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
6 of 65 6/13/14, 2:01 PM
7. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the metric
AUC(area under the ROC curve)
is a metric used to judge predictions in binary response (0/1) problem
is only sensitive to the order determined by the predictions and not their magnitudes
package verification or ROCR in R
·
·
·
7/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
7 of 65 6/13/14, 2:01 PM
22. LLooookkiinnttootthheeDDaattaa
check the distribution - role_family_desc
hist(train$role_family_desc, breaks = 100) hist(test$role_family_desc, breaks = 100)
22/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
22 of 65 6/13/14, 2:01 PM
23. LLooookkiinnttootthheeDDaattaa
check the distribution - resource
hist(train$resource, breaks = 100) hist(test$resource, breaks = 100)
23/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
23 of 65 6/13/14, 2:01 PM
24. LLooookkiinnttootthheeDDaattaa
check the distribution - mgr_id
hist(train$mgr_id, breaks = 100) hist(test$mgr_id, breaks = 100)
24/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
24 of 65 6/13/14, 2:01 PM
25. LLooookkiinnttootthheeDDaattaa
treat the features as Categorical or Numerical?
YetiMan shared his findings in the forum:
1) My analyses so far leads me to believe that there is "information" in some of the categorical
labels themselves. My hunch is that they imply some sort of chronology, but I can't be certain.
2) Just for fun I increased the max classes for R's gbm package to 8192 and built a model (using
plain vanilla training data). The leader board result was 0.87 - slightly worse than the all-numeric
gbm. Food for thought.
·
·
25/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
25 of 65 6/13/14, 2:01 PM
26. LLooookkiinnttootthheeDDaattaa
our approach
treat all features as Categorical1.
treat all features as Numerical2.
treat mgr_id as Numerical, the others as Categorical3.
26/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
26 of 65 6/13/14, 2:01 PM
29. MMooddeellBBuuiillddiinngg
Feature Extraction
the raw features(as numerical)1.
the raw features(as categorical) with level reduction2.
the dummies(in sparse Matrix)3.
the dummies including the interaction4.
some derived variables(count & ratio)5.
29/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
29 of 65 6/13/14, 2:01 PM
30. MMooddeellBBuuiillddiinngg
1. the raw features(as numerical)
30/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
30 of 65 6/13/14, 2:01 PM
31. MMooddeellBBuuiillddiinngg
2. the raw features(as categorical) with level reduction
2.1 choose the top frequency categories
VAR_RAW FREQUENCY VAR_WITH_LEVEL_REDUCTION
a 3 a
a 3 a
a 3 a
b 2 b
b 2 b
c 1 other
d 1 other
for (i in 1:ncol(x)) {
the_labels <- names(sort(table(x[, i]), decreasing = T)[1:2])
x[!x[, i] %in% the_labels, i] <- "other"
}
31/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
31 of 65 6/13/14, 2:01 PM
32. MMooddeellBBuuiillddiinngg
2. the raw features(as categorical) with level reduction
2.2 use Pearson's Chi-squared Test
table(y$y, ifelse(x$mgr_id == 770, "mgr_770", "mgr_not_770"))
##
## mgr_770 mgr_not_770
## 0 5 1892
## 1 147 30725
chisq.test(y$y, ifelse(x$mgr_id == 770, "mgr_770", "mgr_not_770"))$p.value
## [1] 0.2507
32/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
32 of 65 6/13/14, 2:01 PM
33. MMooddeellBBuuiillddiinngg
3. the dummies(in sparse Matrix)
ID VAR VAR_A VAR_B VAR_C
1 a 1 0 0
2 a 1 0 0
3 a 1 0 0
4 b 0 1 0
5 c 0 0 1
33/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
33 of 65 6/13/14, 2:01 PM
35. MMooddeellBBuuiillddiinngg
4. the dummies including the interaction
ID M N MN_AP MN_AQ MN_BP MN_BQ
1 a p 1 0 0 0
2 a p 1 0 0 0
3 a q 0 1 0 0
4 b p 0 0 1 0
5 b q 0 0 0 1
35/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
35 of 65 6/13/14, 2:01 PM
36. MMooddeellBBuuiillddiinngg
5. some derived variables(count & ratio)
the frequency of every category
the frequency of the interactions
the proportion
·
·
·
36/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
36 of 65 6/13/14, 2:01 PM
38. MMooddeellBBuuiillddiinngg
Base Learners
Regularized Generalized Linear Model1.
Support Vector Machine2.
Random Forest3.
Gradient Boosting Machine4.
38/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
38 of 65 6/13/14, 2:01 PM
39. MMooddeellBBuuiillddiinngg
Ensemble
mean prediction of all models1.
two-stage stacking2.
based on 5-fold cv holdout predictions·
39/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
39 of 65 6/13/14, 2:01 PM
40. MMooddeellBBuuiillddiinngg
Ensemble
mean prediction of all models1.
two-stage stacking2.
based on 5-fold cv holdout predictions
algorithms in level-1(Regularized Generalized Linear Model & Gradient Boosting Machine)
algorithms in level-2(Regularized Generalized Linear Model)
·
·
·
40/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
40 of 65 6/13/14, 2:01 PM
41. MMooddeellBBuuiillddiinngg
1. Regularized Generalized Linear Model
generalized linear model(glm)
convex penalties
·
·
41/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
41 of 65 6/13/14, 2:01 PM
42. MMooddeellBBuuiillddiinngg
1. Regularized Generalized Linear Model
logistic regression·
x <- sort(rnorm(100))
set.seed(114)
y <- c(sample(x=c(0,1),size=30,prob=c(0.9,0.1),re=T),
sample(x=c(0,1),size=20,prob=c(0.7,0.3),re=T),
sample(x=c(0,1),size=20,prob=c(0.3,0.7),re=T),
sample(x=c(0,1),size=30,prob=c(0.1,0.9),re=T))
m1 <- lm(y~x)
m2 <- glm(y~x,family=binomial(link=logit))
y2 <- predict(m2,data=x,type='response')
par(mar=c(5,4,0,0))
plot(y~x);abline(m1,lwd=3,col=2)
points(x,y2,type='l',lwd=3,col=3)
42/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
42 of 65 6/13/14, 2:01 PM
43. MMooddeellBBuuiillddiinngg
1. Regularized Generalized Linear Model
logistic regression·
convex penalties·
43/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
43 of 65 6/13/14, 2:01 PM
44. MMooddeellBBuuiillddiinngg
1. Regularized Generalized Linear Model
convex penalties·
L1 (lasso)
L2 (ridge regression)
mixture of L1&L2 (elastic net)
-
-
-
44/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
44 of 65 6/13/14, 2:01 PM
45. MMooddeellBBuuiillddiinngg
1. Regularized Generalized Linear Model
the dummies(in sparse Matrix)
the dummies including the interaction
R package:glmnet
·
·
·
45/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
45 of 65 6/13/14, 2:01 PM
46. MMooddeellBBuuiillddiinngg
2. Support Vector Machine(just for Diversity)
46/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
46 of 65 6/13/14, 2:01 PM
47. MMooddeellBBuuiillddiinngg
2. Support Vector Machine(just for Diversity)
47/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
47 of 65 6/13/14, 2:01 PM
48. MMooddeellBBuuiillddiinngg
2. Support Vector Machine(just for Diversity)
48/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
48 of 65 6/13/14, 2:01 PM
49. MMooddeellBBuuiillddiinngg
2. Support Vector Machine(just for Diversity)
the dummies including the interaction
some derived variables(count & ratio)
R package:kernlab,e1071
·
·
·
49/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
49 of 65 6/13/14, 2:01 PM
52. MMooddeellBBuuiillddiinngg
3. Random Forest
the raw features(as numerical)
the raw features(as categorical) with level reduction
some derived variables(count & ratio)
R package:randomForest
·
·
·
·
52/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
52 of 65 6/13/14, 2:01 PM
54. MMooddeellBBuuiillddiinngg
4. Gradient Boosting Machine
the raw features(as numerical)
the raw features(as categorical) with level reduction
some derived variables(count & ratio)
R package:gbm
·
·
·
·
54/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
54 of 65 6/13/14, 2:01 PM
64. SSuummmmaarryy
useful discussions
Python code to achieve 0.90 AUC with Logistic Regression
http://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/4838/python-code-to-
achieve-0-90-auc-with-logistic-regression
Starter code in python with scikit-learn (AUC .885)
http://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/4797/starter-code-in-python-
with-scikit-learn-auc-885
Patterns in Training data set
http://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/4886/patterns-in-training-
data-set
64/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
64 of 65 6/13/14, 2:01 PM