Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Acml kites
1. Acml 2012 contest
Hierarchical Committee Machines to Detect
Frauds in Mobile Advertising
Description about the contest and the dataset provided can
be found here http://palanteer.sis.smu.edu.sg/fdma2012/
Evaluation metric: MAP
S. Shivashankar and P. Manoj
Ericsson Research India
2. Feature engineering/ derived
attributes (1)
S.NO Feature Description Comments
1 Number of unique ip No of unique ip's per pid Helped
2 Number of unique cid No of unique cid's per pid Helped
3 Number of unique cntr No of unique cntr's per pid Helped
4 Number of unique category No of unique categories per pid Helped
5 Total clicks Total clicks per pid Helped
6 Category Category name for which clicks exist per pid Helped
7 Country feature vector Country wise clicks per pid. 1 X C vector, where C is the total number of Did not help
countries much
8 Category feature vector Category wise clicks per pid. 1 X N, where N is the number of categories Did not help
much
9 Clicks per category No of clicks per category per pid Helped
10 Countries with highest number Countries sorted according to number of clicks and K countries with highest Did not help
of clicks clicks per cid are appended. much
Ericsson Internal | 2012-10-19 | Page 2
3. Feature engineering/ derived
attributes (2)
S.NO Feature Description Comments
11 Bank account given or not Boolean attribute : 0 if the bank account for pid is not given, else Did not help
1 much
12 Address given or not Boolean attribute : 0 if the address for pid is not given, else 1 Did not help
much
13 Top country Country with highest clicks per pid Did not help
much
14 Cluster id Cluster clicks data into predefined number of clusters, say 5, add Did not help
the distribution of clicks within the clusters as a feature vector. much
15 No of referrers No of unique referrers per pid Helped
16 Number of days Number of days pid is active Helped
17 Clicks per day Number of clicks per day per pid Helped (A)
18 Sum of difference in time Sum of time difference between each click for a pid Helped
19 Average of difference in time Average over difference in time between each click for a pid Helped
20 Standard deviation of sum of difference in time SD of time difference between each click for a pid Did not help
much
Ericsson Internal | 2012-10-19 | Page 3
4. Feature engineering/ derived
attributes (3)
S.NO Feature Description Comments
21 Clicks per category Total number of clicks per category per pid Helped (B)
22 Average clicks - day Average of clicks per day per pid Did not help much
23 Average clicks - referrer Average of clicks per referrer per pid Did not help much
24 No of agents No of unique agents per pid Helped (C)
25 Sum of difference of clicks – ip and cid Sum of difference of clicks per ip per cid per pid Did not help much
26 Sum of clicks Duplicate clicks sum Did not help much
27 Average clicks – agent Average clicks per agent per pid Helped – LAD
28 Average clicks - ip Average of clicks per ip per pid Helped – LAD
29 Average clicks - cid Average of clicks per cid per pid Helped – LAD
30 Average clicks - cntr Average of clicks per cntr per pid Helped – LAD (D)
Ericsson Internal | 2012-10-19 | Page 4
5. Methods used
› We posed this problem as a two class problem, rather than 3 class, since there are efficient methods
for binary class classification.
– Fraud and Observation are grouped together
– Observation and OK are grouped together
› Fraud and Observation grouped together helped better than 3 class and other 2 class setups.
› Datasets
– First 10 attributes that helped were grouped into dataset A, 13 attributes into dataset B, 14 into dataset C and 18
into dataset D. A, B, C, D are marked in the previous slides.
› Algorithms
– J48, REP tree, LAD tree, AODE
– Note that dataset D performs well with LAD tree only
› Approaches for class imbalance
– Cost sensitive classification
– Ensemble learning
Ericsson Internal | 2012-10-19 | Page 5
6. observations
Method Dataset A Dataset B Dataset C Dataset D
Decorate with j48 38.54 41.99 43.19 43.29
Bagging with REP tree 32.99 39.64 41.64 40.99
Bagging with Cost sensitive classifier with LAD tree 38.57 43.06 46.28 47.57
Kstar 17.87 27.54 29.87 -
AODE 19.01 38.75 - 41.27
Note that results using classifiers that performed well and
that were giving diverse results (to help in ensemble
learning) are given here. Not all classifiers we tried are
presented here.
Ericsson Internal | 2012-10-19 | Page 6
7. Hierarchical Committee
machines
Datasets with different groupings of
attributes
CMA CMB CMC CMD
Combined
CM
Score on the validation set – 51.49
x p(fraud|x) Score on the test set – 38.0744
Ericsson Internal | 2012-10-19 | Page 7
8. Discussions (1)
› Typical methods such as over-sampling, under-sampling, SMOTE, HDDT did not help
– Sampling methods might have to be investigated carefully to see how they can be useful, since it
is a widely accepted method for scenarios with class imbalance.
› Cost-sensitive classification helped with few classifiers like LAD Tree
› Random Forest did not perform better than other tree counterparts with ensemble
learner. And was not diverse to help in the final committee machine
› Bayesian based ranking methods such as AODE performs well with more attributes
› With more attributes LAD tree performs well individually, but does not produce so diverse
results on dataset C and D.
› Memory based methods such as kStar do not perform well individually, but helped as part
of the committee.
Ericsson Internal | 2012-10-19 | Page 8
9. Discussions (2)
› Most of the fraud clicks belonged to publishers whose category was ‘AD’ and
‘MC’
› Common intuition to use duplicate ip per publisher did not help much
› Country information did not help much
› Surprisingly phone agent (model) information of the users helped
› Time information was critically important for good performance
– Might need further investigations/refinements to improve results
Ericsson Internal | 2012-10-19 | Page 9