1. Comparative Study on Classification
Techniques to Identify 3G Customers
GUO Jing
GD405, Motorola (China) Electronics Ltd. Beijing Second Branch, No.39A
Zi Zhu Yuan Road, Hai Dian District, Beijing China, 100089
E-mail: w22411@motorola.com
1. Introduction
Along with the worldwide spreading of 3G network, the mobile market is
undergoing a major transition. In many countries, penetration rates are beginning to
peak, and therefore, the operators are moving from a strategy of new customer
acquisition to one of retention and increasing customer spending and profitability.
Identifying which customers are likely to switch to using their 3G network will
definitely be a major approach to achieve this goal.
Many mobile operators already have tremendous customer usage and demographic
data in their system. However, data is characterized as recorded facts, while
information is the set of patterns or expectations that underlie the data. There is a
huge amount of information locked up there, in the operators’ databases - information
that is potentially important but has not yet been discovered or articulated. Our
mission is to bring it forth. As stated, the data, in many cases, already exist. It is a
matter of knowing which kind of data contributes more useful information and finding
the methodologies to make use of the data to present the information.
Data mining is the extraction of implicit, previously unknown, and potentially
useful information from data. The idea is to build computer programs that examine
through data bases automatically, seeking regularities or patterns. Strong patterns, if
2. found, will likely generalize to make accurate predictions on future data.
The data mining task in this paper aims at solving a given classification problem for
which the objective is to accurately predict as many current 3G customers as possible
from the “holdout” sample provided (See Table 1). An original sample dataset of
20,000 2G network customers and 4,000 3G network customers has been provided
with more than 200 data fields. The target categorical variable is “Customer_Type”
(2G/3G). Of course, in the total 24,000 instances and 251 attributes, there exist
problems. Many of the attributes will be uninteresting and banal. Others will be
spurious, contingent on accidental coincidences in the particular dataset used. In a
word, the real data is imperfect.
Table 1 2G/3G Network Customers’ Distribution in Training/Test Set and Prediction Set
Training /Testing Set Prediction Set
2G user number 3G user number 2G user number 3G user number
15K 3K 5K 1K
For this classification case, various models were built up to find strong patterns,
which will make accurate enough predictions on customer type (2G/3G). In this
paper, we will discuss the employed approaches, full technical details of algorithms as
well as the trained classification model. Although the trained model will never be the
elixir to all kinds of cases, the whole procedure of solving a practical problem will
surely leave us a trace to the whole forest. Finally, we discuss on the insight gained
from the model in terms of identifying current 2G customers with the potential to
switch to 3G.
It should be mentioned that all the analysis results and constructed pattern obtained
in this paper are based on a well-known data mining tool, Weka, which is a collection
of state-of-art machine learning algorithm and data preprocessing tools (See Figure 1).
3. Figure 1 The Weka Explorer Interface
2. Data Preprocessing
We have 24,000 instances at hand, each of which contains 251 attributes.
Obviously there exist some problems in the raw data. Data type varies (nominal or
numeric), missing values exist and not all of attributes contain information we need.
Therefore, we take several steps to make sure the data for later machine learning is
reasonable and clean.
2.1 Remove the banal and uninteresting attributes
Table 2 The List of Removed Attributes in the Data Preprocessing Step
Removed Attributes Reasons
SERIAL_NUMBER {SERIAL_NUMBER<=3000|3G}
4. {SERIAL_NUMBER>=3000|2G}
This attribute is meaningless.
HS_CHANGE, TOT_PAST_DEMAND, All the instances share the same value in
VAS_DRIVE_FLAG, the training/test set or the prediction set
VAS_VMN_FLAG, VAS_INFOSRV, with regard to this attribute.
VAS_SN_FLAG, VAS_CSMS_FLAG,
DELINQ_FREQ, AVG_VAS_IDU,
AVG_VAS_WLINK, AVG_VAS_MILL,
AVG_VAS_IFSMS, AVG_VAS_#123#,
AVG_VAS_CG, AVG_VAS_IEM,
AVG_VAS_ISMS, AVG_VAS_SS,
STD_VAS_IDU, STD_VAS_WLINK,
STD_VAS_MILL, STD _VAS_IFSMS,
STD _VAS_#123#, STD _VAS_CG, STD
_VAS_IEM, STD_VAS_ISMS, STD
_VAS_SS
VAS_VMP_FLAG, Expect for less than 5 instances, nearly all
TELE_CHANGE_FLAG the instances share the same value with
regard to this attribute.
TOT_DIS_1900, TOT_USAGE_DAYS There are other similar attributes
(AVG_DIS_1900, AVG_USAGE_DAYS)
in the data set.
2.2 Remove certain instances that are regarded as outliers
In the training and testing set, there are 108 numeric attributes out of total 251 ones
that are regarded interesting. However, certain instances appear to have merely 0
values for most of the numeric instances. We can simply remove those outliers in the
first place by setting a lower enough threshold 0.001. The instances with 0 values
(below the threshold 0.001) for important numeric attributes like AVG_CALL
(Average number of calls in the last 6 months) will be deleted. In this way, we could
clean and reduce the data to 17577 instances out of 18000 in the training and testing
set.
2.3 Deal with missing values
The next enhancement to the data mining problem deals with the problems of
missing values. Missing values are endemic in real-world data sets. Most machine
5. learning methods make the implicit assumption that there is no particular significance
in the fact that a certain instance has an attribute value missing: the value is simply
not known. However, some attribute like OCCUP_CD (occupation code) shows great
importance in the attribute selection procedure, at the same time contains more than
60% missing values. With regard to this kind of attribute, we should find the value of
the highest frequency (e.g. OTH appears to be the most common value in the attribute
OCCUP_CD) and replace the missing values with it. Since instances with missing
values often provide a good deal of information, taking this approach is much better
than simply ignoring all instances in which some of the values are missing.
3. Construct the Prediction Model
Experience shows that no single machine learning scheme is appropriate to all data
mining program. The universal learner is an idealistic fantasy. Real datasets vary,
and to obtain accurate models the bias of the learning algorithm must match the
structure of the domain. Therefore, after attribute selection, we build the prediction
model by learning a serial of models and combining them together to make trust
worthy and wiser decisions on true positives.
It should be mentioned that, since error rate on the training set is too optimistic to
be a good indicator of future performance, in this section we always measure a
classifier’s performance using 10 times cross-validation.
3.1 Attribute Selection
Most machine learning algorithms are designed to learn which are the most
appropriate attributes to use for making decision. For example, decision tree methods
choose the most promising attribute to split on at each point and should - in theory -
never select irrelevant or unhelpful attributes. Having more features should surely - in
6. theory - result in more discriminating power. However, in practice, adding irrelevant
or distracting attributes to a data set often ”confuses” machine learning systems.
In the data preprocessing step, we have managed to remove some obviously
uninteresting attributes. However, there are still hundreds of attributes in the dataset
some of which may be irrelevant and could probably cause negative effects to the
machine learning schemes. Weka’s attribute selection panel provides us a tool to
specify an attribute subset evaluator and a search method. Subset evaluator takes a
subset of attributes and returns a numeric measure that guides the search. For
instance, CfsSubsetEval assesses the predictive ability of each attribute individually
and the degree of redundancy among them, preferring sets of attributes that are highly
correlated with the class but have low inter-correlation. Search methods traverse that
attribute space to find a good subset whose quality is measured by the chosen attribute
subset evaluator. For example, BestFirst performs greedy hill climbing with
backtracking, while RankSearch sorts attributes using a single-attribute evaluator and
then ranks promising subsets using an attribute subset evaluator. By this means, we
do a lot of experiments to specify certain attributes that should be retained for model
training as shown in the following list.
Table 3 The List of Retained Attributes in the Attribute Selection Step
AGE VAS_CND_FLAG AVG_MINS_INTT3
MARITAL_STATUS VAS_CNND_FLAG AVG_VAS_GAMES
HIGHEND_PROGRAM_FLAG VAS_NR_FLAG AVG_VAS_GPRS
NUM_ACT_TEL VAS_VM_FLAG AVG_VAS_CWAP
NUM_DELINQ_TEL VAS_AR_FLAG STD_VAS_GAMES
HS_AGE AVG_CALL_OB STD_VAS_GPRS
HS_MODEL AVG_CALL_MOB STD_VAS_CWAP
LOYALTY_POINTS_USAGE AVG_MINS_MOB STD_VAS_ESMS
BLACK_LIST_FLAG AVG_BILL_AMT CUSTOMER_TYPE
3.2 Decision Tree
7. The first machine learning scheme that we use to train the customer type model
derives from the simple divide-and-conquer algorithm for producing decision trees.
We employ the popular decision tree algorithm C4.5 to construct model, which, with
its commercial successor C5.0, has emerged as the industry workhorse or off-the-shelf
machine learning. Compared with algorithm ID3, C4.5 has made a series of
improvements, which include methods dealing with numeric attributes, missing
values, noisy data, and generating rules from trees. Figure 2 shows the whole C4.5
training procedure and provides us a platform to think in terms of how data flows
through the system.
Figure 2 The Weka Knowledge Flow Interface
It is a surprising fact that the decision tree we construct with C4.5 is of great scale.
Most popular attribute seems to be HS_MODEL (Handset Model) where the tree
splits. There are 332 distinct handset models in the data set which could be merged
8. into several categories described in Table 4 and Figure 3. We hope that, by this
means, the decision tree induced from training data can be simplified, without loss of
accuracy.
Table 4 Handset Models Merge into Different Categories
Handset Model Category Characteristic
baicj, bagia, bagib, bagic, 3G Very high proportions of customers
bagid, baicb, bgcbj, bbcea, with these handset models are 3G
beccj (Total number: 9) customers.
bahji, baggb, bdaac, 2G Few customers own handset of these
bgbba, baggg, bagfc... kinds, and according to the training
(Total number: 145) and test set, these handsets can only
support 2G services.
Begac, bbacj, begaf, 0to1 Normal proportion patterns: 2G vs.
bdaab, bbafa, baibj… 3G; instance number ranges from 0 to
(Total number: 148) 100.
Bbbei, bbcdd, bbadi, 1to5 Normal proportion patterns: 2G vs.
bagfh, bgcac, bajci (Total 3G; instance number ranges from 101
number: 25) to 500;
bgcab, baiai, baiaj, bbcch, 5to10 Normal proportion patterns: 2G vs.
bahfi (Total number:5) 3G; instance number ranges from 501
to 1000.
9. Figure 3 Handset Models Merge into Different Categories (in the histogram: blue stands for
3G instances while red stands for 2G instances)
Experiments have shown that this hardly affects the classification accuracy of C4.5
(evaluated by 10 times cross-validation). What the technique does affect is decision
tree size. The resulting trees are invariably much smaller than the original ones, even
though they perform almost the same. This comparison is shown in Table 5.
Table 5 Accuracy and Tree Size Comparison of Different Decision Tree Models
Before Merging After Merging
Handset Models Handset Models
Correctly Classified Instances 16124 (89.5778 %) 16110 (89.5%)
Confusion Matrix 3G 2G 3G 2G
1505 1495 | 3G 1614 1386 | 3G
381 14619 | 2G 504 14496 | 2G
Number of Leaves 773 240
Size of the tree 857 470
3.3 Lazy Classifier
10. We did plenty of experiments to find the second appropriate approach solving the
given classification problem to be Lazy learners, which store the training instances
and do no real work until classification time. IB1 is a basic instance-based learner
which finds the training instance closest in Euclidean distance to the given test
instance and predicts the same class as this training instance. If several instances
qualify as the closest, the first one found is used. IBk is a k-nearest-neighbors which
can be specified explicitly or determined automatically using leave-one-out cross-
validation, subject to an upper limit given by the specified value. In addition,
predictions from more than one neighbor can be weighted according to their distance
from the test instance. We set k to 3 to trade off between accuracy and complexity.
Results could be found in Table 6.
Table 6 Accuracy Comparison of Different Lazy Classifier
1 nearest neighbour 3 nearest neighbour
Correctly Classified Instances 14919 (85.0522 %) 15488 (88.1152 %)
Confusion Matrix 3G 2G 3G 2G
1538 1413 | 3G 1501 1477 | 3G
1209 13381 | 2G 612 13987 | 2G
3.4 Boosting
When wise people make critical decisions, they usually take into account the
opinions of several experts rather than relying on their own judgment or that of a
solitary trusted adviser. In data mining, an obvious approach to making decisions
more reliable is to combine the output of different models. The boosting method for
combining multiple models exploits this insight by explicitly seeking models that
complement one another. Boosting uses voting to combine the output of individual
models which are of the same type (here we employ C4.5). In boosting each new
model is influenced by the performance of those built previously, which means that
boosting encourages new models to become experts for instances handled incorrectly
11. by earlier ones. Boosting perhaps is the most powerful method to achieve true
positives. Results of boosting with learner AdaBoostingM1 is shown in Table 7.
Table 7 Results of Boosting with learner AdaBoostM1
Boosting with C4.5
Correctly Classified Instances 15571 (88.5874 %)
Confusion Matrix 3G 2G
1775 1203 | 3G
803 13796 | 2G
4. Conclusions
We have constructed many models with different machine learning algorithms.
According to 10 times cross-validation performance results, identifying true negatives
(2G customers) is much easier than identifying true positives (3G customers). Since
the number of 3G instances in the training and test set is much smaller than the 2G’s,
there is not enough information identifying 3G customers out of massive 2G
customers. The data set at hand is limited, but we can make use of all kinds of
potentially useful data in the future to build a strong enough dataset in the first place.
As mentioned before, the real data is imperfect: Some parts are garbled, and some
are missing. Anything discovered will be inexact: There will be exceptions to every
rule and cases not covered by any rule. Therefore, algorithms need to be robust
enough to cope with imperfect data and to extract regularities that are inexact but
useful. There is no perfect thing in the universe, what we could do is to dig a little
deeper to make better prediction and make a step closer to the truth.
We combined all the meaningful models together to vote for the final decision on
classification. It is an exiting fact that the overlapped part of the classification results
derived from different algorithms and concepts is great (see Table 8), which also
12. proves that the models we gained is dependable to make future prediction. Finally we
successfully identified 884 3G customers out of the prediction set. The decision trees
derived from the given training and test set is of great scale, so we visualize one of
them in text form as an example, which can be found in Appendix A.
Table 8 A segment of Prediction Results Decided by Three Different Algorithms: AdaBoostM1,
LazyIBk, C4.5
AdaBoostM LazyIBk C4.5 Comparison
1
3G 3G 3G
2G 2G 2G
2G 2G 2G
2G 2G 2G
2G 2G 2G
2G 2G 2G
2G 2G 2G
2G 2G 2G
2G 3G 3G +
2G 2G 2G
2G 2G 2G
2G 2G 2G
2G 2G 2G
2G 3G 3G +
3G 2G 2G +
2G 2G 2G
2G 2G 2G
3G 3G 3G
2G 3G 3G +
2G 2G 2G
2G 3G 3G +
2G 2G 2G
2G 2G 2G
2G 2G 2G
3G 3G 3G
2G 2G 2G
3G 3G 3G
The “false positives” which belong to 2G category but are classified into 3G
category answers our remaining question - how to identify current 2G customers with
the potential switching to 3G. According to the trained model, the reason why those
false positives are recognized to be 3G customers is that they must have something in
common with the real 3G customers. Some of them may use brand new handset
13. model which is developed to fulfill 3G purpose, some of them may play a lot of
games, while others would like to download via GPRS. The false positives in the
confusion matrix could tell us the truth and simply give us valuable information to
which customers the operators should pay more attention, when they intend to gain
more 3G customers and increase their customer spending and profitability.
Appendix A
=== Run information ===
Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2
Relation: Classify_2G_3G-weka.filters.unsupervised.instance.RemoveWithValues-S0.0010-C71-Lfirst-last-
weka.filters.unsupervised.attribute.Remove-
R2,4-6,8-15,17,19-21,23,25-28,31-46,49-50,53-54,56-63,65-75,77-88,91-104,106-115,117,119-121,123-193,195,1
97,200-219
Instances: 17577
Attributes: 27
AGE
MARITAL_STATUS
HIGHEND_PROGRAM_FLAG
NUM_ACT_TEL
NUM_DELINQ_TEL
HS_AGE
HS_MODEL
LOYALTY_POINTS_USAGE
BLACK_LIST_FLAG
VAS_CND_FLAG
VAS_CNND_FLAG
VAS_NR_FLAG
VAS_VM_FLAG
VAS_AR_FLAG
AVG_BILL_AMT
AVG_CALL_OB
AVG_CALL_MOB
AVG_MINS_MOB
AVG_MINS_INTT3
AVG_VAS_GAMES
AVG_VAS_GPRS
AVG_VAS_CWAP