Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
ITB tutorial WEKA Prabhat Agarwal
1. ITB WEKA Tutorial
Data Mining Techniques using WEKA for
Clustering (K-Means), and
Classification (J48 Decision Tree)
VINOD GUPTA SCHOOL OF MANAGEMENT, IIT KHARAGPUR
In partial fulfillment
Of the requirements for the degree of
MASTER OF BUSINESS ADMINISTRATION
SUBMITTED BY:
Prabhat Agarwal 10BM60059
VGSOM, IIT KHARAGPUR
2. About WEKA
Weka (Waikato Environment for Knowledge Analysis) is machine learning software written
in Java and developed at the University of Waikato, New Zealand. WEKA is a collection of
machine learning algorithms for data mining tasks which can either be applied directly (WEKA
GUI) to a dataset or called from the Java code (WEKA CLI). WEKA contains tools for data pre-
processing, classification, regression, clustering, association rules, and visualization. It is also
well-suited for developing new machine learning schemes. WEKA is open source software
issued under the GNU General Public License.
WEKA is a powerful tool that helps in Business Research methods and thus empowers managers
to find out the trends based on past data, consumer surveys, etc and help them prepare to take
better decisions. The managers are greatly benefitted in computing complex mathematical
problems through this software.
The WEKA GUI Chooser provides a starting point for launching WEKA’s main GUI (Graphic
User Interface) applications and supporting tools.
The GUI Chooser consists of four buttons—one for each of the four major WEKA
applications—
1. Explorer – Environment for exploring data with WEKA. It gives access to all the
facilities using menu selection.
2. Experimenter – An environment for performing experiments and conducting statistical
tests between learning schemes.
3. Knowledge Flow – It supports the same function as the Explorer but with Drag and
Drop interface. It also supports Incremental learning.
4. Simple CLI – Provides a simple command-line interface that allows direct execution of
WEKA commands for operating systems that do not provide their own command line
interface.
In the tutorial we have described two techniques of Data Mining.
1. Clustering (K-Means)
2. Classification Decision Trees (J48 Tree)
3. Clustering using WEKA
Cluster analysis or clustering means assigning a set of objects into homogenous groups
(called clusters) so that the objects in the same cluster are more similar (in some sense or
another) to each other than to those in other clusters. So the objects in each cluster tend to be
similar to each other and dissimilar to objects in the other clusters. Clustering is a main task of
explorative data mining, and a common technique for statistical data analysis used in many fields
There are two major types of clustering techniques:
1. Hierarchical Clustering
2. Non-Hierarchical Clustering or K-means Clustering
HIERARCHICAL CLUSTERING - Some measure of distance (usually Euclidean or squared
Euclidean) is used to find out distances between all pairs of objects to be clustered. We start with
all objects in separate clusters so number of clusters is same as the number of data points. Two
closest objects are joined to form a cluster. This process continues, until points keep joining to
some existing clusters (because they are closest to an existing cluster), and clusters join other
clusters, based on the shortest distance criterion. In this way, a range of possible solutions is
formed, from n-cluster solution in the beginning, to a single cluster solution at the end.
NON-HIERARCHICAL (K-MEANS) CLUSTERING - We have to specify the number of
clusters we want our data set to be clustered into. We have a hypothesis that the objects will
group into a certain number of clusters.
In the tutorial I have made the demonstration of using K-means clustering. For this primary data-
set of a survey is collected done by a major apparel store to understand the buyer behavior. The
data is collected for 100 individuals.
4. Problem Statement:
A major apparel store (name is not disclosed) has done a survey to collect data to understand the
buyer behavior in purchasing the items from the store. The survey was made to fill by people
visiting the stores and selected at random to make the data free from any biases.
The questionnaire was a set of 7 questions, which they feel may alter the buyer behavior in
making the purchases. The respondent had to agree or disagree (1 =Strongly Agree, 2 = Agree, 3
= Slightly Agree, 4 = Slightly Disagree, 5 = Disagree, 6 =Strongly Disagree)
The Questions in the data set are:
1. Please rate your frequency in making unplanned casual wear purchase for:
Own Consumption
Other’s Consumption
2. How strongly do you agree with the following sentences
I shop to change my mood
I tend to buy more casual wear unplanned when I feel happy
I tend to buy more casual wear unplanned when I feel unhappy
3. I tend to buy more casual wear unplanned when I see sales promotion such as:
Buy 1 Get 1 free
Cash rebate
Complimentary accessories (ex: Belt, bracelet, necklace)
Complimentary vouchers
Prize Draws
Joint promotions (ex: specific movie ticket given away with purchase of certain
brand of casual wear)
Buy 1 Get the next one at 50 % off
5. 4. I tend to buy more casual wear unplanned when I see sales promotion such as:
50 % discount
20 % discount
Member discount period
Storewide discount
5. Gender
6. Age
7. Monthly income range:
10000 & Below (represented by 1)
10001 to 15000 (represented by 2)
15001 to 20000 (represented by 3)
20001 to 25000 (represented by 4)
25001 to 30000 (represented by 5)
Above 30000 (represented by 6)
A snapshot of the questionnaire is also put.
6. The store wants to cluster the market based on the above attributes. This will help the store in
effectively catering to the demands of most lucrative segment.
In the tutorial we will demonstrate how WEKA can be used to do this.
The data collected in the spreadsheet is converted into .csv format. The attributes are named as
“Var 1” to “Var 19”. This data file contains 100 instances.
7. The WEKA Tutorial Steps :
1. Click on WEKA ―Explorer” tab to start the software.
2. Then click on “Preprocess” -> “Open file” to select the data file to be opened.
Once we click on “Open” the data file will be loaded.
The window will look like this:
8. The bottom right hand corner shows the distribution of data value for Variable 1. The small
window above it shows the Mean and Standard deviation of the variable. This way we can see
the distribution of each variable.
3. However if we want to see the distribution of variables at one go then we can click on tab
“Visualize All” to view the distribution of all variables in the sample population.
9. 4. In the main window there is also an option as “Edit data” where we can edit the data of
the .csv file if we have any error in the data set.
5. For Clustering, we select the tab ―Cluster‖ in the main window and click on “Choose”
tab to select K-means Clustering. There on the text-box beside ―Choose‖ we click to
customize our settings for doing clustering. The setting used for the given clustering is
denoted in the snapshot below.
10. The distance Function used is the Euclidean Distance and the number of cluster to be made is 5.
6. Then we click on the “Start” button to do the analysis. The result will be displayed on
the right hand side panel.
7. We can view the result in a separate window by right clicking the last result set (inside
the "Result list" panel on the left) and select "View in separate window" from the pop-
up menu.
The result that is displayed is given in the snapshot below:
11. It shows that it needed 8 iterations to arrive at the result.
There are 5 clusters. 3 % of the population lies in first cluster, 22 % of the population lies in
second cluster, 23 % of the population lies in third cluster, 34 % of the population lies in fourth
cluster and 18 % of the population lies in fifth cluster.
So cluster 3 (fourth cluster) is having the maximum population.
Cluster 3 characteristics
They do not do unplanned casual wear purchase for own consumption.
12. Sometimes do unplanned casual wear purchase for others consumption.
They shop to change their mood
Slightly agree that they buy more casual wear unplanned when happy.
Slightly Disagree that they buy more casual wear unplanned when feel unhappy.
Slightly disagree that they buy more casual wear unplanned when they see sales
promotion such as Buy 1 Get 1 free.
Slightly agree that that they buy more casual wear unplanned when they see sales
promotion such as cash rebate
Slightly disagree that they buy more casual wear unplanned when they see sales
promotion such as complimentary accessories
Slightly disagree that they buy more casual wear unplanned when they see sales
promotion such as complimentary vouchers
Slightly disagree that they buy more casual wear unplanned when they see sales
promotion such as Prize Draws
Purchasers are mostly Female
Purchasers are of 16 to 25 years old
Income range is in the higher side of the range 10001 to 15000 (approx around 14000)
This way we can understand different kinds of customers lying in different clusters and their
behaviour. This will help the store manager to take important decisions regarding marketing
activities, sales promotions, etc. They will target their product offering to particular segment.
The other kinds of clustering which WEKA enables us:
1. Farthest First Cluster
2. Filtered Clusterer
3. Hierarchical Clusterer
4. Make Density Based Clusterer
13. Classification using WEKA
Classification (also known as classification trees or decision trees) is a data mining algorithm
that creates a step-by-step guide to determine the output of some data entries. The nodes in the
tree represent spot where a decision must be made based on the input data. We move to the next
node by going into another decision criteria and the next until we reach a leaf that tells us the
desired output.
This model can be used for any unknown data instance, and we are able to predict whether this
unknown data instance will fall into that classification tree or not. That is the advantage of
classification trees — it doesn't require a lot of information about the data to create a tree that
could be very accurate and very informative.
In the WEKA tutorial we have used J48 decision tree to form a decision structure
Problem Statement:
A bank is analyzing the data entries of some individual to determine whether they can be given
loan or not. (The data set used here is the secondary data collected from some free data source.)
The following attributes are considered by the bank.
Age –
Education - (1- Middle School, 2- High School, 3 –Graduation, 4- Post graduation
Employment - (1- Not employed, 2- Student, 3 –Business, 4- Post graduation
Income
Credit – (1 and 2 – Bad credit Rating, 3 and 4 – Good Credit rating)
Default – Yes and No
The WEKA Tutorial Steps :
1. Click on WEKA “Explorer” tab to start the software.
2. Then click on “Preprocess” -> “Open file” to select the file to be opened.
14. 3. Next, we select the "Classify" tab and click the "Choose" button to select the J48
classifier. We have to select on the text box beside "Choose" and make the following
setting. (Here we have kept the default setting). The default version does perform some
pruning (using the sub tree raising approach), but does not perform error pruning.
4. To know more about the settings we can click on the “More” tab on the top right hand
corner to know the detail about different options to be filled.
5. Under the "Test options" in the main panel we select 10-fold cross-validation as our
evaluation approach as we do not have separate evaluation data set.
6. We now click "Start" to generate the model. The ASCII version of the tree as well as
evaluation statistics will appear in the panel.
15. 7. We can view this information in a separate window by right clicking the last result set
(inside the "Result list" panel on the left) and selecting "View in separate window"
from the pop-up menu.
The number of leaves is 4 and the size of tree is 7.
The confusion matrix shows how many are correctly categorized and how many are wrongly
categorized. Here we see that out of the data set of 50 entries, 37 are correctly categorized and so
the accuracy of our model is 74 %.
16. 8. WEKA also lets us view a graphical rendition of the classification tree. This can be done
by right clicking the last result set (as before) and selecting "Visualize tree" from the
pop-up menu.
It shows that bank will consider for loan if the age is less than 30 so that repayment guarantee is
there. It further looks for the credit rating of the individual and gives loan if it is more than 2. If
less than 2 then bank will again look for the age. If it is less than 22 then bank will grant the loan.
17. References
Data Mining by Ian H. Witten, Eibe Frank and Mark A. Hall (3rd edition, Morgan
Kaufmann publisher)
WEKA Manual for version 3-6-2 by The University of Waikato