1. IT for Business Intelligence
Data Mining Techniques Classification and
regression Using WEKA
A.Kranthikumar (10BM60001)
2. Classification via decision trees using WEKA
Problem:
A bank is introducing a new financial product. So the bank wants to classify the new
customers whether they will be ready to buy the new product or not. Bank has the
existing information from the old clients who are interested in buying the new
product.
Classification is a statistical technique that helps to classify any new client into one of
the existing groups. It will create a model on the test data available. And then
classifies the new data based on the model that is developed using the test data.
Steps to do classification in WEKA
Step 1: Create a data file in the format of arff or csv. Weka understands these two
formats. We are taking the file in csv format Bank.csv
Step 2: Open the Weka application. This will show the following screen
Now click on the Explorer tab. This directs to the following window.
3. Step 3: Loading data into WEKA.
To do that click on the open file button and browse for the bank.csv file. Then it
shows all the attributes as shown in the below figure.
4. Step 4: View the data
In the selected attribute panel you can see the values corresponding to the
attributes and also its type, name e.t.c
You can also visualize the frequency distribution of all the attributes at a time
by clicking on the “Visualize All” button. It shows the following screen.
This visualizes all shows the range of data for each attribute and also the mean,
median and frequency of each attribute. For example the value of age in our case is
ranging from 18 to 67 with an average of 42.5
Step 5: Classify the Test data
To do this select the classify button which shows the following screen.
5. Then select the J48 algorithm which is under the node of tree when
you click on the choose button. This will show the following screen.
6. Step 6: Run the classification Algorithm
Select the dependent variable that should be classified and click on the
start.
This shows the output in the classifier output panel in ASCII version of
the tree.
This is difficult to understand. To view the output in the form of tree,
right click on the trees.j48 and select “visualize tree” option. This shows
the following screen by again right clicking on the output and selecting
full screen option.
Step 7: Analyze the model created by existing data
From the Classifier output we can find that the Classification accuracy of the
model is 89%.
This means that the model is able to predict the values 89% correctly. So if
we use the same model to find out the buying decision of new customer the
probability will be 0.89
Step 8: Test the New customer data
Create your new customer data in arff or csv format with the same attributes
as test data.
Now input the data by checking the radio button “Supplied test set” and click
on “ set” to browse for the new data set.
7. Then click on the start button which generates a new tree.
Save the classification result as arff. This file contains a copy of the new
instances along with an additional column for the predicted value. The result
will look like following.
8. Regression Using WEKA
Problem: The idea is to find out how the CPU performance is correlated with the
attributes like machine cycle time, minimum main memory, cache memory e.t.c
A regression is a statistic tool that helps in finding out how the dependent variable
(CPU performance) is related to the independent attributes.
Steps to do Regression in WEKA
Step 1: Create data file and open the WEKA as in the same way as we did for
Classification.
Step 2: Load the regression data file CPU.arff into weka.
Click on open file and browse for the file, that shows the following screen
Step 3: Run the regression
Click on the Classify tab and choose “Linear Regression” from the node under
function. This shows the following screen.
9. Click on start that will show output in the classifier output screen which gives a
regression equation.
10. Interpretation of the output:
From the output you can see that the CPU performance is more dependent on
CHMAX and then CACHE
High correlation coefficient of 0.912 from output suggests that the dependent
variable is strongly associated with the independent variables.
We can also determine the new CPU performance by using the regression
equation if we have the values of the attributes.