SlideShare a Scribd company logo
1 of 17
ITB WEKA Tutorial


Data Mining Techniques using WEKA for
   Clustering (K-Means), and
   Classification (J48 Decision Tree)

VINOD GUPTA SCHOOL OF MANAGEMENT, IIT KHARAGPUR

                     In partial fulfillment
            Of the requirements for the degree of
       MASTER OF BUSINESS ADMINISTRATION




                             SUBMITTED BY:

                             Prabhat Agarwal        10BM60059

                             VGSOM, IIT KHARAGPUR
About WEKA
Weka (Waikato Environment for Knowledge Analysis) is machine learning software written
in Java and developed at the University of Waikato, New Zealand. WEKA is a collection of
machine learning algorithms for data mining tasks which can either be applied directly (WEKA
GUI) to a dataset or called from the Java code (WEKA CLI). WEKA contains tools for data pre-
processing, classification, regression, clustering, association rules, and visualization. It is also
well-suited for developing new machine learning schemes. WEKA is open source software
issued under the GNU General Public License.

WEKA is a powerful tool that helps in Business Research methods and thus empowers managers
to find out the trends based on past data, consumer surveys, etc and help them prepare to take
better decisions. The managers are greatly benefitted in computing complex mathematical
problems through this software.

The WEKA GUI Chooser provides a starting point for launching WEKA’s main GUI (Graphic
User Interface) applications and supporting tools.
The GUI Chooser consists of four buttons—one for each of the four major WEKA
applications—


    1. Explorer – Environment for exploring data with WEKA. It gives access to all the
         facilities using menu selection.
    2.   Experimenter – An environment for performing experiments and conducting statistical
         tests between learning schemes.
    3. Knowledge Flow – It supports the same function as the Explorer but with Drag and
         Drop interface. It also supports Incremental learning.
    4. Simple CLI – Provides a simple command-line interface that allows direct execution of
         WEKA commands for operating systems that do not provide their own command line
         interface.
In the tutorial we have described two techniques of Data Mining.
   1. Clustering (K-Means)
   2. Classification Decision Trees (J48 Tree)
Clustering using WEKA

Cluster analysis or clustering means assigning a set of objects into homogenous groups
(called clusters) so that the objects in the same cluster are more similar (in some sense or
another) to each other than to those in other clusters. So the objects in each cluster tend to be
similar to each other and dissimilar to objects in the other clusters. Clustering is a main task of
explorative data mining, and a common technique for statistical data analysis used in many fields


There are two major types of clustering techniques:


   1. Hierarchical Clustering
   2. Non-Hierarchical Clustering or K-means Clustering


HIERARCHICAL CLUSTERING - Some measure of distance (usually Euclidean or squared
Euclidean) is used to find out distances between all pairs of objects to be clustered. We start with
all objects in separate clusters so number of clusters is same as the number of data points. Two
closest objects are joined to form a cluster. This process continues, until points keep joining to
some existing clusters (because they are closest to an existing cluster), and clusters join other
clusters, based on the shortest distance criterion. In this way, a range of possible solutions is
formed, from n-cluster solution in the beginning, to a single cluster solution at the end.



NON-HIERARCHICAL (K-MEANS) CLUSTERING - We have to specify the number of
clusters we want our data set to be clustered into. We have a hypothesis that the objects will
group into a certain number of clusters.




In the tutorial I have made the demonstration of using K-means clustering. For this primary data-
set of a survey is collected done by a major apparel store to understand the buyer behavior. The
data is collected for 100 individuals.
Problem Statement:
A major apparel store (name is not disclosed) has done a survey to collect data to understand the
buyer behavior in purchasing the items from the store. The survey was made to fill by people
visiting the stores and selected at random to make the data free from any biases.
The questionnaire was a set of 7 questions, which they feel may alter the buyer behavior in
making the purchases. The respondent had to agree or disagree (1 =Strongly Agree, 2 = Agree, 3
= Slightly Agree, 4 = Slightly Disagree, 5 = Disagree, 6 =Strongly Disagree)


The Questions in the data set are:


   1. Please rate your frequency in making unplanned casual wear purchase for:
             Own Consumption
             Other’s Consumption


   2. How strongly do you agree with the following sentences
            I shop to change my mood
            I tend to buy more casual wear unplanned when I feel happy
            I tend to buy more casual wear unplanned when I feel unhappy


   3. I tend to buy more casual wear unplanned when I see sales promotion such as:
             Buy 1 Get 1 free
             Cash rebate
             Complimentary accessories (ex: Belt, bracelet, necklace)
             Complimentary vouchers
             Prize Draws
             Joint promotions (ex: specific movie ticket given away with purchase of certain
             brand of casual wear)
             Buy 1 Get the next one at 50 % off
4. I tend to buy more casual wear unplanned when I see sales promotion such as:
         50 % discount
         20 % discount
         Member discount period
         Storewide discount
5. Gender
6. Age
7. Monthly income range:
         10000 & Below        (represented by 1)
         10001 to 15000 (represented by 2)
         15001 to 20000 (represented by 3)
         20001 to 25000 (represented by 4)
         25001 to 30000 (represented by 5)
         Above 30000          (represented by 6)




   A snapshot of the questionnaire is also put.
The store wants to cluster the market based on the above attributes. This will help the store in
effectively catering to the demands of most lucrative segment.
In the tutorial we will demonstrate how WEKA can be used to do this.
The data collected in the spreadsheet is converted into .csv format. The attributes are named as
“Var 1” to “Var 19”. This data file contains 100 instances.
The WEKA Tutorial Steps :


   1. Click on WEKA ―Explorer” tab to start the software.
   2. Then click on “Preprocess” -> “Open file” to select the data file to be opened.




Once we click on “Open” the data file will be loaded.
The window will look like this:
The bottom right hand corner shows the distribution of data value for Variable 1. The small
window above it shows the Mean and Standard deviation of the variable. This way we can see
the distribution of each variable.
   3. However if we want to see the distribution of variables at one go then we can click on tab
       “Visualize All” to view the distribution of all variables in the sample population.
4. In the main window there is also an option as “Edit data” where we can edit the data of
   the .csv file if we have any error in the data set.




5. For Clustering, we select the tab ―Cluster‖ in the main window and click on “Choose”
   tab to select K-means Clustering. There on the text-box beside ―Choose‖ we click to
   customize our settings for doing clustering. The setting used for the given clustering is
   denoted in the snapshot below.
The distance Function used is the Euclidean Distance and the number of cluster to be made is 5.

   6. Then we click on the “Start” button to do the analysis. The result will be displayed on
       the right hand side panel.
   7. We can view the result in a separate window by right clicking the last result set (inside
       the "Result list" panel on the left) and select "View in separate window" from the pop-
       up menu.


   The result that is displayed is given in the snapshot below:
It shows that it needed 8 iterations to arrive at the result.

There are 5 clusters. 3 % of the population lies in first cluster, 22 % of the population lies in
second cluster, 23 % of the population lies in third cluster, 34 % of the population lies in fourth
cluster and 18 % of the population lies in fifth cluster.

So cluster 3 (fourth cluster) is having the maximum population.

Cluster 3 characteristics
        They do not do unplanned casual wear purchase for own consumption.
Sometimes do unplanned casual wear purchase for others consumption.
       They shop to change their mood
       Slightly agree that they buy more casual wear unplanned when happy.
       Slightly Disagree that they buy more casual wear unplanned when feel unhappy.
       Slightly disagree that they buy more casual wear unplanned when they see sales
       promotion such as Buy 1 Get 1 free.
       Slightly agree that that they buy more casual wear unplanned when they see sales
       promotion such as cash rebate
       Slightly disagree that they buy more casual wear unplanned when they see sales
       promotion such as complimentary accessories
       Slightly disagree that they buy more casual wear unplanned when they see sales
       promotion such as complimentary vouchers
       Slightly disagree that they buy more casual wear unplanned when they see sales
       promotion such as Prize Draws
       Purchasers are mostly Female
       Purchasers are of 16 to 25 years old
       Income range is in the higher side of the range 10001 to 15000 (approx around 14000)




This way we can understand different kinds of customers lying in different clusters and their
behaviour. This will help the store manager to take important decisions regarding marketing
activities, sales promotions, etc. They will target their product offering to particular segment.

The other kinds of clustering which WEKA enables us:

   1. Farthest First Cluster
   2. Filtered Clusterer
   3. Hierarchical Clusterer
   4. Make Density Based Clusterer
Classification using WEKA

Classification (also known as classification trees or decision trees) is a data mining algorithm
that creates a step-by-step guide to determine the output of some data entries. The nodes in the
tree represent spot where a decision must be made based on the input data. We move to the next
node by going into another decision criteria and the next until we reach a leaf that tells us the
desired output.

This model can be used for any unknown data instance, and we are able to predict whether this
unknown data instance will fall into that classification tree or not. That is the advantage of
classification trees — it doesn't require a lot of information about the data to create a tree that
could be very accurate and very informative.

In the WEKA tutorial we have used J48 decision tree to form a decision structure

Problem Statement:

A bank is analyzing the data entries of some individual to determine whether they can be given
loan or not. (The data set used here is the secondary data collected from some free data source.)
The following attributes are considered by the bank.

       Age –
       Education - (1- Middle School, 2- High School, 3 –Graduation, 4- Post graduation
       Employment - (1- Not employed, 2- Student, 3 –Business, 4- Post graduation
       Income
       Credit – (1 and 2 – Bad credit Rating, 3 and 4 – Good Credit rating)
       Default – Yes and No



The WEKA Tutorial Steps :

   1. Click on WEKA “Explorer” tab to start the software.
   2. Then click on “Preprocess” -> “Open file” to select the file to be opened.
3. Next, we select the "Classify" tab and click the "Choose" button to select the J48
   classifier. We have to select on the text box beside "Choose" and make the following
   setting. (Here we have kept the default setting). The default version does perform some
   pruning (using the sub tree raising approach), but does not perform error pruning.




4. To know more about the settings we can click on the “More” tab on the top right hand
   corner to know the detail about different options to be filled.
5. Under the "Test options" in the main panel we select 10-fold cross-validation as our
   evaluation approach as we do not have separate evaluation data set.
6. We now click "Start" to generate the model. The ASCII version of the tree as well as
   evaluation statistics will appear in the panel.
7. We can view this information in a separate window by right clicking the last result set
       (inside the "Result list" panel on the left) and selecting "View in separate window"
       from the pop-up menu.




The number of leaves is 4 and the size of tree is 7.

The confusion matrix shows how many are correctly categorized and how many are wrongly
categorized. Here we see that out of the data set of 50 entries, 37 are correctly categorized and so
the accuracy of our model is 74 %.
8. WEKA also lets us view a graphical rendition of the classification tree. This can be done
       by right clicking the last result set (as before) and selecting "Visualize tree" from the
       pop-up menu.




It shows that bank will consider for loan if the age is less than 30 so that repayment guarantee is
there. It further looks for the credit rating of the individual and gives loan if it is more than 2. If
less than 2 then bank will again look for the age. If it is less than 22 then bank will grant the loan.
References


    Data Mining by Ian H. Witten, Eibe Frank and Mark A. Hall (3rd edition, Morgan
    Kaufmann publisher)

    WEKA Manual for version 3-6-2 by The University of Waikato

More Related Content

Similar to ITB tutorial WEKA Prabhat Agarwal

BUSI 331Marketing Research Report Part 3 InstructionsData .docx
BUSI 331Marketing Research Report Part 3 InstructionsData .docxBUSI 331Marketing Research Report Part 3 InstructionsData .docx
BUSI 331Marketing Research Report Part 3 InstructionsData .docxhumphrieskalyn
 
Discussion Questions Chapter 15Terms in Review1Define or exp.docx
Discussion Questions Chapter 15Terms in Review1Define or exp.docxDiscussion Questions Chapter 15Terms in Review1Define or exp.docx
Discussion Questions Chapter 15Terms in Review1Define or exp.docxedgar6wallace88877
 
The future of market research
The future of market researchThe future of market research
The future of market researchInSites on Stage
 
The Future of Market Research
The Future of Market ResearchThe Future of Market Research
The Future of Market ResearchTom De Ruyck
 
Big Data Analytics Tools..DS_Store__MACOSXBig Data Analyti.docx
Big Data Analytics Tools..DS_Store__MACOSXBig Data Analyti.docxBig Data Analytics Tools..DS_Store__MACOSXBig Data Analyti.docx
Big Data Analytics Tools..DS_Store__MACOSXBig Data Analyti.docxtangyechloe
 
Survey analytics conjointanalysis_1
Survey analytics conjointanalysis_1Survey analytics conjointanalysis_1
Survey analytics conjointanalysis_1QuestionPro
 
Strategic Market Research (Chapter 7): Analyzing Numeric Data to Determine W...
Strategic Market Research (Chapter 7):  Analyzing Numeric Data to Determine W...Strategic Market Research (Chapter 7):  Analyzing Numeric Data to Determine W...
Strategic Market Research (Chapter 7): Analyzing Numeric Data to Determine W...Matthew A. Gilbert, MBA
 
Data mining techniques using weka
Data mining techniques using wekaData mining techniques using weka
Data mining techniques using wekarathorenitin87
 
Market Basket Analysis of bakery Shop
Market Basket Analysis of bakery ShopMarket Basket Analysis of bakery Shop
Market Basket Analysis of bakery ShopVarunSahdev2
 
Store segmentation progresso
Store segmentation progressoStore segmentation progresso
Store segmentation progressoveesingh
 
Weka term paper(siddharth 10 bm60086)
Weka term paper(siddharth 10 bm60086)Weka term paper(siddharth 10 bm60086)
Weka term paper(siddharth 10 bm60086)Siddharth Verma
 
Cluster analysis in prespective to Marketing Research
Cluster analysis in prespective to Marketing ResearchCluster analysis in prespective to Marketing Research
Cluster analysis in prespective to Marketing ResearchSahil Kapoor
 
How to Run Discrete Choice Conjoint Analysis
How to Run Discrete Choice Conjoint AnalysisHow to Run Discrete Choice Conjoint Analysis
How to Run Discrete Choice Conjoint AnalysisQuestionPro
 
Mining internal sources of data
Mining internal sources of dataMining internal sources of data
Mining internal sources of datanomanbhutta
 
The creative process, six common creative techniques and examples
The creative process, six common creative techniques and examplesThe creative process, six common creative techniques and examples
The creative process, six common creative techniques and examplesGerard Prins
 
LGL Certification Training Guide
LGL Certification Training GuideLGL Certification Training Guide
LGL Certification Training GuideErin Shumaker
 
Chainsaw Conjoint
Chainsaw ConjointChainsaw Conjoint
Chainsaw ConjointQuestionPro
 
Recommended System.pptx
 Recommended System.pptx Recommended System.pptx
Recommended System.pptxDr.Shweta
 

Similar to ITB tutorial WEKA Prabhat Agarwal (20)

BUSI 331Marketing Research Report Part 3 InstructionsData .docx
BUSI 331Marketing Research Report Part 3 InstructionsData .docxBUSI 331Marketing Research Report Part 3 InstructionsData .docx
BUSI 331Marketing Research Report Part 3 InstructionsData .docx
 
Discussion Questions Chapter 15Terms in Review1Define or exp.docx
Discussion Questions Chapter 15Terms in Review1Define or exp.docxDiscussion Questions Chapter 15Terms in Review1Define or exp.docx
Discussion Questions Chapter 15Terms in Review1Define or exp.docx
 
The future of market research
The future of market researchThe future of market research
The future of market research
 
The Future of Market Research
The Future of Market ResearchThe Future of Market Research
The Future of Market Research
 
Big Data Analytics Tools..DS_Store__MACOSXBig Data Analyti.docx
Big Data Analytics Tools..DS_Store__MACOSXBig Data Analyti.docxBig Data Analytics Tools..DS_Store__MACOSXBig Data Analyti.docx
Big Data Analytics Tools..DS_Store__MACOSXBig Data Analyti.docx
 
Survey analytics conjointanalysis_1
Survey analytics conjointanalysis_1Survey analytics conjointanalysis_1
Survey analytics conjointanalysis_1
 
Strategic Market Research (Chapter 7): Analyzing Numeric Data to Determine W...
Strategic Market Research (Chapter 7):  Analyzing Numeric Data to Determine W...Strategic Market Research (Chapter 7):  Analyzing Numeric Data to Determine W...
Strategic Market Research (Chapter 7): Analyzing Numeric Data to Determine W...
 
Data mining techniques using weka
Data mining techniques using wekaData mining techniques using weka
Data mining techniques using weka
 
Market Basket Analysis of bakery Shop
Market Basket Analysis of bakery ShopMarket Basket Analysis of bakery Shop
Market Basket Analysis of bakery Shop
 
Store segmentation progresso
Store segmentation progressoStore segmentation progresso
Store segmentation progresso
 
Weka term paper(siddharth 10 bm60086)
Weka term paper(siddharth 10 bm60086)Weka term paper(siddharth 10 bm60086)
Weka term paper(siddharth 10 bm60086)
 
Cluster analysis in prespective to Marketing Research
Cluster analysis in prespective to Marketing ResearchCluster analysis in prespective to Marketing Research
Cluster analysis in prespective to Marketing Research
 
UC San Diego's Big Data Specialization Capstone
UC San Diego's Big Data Specialization CapstoneUC San Diego's Big Data Specialization Capstone
UC San Diego's Big Data Specialization Capstone
 
How to Run Discrete Choice Conjoint Analysis
How to Run Discrete Choice Conjoint AnalysisHow to Run Discrete Choice Conjoint Analysis
How to Run Discrete Choice Conjoint Analysis
 
Mining internal sources of data
Mining internal sources of dataMining internal sources of data
Mining internal sources of data
 
The creative process, six common creative techniques and examples
The creative process, six common creative techniques and examplesThe creative process, six common creative techniques and examples
The creative process, six common creative techniques and examples
 
LGL Certification Training Guide
LGL Certification Training GuideLGL Certification Training Guide
LGL Certification Training Guide
 
Data Mining GUI Tools with Demo
Data Mining GUI Tools with DemoData Mining GUI Tools with Demo
Data Mining GUI Tools with Demo
 
Chainsaw Conjoint
Chainsaw ConjointChainsaw Conjoint
Chainsaw Conjoint
 
Recommended System.pptx
 Recommended System.pptx Recommended System.pptx
Recommended System.pptx
 

Recently uploaded

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesShubhangi Sonawane
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIShubhangi Sonawane
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 

Recently uploaded (20)

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 

ITB tutorial WEKA Prabhat Agarwal

  • 1. ITB WEKA Tutorial Data Mining Techniques using WEKA for Clustering (K-Means), and Classification (J48 Decision Tree) VINOD GUPTA SCHOOL OF MANAGEMENT, IIT KHARAGPUR In partial fulfillment Of the requirements for the degree of MASTER OF BUSINESS ADMINISTRATION SUBMITTED BY: Prabhat Agarwal 10BM60059 VGSOM, IIT KHARAGPUR
  • 2. About WEKA Weka (Waikato Environment for Knowledge Analysis) is machine learning software written in Java and developed at the University of Waikato, New Zealand. WEKA is a collection of machine learning algorithms for data mining tasks which can either be applied directly (WEKA GUI) to a dataset or called from the Java code (WEKA CLI). WEKA contains tools for data pre- processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. WEKA is open source software issued under the GNU General Public License. WEKA is a powerful tool that helps in Business Research methods and thus empowers managers to find out the trends based on past data, consumer surveys, etc and help them prepare to take better decisions. The managers are greatly benefitted in computing complex mathematical problems through this software. The WEKA GUI Chooser provides a starting point for launching WEKA’s main GUI (Graphic User Interface) applications and supporting tools. The GUI Chooser consists of four buttons—one for each of the four major WEKA applications— 1. Explorer – Environment for exploring data with WEKA. It gives access to all the facilities using menu selection. 2. Experimenter – An environment for performing experiments and conducting statistical tests between learning schemes. 3. Knowledge Flow – It supports the same function as the Explorer but with Drag and Drop interface. It also supports Incremental learning. 4. Simple CLI – Provides a simple command-line interface that allows direct execution of WEKA commands for operating systems that do not provide their own command line interface. In the tutorial we have described two techniques of Data Mining. 1. Clustering (K-Means) 2. Classification Decision Trees (J48 Tree)
  • 3. Clustering using WEKA Cluster analysis or clustering means assigning a set of objects into homogenous groups (called clusters) so that the objects in the same cluster are more similar (in some sense or another) to each other than to those in other clusters. So the objects in each cluster tend to be similar to each other and dissimilar to objects in the other clusters. Clustering is a main task of explorative data mining, and a common technique for statistical data analysis used in many fields There are two major types of clustering techniques: 1. Hierarchical Clustering 2. Non-Hierarchical Clustering or K-means Clustering HIERARCHICAL CLUSTERING - Some measure of distance (usually Euclidean or squared Euclidean) is used to find out distances between all pairs of objects to be clustered. We start with all objects in separate clusters so number of clusters is same as the number of data points. Two closest objects are joined to form a cluster. This process continues, until points keep joining to some existing clusters (because they are closest to an existing cluster), and clusters join other clusters, based on the shortest distance criterion. In this way, a range of possible solutions is formed, from n-cluster solution in the beginning, to a single cluster solution at the end. NON-HIERARCHICAL (K-MEANS) CLUSTERING - We have to specify the number of clusters we want our data set to be clustered into. We have a hypothesis that the objects will group into a certain number of clusters. In the tutorial I have made the demonstration of using K-means clustering. For this primary data- set of a survey is collected done by a major apparel store to understand the buyer behavior. The data is collected for 100 individuals.
  • 4. Problem Statement: A major apparel store (name is not disclosed) has done a survey to collect data to understand the buyer behavior in purchasing the items from the store. The survey was made to fill by people visiting the stores and selected at random to make the data free from any biases. The questionnaire was a set of 7 questions, which they feel may alter the buyer behavior in making the purchases. The respondent had to agree or disagree (1 =Strongly Agree, 2 = Agree, 3 = Slightly Agree, 4 = Slightly Disagree, 5 = Disagree, 6 =Strongly Disagree) The Questions in the data set are: 1. Please rate your frequency in making unplanned casual wear purchase for: Own Consumption Other’s Consumption 2. How strongly do you agree with the following sentences I shop to change my mood I tend to buy more casual wear unplanned when I feel happy I tend to buy more casual wear unplanned when I feel unhappy 3. I tend to buy more casual wear unplanned when I see sales promotion such as: Buy 1 Get 1 free Cash rebate Complimentary accessories (ex: Belt, bracelet, necklace) Complimentary vouchers Prize Draws Joint promotions (ex: specific movie ticket given away with purchase of certain brand of casual wear) Buy 1 Get the next one at 50 % off
  • 5. 4. I tend to buy more casual wear unplanned when I see sales promotion such as: 50 % discount 20 % discount Member discount period Storewide discount 5. Gender 6. Age 7. Monthly income range: 10000 & Below (represented by 1) 10001 to 15000 (represented by 2) 15001 to 20000 (represented by 3) 20001 to 25000 (represented by 4) 25001 to 30000 (represented by 5) Above 30000 (represented by 6) A snapshot of the questionnaire is also put.
  • 6. The store wants to cluster the market based on the above attributes. This will help the store in effectively catering to the demands of most lucrative segment. In the tutorial we will demonstrate how WEKA can be used to do this. The data collected in the spreadsheet is converted into .csv format. The attributes are named as “Var 1” to “Var 19”. This data file contains 100 instances.
  • 7. The WEKA Tutorial Steps : 1. Click on WEKA ―Explorer” tab to start the software. 2. Then click on “Preprocess” -> “Open file” to select the data file to be opened. Once we click on “Open” the data file will be loaded. The window will look like this:
  • 8. The bottom right hand corner shows the distribution of data value for Variable 1. The small window above it shows the Mean and Standard deviation of the variable. This way we can see the distribution of each variable. 3. However if we want to see the distribution of variables at one go then we can click on tab “Visualize All” to view the distribution of all variables in the sample population.
  • 9. 4. In the main window there is also an option as “Edit data” where we can edit the data of the .csv file if we have any error in the data set. 5. For Clustering, we select the tab ―Cluster‖ in the main window and click on “Choose” tab to select K-means Clustering. There on the text-box beside ―Choose‖ we click to customize our settings for doing clustering. The setting used for the given clustering is denoted in the snapshot below.
  • 10. The distance Function used is the Euclidean Distance and the number of cluster to be made is 5. 6. Then we click on the “Start” button to do the analysis. The result will be displayed on the right hand side panel. 7. We can view the result in a separate window by right clicking the last result set (inside the "Result list" panel on the left) and select "View in separate window" from the pop- up menu. The result that is displayed is given in the snapshot below:
  • 11. It shows that it needed 8 iterations to arrive at the result. There are 5 clusters. 3 % of the population lies in first cluster, 22 % of the population lies in second cluster, 23 % of the population lies in third cluster, 34 % of the population lies in fourth cluster and 18 % of the population lies in fifth cluster. So cluster 3 (fourth cluster) is having the maximum population. Cluster 3 characteristics They do not do unplanned casual wear purchase for own consumption.
  • 12. Sometimes do unplanned casual wear purchase for others consumption. They shop to change their mood Slightly agree that they buy more casual wear unplanned when happy. Slightly Disagree that they buy more casual wear unplanned when feel unhappy. Slightly disagree that they buy more casual wear unplanned when they see sales promotion such as Buy 1 Get 1 free. Slightly agree that that they buy more casual wear unplanned when they see sales promotion such as cash rebate Slightly disagree that they buy more casual wear unplanned when they see sales promotion such as complimentary accessories Slightly disagree that they buy more casual wear unplanned when they see sales promotion such as complimentary vouchers Slightly disagree that they buy more casual wear unplanned when they see sales promotion such as Prize Draws Purchasers are mostly Female Purchasers are of 16 to 25 years old Income range is in the higher side of the range 10001 to 15000 (approx around 14000) This way we can understand different kinds of customers lying in different clusters and their behaviour. This will help the store manager to take important decisions regarding marketing activities, sales promotions, etc. They will target their product offering to particular segment. The other kinds of clustering which WEKA enables us: 1. Farthest First Cluster 2. Filtered Clusterer 3. Hierarchical Clusterer 4. Make Density Based Clusterer
  • 13. Classification using WEKA Classification (also known as classification trees or decision trees) is a data mining algorithm that creates a step-by-step guide to determine the output of some data entries. The nodes in the tree represent spot where a decision must be made based on the input data. We move to the next node by going into another decision criteria and the next until we reach a leaf that tells us the desired output. This model can be used for any unknown data instance, and we are able to predict whether this unknown data instance will fall into that classification tree or not. That is the advantage of classification trees — it doesn't require a lot of information about the data to create a tree that could be very accurate and very informative. In the WEKA tutorial we have used J48 decision tree to form a decision structure Problem Statement: A bank is analyzing the data entries of some individual to determine whether they can be given loan or not. (The data set used here is the secondary data collected from some free data source.) The following attributes are considered by the bank. Age – Education - (1- Middle School, 2- High School, 3 –Graduation, 4- Post graduation Employment - (1- Not employed, 2- Student, 3 –Business, 4- Post graduation Income Credit – (1 and 2 – Bad credit Rating, 3 and 4 – Good Credit rating) Default – Yes and No The WEKA Tutorial Steps : 1. Click on WEKA “Explorer” tab to start the software. 2. Then click on “Preprocess” -> “Open file” to select the file to be opened.
  • 14. 3. Next, we select the "Classify" tab and click the "Choose" button to select the J48 classifier. We have to select on the text box beside "Choose" and make the following setting. (Here we have kept the default setting). The default version does perform some pruning (using the sub tree raising approach), but does not perform error pruning. 4. To know more about the settings we can click on the “More” tab on the top right hand corner to know the detail about different options to be filled. 5. Under the "Test options" in the main panel we select 10-fold cross-validation as our evaluation approach as we do not have separate evaluation data set. 6. We now click "Start" to generate the model. The ASCII version of the tree as well as evaluation statistics will appear in the panel.
  • 15. 7. We can view this information in a separate window by right clicking the last result set (inside the "Result list" panel on the left) and selecting "View in separate window" from the pop-up menu. The number of leaves is 4 and the size of tree is 7. The confusion matrix shows how many are correctly categorized and how many are wrongly categorized. Here we see that out of the data set of 50 entries, 37 are correctly categorized and so the accuracy of our model is 74 %.
  • 16. 8. WEKA also lets us view a graphical rendition of the classification tree. This can be done by right clicking the last result set (as before) and selecting "Visualize tree" from the pop-up menu. It shows that bank will consider for loan if the age is less than 30 so that repayment guarantee is there. It further looks for the credit rating of the individual and gives loan if it is more than 2. If less than 2 then bank will again look for the age. If it is less than 22 then bank will grant the loan.
  • 17. References Data Mining by Ian H. Witten, Eibe Frank and Mark A. Hall (3rd edition, Morgan Kaufmann publisher) WEKA Manual for version 3-6-2 by The University of Waikato