SlideShare una empresa de Scribd logo
1 de 73
Descargar para leer sin conexión
Brief Weka Introduction
Shuang Wu
Guided by Dr. Thanh Tran
Weka
• The software: Waikato Environment for
Knowledge Analysis
– Machine learning/data mining software written in
Java (distributed under the GNU Public License)
• The bird: an endemic bird of New Zealand
Outline
• ARFF format and loading files to Weka
• Basic preprocess and classifier Demo
• Attribute selection & Demo
• Filtering datasets & Demo
ARFF format and loading files to Weka
Attribute-Relation File Format (ARFF)
• Two distinct sections
– Header & Data
• Four data types supported
– numeric
– <nominal-specification>
– string
– date [<date-format>]
• E.g.: DATE "yyyy-MM-dd HH:mm:ss"
(http://www.cs.waikato.ac.nz/ml/weka/arff.html)
Converting Files to ARFF
• Weka has converters for the following file
formats:
– Spreadsheet files with extension .csv.
– C4.5’s native file format with extensions .names
and .data.
– Serialized instances with extension .bsi.
– LIBSVM format files with extension .libsvm.
– SVM-Light format files with extension .dat.
– XML-based ARFF format files with extension .xrff.
(Witten, Frank & Witten, 2011)
(Witten, Frank & Witten, 2011)
(Witten, Frank & Witten, 2011)
Basic preprocess and classifier Demo
More Information
can be seen from
here.
Attribute selection
Why Feature Selection
• Not all the features contained in the datasets
of a classification problem are useful
• Redundant or irrelevant features may even
reduce the classification performance
• Eliminating noisy and unnecessary features
can
– Improve classification performance
– Make learning and executing processes faster
– Simplify the structure of the learned models
Feature Selection
• Two categories of feature selection
– Wrapper approaches:
• Conduct a search for the best feature subset using the learning
algorithm itself as part of the evaluation function
• A feature selection algorithm exists as a wrapper around a learning
algorithm
– Filter approaches:
• Independent of a learning algorithm
• Argued to be computationally less expensive and more general
• By considering the performance of the selected feature
subset on a particular learning algorithm, wrappers can
usually achieve better results than filter approaches
Wrapper v.s. Filter
(Kohavi & John, 1997)
Filter: one example
• One algorithm that falls into the filter approach: the
FOCUS algorithm
– Exhaustively examines all subsets of features, selecting the
minimal subset of features that is sufficient to determine
the label value for all instances in the training set.
– May introduces the MIN-FEATURES bias.
– For example, in a medical diagnosis task, a set of features
describing a patient might include the patient’s social
security number (SSN). When FOCUS searches for the
minimum set of features, it will pick the SSN as the only
feature needed to uniquely determine the label. Given
only the SSN, any induction algorithm is expected to
generalize very poorly.
(Kohavi & John, 1997)
Searching Attribute Space
• The size of search space for n features is 2n, so it is
impractical to search the whole space exhaustively in
most situations
• Single Feature Ranking
– A relaxed version of feature selection that only requires
the computation of the relative importance of the features
and subsequently sorting them
– Computationally cheap, but the combination of the top-
ranked features may be a redundant subset
• Feature Subset Ranking, such as
– Greedy Algorithms
– Genetic Algorithm (GA)
WEKA Attribute Selection Function
• Two ways to do attribute selection:
– Normally done by searching the space of attribute
subsets, evaluating each one (Feature Subset Ranking)
• By combining 1 attribute subset evaluator and 1 search
method
– A potentially faster but less accurate approach is to
evaluate the attributes individually and sort them,
discarding attributes that fall below a chosen cutoff
point (Single Feature Ranking)
• By using 1 single-attribute evaluator and the ranking
method
Two Wrapper Methods in Weka
• ClassifierSubsetEval
– Use a classifier, specified in the object editor as a
parameter, to evaluate sets of attributes on the
training data or on a separate holdout set.
• WrapperSubsetEval
– Also use a classifier to evaluate attribute sets, but
employ cross-validation to estimate the accuracy
of the learning scheme for each set
Attribute Subset Evaluators
(Witten, Frank & Witten, 2011)
This one
will be used
in Demo
Search Methods
(Witten, Frank & Witten, 2011)
This one
will be used
in Demo
Single-Attribute Evaluators
(Witten, Frank & Witten, 2011)
Ranking Method
(Witten, Frank & Witten, 2011)
Attribute selection Demo
Filtering datasets
Filtering Algorithms
• There are two kinds of filter
– Supervised : taking advantage of the class
information. A class must be assigned. Default
behavior uses the last attribute as class.
– Unsupervised: A class is not taking into consideration
here.
• Both unsupervised and supervised filters have
– Attribute filters, which work on the attributes in the
datasets, and
– Instance filters, which work on the instances
Unsupervised Attribute Filters
• Including operations of
– Adding and Removing Attributes
– Changing Values
– Converting attributes from one form to another
– Converting multi-instance data into single-
instance format
– Working with time series data
– Randomizing
(Witten, Frank & Witten, 2011)
This one will
be used in the
Demo.
(Witten, Frank & Witten, 2011)
(Witten, Frank & Witten, 2011)
Unsupervised Instance Filters
(Witten, Frank & Witten, 2011)
This one will
be used in
the Demo.
Supervised Attribute and Instance
Filters
(Witten, Frank & Witten, 2011)
Filtering datasets Demo
Noted that the data type of
the attribute “temperature ”
is numeric.
First, let’s filter the attributes.
Set the “attributeIndices” to 2
(the “temperature” attribute)
and the “bins” to 5 (which
means to discretize the datasets
to 5 bins)
Noted the discretization result.
We can also filter the instances.
Noted here that
there are 3
instances that has
label (-inf-68.2].
Set the “attributeIndex” to 2 (the
“temperature” attribute) and the
“nominalIndices” to 1 (which
means to remove all the instances
with label (-inf-68.2].)
All the instances labeled
as (-inf-68.2] have been
removed.
Then when you do the
classification, it will be based
on the filtered datasets, as
shown here.
Resources
• Weka official website:
http://www.cs.waikato.ac.nz/ml/weka/
• Two Weka tutorials on YouTube:
– https://www.youtube.com/user/WekaMOOC
– https://www.youtube.com/user/rushdishams/videos
• Book: Data Mining:
Practical Machine Learning Tools and Techniques.
Please refer to
http://www.cs.waikato.ac.nz/ml/weka/book.html
for more details.
References
• Frank, E., Machine Learning with WEKA. Retrieved April 05, 2014
from http://www.cs.waikato.ac.nz/ml/weka/documentation.html
• Kohavi, R. & John, G.H. (1997), Wrappers for feature subset
selection, Articial Intelligence 97, 315–333.
• Reservoir sampling. Retrieved April 05, 2014, from
http://en.wikipedia.org/wiki/Reservoir_sampling
• Witten, I. H., Frank, E., Hall, M. (2011) Data Mining: Practical
Machine Learning Tools and Techniques (Third Edition). Morgan
Kaufmann.
• Xue, B., Zhang, M., & Browne, W. N. (2012). Single feature ranking
and binary particle swarm optimisation based feature subset
ranking for feature selection. Paper presented at the Proceedings of
the Thirty-fifth Australasian Computer Science Conference - Volume
122, Melbourne, Australia.

Más contenido relacionado

La actualidad más candente

Machine Learning with WEKA
Machine Learning with WEKAMachine Learning with WEKA
Machine Learning with WEKA
butest
 
WEKA Tutorial
WEKA TutorialWEKA Tutorial
WEKA Tutorial
butest
 
Weka presentation
Weka presentationWeka presentation
Weka presentation
Saeed Iqbal
 
Weka a tool_for_exploratory_data_mining
Weka a tool_for_exploratory_data_miningWeka a tool_for_exploratory_data_mining
Weka a tool_for_exploratory_data_mining
Tony Frame
 
Data mining techniques using weka
Data mining techniques using wekaData mining techniques using weka
Data mining techniques using weka
rathorenitin87
 

La actualidad más candente (17)

Weka tutorial
Weka tutorialWeka tutorial
Weka tutorial
 
WEKA: Introduction To Weka
WEKA: Introduction To WekaWEKA: Introduction To Weka
WEKA: Introduction To Weka
 
Machine Learning with WEKA
Machine Learning with WEKAMachine Learning with WEKA
Machine Learning with WEKA
 
Analytics machine learning in weka
Analytics machine learning in wekaAnalytics machine learning in weka
Analytics machine learning in weka
 
WEKA Tutorial
WEKA TutorialWEKA Tutorial
WEKA Tutorial
 
Weka presentation
Weka presentationWeka presentation
Weka presentation
 
weka data mining
weka data mining weka data mining
weka data mining
 
Weka a tool_for_exploratory_data_mining
Weka a tool_for_exploratory_data_miningWeka a tool_for_exploratory_data_mining
Weka a tool_for_exploratory_data_mining
 
Wek1
Wek1Wek1
Wek1
 
Weka library, JAVA
Weka library, JAVAWeka library, JAVA
Weka library, JAVA
 
A simple introduction to weka
A simple introduction to wekaA simple introduction to weka
A simple introduction to weka
 
Data mining techniques using weka
Data mining techniques using wekaData mining techniques using weka
Data mining techniques using weka
 
Weka bike rental
Weka bike rentalWeka bike rental
Weka bike rental
 
Weka presentation
Weka presentationWeka presentation
Weka presentation
 
Data mining with Weka
Data mining with WekaData mining with Weka
Data mining with Weka
 
WEKA: Data Mining Input Concepts Instances And Attributes
WEKA: Data Mining Input Concepts Instances And AttributesWEKA: Data Mining Input Concepts Instances And Attributes
WEKA: Data Mining Input Concepts Instances And Attributes
 
WEKA: The Explorer
WEKA: The ExplorerWEKA: The Explorer
WEKA: The Explorer
 

Destacado

L1 l2 l3 introduction to machine translation
L1 l2 l3  introduction to machine translationL1 l2 l3  introduction to machine translation
L1 l2 l3 introduction to machine translation
Rushdi Shams
 
Probabilistic logic
Probabilistic logicProbabilistic logic
Probabilistic logic
Rushdi Shams
 
L13 why software fails
L13  why software failsL13  why software fails
L13 why software fails
Rushdi Shams
 
Knowledge representation
Knowledge representationKnowledge representation
Knowledge representation
Rushdi Shams
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
kevinlan
 
Lecture 5, 6 and 7 cpu scheduling
Lecture 5, 6 and 7  cpu schedulingLecture 5, 6 and 7  cpu scheduling
Lecture 5, 6 and 7 cpu scheduling
Rushdi Shams
 
Lecture 7, 8, 9 and 10 Inter Process Communication (IPC) in Operating Systems
Lecture 7, 8, 9 and 10  Inter Process Communication (IPC) in Operating SystemsLecture 7, 8, 9 and 10  Inter Process Communication (IPC) in Operating Systems
Lecture 7, 8, 9 and 10 Inter Process Communication (IPC) in Operating Systems
Rushdi Shams
 

Destacado (20)

Hinf6210 Project Classification Of Breast Cancer Dataset
Hinf6210 Project Classification Of Breast Cancer Dataset Hinf6210 Project Classification Of Breast Cancer Dataset
Hinf6210 Project Classification Of Breast Cancer Dataset
 
L1 l2 l3 introduction to machine translation
L1 l2 l3  introduction to machine translationL1 l2 l3  introduction to machine translation
L1 l2 l3 introduction to machine translation
 
Probabilistic logic
Probabilistic logicProbabilistic logic
Probabilistic logic
 
L13 why software fails
L13  why software failsL13  why software fails
L13 why software fails
 
L15 fuzzy logic
L15  fuzzy logicL15  fuzzy logic
L15 fuzzy logic
 
Knowledge representation
Knowledge representationKnowledge representation
Knowledge representation
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Lecture 5, 6 and 7 cpu scheduling
Lecture 5, 6 and 7  cpu schedulingLecture 5, 6 and 7  cpu scheduling
Lecture 5, 6 and 7 cpu scheduling
 
Sharing economy-2
Sharing economy-2Sharing economy-2
Sharing economy-2
 
Amazon marketplace
Amazon marketplaceAmazon marketplace
Amazon marketplace
 
Weka
WekaWeka
Weka
 
Weka.arff
Weka.arffWeka.arff
Weka.arff
 
Fighting spam using social gate keepers
Fighting spam using social gate keepersFighting spam using social gate keepers
Fighting spam using social gate keepers
 
Semi-supervised classification for natural language processing
Semi-supervised classification for natural language processingSemi-supervised classification for natural language processing
Semi-supervised classification for natural language processing
 
Amazon mp
Amazon mpAmazon mp
Amazon mp
 
Real time classification of malicious urls.pptx 2
Real time classification of malicious urls.pptx 2Real time classification of malicious urls.pptx 2
Real time classification of malicious urls.pptx 2
 
Twitter r t under crisis
Twitter r t under crisisTwitter r t under crisis
Twitter r t under crisis
 
Lecture 7, 8, 9 and 10 Inter Process Communication (IPC) in Operating Systems
Lecture 7, 8, 9 and 10  Inter Process Communication (IPC) in Operating SystemsLecture 7, 8, 9 and 10  Inter Process Communication (IPC) in Operating Systems
Lecture 7, 8, 9 and 10 Inter Process Communication (IPC) in Operating Systems
 
Weka
WekaWeka
Weka
 
Feature Selection in Machine Learning
Feature Selection in Machine LearningFeature Selection in Machine Learning
Feature Selection in Machine Learning
 

Similar a Weka

Name IDPractical Data MiningCOMP-321BTutorial 5.docx
Name IDPractical Data MiningCOMP-321BTutorial 5.docxName IDPractical Data MiningCOMP-321BTutorial 5.docx
Name IDPractical Data MiningCOMP-321BTutorial 5.docx
rosemarybdodson23141
 
Data Structure & Algorithms - Operations
Data Structure & Algorithms - OperationsData Structure & Algorithms - Operations
Data Structure & Algorithms - Operations
babuk110
 
Machine Learning in GATE Valentin Tablan
Machine Learning in GATE Valentin TablanMachine Learning in GATE Valentin Tablan
Machine Learning in GATE Valentin Tablan
butest
 
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State UniversityLSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
dhabalia
 
QTP Tutorial
QTP TutorialQTP Tutorial
QTP Tutorial
pingkapil
 
SubTopic Detection of Tweets Related to an Entity
SubTopic Detection of Tweets Related to an EntitySubTopic Detection of Tweets Related to an Entity
SubTopic Detection of Tweets Related to an Entity
Ankita Kumari
 
Qtp Training Deepti 2 Of 44780
Qtp Training Deepti 2 Of 44780Qtp Training Deepti 2 Of 44780
Qtp Training Deepti 2 Of 44780
Azhar Satti
 

Similar a Weka (20)

Name IDPractical Data MiningCOMP-321BTutorial 5.docx
Name IDPractical Data MiningCOMP-321BTutorial 5.docxName IDPractical Data MiningCOMP-321BTutorial 5.docx
Name IDPractical Data MiningCOMP-321BTutorial 5.docx
 
Nbvtalkonfeatureselection
NbvtalkonfeatureselectionNbvtalkonfeatureselection
Nbvtalkonfeatureselection
 
Optimization Technique for Feature Selection and Classification Using Support...
Optimization Technique for Feature Selection and Classification Using Support...Optimization Technique for Feature Selection and Classification Using Support...
Optimization Technique for Feature Selection and Classification Using Support...
 
An introduction to variable and feature selection
An introduction to variable and feature selectionAn introduction to variable and feature selection
An introduction to variable and feature selection
 
Data Structure & Algorithms - Operations
Data Structure & Algorithms - OperationsData Structure & Algorithms - Operations
Data Structure & Algorithms - Operations
 
Machine Learning in GATE Valentin Tablan
Machine Learning in GATE Valentin TablanMachine Learning in GATE Valentin Tablan
Machine Learning in GATE Valentin Tablan
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document Classification
 
Search Quality Evaluation to Help Reproducibility : an Open Source Approach
Search Quality Evaluation to Help Reproducibility : an Open Source ApproachSearch Quality Evaluation to Help Reproducibility : an Open Source Approach
Search Quality Evaluation to Help Reproducibility : an Open Source Approach
 
Search Quality Evaluation to Help Reproducibility: An Open-source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source ApproachSearch Quality Evaluation to Help Reproducibility: An Open-source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source Approach
 
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State UniversityLSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
 
Tutorial Mahout - Recommendation
Tutorial Mahout - RecommendationTutorial Mahout - Recommendation
Tutorial Mahout - Recommendation
 
Lucene And Solr Document Classification
Lucene And Solr Document ClassificationLucene And Solr Document Classification
Lucene And Solr Document Classification
 
Unit iii
Unit iiiUnit iii
Unit iii
 
Query-porcessing-& Query optimization
Query-porcessing-& Query optimizationQuery-porcessing-& Query optimization
Query-porcessing-& Query optimization
 
QTP Tutorial
QTP TutorialQTP Tutorial
QTP Tutorial
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014
 
SubTopic Detection of Tweets Related to an Entity
SubTopic Detection of Tweets Related to an EntitySubTopic Detection of Tweets Related to an Entity
SubTopic Detection of Tweets Related to an Entity
 
Qtp Training Deepti 2 Of 44780
Qtp Training Deepti 2 Of 44780Qtp Training Deepti 2 Of 44780
Qtp Training Deepti 2 Of 44780
 
Building largescalepredictionsystemv1
Building largescalepredictionsystemv1Building largescalepredictionsystemv1
Building largescalepredictionsystemv1
 
UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 

Weka

  • 1. Brief Weka Introduction Shuang Wu Guided by Dr. Thanh Tran
  • 2. Weka • The software: Waikato Environment for Knowledge Analysis – Machine learning/data mining software written in Java (distributed under the GNU Public License) • The bird: an endemic bird of New Zealand
  • 3. Outline • ARFF format and loading files to Weka • Basic preprocess and classifier Demo • Attribute selection & Demo • Filtering datasets & Demo
  • 4. ARFF format and loading files to Weka
  • 5. Attribute-Relation File Format (ARFF) • Two distinct sections – Header & Data • Four data types supported – numeric – <nominal-specification> – string – date [<date-format>] • E.g.: DATE "yyyy-MM-dd HH:mm:ss" (http://www.cs.waikato.ac.nz/ml/weka/arff.html)
  • 6. Converting Files to ARFF • Weka has converters for the following file formats: – Spreadsheet files with extension .csv. – C4.5’s native file format with extensions .names and .data. – Serialized instances with extension .bsi. – LIBSVM format files with extension .libsvm. – SVM-Light format files with extension .dat. – XML-based ARFF format files with extension .xrff. (Witten, Frank & Witten, 2011)
  • 7. (Witten, Frank & Witten, 2011)
  • 8. (Witten, Frank & Witten, 2011)
  • 9. Basic preprocess and classifier Demo
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20. More Information can be seen from here.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 27. Why Feature Selection • Not all the features contained in the datasets of a classification problem are useful • Redundant or irrelevant features may even reduce the classification performance • Eliminating noisy and unnecessary features can – Improve classification performance – Make learning and executing processes faster – Simplify the structure of the learned models
  • 28. Feature Selection • Two categories of feature selection – Wrapper approaches: • Conduct a search for the best feature subset using the learning algorithm itself as part of the evaluation function • A feature selection algorithm exists as a wrapper around a learning algorithm – Filter approaches: • Independent of a learning algorithm • Argued to be computationally less expensive and more general • By considering the performance of the selected feature subset on a particular learning algorithm, wrappers can usually achieve better results than filter approaches
  • 30. Filter: one example • One algorithm that falls into the filter approach: the FOCUS algorithm – Exhaustively examines all subsets of features, selecting the minimal subset of features that is sufficient to determine the label value for all instances in the training set. – May introduces the MIN-FEATURES bias. – For example, in a medical diagnosis task, a set of features describing a patient might include the patient’s social security number (SSN). When FOCUS searches for the minimum set of features, it will pick the SSN as the only feature needed to uniquely determine the label. Given only the SSN, any induction algorithm is expected to generalize very poorly. (Kohavi & John, 1997)
  • 31. Searching Attribute Space • The size of search space for n features is 2n, so it is impractical to search the whole space exhaustively in most situations • Single Feature Ranking – A relaxed version of feature selection that only requires the computation of the relative importance of the features and subsequently sorting them – Computationally cheap, but the combination of the top- ranked features may be a redundant subset • Feature Subset Ranking, such as – Greedy Algorithms – Genetic Algorithm (GA)
  • 32. WEKA Attribute Selection Function • Two ways to do attribute selection: – Normally done by searching the space of attribute subsets, evaluating each one (Feature Subset Ranking) • By combining 1 attribute subset evaluator and 1 search method – A potentially faster but less accurate approach is to evaluate the attributes individually and sort them, discarding attributes that fall below a chosen cutoff point (Single Feature Ranking) • By using 1 single-attribute evaluator and the ranking method
  • 33. Two Wrapper Methods in Weka • ClassifierSubsetEval – Use a classifier, specified in the object editor as a parameter, to evaluate sets of attributes on the training data or on a separate holdout set. • WrapperSubsetEval – Also use a classifier to evaluate attribute sets, but employ cross-validation to estimate the accuracy of the learning scheme for each set
  • 34. Attribute Subset Evaluators (Witten, Frank & Witten, 2011) This one will be used in Demo
  • 35. Search Methods (Witten, Frank & Witten, 2011) This one will be used in Demo
  • 37. Ranking Method (Witten, Frank & Witten, 2011)
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 49. Filtering Algorithms • There are two kinds of filter – Supervised : taking advantage of the class information. A class must be assigned. Default behavior uses the last attribute as class. – Unsupervised: A class is not taking into consideration here. • Both unsupervised and supervised filters have – Attribute filters, which work on the attributes in the datasets, and – Instance filters, which work on the instances
  • 50. Unsupervised Attribute Filters • Including operations of – Adding and Removing Attributes – Changing Values – Converting attributes from one form to another – Converting multi-instance data into single- instance format – Working with time series data – Randomizing
  • 51. (Witten, Frank & Witten, 2011) This one will be used in the Demo.
  • 52. (Witten, Frank & Witten, 2011)
  • 53. (Witten, Frank & Witten, 2011)
  • 54. Unsupervised Instance Filters (Witten, Frank & Witten, 2011) This one will be used in the Demo.
  • 55. Supervised Attribute and Instance Filters (Witten, Frank & Witten, 2011)
  • 57.
  • 58.
  • 59. Noted that the data type of the attribute “temperature ” is numeric.
  • 60. First, let’s filter the attributes.
  • 61.
  • 62.
  • 63. Set the “attributeIndices” to 2 (the “temperature” attribute) and the “bins” to 5 (which means to discretize the datasets to 5 bins)
  • 64.
  • 66. We can also filter the instances.
  • 67. Noted here that there are 3 instances that has label (-inf-68.2].
  • 68. Set the “attributeIndex” to 2 (the “temperature” attribute) and the “nominalIndices” to 1 (which means to remove all the instances with label (-inf-68.2].)
  • 69.
  • 70. All the instances labeled as (-inf-68.2] have been removed.
  • 71. Then when you do the classification, it will be based on the filtered datasets, as shown here.
  • 72. Resources • Weka official website: http://www.cs.waikato.ac.nz/ml/weka/ • Two Weka tutorials on YouTube: – https://www.youtube.com/user/WekaMOOC – https://www.youtube.com/user/rushdishams/videos • Book: Data Mining: Practical Machine Learning Tools and Techniques. Please refer to http://www.cs.waikato.ac.nz/ml/weka/book.html for more details.
  • 73. References • Frank, E., Machine Learning with WEKA. Retrieved April 05, 2014 from http://www.cs.waikato.ac.nz/ml/weka/documentation.html • Kohavi, R. & John, G.H. (1997), Wrappers for feature subset selection, Articial Intelligence 97, 315–333. • Reservoir sampling. Retrieved April 05, 2014, from http://en.wikipedia.org/wiki/Reservoir_sampling • Witten, I. H., Frank, E., Hall, M. (2011) Data Mining: Practical Machine Learning Tools and Techniques (Third Edition). Morgan Kaufmann. • Xue, B., Zhang, M., & Browne, W. N. (2012). Single feature ranking and binary particle swarm optimisation based feature subset ranking for feature selection. Paper presented at the Proceedings of the Thirty-fifth Australasian Computer Science Conference - Volume 122, Melbourne, Australia.