2. Weka
• The software: Waikato Environment for
Knowledge Analysis
– Machine learning/data mining software written in
Java (distributed under the GNU Public License)
• The bird: an endemic bird of New Zealand
3. Outline
• ARFF format and loading files to Weka
• Basic preprocess and classifier Demo
• Attribute selection & Demo
• Filtering datasets & Demo
5. Attribute-Relation File Format (ARFF)
• Two distinct sections
– Header & Data
• Four data types supported
– numeric
– <nominal-specification>
– string
– date [<date-format>]
• E.g.: DATE "yyyy-MM-dd HH:mm:ss"
(http://www.cs.waikato.ac.nz/ml/weka/arff.html)
6. Converting Files to ARFF
• Weka has converters for the following file
formats:
– Spreadsheet files with extension .csv.
– C4.5’s native file format with extensions .names
and .data.
– Serialized instances with extension .bsi.
– LIBSVM format files with extension .libsvm.
– SVM-Light format files with extension .dat.
– XML-based ARFF format files with extension .xrff.
(Witten, Frank & Witten, 2011)
27. Why Feature Selection
• Not all the features contained in the datasets
of a classification problem are useful
• Redundant or irrelevant features may even
reduce the classification performance
• Eliminating noisy and unnecessary features
can
– Improve classification performance
– Make learning and executing processes faster
– Simplify the structure of the learned models
28. Feature Selection
• Two categories of feature selection
– Wrapper approaches:
• Conduct a search for the best feature subset using the learning
algorithm itself as part of the evaluation function
• A feature selection algorithm exists as a wrapper around a learning
algorithm
– Filter approaches:
• Independent of a learning algorithm
• Argued to be computationally less expensive and more general
• By considering the performance of the selected feature
subset on a particular learning algorithm, wrappers can
usually achieve better results than filter approaches
30. Filter: one example
• One algorithm that falls into the filter approach: the
FOCUS algorithm
– Exhaustively examines all subsets of features, selecting the
minimal subset of features that is sufficient to determine
the label value for all instances in the training set.
– May introduces the MIN-FEATURES bias.
– For example, in a medical diagnosis task, a set of features
describing a patient might include the patient’s social
security number (SSN). When FOCUS searches for the
minimum set of features, it will pick the SSN as the only
feature needed to uniquely determine the label. Given
only the SSN, any induction algorithm is expected to
generalize very poorly.
(Kohavi & John, 1997)
31. Searching Attribute Space
• The size of search space for n features is 2n, so it is
impractical to search the whole space exhaustively in
most situations
• Single Feature Ranking
– A relaxed version of feature selection that only requires
the computation of the relative importance of the features
and subsequently sorting them
– Computationally cheap, but the combination of the top-
ranked features may be a redundant subset
• Feature Subset Ranking, such as
– Greedy Algorithms
– Genetic Algorithm (GA)
32. WEKA Attribute Selection Function
• Two ways to do attribute selection:
– Normally done by searching the space of attribute
subsets, evaluating each one (Feature Subset Ranking)
• By combining 1 attribute subset evaluator and 1 search
method
– A potentially faster but less accurate approach is to
evaluate the attributes individually and sort them,
discarding attributes that fall below a chosen cutoff
point (Single Feature Ranking)
• By using 1 single-attribute evaluator and the ranking
method
33. Two Wrapper Methods in Weka
• ClassifierSubsetEval
– Use a classifier, specified in the object editor as a
parameter, to evaluate sets of attributes on the
training data or on a separate holdout set.
• WrapperSubsetEval
– Also use a classifier to evaluate attribute sets, but
employ cross-validation to estimate the accuracy
of the learning scheme for each set
49. Filtering Algorithms
• There are two kinds of filter
– Supervised : taking advantage of the class
information. A class must be assigned. Default
behavior uses the last attribute as class.
– Unsupervised: A class is not taking into consideration
here.
• Both unsupervised and supervised filters have
– Attribute filters, which work on the attributes in the
datasets, and
– Instance filters, which work on the instances
50. Unsupervised Attribute Filters
• Including operations of
– Adding and Removing Attributes
– Changing Values
– Converting attributes from one form to another
– Converting multi-instance data into single-
instance format
– Working with time series data
– Randomizing
51. (Witten, Frank & Witten, 2011)
This one will
be used in the
Demo.
68. Set the “attributeIndex” to 2 (the
“temperature” attribute) and the
“nominalIndices” to 1 (which
means to remove all the instances
with label (-inf-68.2].)
71. Then when you do the
classification, it will be based
on the filtered datasets, as
shown here.
72. Resources
• Weka official website:
http://www.cs.waikato.ac.nz/ml/weka/
• Two Weka tutorials on YouTube:
– https://www.youtube.com/user/WekaMOOC
– https://www.youtube.com/user/rushdishams/videos
• Book: Data Mining:
Practical Machine Learning Tools and Techniques.
Please refer to
http://www.cs.waikato.ac.nz/ml/weka/book.html
for more details.
73. References
• Frank, E., Machine Learning with WEKA. Retrieved April 05, 2014
from http://www.cs.waikato.ac.nz/ml/weka/documentation.html
• Kohavi, R. & John, G.H. (1997), Wrappers for feature subset
selection, Articial Intelligence 97, 315–333.
• Reservoir sampling. Retrieved April 05, 2014, from
http://en.wikipedia.org/wiki/Reservoir_sampling
• Witten, I. H., Frank, E., Hall, M. (2011) Data Mining: Practical
Machine Learning Tools and Techniques (Third Edition). Morgan
Kaufmann.
• Xue, B., Zhang, M., & Browne, W. N. (2012). Single feature ranking
and binary particle swarm optimisation based feature subset
ranking for feature selection. Paper presented at the Proceedings of
the Thirty-fifth Australasian Computer Science Conference - Volume
122, Melbourne, Australia.