BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
Agressive feature selection for text categorization
1. Aggressive feature selection for text categorization
E Gabrilovich and S Markovitch, “Text categorization with many
redundant features: Using aggressive feature selection to make
SVMs competitive with C4.5,” 21st International Conference on
Machine Learning, ACM, 2004.
Presented by Hershel Safer
in Machine Learning :: Reading Group Meetup
on 12/2/14
Aggressive feature selection for text categorization – Hershel Safer Page 112 February 2014
2. Results
Key result: They introduce a measure of the redundancy of
words in a collection of documents that predicts if feature
selection will improve categorization of the documents.
Also:
A method to generate labeled datasets for testing text-
categorization algorithms (previous work)categorization algorithms (previous work)
A platform testing text-categorization algorithms
Aggressive feature selection for text categorization – Hershel Safer Page 212 February 2014
3. Background: Text categorization
Text categorization: Given a set of natural-language documents
and a set of labels, assign one or more labels to each document.
Most algorithms treat a document as a collection of words, with
each word as a feature; so even modest collections have
thousands or tens of thousands of features.
For such high-dimensional problems, feature selection is oftenFor such high-dimensional problems, feature selection is often
used to reduce noise and avoid overfitting.
Aggressive feature selection for text categorization – Hershel Safer Page 312 February 2014
4. Background: Feature selection
Use various methods to measure how well specific words
discriminate between categories: information gain (IG), chi-
squared, bi-normal separation, document frequency, etc.
Feature selection: Choose the most informative features using a
score cutoff or a fixed percentage of the top-scoring features.
Previous work on standard document collections found thatPrevious work on standard document collections found that
even words with low discriminative power improved
classification.
Question asked by this work: When does aggressive feature
selection (using ~1% of the words in the collection) improve text
categorization?
Aggressive feature selection for text categorization – Hershel Safer Page 412 February 2014
5. The data
The data consist of 100 datasets created from Web directories,
each containing documents from 2 categories.
The categorization difficulty ranges from very easy to very hard.
Baseline accuracy of categorization using SVM is fairly uniformly
distributed between 0.6 and 0.92.
Aggressive feature selection for text categorization – Hershel Safer Page 512 February 2014
6. Distribution of IG and effect of feature selection
Key is not the level of IG values but rather the rate of decrease.
For dataset D and features F, the Outlier Count (OC) is # features
with IG at least 3 standard deviations above the mean:
Aggressive feature selection for text categorization – Hershel Safer Page 612 February 2014
7. Effect of Outlier Count on SVM accuracy
OC has a strong negative correlation with the improvement in
SVM accuracy that results from aggressive feature selection.
Studies that found no benefit from aggressive feature selection
used datasets with very large OC.
Aggressive feature selection for text categorization – Hershel Safer Page 712 February 2014
8. Choosing classifier and feature-selection methods
Using feature selection may affect choice of classifier method.
Different methods for feature selection give different results.
They report information gain, Chi-squared, and bi-normal
separation as being best.
Aggressive feature selection for text categorization – Hershel Safer Page 812 February 2014