Agressive feature selection for text categorization

•

0 recomendaciones•488 vistas

Hershel Safer

Paper review at Machine Learning :: Reading Group, 2/2014

Datos y análisis Educación Tecnología

Aggressive feature selection for text categorization
E Gabrilovich and S Markovitch, “Text categorization with many
redundant features: Using aggressive feature selection to make
SVMs competitive with C4.5,” 21st International Conference on
Machine Learning, ACM, 2004.
Presented by Hershel Safer
in Machine Learning :: Reading Group Meetup
on 12/2/14
Aggressive feature selection for text categorization – Hershel Safer Page 112 February 2014

Results
Key result: They introduce a measure of the redundancy of
words in a collection of documents that predicts if feature
selection will improve categorization of the documents.
Also:
A method to generate labeled datasets for testing text-
categorization algorithms (previous work)categorization algorithms (previous work)
A platform testing text-categorization algorithms
Aggressive feature selection for text categorization – Hershel Safer Page 212 February 2014

Background: Text categorization
Text categorization: Given a set of natural-language documents
and a set of labels, assign one or more labels to each document.
Most algorithms treat a document as a collection of words, with
each word as a feature; so even modest collections have
thousands or tens of thousands of features.
For such high-dimensional problems, feature selection is oftenFor such high-dimensional problems, feature selection is often
used to reduce noise and avoid overfitting.
Aggressive feature selection for text categorization – Hershel Safer Page 312 February 2014

Background: Feature selection
Use various methods to measure how well specific words
discriminate between categories: information gain (IG), chi-
squared, bi-normal separation, document frequency, etc.
Feature selection: Choose the most informative features using a
score cutoff or a fixed percentage of the top-scoring features.
Previous work on standard document collections found thatPrevious work on standard document collections found that
even words with low discriminative power improved
classification.
Question asked by this work: When does aggressive feature
selection (using ~1% of the words in the collection) improve text
categorization?
Aggressive feature selection for text categorization – Hershel Safer Page 412 February 2014

The data
The data consist of 100 datasets created from Web directories,
each containing documents from 2 categories.
The categorization difficulty ranges from very easy to very hard.
Baseline accuracy of categorization using SVM is fairly uniformly
distributed between 0.6 and 0.92.
Aggressive feature selection for text categorization – Hershel Safer Page 512 February 2014

Distribution of IG and effect of feature selection
Key is not the level of IG values but rather the rate of decrease.
For dataset D and features F, the Outlier Count (OC) is # features
with IG at least 3 standard deviations above the mean:
Aggressive feature selection for text categorization – Hershel Safer Page 612 February 2014

Effect of Outlier Count on SVM accuracy
OC has a strong negative correlation with the improvement in
SVM accuracy that results from aggressive feature selection.
Studies that found no benefit from aggressive feature selection
used datasets with very large OC.
Aggressive feature selection for text categorization – Hershel Safer Page 712 February 2014

Choosing classifier and feature-selection methods
Using feature selection may affect choice of classifier method.
Different methods for feature selection give different results.
They report information gain, Chi-squared, and bi-normal
separation as being best.
Aggressive feature selection for text categorization – Hershel Safer Page 812 February 2014

Más contenido relacionado

Destacado

Professioni web 2013Luca Filippetti

EsquíJose

Spanish MigasJose

Templo budista de PanilloJose

Cittaslow international coordinating committeeLuca Filippetti

La Chinchana (comenius)Jose

Románico en la ribagorzaJose

Young cittaslowLuca Filippetti

The Multi-Layered iPadMatt Jacobs

Jabes2010 sudoc plusSylvain Machefert

Hypergraph for consensus optimizationHershel Safer

Improving modern art articles on wikipedia, a partnership between Wikimédia F...Sylvain Machefert

InternshipsCrossroads: Pathways to Success, Inc.

Crossroads Social Network Survival GuideCrossroads: Pathways to Success, Inc.

Men's Health Powerpoint PresentationCrossroads: Pathways to Success, Inc.

Destacado (15)

Professioni web 2013

Esquí

Spanish Migas

Templo budista de Panillo

Cittaslow international coordinating committee

La Chinchana (comenius)

Románico en la ribagorza

Young cittaslow

The Multi-Layered iPad

Jabes2010 sudoc plus

Hypergraph for consensus optimization

Improving modern art articles on wikipedia, a partnership between Wikimédia F...

Internships

Crossroads Social Network Survival Guide

Men's Health Powerpoint Presentation

Similar a Agressive feature selection for text categorization

Natural Language Processing Through Different Classes of Machine Learningcsandit

A Novel Approach for Keyword extraction in learning objects using text miningIJSRD

0butest

View the Microsoft Word document.docbutest

NLP Techniques for Text Classification.docxKevinSims18

IJET-V3I1P1IJET - International Journal of Engineering and Techniques

Innovating Multi-Class Text Classification:Transforming Models with propmtify...ankarao14

Paper id 25201435IJRAT

Camera ready sentiment analysis : quantification of real time brand advocacy ...Absolutdata Analytics

Hc3612711275IJERA Editor

An efficient-classification-model-for-unstructured-text-documentSaleihGero

Doc format.butest

A Survey on Sentiment Categorization of Movie ReviewsEditor IJMTER

Proceedings Template - WORDbutest

Current Approaches in Search Result DiversificationMario Sangiorgio

Aq35241246IJERA Editor

Automated Software Requirements LabelingData Works MD

Similar a Agressive feature selection for text categorization (20)

Natural Language Processing Through Different Classes of Machine Learning

A Novel Approach for Keyword extraction in learning objects using text mining

View the Microsoft Word document.doc

NLP Techniques for Text Classification.docx

IJET-V3I1P1

Innovating Multi-Class Text Classification:Transforming Models with propmtify...

Paper id 25201435

Camera ready sentiment analysis : quantification of real time brand advocacy ...

Hc3612711275

An efficient-classification-model-for-unstructured-text-document

Doc format.

A Survey on Sentiment Categorization of Movie Reviews

Proceedings Template - WORD

Current Approaches in Search Result Diversification

Aq35241246

Automated Software Requirements Labeling

Último

Carero dropshipping via API with DroFx.pptxolyaivanovalion

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls

Edukaciniai dropshipping via API with DroFxolyaivanovalion

Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila

Smarteg dropshipping via API with DroFx.pptxolyaivanovalion

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H

Mature dropshipping via API with DroFx.pptxolyaivanovalion

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal

Ravak dropshipping via API with DroFx.pptxolyaivanovalion

Invezz.com - Grow your wealth with trading signalsInvezz1

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823

Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson

Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls

Sampling (random) method and Non random.pptDr. Soumendra Kumar Patra

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692

Agressive feature selection for text categorization

1. Aggressive feature selection for text categorization E Gabrilovich and S Markovitch, “Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with C4.5,” 21st International Conference on Machine Learning, ACM, 2004. Presented by Hershel Safer in Machine Learning :: Reading Group Meetup on 12/2/14 Aggressive feature selection for text categorization – Hershel Safer Page 112 February 2014

2. Results Key result: They introduce a measure of the redundancy of words in a collection of documents that predicts if feature selection will improve categorization of the documents. Also: A method to generate labeled datasets for testing text- categorization algorithms (previous work)categorization algorithms (previous work) A platform testing text-categorization algorithms Aggressive feature selection for text categorization – Hershel Safer Page 212 February 2014

3. Background: Text categorization Text categorization: Given a set of natural-language documents and a set of labels, assign one or more labels to each document. Most algorithms treat a document as a collection of words, with each word as a feature; so even modest collections have thousands or tens of thousands of features. For such high-dimensional problems, feature selection is oftenFor such high-dimensional problems, feature selection is often used to reduce noise and avoid overfitting. Aggressive feature selection for text categorization – Hershel Safer Page 312 February 2014

4. Background: Feature selection Use various methods to measure how well specific words discriminate between categories: information gain (IG), chi- squared, bi-normal separation, document frequency, etc. Feature selection: Choose the most informative features using a score cutoff or a fixed percentage of the top-scoring features. Previous work on standard document collections found thatPrevious work on standard document collections found that even words with low discriminative power improved classification. Question asked by this work: When does aggressive feature selection (using ~1% of the words in the collection) improve text categorization? Aggressive feature selection for text categorization – Hershel Safer Page 412 February 2014

5. The data The data consist of 100 datasets created from Web directories, each containing documents from 2 categories. The categorization difficulty ranges from very easy to very hard. Baseline accuracy of categorization using SVM is fairly uniformly distributed between 0.6 and 0.92. Aggressive feature selection for text categorization – Hershel Safer Page 512 February 2014

6. Distribution of IG and effect of feature selection Key is not the level of IG values but rather the rate of decrease. For dataset D and features F, the Outlier Count (OC) is # features with IG at least 3 standard deviations above the mean: Aggressive feature selection for text categorization – Hershel Safer Page 612 February 2014

7. Effect of Outlier Count on SVM accuracy OC has a strong negative correlation with the improvement in SVM accuracy that results from aggressive feature selection. Studies that found no benefit from aggressive feature selection used datasets with very large OC. Aggressive feature selection for text categorization – Hershel Safer Page 712 February 2014

8. Choosing classifier and feature-selection methods Using feature selection may affect choice of classifier method. Different methods for feature selection give different results. They report information gain, Chi-squared, and bi-normal separation as being best. Aggressive feature selection for text categorization – Hershel Safer Page 812 February 2014

Agressive feature selection for text categorization

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (15)

Similar a Agressive feature selection for text categorization

Similar a Agressive feature selection for text categorization (20)

Último

Último (20)

Agressive feature selection for text categorization