SlideShare una empresa de Scribd logo
1 de 8
Descargar para leer sin conexión
Aggressive feature selection for text categorization
E Gabrilovich and S Markovitch, “Text categorization with many
redundant features: Using aggressive feature selection to make
SVMs competitive with C4.5,” 21st International Conference on
Machine Learning, ACM, 2004.
Presented by Hershel Safer
in Machine Learning :: Reading Group Meetup
on 12/2/14
Aggressive feature selection for text categorization – Hershel Safer Page 112 February 2014
Results
Key result: They introduce a measure of the redundancy of
words in a collection of documents that predicts if feature
selection will improve categorization of the documents.
Also:
A method to generate labeled datasets for testing text-
categorization algorithms (previous work)categorization algorithms (previous work)
A platform testing text-categorization algorithms
Aggressive feature selection for text categorization – Hershel Safer Page 212 February 2014
Background: Text categorization
Text categorization: Given a set of natural-language documents
and a set of labels, assign one or more labels to each document.
Most algorithms treat a document as a collection of words, with
each word as a feature; so even modest collections have
thousands or tens of thousands of features.
For such high-dimensional problems, feature selection is oftenFor such high-dimensional problems, feature selection is often
used to reduce noise and avoid overfitting.
Aggressive feature selection for text categorization – Hershel Safer Page 312 February 2014
Background: Feature selection
Use various methods to measure how well specific words
discriminate between categories: information gain (IG), chi-
squared, bi-normal separation, document frequency, etc.
Feature selection: Choose the most informative features using a
score cutoff or a fixed percentage of the top-scoring features.
Previous work on standard document collections found thatPrevious work on standard document collections found that
even words with low discriminative power improved
classification.
Question asked by this work: When does aggressive feature
selection (using ~1% of the words in the collection) improve text
categorization?
Aggressive feature selection for text categorization – Hershel Safer Page 412 February 2014
The data
The data consist of 100 datasets created from Web directories,
each containing documents from 2 categories.
The categorization difficulty ranges from very easy to very hard.
Baseline accuracy of categorization using SVM is fairly uniformly
distributed between 0.6 and 0.92.
Aggressive feature selection for text categorization – Hershel Safer Page 512 February 2014
Distribution of IG and effect of feature selection
Key is not the level of IG values but rather the rate of decrease.
For dataset D and features F, the Outlier Count (OC) is # features
with IG at least 3 standard deviations above the mean:
Aggressive feature selection for text categorization – Hershel Safer Page 612 February 2014
Effect of Outlier Count on SVM accuracy
OC has a strong negative correlation with the improvement in
SVM accuracy that results from aggressive feature selection.
Studies that found no benefit from aggressive feature selection
used datasets with very large OC.
Aggressive feature selection for text categorization – Hershel Safer Page 712 February 2014
Choosing classifier and feature-selection methods
Using feature selection may affect choice of classifier method.
Different methods for feature selection give different results.
They report information gain, Chi-squared, and bi-normal
separation as being best.
Aggressive feature selection for text categorization – Hershel Safer Page 812 February 2014

Más contenido relacionado

Destacado

Esquí
EsquíEsquí
EsquíJose
 
Spanish Migas
Spanish MigasSpanish Migas
Spanish MigasJose
 
Templo budista de Panillo
Templo budista de PanilloTemplo budista de Panillo
Templo budista de PanilloJose
 
Cittaslow international coordinating committee
Cittaslow international coordinating committeeCittaslow international coordinating committee
Cittaslow international coordinating committeeLuca Filippetti
 
La Chinchana (comenius)
La Chinchana (comenius)La Chinchana (comenius)
La Chinchana (comenius)Jose
 
Románico en la ribagorza
Románico en la ribagorzaRománico en la ribagorza
Románico en la ribagorzaJose
 
The Multi-Layered iPad
The Multi-Layered iPadThe Multi-Layered iPad
The Multi-Layered iPadMatt Jacobs
 
Hypergraph for consensus optimization
Hypergraph for consensus optimizationHypergraph for consensus optimization
Hypergraph for consensus optimizationHershel Safer
 
Improving modern art articles on wikipedia, a partnership between Wikimédia F...
Improving modern art articles on wikipedia, a partnership between Wikimédia F...Improving modern art articles on wikipedia, a partnership between Wikimédia F...
Improving modern art articles on wikipedia, a partnership between Wikimédia F...Sylvain Machefert
 

Destacado (15)

Professioni web 2013
Professioni web 2013Professioni web 2013
Professioni web 2013
 
Esquí
EsquíEsquí
Esquí
 
Spanish Migas
Spanish MigasSpanish Migas
Spanish Migas
 
Templo budista de Panillo
Templo budista de PanilloTemplo budista de Panillo
Templo budista de Panillo
 
Cittaslow international coordinating committee
Cittaslow international coordinating committeeCittaslow international coordinating committee
Cittaslow international coordinating committee
 
La Chinchana (comenius)
La Chinchana (comenius)La Chinchana (comenius)
La Chinchana (comenius)
 
Románico en la ribagorza
Románico en la ribagorzaRománico en la ribagorza
Románico en la ribagorza
 
Young cittaslow
Young cittaslowYoung cittaslow
Young cittaslow
 
The Multi-Layered iPad
The Multi-Layered iPadThe Multi-Layered iPad
The Multi-Layered iPad
 
Jabes2010 sudoc plus
Jabes2010 sudoc plusJabes2010 sudoc plus
Jabes2010 sudoc plus
 
Hypergraph for consensus optimization
Hypergraph for consensus optimizationHypergraph for consensus optimization
Hypergraph for consensus optimization
 
Improving modern art articles on wikipedia, a partnership between Wikimédia F...
Improving modern art articles on wikipedia, a partnership between Wikimédia F...Improving modern art articles on wikipedia, a partnership between Wikimédia F...
Improving modern art articles on wikipedia, a partnership between Wikimédia F...
 
Internships
InternshipsInternships
Internships
 
Crossroads Social Network Survival Guide
Crossroads Social Network Survival GuideCrossroads Social Network Survival Guide
Crossroads Social Network Survival Guide
 
Men's Health Powerpoint Presentation
Men's Health Powerpoint PresentationMen's Health Powerpoint Presentation
Men's Health Powerpoint Presentation
 

Similar a Agressive feature selection for text categorization

Natural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine LearningNatural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine Learningcsandit
 
A Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text miningA Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text miningIJSRD
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.docbutest
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.docbutest
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.docbutest
 
NLP Techniques for Text Classification.docx
NLP Techniques for Text Classification.docxNLP Techniques for Text Classification.docx
NLP Techniques for Text Classification.docxKevinSims18
 
Innovating Multi-Class Text Classification:Transforming Models with propmtify...
Innovating Multi-Class Text Classification:Transforming Models with propmtify...Innovating Multi-Class Text Classification:Transforming Models with propmtify...
Innovating Multi-Class Text Classification:Transforming Models with propmtify...ankarao14
 
Innovating Multi-Class Text Classification:Transforming Models with propmtify...
Innovating Multi-Class Text Classification:Transforming Models with propmtify...Innovating Multi-Class Text Classification:Transforming Models with propmtify...
Innovating Multi-Class Text Classification:Transforming Models with propmtify...ankarao14
 
Paper id 25201435
Paper id 25201435Paper id 25201435
Paper id 25201435IJRAT
 
Camera ready sentiment analysis : quantification of real time brand advocacy ...
Camera ready sentiment analysis : quantification of real time brand advocacy ...Camera ready sentiment analysis : quantification of real time brand advocacy ...
Camera ready sentiment analysis : quantification of real time brand advocacy ...Absolutdata Analytics
 
An efficient-classification-model-for-unstructured-text-document
An efficient-classification-model-for-unstructured-text-documentAn efficient-classification-model-for-unstructured-text-document
An efficient-classification-model-for-unstructured-text-documentSaleihGero
 
Doc format.
Doc format.Doc format.
Doc format.butest
 
A Survey on Sentiment Categorization of Movie Reviews
A Survey on Sentiment Categorization of Movie ReviewsA Survey on Sentiment Categorization of Movie Reviews
A Survey on Sentiment Categorization of Movie ReviewsEditor IJMTER
 
Proceedings Template - WORD
Proceedings Template - WORDProceedings Template - WORD
Proceedings Template - WORDbutest
 
Current Approaches in Search Result Diversification
Current Approaches in Search Result DiversificationCurrent Approaches in Search Result Diversification
Current Approaches in Search Result DiversificationMario Sangiorgio
 
Automated Software Requirements Labeling
Automated Software Requirements LabelingAutomated Software Requirements Labeling
Automated Software Requirements LabelingData Works MD
 

Similar a Agressive feature selection for text categorization (20)

Natural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine LearningNatural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine Learning
 
A Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text miningA Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text mining
 
0
00
0
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.doc
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.doc
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.doc
 
NLP Techniques for Text Classification.docx
NLP Techniques for Text Classification.docxNLP Techniques for Text Classification.docx
NLP Techniques for Text Classification.docx
 
IJET-V3I1P1
IJET-V3I1P1IJET-V3I1P1
IJET-V3I1P1
 
Innovating Multi-Class Text Classification:Transforming Models with propmtify...
Innovating Multi-Class Text Classification:Transforming Models with propmtify...Innovating Multi-Class Text Classification:Transforming Models with propmtify...
Innovating Multi-Class Text Classification:Transforming Models with propmtify...
 
Innovating Multi-Class Text Classification:Transforming Models with propmtify...
Innovating Multi-Class Text Classification:Transforming Models with propmtify...Innovating Multi-Class Text Classification:Transforming Models with propmtify...
Innovating Multi-Class Text Classification:Transforming Models with propmtify...
 
Paper id 25201435
Paper id 25201435Paper id 25201435
Paper id 25201435
 
Camera ready sentiment analysis : quantification of real time brand advocacy ...
Camera ready sentiment analysis : quantification of real time brand advocacy ...Camera ready sentiment analysis : quantification of real time brand advocacy ...
Camera ready sentiment analysis : quantification of real time brand advocacy ...
 
Hc3612711275
Hc3612711275Hc3612711275
Hc3612711275
 
An efficient-classification-model-for-unstructured-text-document
An efficient-classification-model-for-unstructured-text-documentAn efficient-classification-model-for-unstructured-text-document
An efficient-classification-model-for-unstructured-text-document
 
Doc format.
Doc format.Doc format.
Doc format.
 
A Survey on Sentiment Categorization of Movie Reviews
A Survey on Sentiment Categorization of Movie ReviewsA Survey on Sentiment Categorization of Movie Reviews
A Survey on Sentiment Categorization of Movie Reviews
 
Proceedings Template - WORD
Proceedings Template - WORDProceedings Template - WORD
Proceedings Template - WORD
 
Current Approaches in Search Result Diversification
Current Approaches in Search Result DiversificationCurrent Approaches in Search Result Diversification
Current Approaches in Search Result Diversification
 
Aq35241246
Aq35241246Aq35241246
Aq35241246
 
Automated Software Requirements Labeling
Automated Software Requirements LabelingAutomated Software Requirements Labeling
Automated Software Requirements Labeling
 

Último

Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 

Último (20)

Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 

Agressive feature selection for text categorization

  • 1. Aggressive feature selection for text categorization E Gabrilovich and S Markovitch, “Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with C4.5,” 21st International Conference on Machine Learning, ACM, 2004. Presented by Hershel Safer in Machine Learning :: Reading Group Meetup on 12/2/14 Aggressive feature selection for text categorization – Hershel Safer Page 112 February 2014
  • 2. Results Key result: They introduce a measure of the redundancy of words in a collection of documents that predicts if feature selection will improve categorization of the documents. Also: A method to generate labeled datasets for testing text- categorization algorithms (previous work)categorization algorithms (previous work) A platform testing text-categorization algorithms Aggressive feature selection for text categorization – Hershel Safer Page 212 February 2014
  • 3. Background: Text categorization Text categorization: Given a set of natural-language documents and a set of labels, assign one or more labels to each document. Most algorithms treat a document as a collection of words, with each word as a feature; so even modest collections have thousands or tens of thousands of features. For such high-dimensional problems, feature selection is oftenFor such high-dimensional problems, feature selection is often used to reduce noise and avoid overfitting. Aggressive feature selection for text categorization – Hershel Safer Page 312 February 2014
  • 4. Background: Feature selection Use various methods to measure how well specific words discriminate between categories: information gain (IG), chi- squared, bi-normal separation, document frequency, etc. Feature selection: Choose the most informative features using a score cutoff or a fixed percentage of the top-scoring features. Previous work on standard document collections found thatPrevious work on standard document collections found that even words with low discriminative power improved classification. Question asked by this work: When does aggressive feature selection (using ~1% of the words in the collection) improve text categorization? Aggressive feature selection for text categorization – Hershel Safer Page 412 February 2014
  • 5. The data The data consist of 100 datasets created from Web directories, each containing documents from 2 categories. The categorization difficulty ranges from very easy to very hard. Baseline accuracy of categorization using SVM is fairly uniformly distributed between 0.6 and 0.92. Aggressive feature selection for text categorization – Hershel Safer Page 512 February 2014
  • 6. Distribution of IG and effect of feature selection Key is not the level of IG values but rather the rate of decrease. For dataset D and features F, the Outlier Count (OC) is # features with IG at least 3 standard deviations above the mean: Aggressive feature selection for text categorization – Hershel Safer Page 612 February 2014
  • 7. Effect of Outlier Count on SVM accuracy OC has a strong negative correlation with the improvement in SVM accuracy that results from aggressive feature selection. Studies that found no benefit from aggressive feature selection used datasets with very large OC. Aggressive feature selection for text categorization – Hershel Safer Page 712 February 2014
  • 8. Choosing classifier and feature-selection methods Using feature selection may affect choice of classifier method. Different methods for feature selection give different results. They report information gain, Chi-squared, and bi-normal separation as being best. Aggressive feature selection for text categorization – Hershel Safer Page 812 February 2014