SlideShare una empresa de Scribd logo
1 de 22
Stemming Technology
E-Business Technologies
Prof. Dr. Eduard Heindl
By Ajay Singh
Introduction
Information Retrieval (IR) is the
arrangement of documents in a collection
to meet user's need for information.
Representation
 Query or profile,
 One or more search terms
 Information importance weights.
Query has
 Index terms (important words or phrases).
 Decision may be binary (retrieve/reject).
 Degree of relevance
Definition
Stemming is the process for Reducing inflected (or
sometimes derived) words to their stem, base or root form
– generally a written word form. The process of stemming
is often called conflation. These programs are commonly
referred to as stemming algorithms or stemmers
Utility
The process of stemming is
useful in search engines for
 Query expansion.
 Indexing.
 Natural language
processing
Algorithms
There are several types of stemming algorithms which
differ in respect to performance and accuracy and how
certain stemming obstacles are overcome.
A stemmer for ENGLISH, for example, should identify
the STRING "cats" (and possibly "catlike", "catty" etc.) as
based on the root "cat", and "stemmer", "stemming",
"stemmed" as based on "stem". A stemming algorithm
reduces the words "fishing", "fished", "fish", and "fisher" to
the root word, "fish".
Brute Force Algorithms
These stemmers employ a lookup table which contains relations between
root forms and inflected forms. To stem a word, the table is queried to find
a matching inflection. If a matching inflection is found, the associated root
form is returned.
Advantages.
 Stemming error less.
 User friendly.
Problems
 They lack elegance to converge to the result fast.
 Time consuming.
 Back end updating
 Difficult to design.
.
Suffix Stripping Algorithms
Suffix stripping algorithms do not rely on a lookup table that
consists of inflected forms and root form relations. Instead, a
typically smaller list of "rules" are stored which provide a path for
the algorithm, given an input word form, to find its root form.
Some examples of the rules include:
 if the word ends in 'ed', remove the 'ed'
 if the word ends in 'ing', remove the 'ing'
 if the word ends in 'ly', remove the 'ly'
Benefits
 Simple
Lemmatisation Algorithms
The more complex approach to the problem of
determining a stem of a word is lemmatisation. This
process involves first determining the part of speech of a
word, and applying different normalization rules for each
part of speech. The part of speech is first detected prior to
attempting to find the root since for some languages, the
stemming rules change depending on a word's part of
speech.
This approach is highly conditional upon obtaining the
correct lexical category (part of speech). While there is
overlap between the normalization rules for certain
categories, identifying the wrong category or being unable
to produce the right category limits the added benefit of
this approach over suffix stripping algorithms. The basic
idea is that, if we are able to grasp more information
about the word to be stemmed, then we are able to more
accurately apply normalization rules (which are, more or
less, suffix stripping rules).
Hybrid Approaches
Hybrid approaches use two or more of the approaches
described above in unison. A simple example is a
algorithm which first consults a lookup table using brute
force. However, instead of trying to store the entire set of
relations between words in a given language, the lookup
table is kept small and is only used to store a minute
amount of "frequent exceptions" like "ran => run". If the
word is not in the exception list, apply suffix stripping or
lemmatisation and output the result
Affix Stemmers
In linguistics, the term affix refers to either a prefix and suffix. In
addition to dealing with suffixes, several approaches also attempt
to remove common prefixes. For example, given the word
indefinitely, identify that the leading "in" is a prefix that can be
removed. Many of the same approaches mentioned earlier apply,
but go by the name affix stripping.
Matching Algorithms
These algorithms use a stem database (for example a set
of documents that contain stem words). These stems, as
mentioned above, are not necessarily valid words
themselves (but rather common sub-strings, as the
"brows" in "browse" and in "browsing"). In order to stem a
word the algorithm tries to match it with stems from the
database, applying various constraints, such as on the
relative length of the candidate stem within the word (so
that, for example, the short prefix "be", which is the stem
of such words as "be", "been" and "being", would not be
considered as the stem of the word "beside").
Multilingual Stemming
Multilingual stemming applies morphological rules of two or more
languages simultaneously instead of rules for only a single
language when interpreting a search query. Commercial systems
using multilingual stemming exist.
Challenges
 Hebrew and Arabic are tough languages for
stemming.
 The morphology, orthography, and character
encoding of the target language becomes more
complex for stemmer design in some languages.
 Italian stemmer is more complex than an English
one (because of more possible verb inflections), a
Russian one is more complex (more possible noun
declensions),
 Hebrew one is even more complex (due to non-
catenative morphology and a writing system without
vowels).
 Stemmer for Hungarian is easier to due to the
precise rules in the language for flexion.
Stemming algorithm for
German language.
Recently a stemming algorithm for morphological
complex languages like German or Dutch is presented.
The main idea is not to use stems as common forms in
order to make the algorithm simple and fast.
The algorithm consists of two steps:
 The certain characters and/or character sequences
are substituted. This step takes linguistic rules and
statistical heuristics into account.
 A very simple, context free suffix-stripping algorithm is
applied. Three variations of the algorithm are described:
The simplest one can easily be implemented with 50 lines
of C++ code while the most complex one requires about
100 lines of code and a small wordlist. Speed and quality
of the algorithm can be scaled by applying further
linguistic rules and statistical heuristics
Performance
Direct Assessment
The most primitive method for assessing the
perform ance of a stemmer is to examine its
behaviour when applied to samples of words -
especially words which have already been arranged
into 'conflation groups'. This way, specific errors
(e.g., failing to merge "maintained" with
"maintenance", or wrongly merging "experiment"
with "experience") can be identified, and the rules
adjusted accordingly. This approach is of very
limited utility on its own, but can be used to
complement other methods, such as the error-
counting approach outlined later.
Components
Information Retrieval Components.
 Precision
 Recall
 Fall-Out
 F-measure
Error counting
There is a possibility to evaluate stemming by counting the
numbers of two kinds of errors that occur during
stemming, namely;
 Under Stemming.
This refers to words that should be grouped together by
stemming, but aren't. This causes a single concept to be
spread over various different stems, which will tend to
decrease the Recall in an IR search.
 Over-Stemming
This refers to words that shouldn’t be grouped together by
stemming, but are. This causes the meanings of the
stems to be diluted, which will effect Precision of IR.
Using a sample file of grouped words, these errors are
then counted.
Mathematical Notation
There is a method that returns a value for an Under-
Stemming (or Conflation) index;
UI = Under-Stemming Index
CI = Conflation Index: proportion of equivalent word
pairs which were successfully grouped to the same
stem.
UI= 1 - CI
Also the value for an Over-Stemming (or
Distinctness) index;
OI = Over-Stemming index
DI = Distinctness Index: proportion of non-
equivalent word pairs which remained distinct after
stemming.
OI= 1 - DI
Stemmer Strength
Number of words per conflation class
This is the average size of the groups of words
coverted to a particular stem (regardless of whether
they are all correct).
This metric is obviously dependent on the number
of words processed, but for a word collection of
given size, a higher value indicates a heavier
stemmer. The value is easily calculated as follows:
WC = Mean number of words per conflation class
N = Number of unique words before Stemming
S = Number of unique stems after Stemming
MWC=N/S
Index Compression
The Index Compression Factor represents the extent to
which a collection of unique words is reduced
(compressed) by stemming, the idea being that the
heavier the Stemmer, the greater the Index Compression
Factor. This can be calculated by;
IC = Index Compression Factor
N = Number of unique words before Stemming
S = Number of unique stems after Stemming
ICF =( N-S)/N
Applications
 Information retrieval
 Usage in commercial products .
References.
 J. Carlberger and V. Kann. 1999. Implementing an efficient part-of-speech tagger,
Software Practice and Experience, 29, 815-832, 1999.
 D. Harman. 1991. How effective is suffixing? Journal of the American Society for
Information Science, 42(1): 7-15.
 D.A. Hull. 1996. Stemming Algorithms - A Case Study for Detailed Evaluation. Journal
of the American Society for Information Science, 47(1): 70-84
 R. Krovetz. 1993. Viewing Morphology as an Inference Process. In Proceedings of the
16th Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval, ACM, New York, pp 191-202.
 M.F. Porter. 1980. An algorithm for suffix stripping. Program, vol 14, no 3, pp 130-130.
Xu and W. B. Croft. 1998. Corpus-based Stemming using Co-occurrence of Word
Variants. ACM Transactions on Information Systems, Volume 16, Number 1, pp 61-81,
January 1998.
 W. Kraaij and R.Pohlmann. 1994. Porter's stemming algorithm for Dutch. In L.G.M.
Noordman and W.A.M. de Vroomen,editors, Informatie wetenschap 1994:
Wetenschappelijke bijdragen aan de derde STINFON Conferentie, pp. 167-180.
 M. Hassel. 2001. Internet as Corpus – Automatic Construction of a Swedish News
Corpus. NODALIDA ’01 - 13th Nordic Conference on Computational Linguistics, May
21-22 2001, Uppsala, Sweden
 M. Popovic and P. Willett. 1992. The effectiveness of stemming for natural-language
access to Slovene textual data. Journal of the American Society for Information
Science, 43(5): 384-390.
 Stemming algorithms research. www.cranfield.ac.uk/research.
 Key word stemming. www.cybertouch.info
 Stemming technology research. www.vlex.be
 Lovins, Julie B. Development of a Stemming Algorithm, Electronic systems lab, MIT,
USA

Más contenido relacionado

Similar a Stemming is one of several text normalization techniques that converts raw text data into a readable format for natural language processing tasks

A NEW STEMMER TO IMPROVE INFORMATION RETRIEVAL
A NEW STEMMER TO IMPROVE INFORMATION RETRIEVALA NEW STEMMER TO IMPROVE INFORMATION RETRIEVAL
A NEW STEMMER TO IMPROVE INFORMATION RETRIEVAL
IJNSA Journal
 
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
beshahashenafe20
 
Designing A Rule Based Stemming Algorithm for Kambaata Language Text
Designing A Rule Based Stemming Algorithm for Kambaata Language TextDesigning A Rule Based Stemming Algorithm for Kambaata Language Text
Designing A Rule Based Stemming Algorithm for Kambaata Language Text
CSCJournals
 
Designing A Rule Based Stemming Algorithm for Kambaata Language Text
Designing A Rule Based Stemming Algorithm for Kambaata Language TextDesigning A Rule Based Stemming Algorithm for Kambaata Language Text
Designing A Rule Based Stemming Algorithm for Kambaata Language Text
CSCJournals
 
Chinese Word Segmentation in MSR-NLP
Chinese Word Segmentation in MSR-NLPChinese Word Segmentation in MSR-NLP
Chinese Word Segmentation in MSR-NLP
Andi Wu
 
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Chunyang Chen
 
COMPILER DESIGN LECTURES -UNIT-2 ST.pptx
COMPILER DESIGN LECTURES -UNIT-2 ST.pptxCOMPILER DESIGN LECTURES -UNIT-2 ST.pptx
COMPILER DESIGN LECTURES -UNIT-2 ST.pptx
Ranjeet Reddy
 
Cohesive Software Design
Cohesive Software DesignCohesive Software Design
Cohesive Software Design
ijtsrd
 

Similar a Stemming is one of several text normalization techniques that converts raw text data into a readable format for natural language processing tasks (20)

A NEW STEMMER TO IMPROVE INFORMATION RETRIEVAL
A NEW STEMMER TO IMPROVE INFORMATION RETRIEVALA NEW STEMMER TO IMPROVE INFORMATION RETRIEVAL
A NEW STEMMER TO IMPROVE INFORMATION RETRIEVAL
 
A NEW STEMMER TO IMPROVE INFORMATION RETRIEVAL
A NEW STEMMER TO IMPROVE INFORMATION RETRIEVALA NEW STEMMER TO IMPROVE INFORMATION RETRIEVAL
A NEW STEMMER TO IMPROVE INFORMATION RETRIEVAL
 
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
 
7 probability and statistics an introduction
7 probability and statistics an introduction7 probability and statistics an introduction
7 probability and statistics an introduction
 
Designing A Rule Based Stemming Algorithm for Kambaata Language Text
Designing A Rule Based Stemming Algorithm for Kambaata Language TextDesigning A Rule Based Stemming Algorithm for Kambaata Language Text
Designing A Rule Based Stemming Algorithm for Kambaata Language Text
 
Designing A Rule Based Stemming Algorithm for Kambaata Language Text
Designing A Rule Based Stemming Algorithm for Kambaata Language TextDesigning A Rule Based Stemming Algorithm for Kambaata Language Text
Designing A Rule Based Stemming Algorithm for Kambaata Language Text
 
Parser
ParserParser
Parser
 
ONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVAL
ONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVALONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVAL
ONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVAL
 
A survey of named entity recognition in assamese and other indian languages
A survey of named entity recognition in assamese and other indian languagesA survey of named entity recognition in assamese and other indian languages
A survey of named entity recognition in assamese and other indian languages
 
Chinese Word Segmentation in MSR-NLP
Chinese Word Segmentation in MSR-NLPChinese Word Segmentation in MSR-NLP
Chinese Word Segmentation in MSR-NLP
 
Speech recognition an overview
Speech recognition   an overviewSpeech recognition   an overview
Speech recognition an overview
 
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
 
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
 
IRJET- Vernacular Language Spell Checker & Autocorrection
IRJET- Vernacular Language Spell Checker & AutocorrectionIRJET- Vernacular Language Spell Checker & Autocorrection
IRJET- Vernacular Language Spell Checker & Autocorrection
 
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCEDETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
 
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
 
COMPILER DESIGN LECTURES -UNIT-2 ST.pptx
COMPILER DESIGN LECTURES -UNIT-2 ST.pptxCOMPILER DESIGN LECTURES -UNIT-2 ST.pptx
COMPILER DESIGN LECTURES -UNIT-2 ST.pptx
 
Cohesive Software Design
Cohesive Software DesignCohesive Software Design
Cohesive Software Design
 
Ijartes v1-i1-002
Ijartes v1-i1-002Ijartes v1-i1-002
Ijartes v1-i1-002
 
Deciphering voice of customer through speech analytics
Deciphering voice of customer through speech analyticsDeciphering voice of customer through speech analytics
Deciphering voice of customer through speech analytics
 

Más de NALESVPMEngg

Introduction to Csharp (C-Sharp) is a programming language developed by Micro...
Introduction to Csharp (C-Sharp) is a programming language developed by Micro...Introduction to Csharp (C-Sharp) is a programming language developed by Micro...
Introduction to Csharp (C-Sharp) is a programming language developed by Micro...
NALESVPMEngg
 

Más de NALESVPMEngg (13)

a simple idealized machine used to recognize patterns within input taken from...
a simple idealized machine used to recognize patterns within input taken from...a simple idealized machine used to recognize patterns within input taken from...
a simple idealized machine used to recognize patterns within input taken from...
 
Information retrieval is the process of accessing data resources. Usually doc...
Information retrieval is the process of accessing data resources. Usually doc...Information retrieval is the process of accessing data resources. Usually doc...
Information retrieval is the process of accessing data resources. Usually doc...
 
Information retrieval is the process of accessing data resources. Usually doc...
Information retrieval is the process of accessing data resources. Usually doc...Information retrieval is the process of accessing data resources. Usually doc...
Information retrieval is the process of accessing data resources. Usually doc...
 
Class diagrams are a type of UML (Unified Modeling Language) diagram used in ...
Class diagrams are a type of UML (Unified Modeling Language) diagram used in ...Class diagrams are a type of UML (Unified Modeling Language) diagram used in ...
Class diagrams are a type of UML (Unified Modeling Language) diagram used in ...
 
Class diagrams are a type of UML (Unified Modeling Language) diagram used in ...
Class diagrams are a type of UML (Unified Modeling Language) diagram used in ...Class diagrams are a type of UML (Unified Modeling Language) diagram used in ...
Class diagrams are a type of UML (Unified Modeling Language) diagram used in ...
 
Activity diagrams show the flow of one activity to another within a system or...
Activity diagrams show the flow of one activity to another within a system or...Activity diagrams show the flow of one activity to another within a system or...
Activity diagrams show the flow of one activity to another within a system or...
 
Activity diagrams show the flow of one activity to another within a system or...
Activity diagrams show the flow of one activity to another within a system or...Activity diagrams show the flow of one activity to another within a system or...
Activity diagrams show the flow of one activity to another within a system or...
 
Introduction to Csharp (C-Sharp) is a programming language developed by Micro...
Introduction to Csharp (C-Sharp) is a programming language developed by Micro...Introduction to Csharp (C-Sharp) is a programming language developed by Micro...
Introduction to Csharp (C-Sharp) is a programming language developed by Micro...
 
Wk5_UML_ActivityDiagram.pptx
Wk5_UML_ActivityDiagram.pptxWk5_UML_ActivityDiagram.pptx
Wk5_UML_ActivityDiagram.pptx
 
TutorialUML.pptx
TutorialUML.pptxTutorialUML.pptx
TutorialUML.pptx
 
6 Use Case Modeling.pptx
6 Use Case Modeling.pptx6 Use Case Modeling.pptx
6 Use Case Modeling.pptx
 
Introduction To Data Structures.ppt
Introduction To Data Structures.pptIntroduction To Data Structures.ppt
Introduction To Data Structures.ppt
 
Introduction To Algorithms.ppt
Introduction To Algorithms.pptIntroduction To Algorithms.ppt
Introduction To Algorithms.ppt
 

Último

Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..
MaherOthman7
 
Microkernel in Operating System | Operating System
Microkernel in Operating System | Operating SystemMicrokernel in Operating System | Operating System
Microkernel in Operating System | Operating System
Sampad Kar
 

Último (20)

Multivibrator and its types defination and usges.pptx
Multivibrator and its types defination and usges.pptxMultivibrator and its types defination and usges.pptx
Multivibrator and its types defination and usges.pptx
 
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisSeismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
 
Theory for How to calculation capacitor bank
Theory for How to calculation capacitor bankTheory for How to calculation capacitor bank
Theory for How to calculation capacitor bank
 
Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..
 
Lesson no16 application of Induction Generator in Wind.ppsx
Lesson no16 application of Induction Generator in Wind.ppsxLesson no16 application of Induction Generator in Wind.ppsx
Lesson no16 application of Induction Generator in Wind.ppsx
 
5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...
 
Electrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission lineElectrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission line
 
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdfInstruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
 
Raashid final report on Embedded Systems
Raashid final report on Embedded SystemsRaashid final report on Embedded Systems
Raashid final report on Embedded Systems
 
How to Design and spec harmonic filter.pdf
How to Design and spec harmonic filter.pdfHow to Design and spec harmonic filter.pdf
How to Design and spec harmonic filter.pdf
 
Filters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility ApplicationsFilters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility Applications
 
analog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptxanalog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptx
 
Lab Manual Arduino UNO Microcontrollar.docx
Lab Manual Arduino UNO Microcontrollar.docxLab Manual Arduino UNO Microcontrollar.docx
Lab Manual Arduino UNO Microcontrollar.docx
 
Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...
 
Linux Systems Programming: Semaphores, Shared Memory, and Message Queues
Linux Systems Programming: Semaphores, Shared Memory, and Message QueuesLinux Systems Programming: Semaphores, Shared Memory, and Message Queues
Linux Systems Programming: Semaphores, Shared Memory, and Message Queues
 
Diploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdfDiploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdf
 
Electrical shop management system project report.pdf
Electrical shop management system project report.pdfElectrical shop management system project report.pdf
Electrical shop management system project report.pdf
 
Introduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AIIntroduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AI
 
Artificial Intelligence Bayesian Reasoning
Artificial Intelligence Bayesian ReasoningArtificial Intelligence Bayesian Reasoning
Artificial Intelligence Bayesian Reasoning
 
Microkernel in Operating System | Operating System
Microkernel in Operating System | Operating SystemMicrokernel in Operating System | Operating System
Microkernel in Operating System | Operating System
 

Stemming is one of several text normalization techniques that converts raw text data into a readable format for natural language processing tasks

  • 1. Stemming Technology E-Business Technologies Prof. Dr. Eduard Heindl By Ajay Singh
  • 2. Introduction Information Retrieval (IR) is the arrangement of documents in a collection to meet user's need for information. Representation  Query or profile,  One or more search terms  Information importance weights. Query has  Index terms (important words or phrases).  Decision may be binary (retrieve/reject).  Degree of relevance
  • 3. Definition Stemming is the process for Reducing inflected (or sometimes derived) words to their stem, base or root form – generally a written word form. The process of stemming is often called conflation. These programs are commonly referred to as stemming algorithms or stemmers
  • 4. Utility The process of stemming is useful in search engines for  Query expansion.  Indexing.  Natural language processing
  • 5. Algorithms There are several types of stemming algorithms which differ in respect to performance and accuracy and how certain stemming obstacles are overcome. A stemmer for ENGLISH, for example, should identify the STRING "cats" (and possibly "catlike", "catty" etc.) as based on the root "cat", and "stemmer", "stemming", "stemmed" as based on "stem". A stemming algorithm reduces the words "fishing", "fished", "fish", and "fisher" to the root word, "fish".
  • 6. Brute Force Algorithms These stemmers employ a lookup table which contains relations between root forms and inflected forms. To stem a word, the table is queried to find a matching inflection. If a matching inflection is found, the associated root form is returned. Advantages.  Stemming error less.  User friendly. Problems  They lack elegance to converge to the result fast.  Time consuming.  Back end updating  Difficult to design. .
  • 7. Suffix Stripping Algorithms Suffix stripping algorithms do not rely on a lookup table that consists of inflected forms and root form relations. Instead, a typically smaller list of "rules" are stored which provide a path for the algorithm, given an input word form, to find its root form. Some examples of the rules include:  if the word ends in 'ed', remove the 'ed'  if the word ends in 'ing', remove the 'ing'  if the word ends in 'ly', remove the 'ly' Benefits  Simple
  • 8. Lemmatisation Algorithms The more complex approach to the problem of determining a stem of a word is lemmatisation. This process involves first determining the part of speech of a word, and applying different normalization rules for each part of speech. The part of speech is first detected prior to attempting to find the root since for some languages, the stemming rules change depending on a word's part of speech. This approach is highly conditional upon obtaining the correct lexical category (part of speech). While there is overlap between the normalization rules for certain categories, identifying the wrong category or being unable to produce the right category limits the added benefit of this approach over suffix stripping algorithms. The basic idea is that, if we are able to grasp more information about the word to be stemmed, then we are able to more accurately apply normalization rules (which are, more or less, suffix stripping rules).
  • 9. Hybrid Approaches Hybrid approaches use two or more of the approaches described above in unison. A simple example is a algorithm which first consults a lookup table using brute force. However, instead of trying to store the entire set of relations between words in a given language, the lookup table is kept small and is only used to store a minute amount of "frequent exceptions" like "ran => run". If the word is not in the exception list, apply suffix stripping or lemmatisation and output the result
  • 10. Affix Stemmers In linguistics, the term affix refers to either a prefix and suffix. In addition to dealing with suffixes, several approaches also attempt to remove common prefixes. For example, given the word indefinitely, identify that the leading "in" is a prefix that can be removed. Many of the same approaches mentioned earlier apply, but go by the name affix stripping.
  • 11. Matching Algorithms These algorithms use a stem database (for example a set of documents that contain stem words). These stems, as mentioned above, are not necessarily valid words themselves (but rather common sub-strings, as the "brows" in "browse" and in "browsing"). In order to stem a word the algorithm tries to match it with stems from the database, applying various constraints, such as on the relative length of the candidate stem within the word (so that, for example, the short prefix "be", which is the stem of such words as "be", "been" and "being", would not be considered as the stem of the word "beside").
  • 12. Multilingual Stemming Multilingual stemming applies morphological rules of two or more languages simultaneously instead of rules for only a single language when interpreting a search query. Commercial systems using multilingual stemming exist.
  • 13. Challenges  Hebrew and Arabic are tough languages for stemming.  The morphology, orthography, and character encoding of the target language becomes more complex for stemmer design in some languages.  Italian stemmer is more complex than an English one (because of more possible verb inflections), a Russian one is more complex (more possible noun declensions),  Hebrew one is even more complex (due to non- catenative morphology and a writing system without vowels).  Stemmer for Hungarian is easier to due to the precise rules in the language for flexion.
  • 14. Stemming algorithm for German language. Recently a stemming algorithm for morphological complex languages like German or Dutch is presented. The main idea is not to use stems as common forms in order to make the algorithm simple and fast. The algorithm consists of two steps:  The certain characters and/or character sequences are substituted. This step takes linguistic rules and statistical heuristics into account.  A very simple, context free suffix-stripping algorithm is applied. Three variations of the algorithm are described: The simplest one can easily be implemented with 50 lines of C++ code while the most complex one requires about 100 lines of code and a small wordlist. Speed and quality of the algorithm can be scaled by applying further linguistic rules and statistical heuristics
  • 15. Performance Direct Assessment The most primitive method for assessing the perform ance of a stemmer is to examine its behaviour when applied to samples of words - especially words which have already been arranged into 'conflation groups'. This way, specific errors (e.g., failing to merge "maintained" with "maintenance", or wrongly merging "experiment" with "experience") can be identified, and the rules adjusted accordingly. This approach is of very limited utility on its own, but can be used to complement other methods, such as the error- counting approach outlined later.
  • 16. Components Information Retrieval Components.  Precision  Recall  Fall-Out  F-measure
  • 17. Error counting There is a possibility to evaluate stemming by counting the numbers of two kinds of errors that occur during stemming, namely;  Under Stemming. This refers to words that should be grouped together by stemming, but aren't. This causes a single concept to be spread over various different stems, which will tend to decrease the Recall in an IR search.  Over-Stemming This refers to words that shouldn’t be grouped together by stemming, but are. This causes the meanings of the stems to be diluted, which will effect Precision of IR. Using a sample file of grouped words, these errors are then counted.
  • 18. Mathematical Notation There is a method that returns a value for an Under- Stemming (or Conflation) index; UI = Under-Stemming Index CI = Conflation Index: proportion of equivalent word pairs which were successfully grouped to the same stem. UI= 1 - CI Also the value for an Over-Stemming (or Distinctness) index; OI = Over-Stemming index DI = Distinctness Index: proportion of non- equivalent word pairs which remained distinct after stemming. OI= 1 - DI
  • 19. Stemmer Strength Number of words per conflation class This is the average size of the groups of words coverted to a particular stem (regardless of whether they are all correct). This metric is obviously dependent on the number of words processed, but for a word collection of given size, a higher value indicates a heavier stemmer. The value is easily calculated as follows: WC = Mean number of words per conflation class N = Number of unique words before Stemming S = Number of unique stems after Stemming MWC=N/S
  • 20. Index Compression The Index Compression Factor represents the extent to which a collection of unique words is reduced (compressed) by stemming, the idea being that the heavier the Stemmer, the greater the Index Compression Factor. This can be calculated by; IC = Index Compression Factor N = Number of unique words before Stemming S = Number of unique stems after Stemming ICF =( N-S)/N
  • 21. Applications  Information retrieval  Usage in commercial products .
  • 22. References.  J. Carlberger and V. Kann. 1999. Implementing an efficient part-of-speech tagger, Software Practice and Experience, 29, 815-832, 1999.  D. Harman. 1991. How effective is suffixing? Journal of the American Society for Information Science, 42(1): 7-15.  D.A. Hull. 1996. Stemming Algorithms - A Case Study for Detailed Evaluation. Journal of the American Society for Information Science, 47(1): 70-84  R. Krovetz. 1993. Viewing Morphology as an Inference Process. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, pp 191-202.  M.F. Porter. 1980. An algorithm for suffix stripping. Program, vol 14, no 3, pp 130-130. Xu and W. B. Croft. 1998. Corpus-based Stemming using Co-occurrence of Word Variants. ACM Transactions on Information Systems, Volume 16, Number 1, pp 61-81, January 1998.  W. Kraaij and R.Pohlmann. 1994. Porter's stemming algorithm for Dutch. In L.G.M. Noordman and W.A.M. de Vroomen,editors, Informatie wetenschap 1994: Wetenschappelijke bijdragen aan de derde STINFON Conferentie, pp. 167-180.  M. Hassel. 2001. Internet as Corpus – Automatic Construction of a Swedish News Corpus. NODALIDA ’01 - 13th Nordic Conference on Computational Linguistics, May 21-22 2001, Uppsala, Sweden  M. Popovic and P. Willett. 1992. The effectiveness of stemming for natural-language access to Slovene textual data. Journal of the American Society for Information Science, 43(5): 384-390.  Stemming algorithms research. www.cranfield.ac.uk/research.  Key word stemming. www.cybertouch.info  Stemming technology research. www.vlex.be  Lovins, Julie B. Development of a Stemming Algorithm, Electronic systems lab, MIT, USA