SlideShare una empresa de Scribd logo
1 de 10
Descargar para leer sin conexión
A Comparison of Stemmers on
                  Source Code Identifiers for
                       Software Search
                            Andrew Wiese,Valerie Ho, Emily Hill
                                Montclair State University




Thursday, October 6, 2011
Problem: Source Code Search
                     • Challenge: Query words may not exactly
                            match source code words & can hurt search
                     •      Example: “add item” query should match
                            • add, adds, adding, added
                            • item, items
                     •      Stemming used by Information Retrieval (IR)
                            systems to strip suffixes
                            • reduce all words to root form, or stem
                            • a.k.a. word conflation


Thursday, October 6, 2011
What makes stemming source code
                  different from traditional IR?
                    •       Word choice more restrictive in naming identifiers
                            than in natural language (NL) documents
                            • NL: stem, stems, stemmer, stemming, stemmed
                            • Code: stem, stemmer
                    •       Classes that encapsulate actions have names with
                            nominalized verbs:
                            • play → player
                            • compile → compiler
                    •       Tradtional IR prefer light Porter’s
                            • tends not to stem across parts of speech
                            • E.g., noun ‘player’ will not stem to verb ‘play’

Thursday, October 6, 2011
Stemming Challenges
       •       Understemming
             •   stemmer assigns different stems to words in the same concept
             •   reduces number of relevant results in search
                 (i.e., reduces recall)
       •       Overstemming
             •   stemmer assigns the same stem for words with different
                 meanings (e.g., business conflated with busy,
                 university with universe)
             •   increases number of irrelevant results (i.e., reduces precision)
       •       Stemmers categorized by type of error
             •   Light stemmers: understem
             •   Heavy stemmers: overstem


Thursday, October 6, 2011
A Brief History of Stemming
              • Light Stemmers (tend not to stem across parts of speech)
               • Porter (1980): rule-based, simple & efficient
                            •   Most popular stemmer in IR & SE
                            •   Snowball (2001): minor rule improvements
                    •       KStem (1993): morphology-based
                            •   based on word’s structure & hand-tuned dictionary
                            •   in experiments shown to outperform porter’s
              • Heavy Stemmers
               • Lovins (1968): rule-based
               • Paice (1990): rule-based
               • MStem: morphological (PC-Kimmo), specialized
                            for source code using word frequencies

Thursday, October 6, 2011
Our Contribution
                 • Compare performance of 5 stemmers on
                   source code identifiers
                 • Evaluation 1: compare conflated word classes
                  • started from 100 most frequently occurring
                     words in 9,000 open source Java programs
                  • analyzed by 2 human Java programmers in
                     terms of accuracy & completeness
                 • Evaluation 2: compare effect of using 5
                        stemmers vs not stemming on 8 search tasks



Thursday, October 6, 2011
Stemmer Word Classes Comparison
            •      accurate: word class contains no unrelated words
            •      complete: word class not missing related words
                   (rely on greediness & diversity of stemmers)
            •      context sensitive (CS): multiple senses or disagreement
                                                      100
                                                      90
                            No. Accurate & Complete




                                                      80
                                                      70
                                                      60
                                                                                                                           58%
                                                      50                                                         53%
                                                      40                                                37%
                                                                                            32%
                                                      30                     29%
                                                      20
                                                      10

                                                                 e   CS              er         e          ll         m        m
                                                             Non                 ort       Paic        w ba        Ste      Ste
                                                                            P                        no          K        M
                                                                                                    S
                                                            None     Context              PORTER         PAICE      SNOWBALL       KSTEM   MSTEM
                                                                     Sensitive
Thursday, October 6, 2011
element      KStem    element
                              (MStem)      MStem    element, elemental, elements
                                                                                                             stemmers
                                           Paice    el, ela, ele, element, elemental, elementary,            and inaccu
                                  Word Classes Example
                                                    elemente, elementen, elements, elen, eles,
                                                    eli, elif, elise, elist, ell, elle, ellen, eller, els,
                                                                                                             words. Fo
                                                                                                             ‘method’ w

                     • Stemmer comparison for 2 examples
                                                    else, elseif, elses, elsif
                                         Porter     import, importable, importance, important,               with Span
                                                            Table I
                                                                                                             and, in the
                     • Underlined words in all stemmer classes
                                                    imported, importer, importers, importing,                the adverb
                            S TEMMER WORD CLASS COMPARISONS FOR 4 EXAMPLES ( UNDERLINED
                                                    imports
                                    WORDS ARE IN THE WORD CLASSES FOR ALL STEMMERS )                         quently we
                                                                                                             KStem con
                                         Snowbl     import, importable, importance, important,
                                                    importantly, imported, importer, importers,
                                                                                                             word frequ
                                                                                                             with ‘else’
                               Word      Stemmer Word Class                                                  uses an En
                               (A & C)
                                                    importing, imports                                       ‘stationary’
                               import    KStem      import, importable, imported, importer,                     The ann
                                         Porter     element, elemental, elemente, elements
                               (Kstem)              importers, importing, imports                            C. Threats
                                         Snwbl
                                         MStem
                                                    element, elemental, elemente, elements                   phological
                               element   KStem      element importable, importance, important,
                                                    import,
                                                                                                                Because
                               (MStem)   MStem      importantly, imported, importer, importers,
                                                    element, elemental, elements
                                                                                                             stemmers
                                         Paice      el, ela, ele,imports elemental, elementary,
                                                    importing, element,                                      programs,
                                                                                                             and inaccu
                                         Paice      elemente, elementen,importance, elen, eles,
                                                    import, importable, elements, important,                 words.lang
                                                                                                             ming For
                                                    eli, elif, elise,importar, elle, ellen, eller, els,
                                                    importantly, elist, ell, imported, importer,             9,000+ Jav
                                                    else, elseif,importing, imports
                                                    importers, elses, elsif                                  ‘method’ w
                                                    add, adde, addes, adds
                                                                                                             frequent w
                                                                                                             with Spani
                                         Porter     import, importable, importance, important,
                                         Snwbl      imported, addes, adds
                                                    add, adde, importer, importers, importing,               and,large s
                                                                                                             the in the
                               add       KStem      add, addable, added, addes, adding, adds
                                                    imports                                                  it is unlik
                                                                                                             KStem wer
                               (CS)      MStem
                                         Snowbl     import, importable, adder, adding, addition,
                                                    add, addable, added, importance, important,
                                                                                                             of 100 wo
                                                                                                             word frequ
                                                    importantly,additionally,importer, importers,
                                                    additional, imported, additions, additive,
                                                    importing, adds
                                                    additivity, imports                                      of word cl
                                                                                                             uses an En
                               import    Paice
                                         KStem      import, add, addable, imported, importer,
                                                    ad, ada, importable, adde, added, adder,                 may not g
                               (Kstem)              importers, importing, ade, ads
                                                    addes, adding, adds, imports                             C. Threats
                                                                                                             stemmers.
                                         Porter
                                         MStem      import,named, namely, names, naming
                                                    name, importable, importance, important,
                                         Snwbl      name, named, namely, names, naming                       can be am
                                                                                                                Because
                                                    importantly, imported, importer, importers,
Thursday, October 6, 2011      name      KStem      name, nameable, named, namer, names,                     the ‘contex
Stemming and Source Code Search
            •      search technique: tf-idf
            •      search tasks: 8 with 48 queries from prior study
                   [Shepherd, et al. ’07]
            •      Paice: overstemming & understemming mistakes improved
                   results for 2 tasks (e.g., textfield report element)
                                                   1.0
                            Area Under the Curve

                                                   0.9
                                                   0.8
                                                   0.7
                                                   0.6
                                                   0.5




                                                         NoStem   Porter
                                                                    !
                                                                    !      Snowbl
                                                                             !
                                                                             !      KStem
                                                                                      !
                                                                                      !     MStem
                                                                                              !
                                                                                              !     Paice
                                                                                                      !
                                                                                                      !



Thursday, October 6, 2011
Conclusion
                     •      Morphological stemmers appear to be more
                            accurate & complete than rule-based

                     •      In search, stemming more consistently produces
                            relevant results than not stemming

                     •      Heavy stemmers like MStem & Paice appear to be
                            more effective in searching source code than light
                            stemmers like Porter

                     •      Future work: more examples (less frequent &
                            more domain-specific), more human judgements,
                            more search tasks, other SE tasks beyond search



Thursday, October 6, 2011

Más contenido relacionado

Destacado

Components - Graph Based Detection of Library API Limitations
Components - Graph Based Detection of Library API LimitationsComponents - Graph Based Detection of Library API Limitations
Components - Graph Based Detection of Library API LimitationsICSM 2011
 
Faults and Regression Testing - Fault interaction and its repercussions
Faults and Regression Testing - Fault interaction and its repercussionsFaults and Regression Testing - Fault interaction and its repercussions
Faults and Regression Testing - Fault interaction and its repercussionsICSM 2011
 
ERA - Measuring Disruption from Software Evolution Activities Using Graph-Bas...
ERA - Measuring Disruption from Software Evolution Activities Using Graph-Bas...ERA - Measuring Disruption from Software Evolution Activities Using Graph-Bas...
ERA - Measuring Disruption from Software Evolution Activities Using Graph-Bas...ICSM 2011
 
Impact analysis - A Seismology-inspired Approach to Study Change Propagation
Impact analysis - A Seismology-inspired Approach to Study Change PropagationImpact analysis - A Seismology-inspired Approach to Study Change Propagation
Impact analysis - A Seismology-inspired Approach to Study Change PropagationICSM 2011
 
ERA - Clustering and Recommending Collections of Code Relevant to Task
ERA - Clustering and Recommending Collections of Code Relevant to TaskERA - Clustering and Recommending Collections of Code Relevant to Task
ERA - Clustering and Recommending Collections of Code Relevant to TaskICSM 2011
 
Lionel Briand ICSM 2011 Keynote
Lionel Briand ICSM 2011 KeynoteLionel Briand ICSM 2011 Keynote
Lionel Briand ICSM 2011 KeynoteICSM 2011
 
Richard Kemmerer Keynote icsm11
Richard Kemmerer Keynote icsm11Richard Kemmerer Keynote icsm11
Richard Kemmerer Keynote icsm11ICSM 2011
 
Migration and Refactoring - Identifying Overly Strong Conditions in Refactori...
Migration and Refactoring - Identifying Overly Strong Conditions in Refactori...Migration and Refactoring - Identifying Overly Strong Conditions in Refactori...
Migration and Refactoring - Identifying Overly Strong Conditions in Refactori...ICSM 2011
 
Impact Analysis - ImpactScale: Quantifying Change Impact to Predict Faults in...
Impact Analysis - ImpactScale: Quantifying Change Impact to Predict Faults in...Impact Analysis - ImpactScale: Quantifying Change Impact to Predict Faults in...
Impact Analysis - ImpactScale: Quantifying Change Impact to Predict Faults in...ICSM 2011
 
Industry - Relating Developers' Concepts and Artefact Vocabulary in a Financ...
Industry -  Relating Developers' Concepts and Artefact Vocabulary in a Financ...Industry -  Relating Developers' Concepts and Artefact Vocabulary in a Financ...
Industry - Relating Developers' Concepts and Artefact Vocabulary in a Financ...ICSM 2011
 
ERA - Tracking Technical Debt
ERA - Tracking Technical DebtERA - Tracking Technical Debt
ERA - Tracking Technical DebtICSM 2011
 
ICSM'01 Most Influential Paper - Rainer Koschke
ICSM'01 Most Influential Paper - Rainer KoschkeICSM'01 Most Influential Paper - Rainer Koschke
ICSM'01 Most Influential Paper - Rainer KoschkeICSM 2011
 
Industry - Estimating software maintenance effort from use cases an indu...
Industry - Estimating software maintenance effort from use cases an      indu...Industry - Estimating software maintenance effort from use cases an      indu...
Industry - Estimating software maintenance effort from use cases an indu...ICSM 2011
 
Dynamic Analysis - SCOTCH: Improving Test-to-Code Traceability using Slicing ...
Dynamic Analysis - SCOTCH: Improving Test-to-Code Traceability using Slicing ...Dynamic Analysis - SCOTCH: Improving Test-to-Code Traceability using Slicing ...
Dynamic Analysis - SCOTCH: Improving Test-to-Code Traceability using Slicing ...ICSM 2011
 
Reliability and Quality - Predicting post-release defects using pre-release f...
Reliability and Quality - Predicting post-release defects using pre-release f...Reliability and Quality - Predicting post-release defects using pre-release f...
Reliability and Quality - Predicting post-release defects using pre-release f...ICSM 2011
 
ERA - Measuring Maintainability of Spreadsheets in the Wild
ERA - Measuring Maintainability of Spreadsheets in the Wild ERA - Measuring Maintainability of Spreadsheets in the Wild
ERA - Measuring Maintainability of Spreadsheets in the Wild ICSM 2011
 
Components - Crossing the Boundaries while Analyzing Heterogeneous Component-...
Components - Crossing the Boundaries while Analyzing Heterogeneous Component-...Components - Crossing the Boundaries while Analyzing Heterogeneous Component-...
Components - Crossing the Boundaries while Analyzing Heterogeneous Component-...ICSM 2011
 
Industry - The Evolution of Information Systems. A Case Study on Document Man...
Industry - The Evolution of Information Systems. A Case Study on Document Man...Industry - The Evolution of Information Systems. A Case Study on Document Man...
Industry - The Evolution of Information Systems. A Case Study on Document Man...ICSM 2011
 
Postdoc Symposium - Abram Hindle
Postdoc Symposium - Abram HindlePostdoc Symposium - Abram Hindle
Postdoc Symposium - Abram HindleICSM 2011
 
Postdoc symposium - A Logic Meta-Programming Foundation for Example-Driven Pa...
Postdoc symposium - A Logic Meta-Programming Foundation for Example-Driven Pa...Postdoc symposium - A Logic Meta-Programming Foundation for Example-Driven Pa...
Postdoc symposium - A Logic Meta-Programming Foundation for Example-Driven Pa...ICSM 2011
 

Destacado (20)

Components - Graph Based Detection of Library API Limitations
Components - Graph Based Detection of Library API LimitationsComponents - Graph Based Detection of Library API Limitations
Components - Graph Based Detection of Library API Limitations
 
Faults and Regression Testing - Fault interaction and its repercussions
Faults and Regression Testing - Fault interaction and its repercussionsFaults and Regression Testing - Fault interaction and its repercussions
Faults and Regression Testing - Fault interaction and its repercussions
 
ERA - Measuring Disruption from Software Evolution Activities Using Graph-Bas...
ERA - Measuring Disruption from Software Evolution Activities Using Graph-Bas...ERA - Measuring Disruption from Software Evolution Activities Using Graph-Bas...
ERA - Measuring Disruption from Software Evolution Activities Using Graph-Bas...
 
Impact analysis - A Seismology-inspired Approach to Study Change Propagation
Impact analysis - A Seismology-inspired Approach to Study Change PropagationImpact analysis - A Seismology-inspired Approach to Study Change Propagation
Impact analysis - A Seismology-inspired Approach to Study Change Propagation
 
ERA - Clustering and Recommending Collections of Code Relevant to Task
ERA - Clustering and Recommending Collections of Code Relevant to TaskERA - Clustering and Recommending Collections of Code Relevant to Task
ERA - Clustering and Recommending Collections of Code Relevant to Task
 
Lionel Briand ICSM 2011 Keynote
Lionel Briand ICSM 2011 KeynoteLionel Briand ICSM 2011 Keynote
Lionel Briand ICSM 2011 Keynote
 
Richard Kemmerer Keynote icsm11
Richard Kemmerer Keynote icsm11Richard Kemmerer Keynote icsm11
Richard Kemmerer Keynote icsm11
 
Migration and Refactoring - Identifying Overly Strong Conditions in Refactori...
Migration and Refactoring - Identifying Overly Strong Conditions in Refactori...Migration and Refactoring - Identifying Overly Strong Conditions in Refactori...
Migration and Refactoring - Identifying Overly Strong Conditions in Refactori...
 
Impact Analysis - ImpactScale: Quantifying Change Impact to Predict Faults in...
Impact Analysis - ImpactScale: Quantifying Change Impact to Predict Faults in...Impact Analysis - ImpactScale: Quantifying Change Impact to Predict Faults in...
Impact Analysis - ImpactScale: Quantifying Change Impact to Predict Faults in...
 
Industry - Relating Developers' Concepts and Artefact Vocabulary in a Financ...
Industry -  Relating Developers' Concepts and Artefact Vocabulary in a Financ...Industry -  Relating Developers' Concepts and Artefact Vocabulary in a Financ...
Industry - Relating Developers' Concepts and Artefact Vocabulary in a Financ...
 
ERA - Tracking Technical Debt
ERA - Tracking Technical DebtERA - Tracking Technical Debt
ERA - Tracking Technical Debt
 
ICSM'01 Most Influential Paper - Rainer Koschke
ICSM'01 Most Influential Paper - Rainer KoschkeICSM'01 Most Influential Paper - Rainer Koschke
ICSM'01 Most Influential Paper - Rainer Koschke
 
Industry - Estimating software maintenance effort from use cases an indu...
Industry - Estimating software maintenance effort from use cases an      indu...Industry - Estimating software maintenance effort from use cases an      indu...
Industry - Estimating software maintenance effort from use cases an indu...
 
Dynamic Analysis - SCOTCH: Improving Test-to-Code Traceability using Slicing ...
Dynamic Analysis - SCOTCH: Improving Test-to-Code Traceability using Slicing ...Dynamic Analysis - SCOTCH: Improving Test-to-Code Traceability using Slicing ...
Dynamic Analysis - SCOTCH: Improving Test-to-Code Traceability using Slicing ...
 
Reliability and Quality - Predicting post-release defects using pre-release f...
Reliability and Quality - Predicting post-release defects using pre-release f...Reliability and Quality - Predicting post-release defects using pre-release f...
Reliability and Quality - Predicting post-release defects using pre-release f...
 
ERA - Measuring Maintainability of Spreadsheets in the Wild
ERA - Measuring Maintainability of Spreadsheets in the Wild ERA - Measuring Maintainability of Spreadsheets in the Wild
ERA - Measuring Maintainability of Spreadsheets in the Wild
 
Components - Crossing the Boundaries while Analyzing Heterogeneous Component-...
Components - Crossing the Boundaries while Analyzing Heterogeneous Component-...Components - Crossing the Boundaries while Analyzing Heterogeneous Component-...
Components - Crossing the Boundaries while Analyzing Heterogeneous Component-...
 
Industry - The Evolution of Information Systems. A Case Study on Document Man...
Industry - The Evolution of Information Systems. A Case Study on Document Man...Industry - The Evolution of Information Systems. A Case Study on Document Man...
Industry - The Evolution of Information Systems. A Case Study on Document Man...
 
Postdoc Symposium - Abram Hindle
Postdoc Symposium - Abram HindlePostdoc Symposium - Abram Hindle
Postdoc Symposium - Abram Hindle
 
Postdoc symposium - A Logic Meta-Programming Foundation for Example-Driven Pa...
Postdoc symposium - A Logic Meta-Programming Foundation for Example-Driven Pa...Postdoc symposium - A Logic Meta-Programming Foundation for Example-Driven Pa...
Postdoc symposium - A Logic Meta-Programming Foundation for Example-Driven Pa...
 

Similar a ERA - A Comparison of Stemmers on Source Code Identifiers for Software Search

Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Saurabh Kaushik
 
Analyzing Arguments during a Debate using Natural Language Processing in Python
Analyzing Arguments during a Debate using Natural Language Processing in PythonAnalyzing Arguments during a Debate using Natural Language Processing in Python
Analyzing Arguments during a Debate using Natural Language Processing in PythonAbhinav Gupta
 
The CLUES database: automated search for linguistic cognates
The CLUES database: automated search for linguistic cognatesThe CLUES database: automated search for linguistic cognates
The CLUES database: automated search for linguistic cognatesMark Planigale
 
Word vectorization(embedding) with nnlm
Word vectorization(embedding) with nnlmWord vectorization(embedding) with nnlm
Word vectorization(embedding) with nnlmhyunsung lee
 
sa-mincut-aditya.ppt
sa-mincut-aditya.pptsa-mincut-aditya.ppt
sa-mincut-aditya.pptaashnareddy1
 
2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptx2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptxSan Kim
 
3. introduction to text mining
3. introduction to text mining3. introduction to text mining
3. introduction to text miningLokesh Ramaswamy
 
3. introduction to text mining
3. introduction to text mining3. introduction to text mining
3. introduction to text miningLokesh Ramaswamy
 
Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...
Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...
Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...IT Arena
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
Gender and language (linguistics, social network theory, Twitter!)
Gender and language (linguistics, social network theory, Twitter!)Gender and language (linguistics, social network theory, Twitter!)
Gender and language (linguistics, social network theory, Twitter!)Tyler Schnoebelen
 
Gender, language, and Twitter: Social theory and computational methods
Gender, language, and Twitter: Social theory and computational methodsGender, language, and Twitter: Social theory and computational methods
Gender, language, and Twitter: Social theory and computational methodsIdibon1
 
Sat lessons power point dt6 10.05.2011
Sat lessons power point dt6 10.05.2011Sat lessons power point dt6 10.05.2011
Sat lessons power point dt6 10.05.2011VJN_88_
 
Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...
Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...
Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...hajinouha0
 
Personalised Terms Derivative- Semantic Stemming
Personalised Terms Derivative- Semantic StemmingPersonalised Terms Derivative- Semantic Stemming
Personalised Terms Derivative- Semantic Stemmingnitin jha
 
Story generation-Sarah Saneei
Story generation-Sarah SaneeiStory generation-Sarah Saneei
Story generation-Sarah SaneeiSRah Sanei
 

Similar a ERA - A Comparison of Stemmers on Source Code Identifiers for Software Search (20)

Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
 
NLP PPT.pptx
NLP PPT.pptxNLP PPT.pptx
NLP PPT.pptx
 
Analyzing Arguments during a Debate using Natural Language Processing in Python
Analyzing Arguments during a Debate using Natural Language Processing in PythonAnalyzing Arguments during a Debate using Natural Language Processing in Python
Analyzing Arguments during a Debate using Natural Language Processing in Python
 
The CLUES database: automated search for linguistic cognates
The CLUES database: automated search for linguistic cognatesThe CLUES database: automated search for linguistic cognates
The CLUES database: automated search for linguistic cognates
 
Word vectorization(embedding) with nnlm
Word vectorization(embedding) with nnlmWord vectorization(embedding) with nnlm
Word vectorization(embedding) with nnlm
 
A^2_Poster
A^2_PosterA^2_Poster
A^2_Poster
 
sa-mincut-aditya.ppt
sa-mincut-aditya.pptsa-mincut-aditya.ppt
sa-mincut-aditya.ppt
 
sa.ppt
sa.pptsa.ppt
sa.ppt
 
sa-mincut-aditya.ppt
sa-mincut-aditya.pptsa-mincut-aditya.ppt
sa-mincut-aditya.ppt
 
2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptx2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptx
 
3. introduction to text mining
3. introduction to text mining3. introduction to text mining
3. introduction to text mining
 
3. introduction to text mining
3. introduction to text mining3. introduction to text mining
3. introduction to text mining
 
Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...
Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...
Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Gender and language (linguistics, social network theory, Twitter!)
Gender and language (linguistics, social network theory, Twitter!)Gender and language (linguistics, social network theory, Twitter!)
Gender and language (linguistics, social network theory, Twitter!)
 
Gender, language, and Twitter: Social theory and computational methods
Gender, language, and Twitter: Social theory and computational methodsGender, language, and Twitter: Social theory and computational methods
Gender, language, and Twitter: Social theory and computational methods
 
Sat lessons power point dt6 10.05.2011
Sat lessons power point dt6 10.05.2011Sat lessons power point dt6 10.05.2011
Sat lessons power point dt6 10.05.2011
 
Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...
Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...
Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...
 
Personalised Terms Derivative- Semantic Stemming
Personalised Terms Derivative- Semantic StemmingPersonalised Terms Derivative- Semantic Stemming
Personalised Terms Derivative- Semantic Stemming
 
Story generation-Sarah Saneei
Story generation-Sarah SaneeiStory generation-Sarah Saneei
Story generation-Sarah Saneei
 

Último

FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 

Último (20)

FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 

ERA - A Comparison of Stemmers on Source Code Identifiers for Software Search

  • 1. A Comparison of Stemmers on Source Code Identifiers for Software Search Andrew Wiese,Valerie Ho, Emily Hill Montclair State University Thursday, October 6, 2011
  • 2. Problem: Source Code Search • Challenge: Query words may not exactly match source code words & can hurt search • Example: “add item” query should match • add, adds, adding, added • item, items • Stemming used by Information Retrieval (IR) systems to strip suffixes • reduce all words to root form, or stem • a.k.a. word conflation Thursday, October 6, 2011
  • 3. What makes stemming source code different from traditional IR? • Word choice more restrictive in naming identifiers than in natural language (NL) documents • NL: stem, stems, stemmer, stemming, stemmed • Code: stem, stemmer • Classes that encapsulate actions have names with nominalized verbs: • play → player • compile → compiler • Tradtional IR prefer light Porter’s • tends not to stem across parts of speech • E.g., noun ‘player’ will not stem to verb ‘play’ Thursday, October 6, 2011
  • 4. Stemming Challenges • Understemming • stemmer assigns different stems to words in the same concept • reduces number of relevant results in search (i.e., reduces recall) • Overstemming • stemmer assigns the same stem for words with different meanings (e.g., business conflated with busy, university with universe) • increases number of irrelevant results (i.e., reduces precision) • Stemmers categorized by type of error • Light stemmers: understem • Heavy stemmers: overstem Thursday, October 6, 2011
  • 5. A Brief History of Stemming • Light Stemmers (tend not to stem across parts of speech) • Porter (1980): rule-based, simple & efficient • Most popular stemmer in IR & SE • Snowball (2001): minor rule improvements • KStem (1993): morphology-based • based on word’s structure & hand-tuned dictionary • in experiments shown to outperform porter’s • Heavy Stemmers • Lovins (1968): rule-based • Paice (1990): rule-based • MStem: morphological (PC-Kimmo), specialized for source code using word frequencies Thursday, October 6, 2011
  • 6. Our Contribution • Compare performance of 5 stemmers on source code identifiers • Evaluation 1: compare conflated word classes • started from 100 most frequently occurring words in 9,000 open source Java programs • analyzed by 2 human Java programmers in terms of accuracy & completeness • Evaluation 2: compare effect of using 5 stemmers vs not stemming on 8 search tasks Thursday, October 6, 2011
  • 7. Stemmer Word Classes Comparison • accurate: word class contains no unrelated words • complete: word class not missing related words (rely on greediness & diversity of stemmers) • context sensitive (CS): multiple senses or disagreement 100 90 No. Accurate & Complete 80 70 60 58% 50 53% 40 37% 32% 30 29% 20 10 e CS er e ll m m Non ort Paic w ba Ste Ste P no K M S None Context PORTER PAICE SNOWBALL KSTEM MSTEM Sensitive Thursday, October 6, 2011
  • 8. element KStem element (MStem) MStem element, elemental, elements stemmers Paice el, ela, ele, element, elemental, elementary, and inaccu Word Classes Example elemente, elementen, elements, elen, eles, eli, elif, elise, elist, ell, elle, ellen, eller, els, words. Fo ‘method’ w • Stemmer comparison for 2 examples else, elseif, elses, elsif Porter import, importable, importance, important, with Span Table I and, in the • Underlined words in all stemmer classes imported, importer, importers, importing, the adverb S TEMMER WORD CLASS COMPARISONS FOR 4 EXAMPLES ( UNDERLINED imports WORDS ARE IN THE WORD CLASSES FOR ALL STEMMERS ) quently we KStem con Snowbl import, importable, importance, important, importantly, imported, importer, importers, word frequ with ‘else’ Word Stemmer Word Class uses an En (A & C) importing, imports ‘stationary’ import KStem import, importable, imported, importer, The ann Porter element, elemental, elemente, elements (Kstem) importers, importing, imports C. Threats Snwbl MStem element, elemental, elemente, elements phological element KStem element importable, importance, important, import, Because (MStem) MStem importantly, imported, importer, importers, element, elemental, elements stemmers Paice el, ela, ele,imports elemental, elementary, importing, element, programs, and inaccu Paice elemente, elementen,importance, elen, eles, import, importable, elements, important, words.lang ming For eli, elif, elise,importar, elle, ellen, eller, els, importantly, elist, ell, imported, importer, 9,000+ Jav else, elseif,importing, imports importers, elses, elsif ‘method’ w add, adde, addes, adds frequent w with Spani Porter import, importable, importance, important, Snwbl imported, addes, adds add, adde, importer, importers, importing, and,large s the in the add KStem add, addable, added, addes, adding, adds imports it is unlik KStem wer (CS) MStem Snowbl import, importable, adder, adding, addition, add, addable, added, importance, important, of 100 wo word frequ importantly,additionally,importer, importers, additional, imported, additions, additive, importing, adds additivity, imports of word cl uses an En import Paice KStem import, add, addable, imported, importer, ad, ada, importable, adde, added, adder, may not g (Kstem) importers, importing, ade, ads addes, adding, adds, imports C. Threats stemmers. Porter MStem import,named, namely, names, naming name, importable, importance, important, Snwbl name, named, namely, names, naming can be am Because importantly, imported, importer, importers, Thursday, October 6, 2011 name KStem name, nameable, named, namer, names, the ‘contex
  • 9. Stemming and Source Code Search • search technique: tf-idf • search tasks: 8 with 48 queries from prior study [Shepherd, et al. ’07] • Paice: overstemming & understemming mistakes improved results for 2 tasks (e.g., textfield report element) 1.0 Area Under the Curve 0.9 0.8 0.7 0.6 0.5 NoStem Porter ! ! Snowbl ! ! KStem ! ! MStem ! ! Paice ! ! Thursday, October 6, 2011
  • 10. Conclusion • Morphological stemmers appear to be more accurate & complete than rule-based • In search, stemming more consistently produces relevant results than not stemming • Heavy stemmers like MStem & Paice appear to be more effective in searching source code than light stemmers like Porter • Future work: more examples (less frequent & more domain-specific), more human judgements, more search tasks, other SE tasks beyond search Thursday, October 6, 2011