SlideShare una empresa de Scribd logo
1 de 19
Descargar para leer sin conexión
Processing Text

         Shilpa Shukla
       Graduate Student
School of Information, UT Austin
Indexing Process
Text Processing

● Goal: transforms documents into index terms or
  features.
● Why do text processing?
   ○ Exact search is too restrictive
   ○ E.g. "computer hardware" doesn't match
     "Computer hardware"
● Easy to handle this example by converting to
  lowercase
● But search engines go much further!
Outline of presentation
● Text statistics
   ○ meaning of text often captured by occurrences and
     co-occurrences of words
   ○ understanding of text statistics is fundamental
● Text transformation
   ○ Tokenization
   ○ Stopping
   ○ Stemming
   ○ Phrases & N-grams
● Document structure
   ○ Web pages have structure (headings, titles, tags)
     that can be exploited to improve search
Text Statistics
● Luhn observed in 1958: significance of a word
  depends on its frequency in the document
● Statistical models of word occurrences are
  therefore very important in IR
● Most obvious statistical feature: distribution of
  word frequencies is skewed
   ○ only a few words have high frequencies ("of",
     "the" alone account for 10% of all occurrences)
   ○ most words have low frequencies
● This is nicely captured by Zipf's Law
Zipf's law: The rank r of a word times its
probability of occurrence Pr is a constant
                  r * Pr = c
Text Transformation

● Tokenization
     ■ splitting words apart
● Stopping
     ■ ignoring some words
● Stemming
     ■ allowing similar words to match each other
       (like "run" and "running")
● Phrases and N-grams
     ■ storing sequence of words
Tokenizing
● Process of forming words called tokens from the
  sequence of characters
● Simple for English but not for all languages (e.g.
  Chinese)
● Earlier IR systems: sequence of 3+ alphanumeric
  characters separated by space or special character
  was considered a word
● Example:
 ● "Bigcorp's 2007 bi‐annual report showed profits rose 10%."


     ● "bigcorp 2007 annual report showed profits rose"
● Leads to too much information loss
(Some) Tokenizing Problems
            Problem             Examples
  Small words         xp, world war II

  Hyphens             e-bay, mazda rx-7

  Capital letters     Bush, Apple

  Apostrophes         can't, 80's, kid's
  Numbers             nokia 3250, 288358
  Periods             I.B.M., Ph.D., ischool.
                      utexas.edu
Steps in Tokenizing
● First: Identify parts of the document to be tokenized using a
  tokenizer and parser designed for a specific language.
● Second: Tokenize the relevant parts of the document
    ○ Defer complex decisions to other components
       ■ Identification of word variants - Stemmer
       ■ Recognizing that a string is a name or a date- Information
         Extractor
   ○ Retain capitalizations and punctuations till information
     extraction has been done
● Examples of rules used with TREC
   ○ Apostrophes in words ignored
       ■ o’connor → oconnor, bob’s → bobs
   ○ Periods in abbreviations ignored
          ■ I.B.M. → ibm, Ph.D. → ph d
Stopping
● Gets rid of stopwords
   ○ delimiters like a, an, the
   ○ prepositions like on, below, over
● Reasons to eliminate stopwords
   ○ Nearly all of the most frequent words fall in this
      category.
   ○ Do not convey relevant information on their own
● Stopping decreases index size, increase retrieval
  efficiency and generally improves effectiveness.
● Caution: Removing too many words might affect
  effectiveness
       ■ e.g. "Take That", "The Who"
Stopping continued

● Stopword list can be manually prepared from high-
  frequency words or based on a standard list.
● Lists are customized for applications, domains, and
  even parts of documents
 e.g., “click” is a good stopword for anchor text
● Best policy is to index all words in documents, make
  decisions about which words to use at query time
Stemming
● Captures the relationships between different variations
  of a word reducing all the forms (inflection, derivation)
  in which a word can occur to a common stem
● Examples
      ■ is, be ,was
      ■ ran, run
      ■ tweet, tweets
● Crucial for highly inflected languages (e.g. Arabic)
● There are three types of stemmers
      ■ Algorithm based: uses knowledge of word
        suffixes. e.g. Porter stemmer
      ■ Dictionary based: uses a pre-created dictionary
        of related terms
      ■ Hybrid approach: e.g. Krovetz stemmer
Phrases & N-grams
● Phrases are important as they are
   ○ More precise than single words
       ■ e.g "World Wide Web"
   ○ Less ambiguous
       ■ e.g. "green bush", "bush"
● Ranking issue
● Text processing issue - recognizing phrases
● Three possible approaches for recognizing phrases
   ○ Parts Of Speech (POS) tagger
   ○ Store word positions in indexes and use proximity
     operators in queries (not covered here)
   ○ N-gram
Recognizing Phrases
● POS tagger
   ○ uses syntactic structure of sentence
        ■ sequences of nouns or
        ■ adjectives followed by nouns
   ○ too slow for large databases
● N-grams
   ○ uses a simpler definition of phrase
   ○ phrase is just a sequence of N words
        ■ 1 word - unigram
        ■ 2 words - bigram
        ■ 3 words - trigram
        ■ N words - N-gram
   ○ fits the Zipf distribution better than words alone
   ○ improves retrieval effectiveness hence used
   ○ takes up a lot of memory
Document Structure and Markup

● Some parts of a document are more important
● Document parser recognizes structure using markup
   ○ Title, Heading, Bold text
   ○ Anchor tags
   ○ Meta data
   ○ Links - used in ranking algorithms
Information Retrieval

From Wikipedia, the free encyclopedia

Information retrieval (IR) is the area of study concerned with
searching for documents, for information within documents, and for
metadata about documents, as well as that of searching relational
databases and the World Wide Web. There is overlap in the usage
of the terms data retrieval, document retrieval, information retrieval,
and text retrieval, but each also has its own body of literature, theory,
praxis, and technologies. IR is interdisciplinary, based on computer
science, mathematics, library science, information science,
information architecture, cognitive psychology, linguistics, and
statistics.




            Part of a Web page from Wikipedia
<html>
<head>

<title>Information retrieval - Wikipedia, the free encyclopedia</title>

…

<body>

    <h1 id="firstHeading" class="firstHeading">Information retrieval</h1>
<p><b>Information retrieval</b> (<b>IR</b>) is the area of study concerned with searching for documents, for <a
href="/wiki/Information" title="Information">information</a> within documents, and for <a href="/wiki/Metadata_
(computing)" title="Metadata (computing)" class="mw-redirect">metadata</a> about documents, as well as that of
searching <a href="/wiki/Relational_database" title="Relational database">relational databases</a> and the <a
href="/wiki/World_Wide_Web" title="World Wide Web">World Wide Web</a>.
...
</body>
</html>




           HTML source for example Wikipedia page
Questions??

   Thanks!

Más contenido relacionado

La actualidad más candente

Comparisons of ranking algorithms
Comparisons of ranking algorithmsComparisons of ranking algorithms
Comparisons of ranking algorithmsPravin Patil
 
HPEC 2021 sparse binary format
HPEC 2021 sparse binary formatHPEC 2021 sparse binary format
HPEC 2021 sparse binary formatErikWelch2
 
4.4 text mining
4.4 text mining4.4 text mining
4.4 text miningKrish_ver2
 
Textmining Information Extraction
Textmining Information ExtractionTextmining Information Extraction
Textmining Information Extractionguest0edcaf
 
Presentation of OpenNLP
Presentation of OpenNLPPresentation of OpenNLP
Presentation of OpenNLPRobert Viseur
 

La actualidad más candente (11)

Comparisons of ranking algorithms
Comparisons of ranking algorithmsComparisons of ranking algorithms
Comparisons of ranking algorithms
 
HPEC 2021 sparse binary format
HPEC 2021 sparse binary formatHPEC 2021 sparse binary format
HPEC 2021 sparse binary format
 
4.4 text mining
4.4 text mining4.4 text mining
4.4 text mining
 
Web search engines
Web search enginesWeb search engines
Web search engines
 
Fragen: visualisierung
Fragen: visualisierungFragen: visualisierung
Fragen: visualisierung
 
Fragebogen mit bildern
Fragebogen mit bildernFragebogen mit bildern
Fragebogen mit bildern
 
Search pitb
Search pitbSearch pitb
Search pitb
 
Text Mining Analytics 101
Text Mining Analytics 101Text Mining Analytics 101
Text Mining Analytics 101
 
Textmining Information Extraction
Textmining Information ExtractionTextmining Information Extraction
Textmining Information Extraction
 
Rdf
RdfRdf
Rdf
 
Presentation of OpenNLP
Presentation of OpenNLPPresentation of OpenNLP
Presentation of OpenNLP
 

Similar a Shilpa shukla processing_text

MODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptxMODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptxnikshaikh786
 
Query Understanding
Query UnderstandingQuery Understanding
Query UnderstandingMatt Corkum
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingSean Golliher
 
Arcomem training entities-and-events_advanced
Arcomem training entities-and-events_advancedArcomem training entities-and-events_advanced
Arcomem training entities-and-events_advancedarcomem
 
Text analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATEText analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATEDiana Maynard
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Alia Hamwi
 
Search in Research, Let's Make it More Complex!
Search in Research, Let's Make it More Complex!Search in Research, Let's Make it More Complex!
Search in Research, Let's Make it More Complex!Marijn Koolen
 
Information retrieval concept, practice and challenge
Information retrieval   concept, practice and challengeInformation retrieval   concept, practice and challenge
Information retrieval concept, practice and challengeGan Keng Hoon
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processingMinh Pham
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlpankit_ppt
 
introtonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdfintrotonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdfAdityaMishra178868
 
Semantic job recommendation engine
Semantic job recommendation engineSemantic job recommendation engine
Semantic job recommendation engineVishal Gupta
 
The Data Architect Manifesto
The Data Architect ManifestoThe Data Architect Manifesto
The Data Architect ManifestoMahesh Vallampati
 
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William EnckHotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William EnckTao Xie
 
Info 2402 irt-chapter_4
Info 2402 irt-chapter_4Info 2402 irt-chapter_4
Info 2402 irt-chapter_4Shahriar Rafee
 
Ontology Access Kit_ Workshop Intro Slides.pptx
Ontology Access Kit_ Workshop Intro Slides.pptxOntology Access Kit_ Workshop Intro Slides.pptx
Ontology Access Kit_ Workshop Intro Slides.pptxChris Mungall
 

Similar a Shilpa shukla processing_text (20)

MODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptxMODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptx
 
Query Understanding
Query UnderstandingQuery Understanding
Query Understanding
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document Parsing
 
Arcomem training entities-and-events_advanced
Arcomem training entities-and-events_advancedArcomem training entities-and-events_advanced
Arcomem training entities-and-events_advanced
 
Ontology matching
Ontology matchingOntology matching
Ontology matching
 
Text analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATEText analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATE
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
Search in Research, Let's Make it More Complex!
Search in Research, Let's Make it More Complex!Search in Research, Let's Make it More Complex!
Search in Research, Let's Make it More Complex!
 
Information retrieval concept, practice and challenge
Information retrieval   concept, practice and challengeInformation retrieval   concept, practice and challenge
Information retrieval concept, practice and challenge
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processing
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
 
introtonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdfintrotonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdf
 
Semantic job recommendation engine
Semantic job recommendation engineSemantic job recommendation engine
Semantic job recommendation engine
 
The Data Architect Manifesto
The Data Architect ManifestoThe Data Architect Manifesto
The Data Architect Manifesto
 
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William EnckHotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
 
Think like a Digital Curator
Think like a Digital CuratorThink like a Digital Curator
Think like a Digital Curator
 
Info 2402 irt-chapter_4
Info 2402 irt-chapter_4Info 2402 irt-chapter_4
Info 2402 irt-chapter_4
 
Data Structures & Algorithms
Data Structures & AlgorithmsData Structures & Algorithms
Data Structures & Algorithms
 
Ontology Access Kit_ Workshop Intro Slides.pptx
Ontology Access Kit_ Workshop Intro Slides.pptxOntology Access Kit_ Workshop Intro Slides.pptx
Ontology Access Kit_ Workshop Intro Slides.pptx
 

Último

Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Último (20)

Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

Shilpa shukla processing_text

  • 1. Processing Text Shilpa Shukla Graduate Student School of Information, UT Austin
  • 3. Text Processing ● Goal: transforms documents into index terms or features. ● Why do text processing? ○ Exact search is too restrictive ○ E.g. "computer hardware" doesn't match "Computer hardware" ● Easy to handle this example by converting to lowercase ● But search engines go much further!
  • 4. Outline of presentation ● Text statistics ○ meaning of text often captured by occurrences and co-occurrences of words ○ understanding of text statistics is fundamental ● Text transformation ○ Tokenization ○ Stopping ○ Stemming ○ Phrases & N-grams ● Document structure ○ Web pages have structure (headings, titles, tags) that can be exploited to improve search
  • 5. Text Statistics ● Luhn observed in 1958: significance of a word depends on its frequency in the document ● Statistical models of word occurrences are therefore very important in IR ● Most obvious statistical feature: distribution of word frequencies is skewed ○ only a few words have high frequencies ("of", "the" alone account for 10% of all occurrences) ○ most words have low frequencies ● This is nicely captured by Zipf's Law
  • 6. Zipf's law: The rank r of a word times its probability of occurrence Pr is a constant r * Pr = c
  • 7. Text Transformation ● Tokenization ■ splitting words apart ● Stopping ■ ignoring some words ● Stemming ■ allowing similar words to match each other (like "run" and "running") ● Phrases and N-grams ■ storing sequence of words
  • 8. Tokenizing ● Process of forming words called tokens from the sequence of characters ● Simple for English but not for all languages (e.g. Chinese) ● Earlier IR systems: sequence of 3+ alphanumeric characters separated by space or special character was considered a word ● Example: ● "Bigcorp's 2007 bi‐annual report showed profits rose 10%." ● "bigcorp 2007 annual report showed profits rose" ● Leads to too much information loss
  • 9. (Some) Tokenizing Problems Problem Examples Small words xp, world war II Hyphens e-bay, mazda rx-7 Capital letters Bush, Apple Apostrophes can't, 80's, kid's Numbers nokia 3250, 288358 Periods I.B.M., Ph.D., ischool. utexas.edu
  • 10. Steps in Tokenizing ● First: Identify parts of the document to be tokenized using a tokenizer and parser designed for a specific language. ● Second: Tokenize the relevant parts of the document ○ Defer complex decisions to other components ■ Identification of word variants - Stemmer ■ Recognizing that a string is a name or a date- Information Extractor ○ Retain capitalizations and punctuations till information extraction has been done ● Examples of rules used with TREC ○ Apostrophes in words ignored ■ o’connor → oconnor, bob’s → bobs ○ Periods in abbreviations ignored ■ I.B.M. → ibm, Ph.D. → ph d
  • 11. Stopping ● Gets rid of stopwords ○ delimiters like a, an, the ○ prepositions like on, below, over ● Reasons to eliminate stopwords ○ Nearly all of the most frequent words fall in this category. ○ Do not convey relevant information on their own ● Stopping decreases index size, increase retrieval efficiency and generally improves effectiveness. ● Caution: Removing too many words might affect effectiveness ■ e.g. "Take That", "The Who"
  • 12. Stopping continued ● Stopword list can be manually prepared from high- frequency words or based on a standard list. ● Lists are customized for applications, domains, and even parts of documents e.g., “click” is a good stopword for anchor text ● Best policy is to index all words in documents, make decisions about which words to use at query time
  • 13. Stemming ● Captures the relationships between different variations of a word reducing all the forms (inflection, derivation) in which a word can occur to a common stem ● Examples ■ is, be ,was ■ ran, run ■ tweet, tweets ● Crucial for highly inflected languages (e.g. Arabic) ● There are three types of stemmers ■ Algorithm based: uses knowledge of word suffixes. e.g. Porter stemmer ■ Dictionary based: uses a pre-created dictionary of related terms ■ Hybrid approach: e.g. Krovetz stemmer
  • 14. Phrases & N-grams ● Phrases are important as they are ○ More precise than single words ■ e.g "World Wide Web" ○ Less ambiguous ■ e.g. "green bush", "bush" ● Ranking issue ● Text processing issue - recognizing phrases ● Three possible approaches for recognizing phrases ○ Parts Of Speech (POS) tagger ○ Store word positions in indexes and use proximity operators in queries (not covered here) ○ N-gram
  • 15. Recognizing Phrases ● POS tagger ○ uses syntactic structure of sentence ■ sequences of nouns or ■ adjectives followed by nouns ○ too slow for large databases ● N-grams ○ uses a simpler definition of phrase ○ phrase is just a sequence of N words ■ 1 word - unigram ■ 2 words - bigram ■ 3 words - trigram ■ N words - N-gram ○ fits the Zipf distribution better than words alone ○ improves retrieval effectiveness hence used ○ takes up a lot of memory
  • 16. Document Structure and Markup ● Some parts of a document are more important ● Document parser recognizes structure using markup ○ Title, Heading, Bold text ○ Anchor tags ○ Meta data ○ Links - used in ranking algorithms
  • 17. Information Retrieval From Wikipedia, the free encyclopedia Information retrieval (IR) is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching relational databases and the World Wide Web. There is overlap in the usage of the terms data retrieval, document retrieval, information retrieval, and text retrieval, but each also has its own body of literature, theory, praxis, and technologies. IR is interdisciplinary, based on computer science, mathematics, library science, information science, information architecture, cognitive psychology, linguistics, and statistics. Part of a Web page from Wikipedia
  • 18. <html> <head> <title>Information retrieval - Wikipedia, the free encyclopedia</title> … <body> <h1 id="firstHeading" class="firstHeading">Information retrieval</h1> <p><b>Information retrieval</b> (<b>IR</b>) is the area of study concerned with searching for documents, for <a href="/wiki/Information" title="Information">information</a> within documents, and for <a href="/wiki/Metadata_ (computing)" title="Metadata (computing)" class="mw-redirect">metadata</a> about documents, as well as that of searching <a href="/wiki/Relational_database" title="Relational database">relational databases</a> and the <a href="/wiki/World_Wide_Web" title="World Wide Web">World Wide Web</a>. ... </body> </html> HTML source for example Wikipedia page
  • 19. Questions?? Thanks!