3. Text Processing
● Goal: transforms documents into index terms or
features.
● Why do text processing?
○ Exact search is too restrictive
○ E.g. "computer hardware" doesn't match
"Computer hardware"
● Easy to handle this example by converting to
lowercase
● But search engines go much further!
4. Outline of presentation
● Text statistics
○ meaning of text often captured by occurrences and
co-occurrences of words
○ understanding of text statistics is fundamental
● Text transformation
○ Tokenization
○ Stopping
○ Stemming
○ Phrases & N-grams
● Document structure
○ Web pages have structure (headings, titles, tags)
that can be exploited to improve search
5. Text Statistics
● Luhn observed in 1958: significance of a word
depends on its frequency in the document
● Statistical models of word occurrences are
therefore very important in IR
● Most obvious statistical feature: distribution of
word frequencies is skewed
○ only a few words have high frequencies ("of",
"the" alone account for 10% of all occurrences)
○ most words have low frequencies
● This is nicely captured by Zipf's Law
6. Zipf's law: The rank r of a word times its
probability of occurrence Pr is a constant
r * Pr = c
7. Text Transformation
● Tokenization
■ splitting words apart
● Stopping
■ ignoring some words
● Stemming
■ allowing similar words to match each other
(like "run" and "running")
● Phrases and N-grams
■ storing sequence of words
8. Tokenizing
● Process of forming words called tokens from the
sequence of characters
● Simple for English but not for all languages (e.g.
Chinese)
● Earlier IR systems: sequence of 3+ alphanumeric
characters separated by space or special character
was considered a word
● Example:
● "Bigcorp's 2007 bi‐annual report showed profits rose 10%."
● "bigcorp 2007 annual report showed profits rose"
● Leads to too much information loss
9. (Some) Tokenizing Problems
Problem Examples
Small words xp, world war II
Hyphens e-bay, mazda rx-7
Capital letters Bush, Apple
Apostrophes can't, 80's, kid's
Numbers nokia 3250, 288358
Periods I.B.M., Ph.D., ischool.
utexas.edu
10. Steps in Tokenizing
● First: Identify parts of the document to be tokenized using a
tokenizer and parser designed for a specific language.
● Second: Tokenize the relevant parts of the document
○ Defer complex decisions to other components
■ Identification of word variants - Stemmer
■ Recognizing that a string is a name or a date- Information
Extractor
○ Retain capitalizations and punctuations till information
extraction has been done
● Examples of rules used with TREC
○ Apostrophes in words ignored
■ o’connor → oconnor, bob’s → bobs
○ Periods in abbreviations ignored
■ I.B.M. → ibm, Ph.D. → ph d
11. Stopping
● Gets rid of stopwords
○ delimiters like a, an, the
○ prepositions like on, below, over
● Reasons to eliminate stopwords
○ Nearly all of the most frequent words fall in this
category.
○ Do not convey relevant information on their own
● Stopping decreases index size, increase retrieval
efficiency and generally improves effectiveness.
● Caution: Removing too many words might affect
effectiveness
■ e.g. "Take That", "The Who"
12. Stopping continued
● Stopword list can be manually prepared from high-
frequency words or based on a standard list.
● Lists are customized for applications, domains, and
even parts of documents
e.g., “click” is a good stopword for anchor text
● Best policy is to index all words in documents, make
decisions about which words to use at query time
13. Stemming
● Captures the relationships between different variations
of a word reducing all the forms (inflection, derivation)
in which a word can occur to a common stem
● Examples
■ is, be ,was
■ ran, run
■ tweet, tweets
● Crucial for highly inflected languages (e.g. Arabic)
● There are three types of stemmers
■ Algorithm based: uses knowledge of word
suffixes. e.g. Porter stemmer
■ Dictionary based: uses a pre-created dictionary
of related terms
■ Hybrid approach: e.g. Krovetz stemmer
14. Phrases & N-grams
● Phrases are important as they are
○ More precise than single words
■ e.g "World Wide Web"
○ Less ambiguous
■ e.g. "green bush", "bush"
● Ranking issue
● Text processing issue - recognizing phrases
● Three possible approaches for recognizing phrases
○ Parts Of Speech (POS) tagger
○ Store word positions in indexes and use proximity
operators in queries (not covered here)
○ N-gram
15. Recognizing Phrases
● POS tagger
○ uses syntactic structure of sentence
■ sequences of nouns or
■ adjectives followed by nouns
○ too slow for large databases
● N-grams
○ uses a simpler definition of phrase
○ phrase is just a sequence of N words
■ 1 word - unigram
■ 2 words - bigram
■ 3 words - trigram
■ N words - N-gram
○ fits the Zipf distribution better than words alone
○ improves retrieval effectiveness hence used
○ takes up a lot of memory
16. Document Structure and Markup
● Some parts of a document are more important
● Document parser recognizes structure using markup
○ Title, Heading, Bold text
○ Anchor tags
○ Meta data
○ Links - used in ranking algorithms
17. Information Retrieval
From Wikipedia, the free encyclopedia
Information retrieval (IR) is the area of study concerned with
searching for documents, for information within documents, and for
metadata about documents, as well as that of searching relational
databases and the World Wide Web. There is overlap in the usage
of the terms data retrieval, document retrieval, information retrieval,
and text retrieval, but each also has its own body of literature, theory,
praxis, and technologies. IR is interdisciplinary, based on computer
science, mathematics, library science, information science,
information architecture, cognitive psychology, linguistics, and
statistics.
Part of a Web page from Wikipedia
18. <html>
<head>
<title>Information retrieval - Wikipedia, the free encyclopedia</title>
…
<body>
<h1 id="firstHeading" class="firstHeading">Information retrieval</h1>
<p><b>Information retrieval</b> (<b>IR</b>) is the area of study concerned with searching for documents, for <a
href="/wiki/Information" title="Information">information</a> within documents, and for <a href="/wiki/Metadata_
(computing)" title="Metadata (computing)" class="mw-redirect">metadata</a> about documents, as well as that of
searching <a href="/wiki/Relational_database" title="Relational database">relational databases</a> and the <a
href="/wiki/World_Wide_Web" title="World Wide Web">World Wide Web</a>.
...
</body>
</html>
HTML source for example Wikipedia page