SlideShare una empresa de Scribd logo
1 de 55
Descargar para leer sin conexión
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
FULL SYLLABUS - ALL UNIT QUESTIONS & ANSWERS
Course Code:C413 Year / Semester: IV / VII
Sub Code :CS6007 Sub Name :Information Retrieval
Faculty In-charge : A.Anandh
1. Define Information Retrieval.
Information Retrieval (IR) deals with the representation, storage and organization of
unstructured data. Information retrieval is the process of searching within a document
collectionforaparticular information need (a query).
2. How AI is applied in IR systems?
Four main roles investigated
1. Information characterization
2. Search formulation in information seeking
3. System Integration
4. Support functions
3. Give the historical view of Information Retrieval.
• Boolean model, statistics of language (1950’s)
• Vector space model, probabilistic indexing, relevance feedback (1960’s)
• Probabilistic querying (1970’s)
• Fuzzy set/logic, evidential reasoning (1980’s)
• Regression, neural nets, inference networks, latent semantic indexing, TREC (1990’s)
4. CompareIR with Web Search.
• Information retrieval (IR) is finding material (usually documents) of an unstructured nature
(usually text) that satisfies an information need from a large collection (usually stored on
computers).
Web search is often not informational -- it might be navigational (give me the url of the site I want
to reach) or transactional (show me sites where I can perform a certain transaction, e.g. shop,
download a file, or find a map).
5. Define Heap’s Law.
An empirical rule which describes the vocabulary growth as a function of the text size is calculated by
Heap’s Law.
V = k * nb
Where, V- Number of new words seen
n – Total number of words
k-1 ≤ 1 ≤ 100
b = Constant
6. What is open source software?
✓ Open source software is software whose source code is available for modification or
enhancement by anyone.
✓ There is usually no license cost and free of cost.
✓ The source code is open and can be modified freely.
✓ Open standards.
7. Give any two advantages of using AI in IR.
✓ Artificial intelligence methods are employed throughout the standard information retrieval
process and for novel value added services.
✓ The text pre-processing processes for indexing like stemming are from Artificial
intelligence.
✓ Neural networks have been applied widely in IR.
8. What are the applications of IR?
• Indexing
• Ranked retrieval
• Web search
• Query processing
9. What are the components of IR?(Nov/ Dec 2016)
• The document subsystem
• The indexing subsystem
• The vocabulary subsystem
• The searching subsystem
• The server-system interface
• The matching subsystem
10. How to introduce AI into IR systems?
• User simply enters a query, suggests what needs to be done, and the system executes the
query to return results.
• First signs of AI. System actually starts suggesting improvements to user.
• Full Automation. User queries are entered and the rest is done by the system.
11. What are the areas of AI for information retrieval?
• Natural language processing
• Knowledge representation
• Machine learning
• Computer Vision
• Reasoning under uncertainty
• Cognitive theory
12. Give the functions of information retrieval system.
• To identify the information(sources) relevant to the areas of interest of the target users
community
• To analyze the contents of the sources(documents)
• To represent the contents of the analyzed sources in a way that will be suitable for matching
user’s queries
• To analyze user’s queries and to represent them in a form that will be suitable for matching with
the database
• To match the search statement with the stored database
• To retrieve the information that is relevant
• To make necessary adjustments in the system based on feedback form the users.
13. List the issues in information retrieval system.
• Assisting the user in clarifying and analyzing the problem and determining information
needs.
• Knowing how people use and process information.
• Assembling a package of information that enables group the user to come closer to a solution
of his problem.
• Knowledge representation.
• Procedures for processing knowledge/information.
• The human-computer interface.
• Designing integrated workbench systems.
14. What are some open source search frameworks?
• Google Search API
• Apache Lucene
• blekko API
• Carrot2
• Egothor
• Nutch
15. Define relevance.
Relevance appears to be a subjective quality, unique between the individual and a given
document supporting the assumption that relevance can only be judged by the information
user.Subjectivity and fluidity make it difficult to use as measuring tool for system performance.
16. What is meant by stemming?
Stemming is techniques used to find out the root/stem of a word. It is used to improve
effectiveness of IR and text mining.Stemming usually refers to a crude heuristic process that chops off
the ends of words in the hope of achieving this goal correctly most of the time, and often includes the
removal of derivational affixes.
17. Define indexing & document indexing.
Association of descriptors (keywords, concepts, metadata) to documents in view of future
retrieval.Document indexing is the process of associating or tagging documents with different “search”
terms. Assign to each document (respectively query) a descriptor represented with a set of features,
usually weighted keywords, derived from the document (respectively query) content.
18. List Information retrieval models.(Nov/ Dec 2016)
• Boolean model
• Vector space model
• Statistical language model
16. Define web search and web search engine.
Web search is often not informational -- it might be navigational (give me the url of the site
I want to reach) or transactional (show me sites where I can perform a certain transaction, e.g. shop,
download a file, or find a map).
Web search engines crawl the Web, downloading and indexing pages in order to allow full-
text search. There are many general purpose search engines; unfortunately none of them come close
to indexing the entire Web. There are also thousands of specialized search services that index
specific content or specific sites.
17. What are the components of search engine?
Generally there are three basic components of a search engine as listed below:
1. Web Crawler
2. Database
3. Search Interfaces
18. Define web crawler.
This is the part of the search engine which combs through the pages on the internet and gathers the
information for the search engine. It is also known as spider or bots. It is a software component that
traverses the web to gather information.
19. What are search engine processes?
Indexing Process
• Text acquisition
• Text transformation
• Index creation
Query Process
• User interaction
• Ranking
• Evaluation
20. How to characterize the web?
Web can be characterized by three forms
• Search engines -AltaVista
• Web directories -Yahoo
• Hyperlink search-Web Glimpse
21. What are the challenges of web?
• Distributed data
• Volatile data
• Large volume
• Unstructured and redundant data
• Data quality
• Heterogeneous data
22. Zipf’s law
Frequent words and rare words are observed by Zipf’s law in the web.
The most frequent term(the) occurs cf1 times, thesecond most frequent term(of) occurs
times, the third mostfrequent term (and) occurs times etc.
23. Jaccard coefficient
▪ A commonly used measure of overlap of two sets
▪ Let A and B be two sets
▪ Jaccard coefficient
▪ JACCARD (A, A) = 1
▪ JACCARD (A, B) = 0 if A ∩ B = 0
▪ A and B don’t have to be the same size.
▪ Always assigns a number between 0 and 1.
24. Bag of words model
▪ We do not consider the order of words in a document.
▪ John is quicker than Mary and Mary is quicker than John are represented the same way.
▪ This is called a bag of words model.
25. Draw the architecture of a search engine
26. Compare web search and IR
Sl.
No
Characteristics Web Search IR
1 Languages Documents in many different
languages. Usually search
engines use full text
indexing; no additional
subject analysis.
Databases usually cover only one
language or indexing of
documents written in different
languages with the same
vocabulary.
2 File Types Several file types, some hard
to index because of a lack of
textual information
Usually all indexed documents
have the same format (e.g. PDF)
or only bibliographic information
is provided.
27. Search Engine Classification
Search engines can be classified according to
▪ Programming language in which it is implemented
▪ Stores the index (inverted file, database, other file structure)
▪ Searching capabilities (boolean operators, fuzzy search, use of stemming)
▪ Way of ranking, type of files capable of indexing (HTML, PDF, plain text)
▪ Possibility of on-line indexing and/or making incremental indexes.
28. Give some examples of search engine
Nutch, Lucene, ASPSeek, BBDBot, Datapark, ebhath, Eureka, http://Dig, Indri, ISearch, IXE,
ManagingGigabytes (MG), MG4J, mnoGoSearch, MPS Information Server, Namazu, Omega,
OmniFind IBM Yahoo! Ed., OpenFTS, PLWeb, SWISH-E, SWISH++, Terrier, WAIS/ freeWAIS,
WebGlimpse, XML Query Engine, XMLSearch, Zebra.
29. Define Precision and Recall
Recall is a measure of performance
Relevant document retrieved
Recall = ----------------------------------------------
Relevant documents in collection
Relevant documents retrieved
Precision = -------------------------------------
Documents retrieved
30. What is peer-to-peer search?
A peer that joins the network does not only use resources, but also contributes resources
back. Hence a peer-to-peer network can potentially scale beyond what is possible in client-
server set-ups. Peer-to-peer (P2P) networks have been identified as promising architectural
concepts for integrating search facilities across Digital Library collections.
31.What arethe performance measures for search engine?
Precision and recall are the two most basic measures of the performance of information
retrieval systems.
32.Explaindifferencebetweendataretrievalandinformationretrieval.
Parameters Data Retrieval Information retrieval
Example Data Base Query WWW Search
Matching Exact Partial Match, Best Match
Inference Deduction Induction
Model Deterministic Probabilistic
33. Explainthetypeofnaturallanguagetechnologyusedininformationretrieval.
Twotypes
I. Naturallanguageinterfacemakethetaskofcommunicatingwiththeinformationsource
easier, allowing a system to respond to a range of inputs.
2. NaturalLanguagetextprocessingallowsasystemtoscanthesourcetexts,either to retrieve
particular information or to derive knowledge structures that may be
usedinaccessinginformationfromthetexts.
34. What is searchengine?
A search engine is a document retrieval system design to help find information
stored in a computer system, such as on the WWW. The search engine allows one to
ask for content meeting specific criteria and retrieves a list of items that match those
criteria.
35. What isconflation or Stemming?
Stemming is the process for reducing inflected words to their stem, base or
root form, generally a written word form. The process of stemming if often called
conflation.
36. What is an invisibleweb?
Many dynamically generated sites are not index able by search engines; This
phenomenon is known as the invisible web.
37. What is proprietarysoftware?
Proprietary software is computer software which is the legal property of one
party. The term of use for other parties is defined by contracts or licensing agreements.
These terms may include various privileges to share, alter , dissemble, and use the
software and its code.
38. What is closedsoftware?
Closed software is a term for software whose license does not allow for the release or
distribution of the software’s source code. Generally it means only the binaries of a
computer program are distributed and the license provides no access to the programs source
code. The source code of such programs is usually regarded as a trade secret of the company.
Access to source code by third parties commonly requires the party to sign a non-
disclosureagreement.
39. Listtheadvantageofopensource.
▪ The right to use the software in anyway.
▪ Thereisusuallyno licensecostandfreeofcost.
▪ The source code is open and can be modifiedfreely.
▪ Openstandards.
▪ It provides higherflexibility.
40. Listthedisadvantageofopensource.
▪ Thereisnoguaranteethatdevelopmentwillhappen.
▪ Itissometimesdifficulttoknowthataprojectexist, anditscurrentstatus.
No secured follow-up developmentstrategy.
41. Whatdoyou meanbyApacheLicense?
The Apache License is a free software license written by the Apache Software
Foundation (ASF). The name Apache is a registered trademark and may only be used
with the trademark holders express permission. Apache license is a high performance,
Full-featured text search engine library written entirely in Java.
42. ExplainfeaturesofGPLversion2.
It givespermissiontocopyanddistributetheprogramsunmodifiedsourcecode. It allows
modifying the programs source code anddistributing the modified sourcecode.
Userdistributescompiledversionsoftheprogram,bothmodifiedandunmodified.
AllmodifiedcopiesaredistributedundertheGPLv2. All compiled versions of the program are
accompanied by the relevant source code.
43. List out any four search engines.
Google, Bing, ASK, Alta Vista
Part B (2 x 13 = 26)
1. Discuss the characteristics of web in detail. (13)
In characterizing the structure and content of the Web, it is necessary to establish precise semantics for
Web concepts.
Measuring the Web
• Internet and particular web is dynamic in nature so it is difficult task to measure.
• Web explosion is due in no small part to the extended application of an axiom known as
Moore's Law. While ostensibly a prediction about semi-conductor innovation rates, this bit of
prophecy from Intel co-founder Gordon Moore has come to represent the doubling not just of
processing power, but of computing power in general.
Modeling the Web
• The Heap's and Zipf’s laws are also valid in the web. Normally the vocabulary grows faster and
the word distribution should be more biased. But there are no such experiments on large Web
collections to measure these parameters.
2. Describe the various impact of WEB on IR (13)
Impact of the web
The first impact of the web on search is related to the characteristics of the document collection itself.
o The web is composed of pages distributed over millions of sites and connected through
hyperlinks
o This requires collecting ll documents and storing copies of them in a central
repository,prior to indexing.
o This new phase in the IR process,introduced by the web is called crawling
The second impact of the web on search is related to
o The size of the collection
o The volume of user queries submitted on a daily basis
o As a consequence,performance and scalability have critical characteristics of the IR
system.
The third impact in a very large collection, predicting relevance is much harder than before
o Fortunately the web also includes new sources of evidence
o Ex. hyperlinks and user clicks in documents in the answer set
The fourth impact derives from the fact that the web is also a medium to do business.
o Search problem has been extended beyond the seeking of text information to
alsoencompass other user needsEx.price of a book, the phone number of a hotel
The fifth impact of the web on search is web spam
o Web spam: abusive availability of commercial information disguised in the form of
informational content.
o This difficulty is so large that today we talk of adverbial web retrieval.
3. Compare in detail Information Retrieval and Web Search with examples.(13)
Sl.No Differentiator Web Search IR
1 Languages Documents in many different
languages. Usually search
engines use full text indexing;
no additional subject analysis.
Databases usually cover only one
language or indexing of
documents written in different
languages with the same
vocabulary.
2 File Types Several file types, some hard to
index because of a lack of
textual information.
Usually all indexed documents
have the same format (e.g. PDF)
or only bibliographic information
is provided.
3 Document length Wide range from very short to
very long. Longer documents
are often divided into parts.
Document length varies, but not
to such a high degree as with the
Web documents
4 Document
structure
HTML documents are semi
structures.
Structured documents allow
complex field searching
5 Spam Search engines have to decide
which documents are suitable
for indexing.
Suitable document types are
defined in the process of database
design.
5. Demonstrate the role of Artificial Intelligence in Information Retrieval Systems. (14)
6. Explain in detail the components of Information Retrieval.
8. Describe the components of a search engine with neat diagram.
• The main components of a search engine are the crawler, indexer, search index, query engine,
and search interface.
• web crawler is a software program that traverses web pages, downloads them for indexing,
and follows the hyperlinks that are referenced on the downloaded pages.
• A web crawler is also known as a spider, a wanderer or a software robot.
Fig: Crawling the web
• The second component is the indexer which is responsible for creating the search index from
the web pages it receives from the crawler.
• The third component is the search index .
• The search index is a data repository containing all the information the search engine needs to
match and retrieve web pages.
• The type of data structure used to organize the index is known as an inverted file.
• It is very much like an index at the back of a book.
• It contains all the words appearing in the web pages crawled, listed in alphabetical order (this
is called the index file), and for each word it has a list of references to the web pages in which
the word appears (this is called the posting list ).
• The search index will also store information pertaining to hyperlinks in a separate link
database, which allows the search engine to perform hyperlink analysis, which is used as part
of the ranking process of web pages.
• The fourth component is the query engine.
• The interface between the search index, the user and the web.
• Algorithmic details of commercial search engines kept as trade secrets.
• The query engine processes a user query in two steps.
• First step is retrieval of potential results from the index.
• Second step is the ranking of the results based on their “relevance” to the query.
• The fifth component is the search interface
• Once the query is processed, the query engine sends the results list to the search interface,
which displays the results on the user’s screen.
• From the usability point of view, it is important that users can distinguish between sponsored
links, which are ads, and organic results, which are ranked by the query engine.
9. Explain in detail the components of IR with neat sketch.
An information retrieval process begins when a user enters a query into the system. Queries are
formal statements of information need.
User queries are matched against the database information. Depending on the application the data
objects may be, for example, text documents, images, audio, mind maps or video
Most IR systems compute a numeric score on how well each object in the database match the query,
and rank the objects according to this value.
The top ranking objects are then shown to the user. The process may then be iterated if the user
wishes to refine the query.
Three Major Components
1. Document Subsystem
a) Acquisition
10). Discuss the query likelihood model in detail and describe the approach for IR using this model.
A Language model is a function that puts a probability over strings drawn from some vocabulary.
Language Model M over an alphabet ∑:
∑ 𝑃(𝑠) = 1
𝑠∈∑∗
To compare two models for data set Likelihood ratio is used.Dividing the probability of the data
according to one model by the probability of the data according to other model.
▪ Each document is treated as (the basis for) a language model.
▪ Given a query q
▪ Rank documents based on P(d|q)
▪ P(q) is the same for all documents, so ignore
▪ P(d) is the prior – often treated as the same for all d
▪ But we can give a prior to “high-quality” documents, e.g., those with high PageRank.
▪ P(q|d) is the probability of q given d.
▪ So to rank documents according to relevance to q, ranking according to P(q|d) and P(d|q) is
equivalent.
▪ In the LM approach to IR, we attempt to model the query generation process.
▪ Then we rank documents by the probability that a query would be observed as a random
sample from the respective document model.
▪ That is, we rank according to P(q|d).
Mixture Model
▪ In the LM approach to IR, we attempt to model the query generation process.
▪ Then we rank documents by the probability that a query would be observed as a random
sample from the respective document model.
▪ That is, we rank according to P(q|d).
11. Explain about relevance feedback and query expansions.
▪ Interactive relevance feedback: improve initial retrieval results by telling the IR system
which docs are relevant / irrelevant
▪ Best known relevance feedback method: Rocchio feedback
▪ Query expansion: improve retrieval results by adding synonyms / related terms to the
query
▪ Two ways of improving recall:
▪ Relevance feedback and Query expansion
▪ The user issues a (short, simple) query.
▪ The search engine returns a set of documents.
▪ User marks some docs as relevant, some as irrelevant.
▪ Search engine computes a new representation of the information need. Hope: better than
the initial query.
▪ Search engine runs new query and returns new results.
▪ New results have (hopefully) better recall [Accuracy].
▪ The Rocchio’ algorithm implements relevance feedback in the vector space model.
▪ Rocchio’ chooses the query that maximizes
▪
▪ Dr : set of relevant docs; Dnr : set of nonrelevant docs
Query Expansion
▪ Query expansion is another method for increasing recall.
▪ We use “global query expansion” to refer to “global methods for query reformulation”.
▪ In global query expansion, the query is modified based on some global resource, i.e. a
resource that is not query-dependent.
▪ Main information we use: (near-)synonymy
▪ A publication or database that collects (near-)synonyms is called a thesaurus.
We will look at two types of thesauri: manually created and automatically created.
12. Explain briefly about InformationRetrivel.
Information retrieval (IR) is finding material (usually documents) of an
unstructured nature (usually text) that satisfies an information need from within large
collections (usually stored oncomputers).
As defined in this way, information retrieval used to be an activity that only a
few people engaged in: reference librarians, paralegals, and similar professional
searchers. Now the world has changed, and hundreds of millions of people engage in
information retrieval every day when they use a web search engine or search their
email. Information retrieval is fast becoming the dominant form of information access,
overtaking traditional database- style searching. IR can also cover other kinds of data
and information problems beyond that specified in the core definition above. The
termunstructured atarefers to data which does not have clear, semantically overt,
easy-for-a-computer structure. It is the opposite of structured data, the canonical
example of which is a relational database, of the sort companies usually use to maintain
product inventories and personnel records. The latent linguistic structure of human
languages. But even accepting that the intended notion of structure is open structure,
most text has structure, such as headings and paragraphs and footnotes, which is
commonly represented in documents by explicit markup (such as the coding underlying
web pages). IR is also used to facilitate search such as finding a document where the title
contains Java and the body contains threading.
The field of information retrieval also covers supporting users in browsing
orfiltering document collections or further processing a set of retrieved documents.
Given a set of documents, clustering is the task of coming up with a good grouping of
the documents based on their contents. It is similar to arranging books on a bookshelf
according to their topic. Given a set of topics, standing information needs, or other
categories (such as suitability of texts for different age groups), classification is the task
of deciding which class(es), if any, each of a set of documents belongs to. It is often
approached by first manually classifying some documents and then hoping to be able to
classify new documents automatically.
Information retrieval systems can also be distinguished by the scale at which
they operate, and it is useful to distinguish three prominent scales. In web search, the
system has to provide search over billions of documents stored on millions of
computers. Distinctive issues need to gather documents for indexing, being able to
build systems that work efficiently at this enormous scale, and handling particular
aspects of the web, such as the exploitation of hypertext and not being fooled by site
providers manipulating page content in an attempt to boost their search engine
rankings, given the commercial importance of theweb.
In the last few years, consumer operating systems have integrated information
details; such as Apple’s Mac OS X Spotlight oƌWiŶoǁs, Vista’s InstaSearch. Email
programs usually not only provide search but also text classification: they at least
provide a spam (junk mail) filter, and commonly also provide eithermanualor
automatic means for classifying mail so that it can be placed directly into particular
folders. Distinctive issues here include handling the broad range of document types
on a typical personal computer, and making the search system maintenance free and
sufficiently lightweight in terms of start-up, processing, and disk space usage that
it can run on one machine without annoying its owner. In between is the space of
enterprise, institutional, anddomain-specificsearch, a database of patents,
orresearcharticlesonbiochemistry.
13. Explain about the history ofIR
The idea of using computers to search for relevant pieces of information was
popularized in the article As We May Tby VannevarBushin 1945.Itwouldappear
that Bush was inspired by patents for a 'statistical machine' - filed by Emanuel
Goldberg in the 1920s and '30s - that searched for documents stored on film. The first
description of a computer searching for information was described by Holmstromin
1948, detailing an early mention of the Univaccomputer.
Automated information retrieval systems were introduced in the 1950s: one
even featured in the 1957 romantic comedy, Desk Set. In the 1960s, the first large
information retrieval research group was formed by Gerard Salton at Cornell.
By the 1970s several different retrieval techniques had been shown to perform well
on small text corpora such as the Cranfield collection (several thousand documents).
Large-scale
retrievalsystems,suchastheLockheedDialogsystem,cameintouseearlyinthe1970s.
In 1992, the US Department of Defense along with the National Institute of
Standards and Technology (NIST), cosponsored the Text Retrieval Conference
(TREC) as part of the TIPSTER text program. The aim of this was to look into
the information retrieval community by supplying the infrastructure that was needed
for evaluation of text retrieval methodologies on a very large text collection. This
catalyzed research on methods that scale to huge corpora. The introduction of web
search engines has boosted the need for very large scale retrieval systems
evenfurther.
Timeline:
1950: The term "information retrieval" was coined by Calvin Mooers.
1951: Philip Bagley conducted the earliest experiment in computerized
document retrieval in a master thesis at MIT.
1955: Allen Kent joined from Western Reserve University published a paper in
American Documentation describing the precision and recall measures as well as
detailing a proposed "framework" for evaluating an IR system which included
statistical sampling methods for determining the number of relevant documents not
retrieved.
1959: Hans Peter Luhn published "Auto-encoding of documents for information
retrieval."
1963: Joseph Becker and Robert M. Hayes published text on information retrieval. Becker,
Joseph; Hayes, Robert Mayo. Information storage and retrieval: tools, elements, theories.
New York, Wiley (1963).
1964:
• Karen Spärck Jones finished her thesis at Cambridge, Synonymy and Semantic
Classification,andcontinuedworkoncomputationallinguisticsasitappliestoIR.
• The National Bureau of Standards sponsored a symposium titled "Statistical
Association Methods for Mechanized Documentation." Several highly
significant papers, including G. Salton's first published reference (we believe) to
the SMARTsystem.
mid-1960s:
• National Library of Medicine developed MEDLARS Medical Literature Analysis
and Retrieval System, the first major machine-readable database and batch-
retrievalsystem.
• Project Intrex atMIT.
1965: J. C. R. Licklider published Libraries of the Future.
late 1960s: F. Wilfrid Lancaster completed evaluation studies of the MEDLARS
system and published the first edition of his text on information retrieval.
1968: Gerard Salton published Automatic Information Organization and Retrieval.
John W. Sammon, Jr.'s RADC Tech report "Some Mathematics of Information
Storage and Retrieval..." outlined the vector model.
1969: Sammon's "A nonlinear mapping for data structure analysis" (IEEE
Transactions on Computers) was the first proposal forvisualization interface to an IR
system.
1970s
Early1970s: Firstonlinesystems—NLM'sAIM-
TWX,MEDLINE;Lockheed'sDialog;SDC's ORBIT.
1971: Nicholas Jardine and Cornelis J. van Rijsbergen published "The use of
hierarchic clustering in information retrieval", which articulated the "cluster
hypothesis."
1975: Three highly influential publications by Salton fully articulated his vector
processing framework and term discrimination model: A Theory of Indexing (Society
for Industrial and Applied Mathematics) A Theory of Term Importance in Automatic
Text Analysis (JASIS v. 26) A Vector Space Model for Automatic Indexing (CACM
18:11)
1978: The First ACM SIGIR conference.
1979: C. J. van Rijsbergen published Information Retrieval (Butterworths). Heavy
emphasis on probabilistic models.
1979: TamasDoszkocs implemented the CITE natural language user interface for
MEDLINE at the National Library of Medicine. The CITE system supported free
form query input, ranked output and relevancefeedback.
1980s
1982: Nicholas J. Belkin, Robert N. Oddy, and Helen M. Brooks proposed the ASK
(Anomalous State of Knowledge) viewpoint for information retrieval. This was
animportant concept, though their automated analysis tool proved ultimately
disappointing.
1983: Salton (and Michael J. McGill) published Introduction to Modern Information
Retrieval (McGraw-Hill), with heavy emphasis on vector space models.
mid-1980s: Efforts to develop end-user versions of commercial IR systems. 1989: First World
Wide Web proposals by Tim Berners-Lee at CERN.
1992: First TREC conference.
1997: Publication of Korfhage's Information Storage and Retrieval with emphasis on
visualization and multi-reference point systems.
late 1990s: Web search engines implementation of many features formerly found
only in experimental IR systems. Search engines become the most common and
maybe best instantiation of IRmodels.
2000s-present:
More applications, especially Web search and interactions with other fields
like Learning to rank, Scalability (e.g., MapReduce), Real-time search.
14. Explain about the Components OfIR.
The Following figure shows the architecture of IR System
Components:
• Textoperations
• Indexing
• Searching
• Ranking
• UserInterface
• Queryoperations
Text operation:
Text Operations forms index words (tokens).
• Stop word removal ,Stemming
Indexing:
Indexing constructs an inverted index of word to document pointers.
Searching:
Searching retrieves documents that contain a given query token from the
invertedindex.
Ranking :
Ranking scores all retrieved documents according to a relevance metric.
User Interface:
User Interface manages interaction with the user:
• Query input and documentoutput.
• Relevancefeedback.
• Visualization ofresults.
Query Operations:
Query Operations transform the query to improve retrieval:
• Query expansion using athesaurus.
• Query transformation using relevancefeedback.
First of all, before the retrieval process can even be initiated, it is necessary to
define the text database. This is usually done by the manager of the database, which
specifies the following: (a) the documents to be used, (b) the operations to be
performed on the text, and (c) the text model (i.e., the text structure and what
elements can be retrieved). The text operations transform the original documents and
generate a logical view of them.
Once the logical view of the documents is defined, the database manager
builds an index of the text. An index is a critical data structure because it
allows fast searching over large volumes of data. Different index structures might be
used, but the most popular one is the inverted file. The resources (time and storage
space) spent on defining the text database and building the index are amortized by
querying the retrieval system manytimes.
Given that the document database is indexed, the retrieval process can be
initiated. The user first specifies a user need which is then parsed and transformed
by the same text operations applied to the text. Then, query operations might be
applied before the actual query, which provides a system representation for the user
need, is generated. The query is then processed to obtain the retrieved documents.
Fast query processing is made possible by the index structure previouslybuilt.
Before been sent to the user, the retrieved documents are ranked according to
a likelihood of relevance. The user then examines the set of ranked documents in the
search for useful information. At this point, he might pinpoint a subset of the
documents seen as definitely of interest and initiate a user feedback cycle. In such a
cycle, the system uses the documents selected by the user to change the query
formulation. Hopefully, this modified query is a betterrepresentation
15. What are the Issues inIR?
1. To process large document collections quickly. The amount of online data has grown
at least as quickly as the speed of computers, and we would now like to be able to
search collections that total in the order of billions to trillions ofwords.
2. To allow more flexible matching operations. For example, it is impractical to perform
the
queryRomansNEARcountrymenwithgrep,whereNEARmightbedefinedasȃwithin5wor
dsȄorȃwithinthesamesentenceȄ.
3. To allow ranked retrieval: in many cases you want the best answer to an information
needamongmanydocumentsthatcontaincertainwords.
The Big Issues
Information retrieval researchers have focused on a few key issues that remain
just as important in the era of commercial web search engines working with billions
of web pages as they were when tests were done in the 1960s on document
collections containingabout 1.5 megabytes of text. One of these issues is relevance.
Relevance is a fundamental concept in information retrieval. Loosely
speaking, a relevant document contains the information that a person was looking
for when she submitted a query to the search engine. Although this sounds simple,
there are many faĐtoƌs that go iŶto a peƌsoŶ’sdeĐisioŶ as to ǁhetheƌ a
paƌtiĐulaƌdoĐuŵeŶtisƌeleǀaŶt. These factors must be taken into account when
designing algorithms for comparing text and ranking documents. Simply comparing
the text of a query with the text of a document and looking for an exact match, as
might be done in a database system or using the grep utility in Unix, produces very
poor results in terms of relevance. One obvious reason for this is that language can be
used to express the same concepts in many different ways, often with very
different words. This is referred to as the vocabulary mismatch problem in
information retrieval. It is also important to distinguish between topical relevance
and user relevance. A text document is topically relevant to a query if it is on the
same topic.
User relevance takes these additional features of the story into account. To address
the issue of relevance, researchers propose retrieval models and test how well they
work. A retrieval model is a formal representation of the process of matching a query
and a document. It is the basis of the ranking algorithm that is used in a search engine
to produce the ranked list of documents. A good retrieval model will find documents
that are likely to be considered relevant by the person who submitted the query. Some
retrieval models focus on topical relevance, but a search engine deployed in a real
environment must use ranking algorithms that incorporate user relevance. An
interesting feature of the retrieval models used in information retrieval is that they
typically model the statistical properties of text rather than the linguistic structure.
This means, for example, that the ranking algorithms are typically far more
concerned with the counts of word occurrences than whether the word is a noun or an
adjective. More advanced models do incorporate linguistic features, but they tend to
be of secondary importance. The use of word frequency information to represent text
started with another information retrieval pioneer, H.P. Luhn, in the 1950s. This
view of text did not become popular in other fields of computer science, such as
natural language processing, until the1990s.
Another core issue for information retrieval is evaluation. Since the
quality of a document ranking necessary early on to develop evaluation measures and
experimental procedures for acquiring this data and using it to compare ranking
algorithms. Cyril Cleverdon led the way in developing evaluation methods in the
early 1960s, and two of the measures he used, precision and recall, are still popular.
Precision is a very intuitive measure, and is the proportion of retrieved documents
that are relevant. Recall is the proportion of relevant documents that are retrieved.
When the recall measure is used, there is an assumption that all the relevant
documents for a given query are known. Such an assumption is clearly problematic in
a web search environment, but with smaller test collections of documents, this
measure can be useful. A test collection for information retrieval experiments
consists of a collection of text documents, a sample of typical queries, and a list of
relevant documents for each query (the relevance judgments). The best-known test
collections are those associated with the TREC6 evaluation forum. Evaluation of
retrieval models and search engines is a very active area, with much of the current
focus on using largevolumesof log data from user interactions, such as click through
data, which records the documents that were clicked on during a search session. Click
through and other log data is strongly correlated with relevance so it can be used to
evaluate search, but search engine companies still use relevance judgments in
addition to log data to ensure the validity of theirresults.
The third core issue for information retrieval is theemphasis on users and
their information needs. This should be clear given that the evaluation of search is
user centered. That is, the users of a search engine are the ultimate judges of
quality. This has led to numerous studies on how people interact with search engines
and, in particular, to the development of techniques to help people express their
information needs. An information need is the underlying cause of the query that a
person submits to a search engine. In contrast to a request to a database system, such
as for the balance of a bank account, text queries are often poor descriptions of what
the user actually wants. A one-word query such asscould be a request for information
onwhere to buy cats or for a description of the Broadway musical. Despite their lack of
specificity, however, one- word queries are very common in web search. Techniques
such as query suggestion, query expansion, and relevance feedback use interaction
and context to refine the initial query in order to produce better rankedlists.
UNIT 2
1. Define an inverted index.
• An inverted index (also referred to as postings file or inverted file) is an index
data structure storing a mapping from content, such as words or numbers, to its
locations in a database file, or in a document or a set of documents.
• Its purpose is to allow fast full text searches, at a cost of increased processing
when a document is added to the database.
Term Document frequency → Postings lists
approach 1 → 3
breakthrough 1 → 1
drug 2 → 1 → 2
for 3 → 1 → 3 → 4
hopes 1 → 4
new 3 → 2 → 3 → 4
•
•
Dictionary Postings
2. Discuss the process of stemming. Give example.
• Stemming is the process of reducing terms to their “roots” before indexing.
• “Stemming” suggests crude affix chopping
o It’s language dependent
o E.g., automate(s), automatic, automation - all reduced to automat.
3. Compare information retrieval and web search.
Differentiator Web Search Information Retrieval
Languages Documents in many different
languages.
Databases usually cover only
one language or indexing of
documents written in
different languages with the
same vocabulary.
Document structure HTML documents are semi
structures.
Structured documents allow
complex field searching
4. What do you mean information retrieval models?
A retrieval model can be a description of either the computational process or the
human process of retrieval: The process of choosing documents for retrieval; the process
by which information needs are first articulated and then refined.
5. What is cosine similarity?
This metric is frequently used when trying to determine similarity between two
documents. Since there are more words that are in common between two documents, it is
useless to use the other methods of calculating similarities.
6. What is language model based IR?
A language model is a probabilistic mechanism for generating text. Language
models estimate the probability distribution of various natural language phenomena.
7. Define unigram language.
A unigram (1-gram) language model makes the strong independence assumption
that words are generated independently from a multinomial distribution
8. What are the characteristics of relevance feedback?
It shields the user from the details of the query reformulation process.
It breaks down the whole searching task into a sequence of small steps which are easier to
grasp. Provide a controlled process designed to emphasize some terms and de-emphasize
others.
9.What are the assumptions of vector space model?
The degree of matching can be used to rank-order documents; this rank-ordering
corresponds to how well a document satisfying a user’s information needs.
10. What are the disadvantages of Boolean model?
It is not simple to translate an information need into a Boolean expression. Exact
matching may lead to retrieval of too many documents. The retrieved documents are not
ranked. The model does not use term weights.
11. Explain Luhn’s ideas
Luhn’s basic idea to use various properties of texts, including statistical ones, was
critical in opening handling of input by computers for IR. Automatic input joined the
already automated output.
12. Define Latent semantic Indexing.
Latent Semantic Indexing is a technique that projects queries and documents into a
space with “latent” Semantic dimensions. It is statistical method for automatic indexing
and retrieval that attempts to solve the major problems of the current technology. It is
intended to uncover latent semantic structure in the data that is hidden. It creates a
semantic space where in terms and documents that are associated are placed near one
another.
13.State Baye’s Rule.
14. How do you calculate the term weighting in document and Query term weight?
15. What is Zone index?
Document titles and abstracts are generally treated as zones. We have built a
separate inverted index for each zone of a document.
16. List down the major retrieval models
• Boolean Exact Match
• Vector space Best Match
– Basic vector space
– Extended Boolean model
– Latent Semantic Indexing (LSI)
• Probabilistic models Best Match
– Basic probabilistic model
– Bayesian inference networks
– Language models
• Citation analysis models
– Hubs & authorities (Kleinberg, IBM Clever) Best Match
– Page rank (Google) Exact Match
17. Initial stages of text processing
Tokenization
• Cut character sequence into word tokens
Deal with “John’s”, a state-of-the-art solution
Normalization
• Map text and query term to same form
You want U.S.A. and USA to match
Stemming
• We may wish different forms of a root to match
authorize, authorization
Stop words
• We may omit very common words (or not)
the, a, to, of
18. What is meant by Boolean Retrieval Model?
Boolean retrieval model is being able to ask a query that is a Boolean expression:
• Boolean Queries are queries using AND, OR and NOT to join query terms
Views each document as a set of words
Is precise: document matches condition or not.
• Perhaps the simplest model to build an IR system onPrimary commercial
retrieval tool for 3 decades.
Many search systems you still use are Boolean:
• Email, library catalog, Mac OS X Spotlight
19. Define inverse document frequency (idf)
Document frequency refer the number of document in which t occurs. dft is an
inverse measure of the in formativeness of term t.
idft= log10(N/dfi)Here, N – Number of documents
Part-B
1. Draw the term-document incidence matrix for this document collection.Draw the inverted index
representation for this collection.
Doc 1: breakthrough drug for schizophrenia
Doc 2: new schizophrenia drug
Doc 3: new approach for treatment of schizophrenia
Doc 4: new hopes for schizophrenia patients
Term‐document incidence matrix
It is amxn matrix where m represents the number of distinct terms (words) as rows in the
matrix and n represents the total number of documents, as columns in the matrix.
Term Document1 Document2 Document3 Document4
approach 0 0 1 0
breakthrough 1 0 0 0
drug 1 1 0 0
for 1 0 1 1
hopes 0 0 0 1
new 0 1 1 1
of 0 0 1 0
patients 0 0 0 1
schizophrenia 1 1 1 1
treatment 0 0 1 0
Inverted index representation for this collection
Within a document collection, we assume that each document has a
unique serial number, the document identifier (docID)
b) Sort the terms
alphabetically
Term docID Term docID
breakthrough 1 approach 3
drug 1 breakthrough 1
for 1 drug 1
schizophrenia 1 drug 2
new 2 for 1
schizophrenia 2 for 3
drug 2 for 4
new 3
=>
hopes 4
approach 3 new 2
for 3 new 3
treatment 3 new 4
of 3 of 3
schizophrenia 3 patients 4
new 4 schizophrenia 1
hopes 4 schizophrenia 2
for 4 schizophrenia 3
schizophrenia 4 schizophrenia 4
patients 4 treatment 3
c) Merge multiple occurrences of the same term
Record the frequency of occurrences of the term in the document Group instances of
the same term and split dictionary and postings
Term Document frequency
approach 1
breakthrough 1
drug 2
for 3
hopes 1
new 3
of 1
patients 1
schizophrenia 4
treatment 1
→ Postings lists
→ 3
→ 1
→ 1 → 2
→ 1 → 3 → 4
→ 4
→ 2 → 3 → 4
→ 3
→ 4
→ 1 → 2 → 3 →
4
→ 3
Term Document1 Document2
ambitious 0 1
be 0 1
brutus 1 1
capitol 1 0
caesar 1 1
did 1 0
enact 1 0
hath 0 1
I 1 0
i' 1 0
it 0 1
julius 1 0
killed 1 0
let 0 1
me 1 0
noble 0 1
so 0 1
the 1 1
told 0 1
you 0 1
was 1 1
with 0 1
2. Draw the term-document incidence matrix for this document collection. Draw the
inverted index representation for this collection.
Doc 1: “I did enact Julius Caesar: I was killed i’ the Capitol; Brutus killed me.”
Doc 2: “So let it be with Caesar. The noble Brutus hath told you Caesar was
ambitious:”
Term‐document incidence matrix
It is amxn matrix where m represents the number of distinct terms (words) as rows in
the matrix and n represents the total number of documents, as columns in the matrix.
Inverted index representation
Within a document collection, we assume that each document has a unique serial
number, the document identifier (docID)
a) List of normalized tokens b) Sort the terms
for each document alphabetically
Term docID Term docID
I 1 ambitious 2
did 1 be 2
enact 1 brutus 1
julius 1 brutus 2
caesar 1 capitol 1
I 1 caesar 1
was 1 caesar 2
killed 1 caesar 2
i' 1 did 1
the 1 enact 1
capitol 1 hath 1
brutus 1 I 1
killed 1
=>
I 1
me 1 i' 1
so 2 it 2
let 2 julius 1
it 2 killed 1
be 2 killed 1
with 2 let 2
caesar 2 me 1
the 2 noble 2
noble 2 so 2
brutus 2 the 1
hath 2 the 2
told 2 told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2
c) Merge multiple occurrences of the same term
Record the frequency of occurrences of the term in the document Group
instances of the same term and split dictionary and postings
Term Document frequency → Postings lists
ambitious 1 → 2
be 1 → 2
brutus 2 → 1 → 2
capitol 1 → 1
caesar 2 → 1 → 2
did 1 → 1
enact 1 → 1
hath 1 → 2
I 1 → 1
i' 1 → 1
it 1 → 2
julius 1 → 1
killed 1 → 1
let 1 → 2
me 1 → 1
noble 1 → 2
so 1 → 2
the 2 → 1 → 2
told 1 → 2
you 1 → 2
was 2 → 1 → 2
3. What is search engine? Explain with diagrammatic illustration the components of a search
engine.
Search engine is a program that searches for and identifies items in a database that correspond to
keywords or characters specified by the user, used especially for finding particular sites on the World
Wide Web
Search engine major functions:
Indexing process - builds the structures that enable searching
Query process - uses those structures and a person’s query to produce a
ranked list of documents
I. Indexing Process
The major components of the indexing process are text acquisition, text transformation, and index
creation.
Indexing Process
I.1. Text acquisition
• The task of the text acquisition component is to identify and make available the documents that
will be searched.
• It often requires building a collection by crawling or scanning the Web, a corporate intranet, a
desktop, or other sources of information.
• It creates a document data store, which contains the text and metadata for all the documents.
• Metadata is information about a document that is not part of the text content, such as the document
type (e.g., email or web page), document structure, and other features, such as document length.
I.2. Text transformation
• The text transformation component transforms documents into index terms or features.
• Index terms are the parts of a document that are stored in the index and used in searching.
• The simplest index term is a word, but not every word may be used for searching.
• A “feature” is more often used in the field of machine learning to refer to a part of a text document
that is used to represent its content, which also describes an index term.
• Examples of other types of index terms or features are phrases, names of people, dates, and links
in a web page.
• Index terms are sometimes referred to as “terms.” The set of all the terms that are indexed for a
document collection is called the index vocabulary.
I.3. Index creation
• The index creation component takes the output of the text transformation component and creates the
indexes or data structures that enable fast searching. It must be efficient in terms of time and space.
• Indexes must also be able to be efficiently updated when new documents are acquired. Inverted
indexes are the most common form of index.
• An inverted index contains a list for every index term of the documents that contain that index
term.
II. Query Process
The major components of the query process are user interaction, ranking, and evaluation.
Query Process
II.1. User interaction
• The user interaction component provides the interface between the person doing the searching and
the search engine.
• One task for this component is accepting the user’s query and transforming it into index terms.
Another task is to take the ranked list of documents from the search engine and organize it into the
results shown to the user.
• Example: generating the snippets used to summarize documents. The document data store is one of the
sources of information used in generating the results.
• This component also provides a range of techniques for refining the query so that it better represents
the information need.
II.2. Ranking
• The ranking component is the core of the search engine.
• It takes the transformed query from the user interaction component and generates a ranked list of
documents using scores based on a retrieval model.
• Ranking must be both efficient, since many queries may need to be processed in a short time, and
effective, since the quality of the ranking determines whether the search engine accomplishes the goal
of finding relevant information.
• The efficiency of ranking depends on the indexes, and the effectiveness depends on the retrieval
model.
II.3. Evaluation
• The task of the evaluation component is to measure and monitor effectiveness and efficiency.
• It records and analyzes user behaviour using log data.
• The results of evaluation are used to tune and improve the ranking component.
• Most of the evaluation component is not part of the online search engine, apart from logging user and
system data.
• Evaluation is primarily an offline activity, but it is a critical part of any search application
4. Explain in detail about Boolean retrieval model
• The Boolean retrieval model is being able to ask a query that is a Boolean expression:
– Boolean Queries are queries using AND, OR and NOT to join query terms
• Views each document as a set of words
• Is precise: document matches condition or not.
– Perhaps the simplest model to build an IR system on
• Primary commercial retrieval tool for 3 decades.
• Many search systems you still use are Boolean:
– Email, library catalog, Mac OS X Spotlight
Types of Retrieval Models
Exact Match Vs Best Match Retrieval
Exact Match
• Query specifies precise retrieval criteria
• Every document either matches or fails to match query
• Result is a set of documents
– Usually in no particular order
– Often in reverse-chronological order
Best Match
• Query describes retrieval criteria for desired documents
• Every document matches a query to some degree
• Result is a ranked list of documents, “best” first
Term-document incidence matrices
• So we have a 0/1 vector for each term.
BrutusANDCaesarBUTNOTCalpurnia
1 if play contains word, 0 otherwise
• To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented) ➔
bitwise AND.
– 110100 AND
– 110111 AND
– 101111 =
– 100100
Boolean IR Algorithm
D: Set of words present in a document
Each term is either present(1) or absent (0)
Q: A Boolean expression
Terms are index terms
Operators are AND, OR, and NOT
F: Boolean algebra over set’s of terms and set of documents
R: A document is predicted as related to a query expression, if it satisfies the query expression.
Each query term specifies a set of documents containing the term
AND( ^) – The intersection of 2 sets
OR(V) – The union of two sets
NOT (~) – Set inverse
Advantages:
• Easy to understand
• Clean formalism
• Predictable, easy to explain
• Boolean model can be extended to include ranking.
• Structured queries
• Works well when searchers knows exactly what is wanted
Disadvantages:
• Most people find it difficult to create good Boolean queries
– Difficulty increases with size of collection
6. Explain in detail about vector space model
• This model is best known and most widely used model.
• The advantage of this model is being simple and appealing framework for implementing
term weighting, ranking and relevance feedback.
• The vector model proposes a framework in which partial matching is possible.
• Assume t distinct terms remain after preprocessing; call them index terms or the vocabulary.
• These “orthogonal” terms form a vector space.
Dimensionality = t = |vocabulary|
• Each term, i, in a document or query, j, is given a real-valued weight, wij.
• Both documents and queries are expressed as t-dimensional vectors:
dj = (w1j, w2j, …, wtj)
Document Collection
• A collection of n documents can be represented in the vector space model by a term-
document matrix.
• An entry in the matrix corresponds to the “weight” of a term in the document; zero means
the term has no significance in the document or it simply doesn’t exist in the document.
Term frequency tf
▪ The term frequency tft,d of term t in document d is defined as the number of times that t
occurs in d.
▪ We want to use tf when computing query-document match scores. But how?
▪ Raw term frequency is not what we want:
▪ A document with 10 occurrences of the term is more relevant than a document with
1 occurrence of the term.
▪ But not 10 times more relevant.
▪ Relevance does not increase proportionally with term frequency.
Document frequency
▪ Rare terms are more informative than frequent terms
▪ Consider a term in the query that is rare in the collection (e.g., arachnocentric)
▪ A document containing this term is very likely to be relevant to the query arachnocentric
▪ → We want a high weight for rare terms like arachnocentric.
▪ Frequent terms are less informative than rare terms
▪ Consider a query term that is frequent in the collection (e.g., high, increase, line)
▪ A document containing such a term is more likely to be relevant than a document that
doesn’t
▪ But it’s not a sure indicator of relevance.
▪ → For frequent terms, we want high positive weights for words like high, increase, and line
▪ But lower weights than for rare terms.
▪ We will use document frequency (df) to capture this.
idf weight:
▪ To scale down the term weights of the term with high collection frequency, defined to be the
total number of occurrences of a term in the collection.
▪ To reduce tf weight of a term by a factor that grows with its collection frequency, document
frequency dft is defined as the number of documents in the collection that contain a term t.
Collection vs. Document frequency
▪ The collection frequency of t is the number of occurrences of t in the collection, counting
multiple occurrences.
▪ the idf (inverse document frequency) of t by
▪ log (N/dft) is used instead of N/dft to “dampen” the effect of idf.
Thus the idf of arare term is high, whereas the idf of a frequent term is likely
to be low
)/df(logidf 10 tt N=
Tf-idf weighting:
• The combination of term frequency and inverse document frequency produce a composite
weight for each term in each document.
• The tf-idf weighting scheme assigns to term t a weight in document d that is
•
• Highest when t occurs many times within a small number of documents
• Lower when the term occurs fewer times in a document, or occurs in many documents.
• Lowest when the term occurs in virtually all documents.
• Each document may also be viewed as a vector with one component corresponding to each
term in the dictionary, together with a weight for each component that is given by equation
above. For dictionary terms that do not occur in a document, this weight is zero.
• Document d is sum , over all query terms, of the number of times each of the query terms
occurs in d.
• Still this can be refined as adding not the occurrences of each term t in d,but instead the tf-
idf weight of each term in d.
Cosine Similarity:
Cosine similarity is a measure of similarity between two non-zero vectors of an inner
product space that measures the cosine of the angle between them. The cosine of 0° is 1, and it is
less than 1 for any other angle.
The documents are ranked by computing the distance between the points representing the
documents and query. More commonly a similarity measure is used so that the documents with the
highest scores are the most similar to the query.
The numerator of this measure is the sum of the products of the term weights for the
matching query and document terms.
• Retrieval based on similarity between query and documents.
• Output documents are ranked according to similarity to query.
• Similarity based on occurrence frequencies of keywords in query and document.
• Automatic relevance feedback can be supported:
– Relevant documents “added” to query.
– Irrelevant documents “subtracted” from query.
• A similarity measure is a function that computes the degree of similarity between two
vectors.
• Using a similarity measure between the query and each document:
– It is possible to rank the retrieved documents in the order of presumed relevance.
– It is possible to enforce a certain threshold so that the size of the retrieved set can be
controlled.
7. Explain in detail about similarity calculation based on inner product
• Similarity between vectors for the document di and query q can be computed
as the vector inner product (a.k.a. dot product):
sim(dj,q) = dj•q =
wherewijis the weight of term i in document j andwiqis the weight of term i in the
query
• For binary vectors, the inner product is the number of matched query terms in
the document (size of intersection).
• For weighted term vectors, it is the sum of the products of the weights of the
matched terms.
Properties of Inner Product
• The inner product is unbounded.
• Favors long documents with a large number of unique terms.
• Measures how many terms matched but not how many terms are not matched.
Example :
Example 2:
8. For example consider four documents and the term document matrix for collection of
documents is
iq
t
i
ij ww=1
Document 3 for example is represented by the vector (1,1,0,2,0,1,0,1,0,0,1)
Queries are represented the same way as documents.
A query Q is represented by a vector of t weights.
Q=(q1,q2,……,qt),
whereqj is the weight of the jth term in the query.
For example the Query was “tropical fish” then the vector representation of the query
would be (0,0,0,1,0,0,0,0,0,0,1).
9. Explain in detail about text and Information pre-processing
Before the documents in a collection are used for retrieval, some preprocessing tasks are
usually performed. For traditional text documents the tasks are stopword removal, stemming and
handling of digits, hyphens,punctuations and cases of letter.
For web pages, additional tasks such as HTML tag removal and identification of main content
blocks also require careful considerations.
• Stop word removal
• Stemming
• Text pre-processing
• Web Page pre-processing
1.Stopword Removal
o Stopwords are frequently occurring and insignificant words in a language that help
construct sentences but do not represent any content of the documents.
• Articles, prepositions and conjunctions and some pronouns are natural candidates.Common
stopwords in English include:
• a,about,an,are,as,at,be,by,for,from,how,in,is,of,on,or,that,the,these,this,to,was,what,where,w
ho,will,with
• Such words should be removed before documents are indexed and stored.
• Stopwords in the query are also removed before retrieval is performed.
2. Stemming:
• In many languages, a word has various syntactical forms depending on the contexts that is
used.
• For example ,in English ,nouns have plural forms,verb have gerund forms(by adding
“ing”),and verbs used in the past tense are different from the present tense.
• These are considered as syntactic variations of the same root form.
• Such variations cause low recall for a retrieval system because a relevant document may
contain a variation of a query word but not the exact word itself.
• This problem can be partially dealt with by stemming.
• Stemming refers to the process of reducing words to their stems or roots.
• A stem is a portion of a word that is left after removing its prefixes and suffixes.
• In English, most variants of a word are generated by the introduction of suffixes rather than
prefixes.
• Thus stemming in English usually means suffix removal or stripping.
• For example,”Computer”,”Computing”and “compute” are reduced to “comput”.”walks”,
”walking” and “walker” are reduced to “walk”.
• Stemming enables different variations of the word to be considered in retrieval, which
improves the recall.
• There are several stemming algorithms, also known as stemmers.
• Stemming increases recall and reduces the size of the indexing structure. However it may
hurt precision because many irrelevant documents may be considered relevant.
• For example,both “cop” and “cope” are reduced to the stem “cop”, However if one is
looking for documents about police,a document that contains only “cope” is unlikely to be
relevant.
3.Text preprocessing:
a.Digits:
Numbers and terms that contain digits are removed in traditional IR systems except some
specific types, e.g: dates, times and other pre specified types expressed with regular expressions.
However, in search engines, they are usually indexed.
b.Hyphens:
o Breaking hyphens are usually applied to dela with inconsistency of usage. For
example, some people use “state-of-the-art”, but others use “State of the art”.
o If the hyphens in the first case are removed, we eliminate the inconsistency problem.
However some words may have a hyphen as an integral part of the word, eg,”Y-21”.
Two types of removal
1.each hyphen is replaced with a space
2.each hyphen is simply removed without leaving a space so that “state-of-the-art “ may be
replaced with “state of the art” or “stateoftheart”.
c)PunctuationMarks:is similar as hyphens
d) Case of letters:
all the letters are usually converted to either the upper or lower case.
Web Page pre-processing:
1. Identifying different text fields:
In HTML, there are different text fields, e.g.title,metadata and body. Identifying them
allows the retrieval system to treat terms in different fields differently. For example, in search
engines terms that appear in other field of a page are regarded as more important than terms that
appear in other fields and are assigned higher weights because the title is usually a concise
description of the page.
In the body text, those emphasized terms (eg.under header tag<h1>,<h2>,…..bold tag<b>,etc.)are
also given higher weights.
2. Identifying anchor text:
Anchor text associated with a hyperlink is treated specially in search engines because the anchor
text often represents more accurate description of the information contained in the page pointed to
by its link. In case that the hyperlink points to an external page, it is especially valuable because it
is a summary description of the page given by other people rather than the author/owner of the
page, and thus more trustworthy.
3.Removing HTML tags:
The removal of HTML tags can be dealt with similarly to punctuation. One issue needs
careful consideration which affects proximity queries and phrase queries.HTML is inherently a
visual presentation language.
4. Identifying main content block:
A typical web page, especially a commercial page, contains a large amount of information
that is not a part of the main content of the page.For example it contain banner ads,navigation
bars,copyright notices,etc.,which can lead to poor results for search and mining.
Two techniques for finding main content block in webpages
a) Partitioning based on visual cues:
o This method uses visual information to help to find main content blocks in the
page.Visual or rendering information of each HTML element in a page can be
obtained from the web browser.
• For example internet explorer provides an API that can output the X and Y
coordinates of each element.
• A machine larning model can then be built based on the location and appearance
features for identifying main content blocks of pages.
b) Tree matching
• This method is based on the observation that in most commercial web sites pages are
generated by using some fixed templates.
• This method thus aims to find hidden templates. Since HTML has a nested
structure,it is easy to build a tag tree for each page.
• Tree matching of multiple pages from the same site can be performed to find such
templates.
9. Explain in detail about Probabilistic Approach based Information Retrieval
• Given a user information need (represented as a query) and a collection of documents
(transformed into document representations), a system must determine how well the
documents satisfy the query
• Boolean or vector space models of IR: query-document matching done in a formally defined
but semantically imprecise calculus of index terms
• An IR system has an uncertain understanding of the user query , and makes an uncertain
guess of whether a document satisfies the query
• Probability theory provides a principled foundation for such reasoning under uncertainty
• Probabilistic models exploit this foundation to estimate how likely it is that a document is
relevant to a query
Probabilistic IR Models
▪ Classical probabilistic retrieval model
▪ Probability ranking principle
▪ Binary Independence Model, BestMatch25 (Okapi)
▪ Bayesian networks for text retrieval
▪ Language model approach to IR
▪ Important recent work, competitive performance
Probabilistic methods are one of the oldest but also one of the currently hottest topics in IR
Basic Probability Theory
▪ For events A and B
▪ Joint probability P(A, B) of both events occurring
▪ Conditional probability P(A|B) of event A occurring given that event
B has occurred
▪ Chain rule gives fundamental relationship between joint and
conditional probabilities:
10. Vector model example
D1: “Shipment of gold delivered in a fire”
D2: “Delivery of silver arrived in a silver truck”
D3: “Shipment of gold arrived in a truck”
Q: “gold silver truck”
❖ let assume we deal with a basic term vector model in which we :
1. do not take into account WHERE the terms occur in documents. (documents consist
of passages and passages consist of sentences )
2. Remove stopwords.
3. do not reduce terms to root terms (stemming).
use raw frequencies for terms and queries (unnormalized data).
Stop Words
Now we need to find similarity measures by using Similarity Methods.
❑ Inner Product (Dot Product)
➢ SC (Q,D1)=(0)(0) + (0)(0) + (0)(0.477) + (0)(0) + (0)(0.477) + (0.176)(0.176) +
(0)(0) + (0)(0)=0.0309
➢ SC (Q,D2)=0.4862
➢ SC (Q,D3)=0.0620
Document ID Similarity score
D1 0.0309
D2 0.4862
D3 0.0620
Now, we rank the documents in a descending order according to their similarity score, as the
following:
Document ID Similarity score
D2 0.4862 (most relevant)
D3 0.0620
D1 0.0309
We can use threshold to retrieve documents above the value of that threshold
❑ Cosine
❖ For cosine method we must calculate the length of each documents and the length of
query as the following:
▪ Length of D1 = sqrt(0.477^2+0.477^2+0.176^2+0.176^2)= 0.7195
▪ Length of D2 = sqrt(0.176^2+0.477^2+0.954^2+0.176^2)= 1.095
▪ Length of D3 = sqrt(0.176^2+0.176^2+0.176^2+0.176^2)= 0.352
▪ Length of Q = sqrt(0.1761^2+0.477^2+0.1761^2)= 0.538
❖ Inner product for each document is:
▪ D1= 0.0309
▪ D2=0.4862
▪ D3=0.0620
❖ Then the similarity values are:
▪ cosSim(D1,Q) = 0.0309 / 0.719 * 0.538 = 0.0801
▪ cosSim(D2,Q) = 0.4862 / 1.095 * 0.538 = 0.8246
cosSim(D3,Q) = 0.061 / 0.352 * 0.538 = 0.3271
❖ Now, we rank the documents in a descending order according to their similarity score, as
the following
Document ID Similarity score
D2 0.8246 (most relevant)
 

= =
=
•
•
t
k
t
k
t
k
kik
qd
qd
kik1 1
22
1
)(
D3 0.3271
D1 0.0801
11. Explain in detail about Relevance feedback and Query expansion
▪ The user issues a (short, simple) query.
▪ The search engine returns a set of documents.
▪ User marks some docs as relevant, some as irrelevant.
▪ Search engine computes a new representation of the information need. Hope: better
than the initial query.
▪ Search engine runs new query and returns new results.
▪ New results have (hopefully) better recall [Accuracy].
▪ The centroid is the center of mass of a set of points.
▪ Recall that we represent documents as points in a high-dimensional space.
▪ Thus: we can compute centroids of documents.
▪ Definition:
where, D is a set of documents and is the vector weuse to
represent document d.
Rocchio’ algorithm
▪ The Rocchio’ algorithm implements relevance feedback in the vector space model.
Rocchio’ chooses the query that maximizes
Dr : set of relevant docs; Dnr : set of nonrelevant docs
▪ Intent: ~qopt is the vector that separates relevant and nonrelevant docs maximally.
▪ Making some additional assumptions, we can rewrite as:
The optimal query vector is:
We move the centroid of the relevant documents by the difference between the two centroids.
qm: modified query vector; q0: original query vector;
Dr and Dnr : sets of known relevant and irrelevant documents respectively; α, β, and γ: weights
▪ New query moves towards relevant documents and away from nonrelevant
documents.
▪ Tradeoff α vs. β/γ: If we have a lot of judged documents, we want a higher β/γ.
▪ Set negative term weights to 0.
▪ “Negative weight” for a term doesn’t make sense in the vector space model.
Positive vs. negative relevance feedback
▪ Positive feedback is more valuable than negative feedback.
▪ For example, setβ = 0.75, γ = 0.25 to give higher weight to positive feedback.
▪ Many systems only allow positive feedback.
Relevance feedback: Problems
▪ Relevance feedback is expensive.
▪ Relevance feedback creates long modified queries.
▪ Long queries are expensive to process.
▪ Users are reluctant to provide explicit feedback.
▪ It’s often hard to understand why a particular document was retrieved after applying
relevance feedback.
▪ The search engine Excite had full relevance feedback at one point, but abandoned it
later.
Pseudo-relevance feedback
▪ Pseudo-relevance feedback automates the “manual” part of true relevance feedback.
▪ Pseudo-relevance algorithm:
▪ Retrieve a ranked list of hits for the user’s query
▪ Assume that the top k documents are relevant.
▪ Do relevance feedback (e.g., Rocchio)
▪ Works very well on average
▪ But can go horribly wrong for some queries.Several iterations can cause query drift.
Pseudo-relevance feedback at TREC4
▪ Cornell SMART system
▪ Results show number of relevant documents out of top 100 for 50 queries (so total
number of documents is 5000):
▪ Results contrast two length normalization schemes (L vs. l) and pseudo-relevance
feedback (PsRF).
▪ The pseudo-relevance feedback method used added only 20 terms to the query.
(Rocchio will add many more.)
▪ This demonstrates that pseudo-relevance feedback is effective on average.
Query expansion
▪ Query expansion is another method for increasing recall.
▪ We use “global query expansion” to refer to “global methods for query reformulation”.
▪ In global query expansion, the query is modified based on some global resource, i.e. a
resource that is not query-dependent.
▪ Main information we use: (near-)synonymy
▪ A publication or database that collects (near-)synonyms is called a thesaurus.
We will look at two types of thesauri: manually created and automatically created.
Types of user feedback
▪ User gives feedback on documents.
▪ More common in relevance feedback
▪ User gives feedback on words or phrases.
▪ More common in query expansion
Types of query expansion
▪ Manual thesaurus (maintained by editors, e.g., PubMed)
▪ Automatically derived thesaurus (e.g., based on co-occurrence statistics)
▪ Query-equivalence based on query log mining (common on the web as in the “palm”
example)
Interactive relevance feedback: improve initial retrieval results by telling the IR system which docs
are relevant / nonrelevant
Best known relevance feedback method: Rocchio feedback
Query expansion: improve retrieval results by adding synonyms / related terms to the query
Sources for related terms: Manual thesauri, automatic thesauri, query logs
Indirect relevance feedback
▪ On the web, DirectHit introduced a form of indirect relevance feedback.
▪ DirectHit ranked documents higher that users look at more often.
▪ Clicked on links are assumed likely to be relevant
▪ Assuming the displayed summaries are good, etc.
▪ Globally: Not necessarily user or query specific.
▪ This is the general area of clickstream mining
12. Explain in detail about language model
A Language model is a function that puts a probability over strings drawn from some
vocabulary.Language Model M over an alphabet ∑:
∑ 𝑃(𝑠) = 1
𝑠∈∑∗
Full set of strings that can be generated is called the language of the automation.
Likelihood Ratio
To compare two models for data set.Dividing the probability of the data according to one model by
the probability of the data according to other model.
Unigram language model
Estimate each term independently. Simply throws away all conditioning context.
Puni(t1t2t3t4) = P(t1) P(t2) P(t3) P(t4)
q1
q
1
1
Bigram language models
Complex model
Condition on the previous term
Puni(t1t2t3t4) = P(t1) P(t2|t1) P(t3|t2) P(t4|t3)
Unigram models are more efficient to estimate and apply than higher order models.
Using language models (LMs) for IR
❶ LM = language model
❷ We view the document as a generative model that generates the query.
❸ What we need to do:
❹ Define the precise generative model we want to use
❺ Estimate parameters (different parameters for each document’s model)
❻ Smooth to avoid zeros
❼ Apply to query and find document most likely to have generated the query
❽ Present most likely document(s) to user
❾ Note that x – y is pretty much what we did in Naive Bayes.
What is a language model?
We can view a finite state automaton as a deterministic language model.
I wish I wish I wish I wish . . . Cannot generate: “wish I wish” or “I wish I”.
Our basic model: each document was generated by a different automaton like this except that these
automata are probabilistic
Using language models in IR
▪ Each document is treated as (the basis for) a language model.
▪ Given a query q
▪ Rank documents based on P(d|q)
▪ P(q) is the same for all documents, so ignore
▪ P(d) is the prior – often treated as the same for all d
▪ But we can give a prior to “high-quality” documents, e.g., those with high
PageRank.
▪ P(q|d) is the probability of q given d.
▪ So to rank documents according to relevance to q, ranking according to P(q|d) and P(d|q) is
equivalent.
▪ In the LM approach to IR, we attempt to model the query generation process.
▪ Then we rank documents by the probability that a query would be observed as a random
sample from the respective document model.
▪ That is, we rank according to P(q|d).
▪ Next: how do we compute P(q|d)?
How to compute P(q|d)
▪ We will make the same conditional independence assumption as for Naive Bayes.
(|q|: length ofrq; tk : the token occurring at position k in q)
▪ This is equivalent to:
tft,q: term frequency (# occurrences) of t in q
▪ Multinomial model (omitting constant factor)
Parameter estimation
▪ Missing piece: Where do the parameters P(t|Md). come from?
▪ Start with maximum likelihood estimates (as we did for Naive Bayes)
(|d|: length of d; tft,d : # occurrences of t in d)
▪ As in Naive Bayes, we have a problem with zeros.
▪ A single t with P(t|Md) = 0 will make zero.
▪ We would give a single term “veto power”.
▪ For example, for query [Michael Jackson top hits] a document about “top songs” (but not
using the word “hits”) would have P(t|Md) = 0. – That’s bad.
▪ We need to smooth the estimates to avoid zeros.
Mixture model
▪ P(t|d) = λP(t|Md) + (1 - λ)P(t|Mc)
▪ Mixes the probability from the document with the general collection frequency of the word.
▪ High value of λ: “conjunctive-like” search – tends to retrieve documents containing all
query words.
▪ Low value of λ: more disjunctive, suitable for long queries
▪ Correctly setting λis very important for good performance.
What we model: The user has a document in mind and generates the query from this document.
▪ The equation represents the probability that the document that the user had in mind was in
fact this one.
Example
▪ Collection: d1 and d2
▪ d1 : Jackson was one of the most talented entertainers of all time
▪ d2: Michael Jackson anointed himself King of Pop
▪ Query q: Michael Jackson
▪ Use mixture model with λ = 1/2
▪ P(q|d1) = [(0/11 + 1/18)/2] · [(1/11 + 2/18)/2] ≈ 0.003
▪ P(q|d2) = [(1/7 + 1/18)/2] · [(1/7 + 2/18)/2] ≈ 0.013
▪ Ranking: d2 >d1
Exercise 2
▪ Collection: d1 and d2
▪ d1 : Xerox reports a profit but revenue is down
▪ d2: Lucene narrows quarter loss but decreases further
▪ Query q: revenue down
▪ Use mixture model with λ = 1/2
▪ P(q|d1) = [(1/8 + 2/16)/2] · [(1/8 + 1/16)/2] = 1/8 · 3/32 = 3/256
▪ P(q|d2) = [(1/8 + 2/16)/2] · [(0/8 + 1/16)/2] = 1/8 · 1/32 = 1/256
▪ Ranking: d2 >d1
13. Explain about the Web Characteristics
The essential feature that led to the explosive growth of the web decentralized
content publishing with essentially no central control of authorship turned out to be the
biggestchallengeforwebsearchenginesintheirquesttoindexandretrievethiscontent.
Web page authors created content in dozens of (natural) languages and thousands of
dialects, thus demanding many different forms of stemming and other linguisticoperations.
Trust of Web: The democratization of content creation on the web meant a new level of
granularity in opinion on virtually any subject. This meant that the web contained truth, lies,
contradictions and suppositions on a grand scale. This gives rise to the question: which
web page does one trust?In a simplistic approach, one might argue that some publishers are
trustworthy and others not begging the question of how a search engine is to assign such a
measure of trust to each website or webpage.
Size: While the question how big the web has?The answer is not an easy.
Static Vs Dynamic: Static web pages are those whose content does not vary from one
request for that page to the next. For this purpose, a professor who manually updateshis web
pages with his desired content. Dynamic pages are typically mechanically generated by an
application server in response to a query to a database.
Fig: Dynamic web page generation
The web graph: We can view the static Web consisting of static HTML pages together
with the hyperlinks between them as a directed graph in which each web page is a
node and each hyperlink a directededge.
Fig :Two nodes of web graph joined by link
Figure shows two nodes A and B from the web graph, each corresponding to a web page,
with a hyperlink from A to B. We refer to the set of all such nodes and directed edges as the
web graph. This text is generally encapsulated in the href attribute of the <a> (for anchor)
tag that encodes the hyperlink in the HTML code of page A, and is referred to as anchor text.
As one might suspect, this directed graph is not strongly connected: there are pairs of pages
such that one cannot proceed from one page of the pair to the other by the following
hyperlinks. We refer to the hyperlinks into a page as in-links and those out of a page as out-
links. The number of in-links to a page (also known as its in-degree) has averaged from
roughly 8 to 15, in a range of studies. We similarly define the out-degree of a web page to be
the number of links out ofit.
Fig: Sample web graph.
Thereissampleevidencethattheselinksarenotrandomlydistributed;this distribution is
widely reported to be a power law, in which the total number of web pages within-
degreeiisproportionalto1/iα;
The directed graph connecting web pages has a bowtie shape: there are threemajor
categories of web pages that are sometimes referred to as IN, OUT and SCC (Strongly
Connected Component). A web surfer can pass from any page in IN to any page in SCC,
by following hyperlinks. Likewise, a surfer can pass from page in SCC to any page in
OUT. Finally, the surfer can surf from any page in SCC to any other page in SCC.
However, it is not possible to pass from a page in SCC to any page in IN, or from a page
in OUT to a page in SCC. The remaining pages form into tubes that are small sets of
pages outside SCC that lead directly from IN to OUT, and tendrils that either lead
nowhere from IN, or from nowhere toOUT.
Fig: Bowtie Structure of the web
Spam:
Web search engines were an important means for connecting advertisers to
prospective buyers. A user searching for Maui golf real estate is not merely seeking news or
entertainment on the subject of housing on golf courses on the island of Maui, but instead
likely to be seeking to purchase such a property. Sellers of such property and their agents,
therefore, have a strong incentive to create web pages that rank highly on this query. In a
search engine whose scoring was based on term frequencies, a web page with numerous
repetitions of maui golf real estate would rank highly. This led to the first generation of
spam in which is the manipulation of web page content for the purpose of appearing high
up in search results for selectedkeywords.
Spammers resorted to such tricks as rendering these repeated terms in the same
colour as the background. Despite these words being consequently invisible to the human
user, a search engine indexer would parse the invisible words out of the
HTMLrepresentationofthewebpageandindexthesewordsasbeingpresentinthepage.
14. Explain in detail about sparse vectors
• Vocabulary and therefore dimensionality of vectors can be very large, ~104
.
• However, most documents and queries do not contain most words, so vectors are sparse (i.e.
most entries are 0).
• Need efficient methods for storing and computing with sparse vectors.
• Store vectors as linked lists of non-zero-weight tokens paired with a weight.
• Space proportional to number of unique tokens (n) in document.
• Requires linear search of the list to find (or change) the weight of a specific token.
• Requires quadratic time in worst case to compute vector for a document:
Sparse Vectors as Trees
• Index tokens in a document in a balanced binary tree or tree with weights stored with tokens at the
leaves.
• Space overhead for tree structure: ~2n nodes.
• O(log n) time to find or update weight of a specific token.
• O(n log n) time to construct vector.
• Need software package to support such data structures.
• Store tokens in hashtable, with token string as key and weight as value.
• Storage overhead for hashtable ~1.5n.
• Table must fit in main memory.
• Constant time to find or update weight of a specific token (ignoring collisions).
• O(n) time to construct vector (ignoring collisions).
Implementation Based on Inverted Files
)(
2
)1( 2
1
nO
nn
i
n
i
=
+
==
memory
<
<
film variable
Variable2Memory1film1Bit2
Balanced Binary Tree
• In practice, document vectors are not stored directly; an inverted organization provides much
better efficiency.
• The keyword-to-document index can be implemented as a hash table, a sorted array, or a tree-
based data structure (trie, B-tree).
• Critical issue is logarithmic or constant-time access to token information.
Inverted Index: Basic Structure
• Term list: a list of all terms
• Document node: a structure that contains information such as term frequency, document ID, and
others
• Posting list: for each term, a list containing document node for each document in which the term
appears
Creating an Inverted Index
Create an empty index term list I;
For each document, D, in the document set V
For each (non-zero) token, T, in D:
If T is not already in I
Insert T into I;
Find the location for T in I;
If (T, D) is in the posting list for T
increase its term frequency for T;
Else
Create (T, D);
Add it to the posting list for T;
Computing IDF
Let N be the total number of documents;
For each token, T, in I:
Determine the total number of documents,
M, in which T occurs (the length of T’s postinglist);
Set the IDF for T to log(N/M);
Retrieval with an Inverted Index
system
computer
database
science D
2
,
4
D
5
, 2
D
1
, 3
D
7
, 4
Index terms df
3
2
4
1
D
j
, tf
j
Term List Postings
lists
••
•
• Tokens that are not in both the query and the document do not affect cosine similarity.
– Product of token weights is zero and does not contribute to the dot product.
• Usually the query is fairly short, and therefore its vector is extremely sparse.
Use inverted index to find the limited set of documents that contain at least one of the query words.
15. Latent semantic indexing – Explain
▪ Term-document matrices are very large
▪ But the number of topics that people talk about is small (in some sense)
➢ Clothes, movies, politics, …
▪ Can we represent the term-document space by a lower dimensional latent space?
▪ Develop a class of operations from linear algebra, known as matrix decomposition.
▪ Examine the application of low-rank approximations to indexing and retrieving documents, a
technique referred to as latent semantic.
▪ Latent semantic indexing has not been established – intriguing approach to clustering.
▪ Let C be an matrix M X N Matrix with real-valued entries.
▪ Term-document matrix all entries are in non-negative.
▪ The rank of the matrix is the number of linearly independent rows (or columns ) in it.
▪ Rank (C) ≤ min (M,N)
Eigenvalues & Eigenvectors
▪ Eigenvectors (for a square M X M matrix S)
How many eigenvalues are there at most?
(right) eigenvector eigenvalue
only has a non-zero solution if
This is aMth order equation in λ which can have at most M distinct solutions
(roots of the characteristic polynomial) – can be complex even though S is
real.
has eigenvalues 30, 20, 1 with
corresponding eigenvectors
On each eigenvector, S acts as a multiple of the identitymatrix: but as a different multiple on each.
Matrix-vector multiplication
▪ Thus a matrix-vector multiplication such as Sx (S, x as in the previous slide) can be rewritten in
terms of the eigenvalues/vectors:
▪ Even though x is an arbitrary vector, the action of S on x is determined by the eigenvalues/vectors.
For symmetric matrices, eigenvectors for distinct
Eigenvalues are orthogonal
All eigenvalues of a real symmetric matrix are real.
All eigenvalues of a positive semidefinite matrixare non-negative
Any vector (say x= ) can be viewed as a combination ofthe eigenvectors:x = 2v
1
+ 4v
2
+ 6v
3

Más contenido relacionado

La actualidad más candente

Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalRoi Blanco
 
Ppt evaluation of information retrieval system
Ppt evaluation of information retrieval systemPpt evaluation of information retrieval system
Ppt evaluation of information retrieval systemsilambu111
 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsSelman Bozkır
 
Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsMounia Lalmas-Roelleke
 
Information retrieval (introduction)
Information  retrieval (introduction) Information  retrieval (introduction)
Information retrieval (introduction) Primya Tamil
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalSudarsun Santhiappan
 
Information retrieval-systems notes
Information retrieval-systems notesInformation retrieval-systems notes
Information retrieval-systems notesBAIRAVI T
 
Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)9866825059
 
Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information RetrievalDishant Ailawadi
 
Information retrieval system
Information retrieval systemInformation retrieval system
Information retrieval systemLeslie Vargas
 
The vector space model
The vector space modelThe vector space model
The vector space modelpkgosh
 
The impact of web on ir
The impact of web on irThe impact of web on ir
The impact of web on irPrimya Tamil
 
Functions of information retrival system(1)
Functions of information retrival system(1)Functions of information retrival system(1)
Functions of information retrival system(1)silambu111
 
INFORMATION RETRIEVAL Anandraj.L
INFORMATION RETRIEVAL Anandraj.LINFORMATION RETRIEVAL Anandraj.L
INFORMATION RETRIEVAL Anandraj.Lanujessy
 

La actualidad más candente (20)

Term weighting
Term weightingTerm weighting
Term weighting
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Ppt evaluation of information retrieval system
Ppt evaluation of information retrieval systemPpt evaluation of information retrieval system
Ppt evaluation of information retrieval system
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I PPT IN PDF
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I  PPT  IN PDFCS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I  PPT  IN PDF
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I PPT IN PDF
 
IR
IRIR
IR
 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systems
 
Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & Models
 
Vector space model in information retrieval
Vector space model in information retrievalVector space model in information retrieval
Vector space model in information retrieval
 
Information retrieval (introduction)
Information  retrieval (introduction) Information  retrieval (introduction)
Information retrieval (introduction)
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information Retrieval
 
Information retrieval-systems notes
Information retrieval-systems notesInformation retrieval-systems notes
Information retrieval-systems notes
 
Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)
 
Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information Retrieval
 
Information retrieval system
Information retrieval systemInformation retrieval system
Information retrieval system
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
The impact of web on ir
The impact of web on irThe impact of web on ir
The impact of web on ir
 
Functions of information retrival system(1)
Functions of information retrival system(1)Functions of information retrival system(1)
Functions of information retrival system(1)
 
INFORMATION RETRIEVAL Anandraj.L
INFORMATION RETRIEVAL Anandraj.LINFORMATION RETRIEVAL Anandraj.L
INFORMATION RETRIEVAL Anandraj.L
 
Text mining
Text miningText mining
Text mining
 

Similar a CS6007 IR SYLLABUS

Indexing Techniques: Their Usage in Search Engines for Information Retrieval
Indexing Techniques: Their Usage in Search Engines for Information RetrievalIndexing Techniques: Their Usage in Search Engines for Information Retrieval
Indexing Techniques: Their Usage in Search Engines for Information RetrievalVikas Bhushan
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...Joaquin Delgado PhD.
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...S. Diana Hu
 
Web search engines and search technology
Web search engines and search technologyWeb search engines and search technology
Web search engines and search technologyStefanos Anastasiadis
 
Chapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and RetrievalChapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and Retrievalcaptainmactavish1996
 
An Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured DataAn Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured DataMelinda Watson
 
An Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAn Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAnkur Biswas
 
Search Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By DesignSearch Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By DesignMarianne Sweeny
 
Semtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorialSemtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorialBarbara Starr
 
Bioschemas Workshop
Bioschemas WorkshopBioschemas Workshop
Bioschemas WorkshopNiall Beard
 
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎Libcorpio
 
Evaluation of Web Scale Discovery Services
Evaluation of Web Scale Discovery ServicesEvaluation of Web Scale Discovery Services
Evaluation of Web Scale Discovery ServicesNikesh Narayanan
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibEl Habib NFAOUI
 
3 Understanding Search
3 Understanding Search3 Understanding Search
3 Understanding Searchmasiclat
 
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineSalford Systems
 
Chapter 1.pptx
Chapter 1.pptxChapter 1.pptx
Chapter 1.pptxHabtamu100
 

Similar a CS6007 IR SYLLABUS (20)

Indexing Techniques: Their Usage in Search Engines for Information Retrieval
Indexing Techniques: Their Usage in Search Engines for Information RetrievalIndexing Techniques: Their Usage in Search Engines for Information Retrieval
Indexing Techniques: Their Usage in Search Engines for Information Retrieval
 
Unit 1
Unit 1Unit 1
Unit 1
 
CS8080_IRT__UNIT_I_NOTES.pdf
CS8080_IRT__UNIT_I_NOTES.pdfCS8080_IRT__UNIT_I_NOTES.pdf
CS8080_IRT__UNIT_I_NOTES.pdf
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
Web Mining
Web MiningWeb Mining
Web Mining
 
Web search engines and search technology
Web search engines and search technologyWeb search engines and search technology
Web search engines and search technology
 
Chapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and RetrievalChapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and Retrieval
 
An Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured DataAn Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured Data
 
An Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAn Introduction to Semantic Web Technology
An Introduction to Semantic Web Technology
 
Search Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By DesignSearch Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By Design
 
Semtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorialSemtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorial
 
Bioschemas Workshop
Bioschemas WorkshopBioschemas Workshop
Bioschemas Workshop
 
Web mining
Web miningWeb mining
Web mining
 
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
 
Evaluation of Web Scale Discovery Services
Evaluation of Web Scale Discovery ServicesEvaluation of Web Scale Discovery Services
Evaluation of Web Scale Discovery Services
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_Habib
 
3 Understanding Search
3 Understanding Search3 Understanding Search
3 Understanding Search
 
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search Engine
 
Chapter 1.pptx
Chapter 1.pptxChapter 1.pptx
Chapter 1.pptx
 

Último

UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 

Último (20)

UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 

CS6007 IR SYLLABUS

  • 1. DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING FULL SYLLABUS - ALL UNIT QUESTIONS & ANSWERS Course Code:C413 Year / Semester: IV / VII Sub Code :CS6007 Sub Name :Information Retrieval Faculty In-charge : A.Anandh 1. Define Information Retrieval. Information Retrieval (IR) deals with the representation, storage and organization of unstructured data. Information retrieval is the process of searching within a document collectionforaparticular information need (a query). 2. How AI is applied in IR systems? Four main roles investigated 1. Information characterization 2. Search formulation in information seeking 3. System Integration 4. Support functions 3. Give the historical view of Information Retrieval. • Boolean model, statistics of language (1950’s) • Vector space model, probabilistic indexing, relevance feedback (1960’s) • Probabilistic querying (1970’s) • Fuzzy set/logic, evidential reasoning (1980’s) • Regression, neural nets, inference networks, latent semantic indexing, TREC (1990’s) 4. CompareIR with Web Search. • Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from a large collection (usually stored on computers). Web search is often not informational -- it might be navigational (give me the url of the site I want to reach) or transactional (show me sites where I can perform a certain transaction, e.g. shop, download a file, or find a map). 5. Define Heap’s Law. An empirical rule which describes the vocabulary growth as a function of the text size is calculated by Heap’s Law. V = k * nb Where, V- Number of new words seen n – Total number of words k-1 ≤ 1 ≤ 100 b = Constant 6. What is open source software?
  • 2. ✓ Open source software is software whose source code is available for modification or enhancement by anyone. ✓ There is usually no license cost and free of cost. ✓ The source code is open and can be modified freely. ✓ Open standards. 7. Give any two advantages of using AI in IR. ✓ Artificial intelligence methods are employed throughout the standard information retrieval process and for novel value added services. ✓ The text pre-processing processes for indexing like stemming are from Artificial intelligence. ✓ Neural networks have been applied widely in IR. 8. What are the applications of IR? • Indexing • Ranked retrieval • Web search • Query processing 9. What are the components of IR?(Nov/ Dec 2016) • The document subsystem • The indexing subsystem • The vocabulary subsystem • The searching subsystem • The server-system interface • The matching subsystem 10. How to introduce AI into IR systems? • User simply enters a query, suggests what needs to be done, and the system executes the query to return results. • First signs of AI. System actually starts suggesting improvements to user. • Full Automation. User queries are entered and the rest is done by the system. 11. What are the areas of AI for information retrieval? • Natural language processing • Knowledge representation • Machine learning • Computer Vision • Reasoning under uncertainty • Cognitive theory 12. Give the functions of information retrieval system. • To identify the information(sources) relevant to the areas of interest of the target users community • To analyze the contents of the sources(documents) • To represent the contents of the analyzed sources in a way that will be suitable for matching user’s queries • To analyze user’s queries and to represent them in a form that will be suitable for matching with the database • To match the search statement with the stored database • To retrieve the information that is relevant • To make necessary adjustments in the system based on feedback form the users.
  • 3. 13. List the issues in information retrieval system. • Assisting the user in clarifying and analyzing the problem and determining information needs. • Knowing how people use and process information. • Assembling a package of information that enables group the user to come closer to a solution of his problem. • Knowledge representation. • Procedures for processing knowledge/information. • The human-computer interface. • Designing integrated workbench systems. 14. What are some open source search frameworks? • Google Search API • Apache Lucene • blekko API • Carrot2 • Egothor • Nutch 15. Define relevance. Relevance appears to be a subjective quality, unique between the individual and a given document supporting the assumption that relevance can only be judged by the information user.Subjectivity and fluidity make it difficult to use as measuring tool for system performance. 16. What is meant by stemming? Stemming is techniques used to find out the root/stem of a word. It is used to improve effectiveness of IR and text mining.Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. 17. Define indexing & document indexing. Association of descriptors (keywords, concepts, metadata) to documents in view of future retrieval.Document indexing is the process of associating or tagging documents with different “search” terms. Assign to each document (respectively query) a descriptor represented with a set of features, usually weighted keywords, derived from the document (respectively query) content. 18. List Information retrieval models.(Nov/ Dec 2016) • Boolean model • Vector space model • Statistical language model 16. Define web search and web search engine. Web search is often not informational -- it might be navigational (give me the url of the site I want to reach) or transactional (show me sites where I can perform a certain transaction, e.g. shop, download a file, or find a map). Web search engines crawl the Web, downloading and indexing pages in order to allow full- text search. There are many general purpose search engines; unfortunately none of them come close to indexing the entire Web. There are also thousands of specialized search services that index specific content or specific sites. 17. What are the components of search engine?
  • 4. Generally there are three basic components of a search engine as listed below: 1. Web Crawler 2. Database 3. Search Interfaces 18. Define web crawler. This is the part of the search engine which combs through the pages on the internet and gathers the information for the search engine. It is also known as spider or bots. It is a software component that traverses the web to gather information. 19. What are search engine processes? Indexing Process • Text acquisition • Text transformation • Index creation Query Process • User interaction • Ranking • Evaluation 20. How to characterize the web? Web can be characterized by three forms • Search engines -AltaVista • Web directories -Yahoo • Hyperlink search-Web Glimpse 21. What are the challenges of web? • Distributed data • Volatile data • Large volume • Unstructured and redundant data • Data quality • Heterogeneous data 22. Zipf’s law Frequent words and rare words are observed by Zipf’s law in the web. The most frequent term(the) occurs cf1 times, thesecond most frequent term(of) occurs times, the third mostfrequent term (and) occurs times etc.
  • 5. 23. Jaccard coefficient ▪ A commonly used measure of overlap of two sets ▪ Let A and B be two sets ▪ Jaccard coefficient ▪ JACCARD (A, A) = 1 ▪ JACCARD (A, B) = 0 if A ∩ B = 0 ▪ A and B don’t have to be the same size. ▪ Always assigns a number between 0 and 1. 24. Bag of words model ▪ We do not consider the order of words in a document. ▪ John is quicker than Mary and Mary is quicker than John are represented the same way. ▪ This is called a bag of words model. 25. Draw the architecture of a search engine 26. Compare web search and IR
  • 6. Sl. No Characteristics Web Search IR 1 Languages Documents in many different languages. Usually search engines use full text indexing; no additional subject analysis. Databases usually cover only one language or indexing of documents written in different languages with the same vocabulary. 2 File Types Several file types, some hard to index because of a lack of textual information Usually all indexed documents have the same format (e.g. PDF) or only bibliographic information is provided. 27. Search Engine Classification Search engines can be classified according to ▪ Programming language in which it is implemented ▪ Stores the index (inverted file, database, other file structure) ▪ Searching capabilities (boolean operators, fuzzy search, use of stemming) ▪ Way of ranking, type of files capable of indexing (HTML, PDF, plain text) ▪ Possibility of on-line indexing and/or making incremental indexes. 28. Give some examples of search engine Nutch, Lucene, ASPSeek, BBDBot, Datapark, ebhath, Eureka, http://Dig, Indri, ISearch, IXE, ManagingGigabytes (MG), MG4J, mnoGoSearch, MPS Information Server, Namazu, Omega, OmniFind IBM Yahoo! Ed., OpenFTS, PLWeb, SWISH-E, SWISH++, Terrier, WAIS/ freeWAIS, WebGlimpse, XML Query Engine, XMLSearch, Zebra. 29. Define Precision and Recall Recall is a measure of performance Relevant document retrieved Recall = ---------------------------------------------- Relevant documents in collection Relevant documents retrieved Precision = ------------------------------------- Documents retrieved 30. What is peer-to-peer search? A peer that joins the network does not only use resources, but also contributes resources back. Hence a peer-to-peer network can potentially scale beyond what is possible in client- server set-ups. Peer-to-peer (P2P) networks have been identified as promising architectural concepts for integrating search facilities across Digital Library collections. 31.What arethe performance measures for search engine?
  • 7. Precision and recall are the two most basic measures of the performance of information retrieval systems. 32.Explaindifferencebetweendataretrievalandinformationretrieval. Parameters Data Retrieval Information retrieval Example Data Base Query WWW Search Matching Exact Partial Match, Best Match Inference Deduction Induction Model Deterministic Probabilistic 33. Explainthetypeofnaturallanguagetechnologyusedininformationretrieval. Twotypes I. Naturallanguageinterfacemakethetaskofcommunicatingwiththeinformationsource easier, allowing a system to respond to a range of inputs. 2. NaturalLanguagetextprocessingallowsasystemtoscanthesourcetexts,either to retrieve particular information or to derive knowledge structures that may be usedinaccessinginformationfromthetexts. 34. What is searchengine? A search engine is a document retrieval system design to help find information stored in a computer system, such as on the WWW. The search engine allows one to ask for content meeting specific criteria and retrieves a list of items that match those criteria. 35. What isconflation or Stemming? Stemming is the process for reducing inflected words to their stem, base or root form, generally a written word form. The process of stemming if often called conflation. 36. What is an invisibleweb? Many dynamically generated sites are not index able by search engines; This phenomenon is known as the invisible web. 37. What is proprietarysoftware? Proprietary software is computer software which is the legal property of one party. The term of use for other parties is defined by contracts or licensing agreements. These terms may include various privileges to share, alter , dissemble, and use the software and its code. 38. What is closedsoftware? Closed software is a term for software whose license does not allow for the release or distribution of the software’s source code. Generally it means only the binaries of a
  • 8. computer program are distributed and the license provides no access to the programs source code. The source code of such programs is usually regarded as a trade secret of the company. Access to source code by third parties commonly requires the party to sign a non- disclosureagreement. 39. Listtheadvantageofopensource. ▪ The right to use the software in anyway. ▪ Thereisusuallyno licensecostandfreeofcost. ▪ The source code is open and can be modifiedfreely. ▪ Openstandards. ▪ It provides higherflexibility. 40. Listthedisadvantageofopensource. ▪ Thereisnoguaranteethatdevelopmentwillhappen. ▪ Itissometimesdifficulttoknowthataprojectexist, anditscurrentstatus. No secured follow-up developmentstrategy. 41. Whatdoyou meanbyApacheLicense? The Apache License is a free software license written by the Apache Software Foundation (ASF). The name Apache is a registered trademark and may only be used with the trademark holders express permission. Apache license is a high performance, Full-featured text search engine library written entirely in Java. 42. ExplainfeaturesofGPLversion2. It givespermissiontocopyanddistributetheprogramsunmodifiedsourcecode. It allows modifying the programs source code anddistributing the modified sourcecode. Userdistributescompiledversionsoftheprogram,bothmodifiedandunmodified. AllmodifiedcopiesaredistributedundertheGPLv2. All compiled versions of the program are accompanied by the relevant source code. 43. List out any four search engines. Google, Bing, ASK, Alta Vista Part B (2 x 13 = 26) 1. Discuss the characteristics of web in detail. (13) In characterizing the structure and content of the Web, it is necessary to establish precise semantics for Web concepts. Measuring the Web • Internet and particular web is dynamic in nature so it is difficult task to measure. • Web explosion is due in no small part to the extended application of an axiom known as Moore's Law. While ostensibly a prediction about semi-conductor innovation rates, this bit of prophecy from Intel co-founder Gordon Moore has come to represent the doubling not just of processing power, but of computing power in general. Modeling the Web
  • 9. • The Heap's and Zipf’s laws are also valid in the web. Normally the vocabulary grows faster and the word distribution should be more biased. But there are no such experiments on large Web collections to measure these parameters. 2. Describe the various impact of WEB on IR (13) Impact of the web The first impact of the web on search is related to the characteristics of the document collection itself. o The web is composed of pages distributed over millions of sites and connected through hyperlinks o This requires collecting ll documents and storing copies of them in a central repository,prior to indexing. o This new phase in the IR process,introduced by the web is called crawling The second impact of the web on search is related to o The size of the collection o The volume of user queries submitted on a daily basis o As a consequence,performance and scalability have critical characteristics of the IR system. The third impact in a very large collection, predicting relevance is much harder than before o Fortunately the web also includes new sources of evidence o Ex. hyperlinks and user clicks in documents in the answer set The fourth impact derives from the fact that the web is also a medium to do business. o Search problem has been extended beyond the seeking of text information to alsoencompass other user needsEx.price of a book, the phone number of a hotel The fifth impact of the web on search is web spam o Web spam: abusive availability of commercial information disguised in the form of informational content. o This difficulty is so large that today we talk of adverbial web retrieval. 3. Compare in detail Information Retrieval and Web Search with examples.(13) Sl.No Differentiator Web Search IR 1 Languages Documents in many different languages. Usually search engines use full text indexing; no additional subject analysis. Databases usually cover only one language or indexing of documents written in different languages with the same vocabulary. 2 File Types Several file types, some hard to index because of a lack of textual information. Usually all indexed documents have the same format (e.g. PDF) or only bibliographic information is provided. 3 Document length Wide range from very short to very long. Longer documents are often divided into parts. Document length varies, but not to such a high degree as with the Web documents 4 Document structure HTML documents are semi structures. Structured documents allow complex field searching 5 Spam Search engines have to decide which documents are suitable for indexing. Suitable document types are defined in the process of database design.
  • 10. 5. Demonstrate the role of Artificial Intelligence in Information Retrieval Systems. (14) 6. Explain in detail the components of Information Retrieval.
  • 11. 8. Describe the components of a search engine with neat diagram. • The main components of a search engine are the crawler, indexer, search index, query engine, and search interface. • web crawler is a software program that traverses web pages, downloads them for indexing, and follows the hyperlinks that are referenced on the downloaded pages. • A web crawler is also known as a spider, a wanderer or a software robot. Fig: Crawling the web
  • 12. • The second component is the indexer which is responsible for creating the search index from the web pages it receives from the crawler. • The third component is the search index . • The search index is a data repository containing all the information the search engine needs to match and retrieve web pages. • The type of data structure used to organize the index is known as an inverted file. • It is very much like an index at the back of a book. • It contains all the words appearing in the web pages crawled, listed in alphabetical order (this is called the index file), and for each word it has a list of references to the web pages in which the word appears (this is called the posting list ). • The search index will also store information pertaining to hyperlinks in a separate link database, which allows the search engine to perform hyperlink analysis, which is used as part of the ranking process of web pages. • The fourth component is the query engine. • The interface between the search index, the user and the web. • Algorithmic details of commercial search engines kept as trade secrets. • The query engine processes a user query in two steps. • First step is retrieval of potential results from the index. • Second step is the ranking of the results based on their “relevance” to the query. • The fifth component is the search interface • Once the query is processed, the query engine sends the results list to the search interface, which displays the results on the user’s screen. • From the usability point of view, it is important that users can distinguish between sponsored links, which are ads, and organic results, which are ranked by the query engine. 9. Explain in detail the components of IR with neat sketch. An information retrieval process begins when a user enters a query into the system. Queries are formal statements of information need.
  • 13. User queries are matched against the database information. Depending on the application the data objects may be, for example, text documents, images, audio, mind maps or video Most IR systems compute a numeric score on how well each object in the database match the query, and rank the objects according to this value. The top ranking objects are then shown to the user. The process may then be iterated if the user wishes to refine the query. Three Major Components 1. Document Subsystem a) Acquisition 10). Discuss the query likelihood model in detail and describe the approach for IR using this model. A Language model is a function that puts a probability over strings drawn from some vocabulary. Language Model M over an alphabet ∑: ∑ 𝑃(𝑠) = 1 𝑠∈∑∗
  • 14. To compare two models for data set Likelihood ratio is used.Dividing the probability of the data according to one model by the probability of the data according to other model. ▪ Each document is treated as (the basis for) a language model. ▪ Given a query q ▪ Rank documents based on P(d|q) ▪ P(q) is the same for all documents, so ignore ▪ P(d) is the prior – often treated as the same for all d ▪ But we can give a prior to “high-quality” documents, e.g., those with high PageRank. ▪ P(q|d) is the probability of q given d. ▪ So to rank documents according to relevance to q, ranking according to P(q|d) and P(d|q) is equivalent. ▪ In the LM approach to IR, we attempt to model the query generation process. ▪ Then we rank documents by the probability that a query would be observed as a random sample from the respective document model. ▪ That is, we rank according to P(q|d). Mixture Model ▪ In the LM approach to IR, we attempt to model the query generation process. ▪ Then we rank documents by the probability that a query would be observed as a random sample from the respective document model. ▪ That is, we rank according to P(q|d). 11. Explain about relevance feedback and query expansions. ▪ Interactive relevance feedback: improve initial retrieval results by telling the IR system which docs are relevant / irrelevant ▪ Best known relevance feedback method: Rocchio feedback ▪ Query expansion: improve retrieval results by adding synonyms / related terms to the query ▪ Two ways of improving recall: ▪ Relevance feedback and Query expansion ▪ The user issues a (short, simple) query. ▪ The search engine returns a set of documents. ▪ User marks some docs as relevant, some as irrelevant. ▪ Search engine computes a new representation of the information need. Hope: better than the initial query. ▪ Search engine runs new query and returns new results. ▪ New results have (hopefully) better recall [Accuracy]. ▪ The Rocchio’ algorithm implements relevance feedback in the vector space model. ▪ Rocchio’ chooses the query that maximizes ▪ ▪ Dr : set of relevant docs; Dnr : set of nonrelevant docs Query Expansion ▪ Query expansion is another method for increasing recall. ▪ We use “global query expansion” to refer to “global methods for query reformulation”. ▪ In global query expansion, the query is modified based on some global resource, i.e. a resource that is not query-dependent. ▪ Main information we use: (near-)synonymy ▪ A publication or database that collects (near-)synonyms is called a thesaurus.
  • 15. We will look at two types of thesauri: manually created and automatically created. 12. Explain briefly about InformationRetrivel. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored oncomputers). As defined in this way, information retrieval used to be an activity that only a few people engaged in: reference librarians, paralegals, and similar professional searchers. Now the world has changed, and hundreds of millions of people engage in information retrieval every day when they use a web search engine or search their email. Information retrieval is fast becoming the dominant form of information access, overtaking traditional database- style searching. IR can also cover other kinds of data and information problems beyond that specified in the core definition above. The termunstructured atarefers to data which does not have clear, semantically overt, easy-for-a-computer structure. It is the opposite of structured data, the canonical example of which is a relational database, of the sort companies usually use to maintain product inventories and personnel records. The latent linguistic structure of human languages. But even accepting that the intended notion of structure is open structure, most text has structure, such as headings and paragraphs and footnotes, which is commonly represented in documents by explicit markup (such as the coding underlying web pages). IR is also used to facilitate search such as finding a document where the title contains Java and the body contains threading. The field of information retrieval also covers supporting users in browsing orfiltering document collections or further processing a set of retrieved documents. Given a set of documents, clustering is the task of coming up with a good grouping of the documents based on their contents. It is similar to arranging books on a bookshelf according to their topic. Given a set of topics, standing information needs, or other categories (such as suitability of texts for different age groups), classification is the task of deciding which class(es), if any, each of a set of documents belongs to. It is often approached by first manually classifying some documents and then hoping to be able to classify new documents automatically. Information retrieval systems can also be distinguished by the scale at which they operate, and it is useful to distinguish three prominent scales. In web search, the system has to provide search over billions of documents stored on millions of computers. Distinctive issues need to gather documents for indexing, being able to build systems that work efficiently at this enormous scale, and handling particular aspects of the web, such as the exploitation of hypertext and not being fooled by site providers manipulating page content in an attempt to boost their search engine rankings, given the commercial importance of theweb. In the last few years, consumer operating systems have integrated information details; such as Apple’s Mac OS X Spotlight oƌWiŶoǁs, Vista’s InstaSearch. Email programs usually not only provide search but also text classification: they at least provide a spam (junk mail) filter, and commonly also provide eithermanualor automatic means for classifying mail so that it can be placed directly into particular folders. Distinctive issues here include handling the broad range of document types on a typical personal computer, and making the search system maintenance free and sufficiently lightweight in terms of start-up, processing, and disk space usage that it can run on one machine without annoying its owner. In between is the space of enterprise, institutional, anddomain-specificsearch, a database of patents, orresearcharticlesonbiochemistry.
  • 16. 13. Explain about the history ofIR The idea of using computers to search for relevant pieces of information was popularized in the article As We May Tby VannevarBushin 1945.Itwouldappear that Bush was inspired by patents for a 'statistical machine' - filed by Emanuel Goldberg in the 1920s and '30s - that searched for documents stored on film. The first description of a computer searching for information was described by Holmstromin 1948, detailing an early mention of the Univaccomputer. Automated information retrieval systems were introduced in the 1950s: one even featured in the 1957 romantic comedy, Desk Set. In the 1960s, the first large information retrieval research group was formed by Gerard Salton at Cornell. By the 1970s several different retrieval techniques had been shown to perform well on small text corpora such as the Cranfield collection (several thousand documents). Large-scale retrievalsystems,suchastheLockheedDialogsystem,cameintouseearlyinthe1970s. In 1992, the US Department of Defense along with the National Institute of Standards and Technology (NIST), cosponsored the Text Retrieval Conference (TREC) as part of the TIPSTER text program. The aim of this was to look into the information retrieval community by supplying the infrastructure that was needed for evaluation of text retrieval methodologies on a very large text collection. This catalyzed research on methods that scale to huge corpora. The introduction of web search engines has boosted the need for very large scale retrieval systems evenfurther. Timeline: 1950: The term "information retrieval" was coined by Calvin Mooers. 1951: Philip Bagley conducted the earliest experiment in computerized document retrieval in a master thesis at MIT. 1955: Allen Kent joined from Western Reserve University published a paper in American Documentation describing the precision and recall measures as well as detailing a proposed "framework" for evaluating an IR system which included statistical sampling methods for determining the number of relevant documents not retrieved. 1959: Hans Peter Luhn published "Auto-encoding of documents for information retrieval." 1963: Joseph Becker and Robert M. Hayes published text on information retrieval. Becker, Joseph; Hayes, Robert Mayo. Information storage and retrieval: tools, elements, theories. New York, Wiley (1963). 1964: • Karen Spärck Jones finished her thesis at Cambridge, Synonymy and Semantic Classification,andcontinuedworkoncomputationallinguisticsasitappliestoIR. • The National Bureau of Standards sponsored a symposium titled "Statistical Association Methods for Mechanized Documentation." Several highly significant papers, including G. Salton's first published reference (we believe) to the SMARTsystem.
  • 17. mid-1960s: • National Library of Medicine developed MEDLARS Medical Literature Analysis and Retrieval System, the first major machine-readable database and batch- retrievalsystem. • Project Intrex atMIT. 1965: J. C. R. Licklider published Libraries of the Future. late 1960s: F. Wilfrid Lancaster completed evaluation studies of the MEDLARS system and published the first edition of his text on information retrieval. 1968: Gerard Salton published Automatic Information Organization and Retrieval. John W. Sammon, Jr.'s RADC Tech report "Some Mathematics of Information Storage and Retrieval..." outlined the vector model. 1969: Sammon's "A nonlinear mapping for data structure analysis" (IEEE Transactions on Computers) was the first proposal forvisualization interface to an IR system. 1970s Early1970s: Firstonlinesystems—NLM'sAIM- TWX,MEDLINE;Lockheed'sDialog;SDC's ORBIT. 1971: Nicholas Jardine and Cornelis J. van Rijsbergen published "The use of hierarchic clustering in information retrieval", which articulated the "cluster hypothesis." 1975: Three highly influential publications by Salton fully articulated his vector processing framework and term discrimination model: A Theory of Indexing (Society for Industrial and Applied Mathematics) A Theory of Term Importance in Automatic Text Analysis (JASIS v. 26) A Vector Space Model for Automatic Indexing (CACM 18:11) 1978: The First ACM SIGIR conference. 1979: C. J. van Rijsbergen published Information Retrieval (Butterworths). Heavy emphasis on probabilistic models. 1979: TamasDoszkocs implemented the CITE natural language user interface for MEDLINE at the National Library of Medicine. The CITE system supported free form query input, ranked output and relevancefeedback. 1980s 1982: Nicholas J. Belkin, Robert N. Oddy, and Helen M. Brooks proposed the ASK (Anomalous State of Knowledge) viewpoint for information retrieval. This was animportant concept, though their automated analysis tool proved ultimately disappointing. 1983: Salton (and Michael J. McGill) published Introduction to Modern Information Retrieval (McGraw-Hill), with heavy emphasis on vector space models. mid-1980s: Efforts to develop end-user versions of commercial IR systems. 1989: First World Wide Web proposals by Tim Berners-Lee at CERN.
  • 18. 1992: First TREC conference. 1997: Publication of Korfhage's Information Storage and Retrieval with emphasis on visualization and multi-reference point systems. late 1990s: Web search engines implementation of many features formerly found only in experimental IR systems. Search engines become the most common and maybe best instantiation of IRmodels. 2000s-present: More applications, especially Web search and interactions with other fields like Learning to rank, Scalability (e.g., MapReduce), Real-time search. 14. Explain about the Components OfIR. The Following figure shows the architecture of IR System Components: • Textoperations • Indexing • Searching • Ranking • UserInterface • Queryoperations Text operation:
  • 19. Text Operations forms index words (tokens). • Stop word removal ,Stemming Indexing: Indexing constructs an inverted index of word to document pointers. Searching: Searching retrieves documents that contain a given query token from the invertedindex. Ranking : Ranking scores all retrieved documents according to a relevance metric. User Interface: User Interface manages interaction with the user: • Query input and documentoutput. • Relevancefeedback. • Visualization ofresults. Query Operations: Query Operations transform the query to improve retrieval: • Query expansion using athesaurus. • Query transformation using relevancefeedback. First of all, before the retrieval process can even be initiated, it is necessary to define the text database. This is usually done by the manager of the database, which specifies the following: (a) the documents to be used, (b) the operations to be performed on the text, and (c) the text model (i.e., the text structure and what elements can be retrieved). The text operations transform the original documents and generate a logical view of them. Once the logical view of the documents is defined, the database manager builds an index of the text. An index is a critical data structure because it allows fast searching over large volumes of data. Different index structures might be used, but the most popular one is the inverted file. The resources (time and storage space) spent on defining the text database and building the index are amortized by querying the retrieval system manytimes. Given that the document database is indexed, the retrieval process can be initiated. The user first specifies a user need which is then parsed and transformed by the same text operations applied to the text. Then, query operations might be applied before the actual query, which provides a system representation for the user need, is generated. The query is then processed to obtain the retrieved documents. Fast query processing is made possible by the index structure previouslybuilt. Before been sent to the user, the retrieved documents are ranked according to a likelihood of relevance. The user then examines the set of ranked documents in the search for useful information. At this point, he might pinpoint a subset of the documents seen as definitely of interest and initiate a user feedback cycle. In such a cycle, the system uses the documents selected by the user to change the query formulation. Hopefully, this modified query is a betterrepresentation
  • 20. 15. What are the Issues inIR? 1. To process large document collections quickly. The amount of online data has grown at least as quickly as the speed of computers, and we would now like to be able to search collections that total in the order of billions to trillions ofwords. 2. To allow more flexible matching operations. For example, it is impractical to perform the queryRomansNEARcountrymenwithgrep,whereNEARmightbedefinedasȃwithin5wor dsȄorȃwithinthesamesentenceȄ. 3. To allow ranked retrieval: in many cases you want the best answer to an information needamongmanydocumentsthatcontaincertainwords. The Big Issues Information retrieval researchers have focused on a few key issues that remain just as important in the era of commercial web search engines working with billions of web pages as they were when tests were done in the 1960s on document collections containingabout 1.5 megabytes of text. One of these issues is relevance. Relevance is a fundamental concept in information retrieval. Loosely speaking, a relevant document contains the information that a person was looking for when she submitted a query to the search engine. Although this sounds simple, there are many faĐtoƌs that go iŶto a peƌsoŶ’sdeĐisioŶ as to ǁhetheƌ a paƌtiĐulaƌdoĐuŵeŶtisƌeleǀaŶt. These factors must be taken into account when designing algorithms for comparing text and ranking documents. Simply comparing the text of a query with the text of a document and looking for an exact match, as might be done in a database system or using the grep utility in Unix, produces very poor results in terms of relevance. One obvious reason for this is that language can be used to express the same concepts in many different ways, often with very different words. This is referred to as the vocabulary mismatch problem in information retrieval. It is also important to distinguish between topical relevance and user relevance. A text document is topically relevant to a query if it is on the same topic. User relevance takes these additional features of the story into account. To address the issue of relevance, researchers propose retrieval models and test how well they work. A retrieval model is a formal representation of the process of matching a query and a document. It is the basis of the ranking algorithm that is used in a search engine to produce the ranked list of documents. A good retrieval model will find documents that are likely to be considered relevant by the person who submitted the query. Some retrieval models focus on topical relevance, but a search engine deployed in a real environment must use ranking algorithms that incorporate user relevance. An interesting feature of the retrieval models used in information retrieval is that they typically model the statistical properties of text rather than the linguistic structure. This means, for example, that the ranking algorithms are typically far more concerned with the counts of word occurrences than whether the word is a noun or an adjective. More advanced models do incorporate linguistic features, but they tend to be of secondary importance. The use of word frequency information to represent text started with another information retrieval pioneer, H.P. Luhn, in the 1950s. This view of text did not become popular in other fields of computer science, such as natural language processing, until the1990s. Another core issue for information retrieval is evaluation. Since the quality of a document ranking necessary early on to develop evaluation measures and
  • 21. experimental procedures for acquiring this data and using it to compare ranking algorithms. Cyril Cleverdon led the way in developing evaluation methods in the early 1960s, and two of the measures he used, precision and recall, are still popular. Precision is a very intuitive measure, and is the proportion of retrieved documents that are relevant. Recall is the proportion of relevant documents that are retrieved. When the recall measure is used, there is an assumption that all the relevant documents for a given query are known. Such an assumption is clearly problematic in a web search environment, but with smaller test collections of documents, this measure can be useful. A test collection for information retrieval experiments consists of a collection of text documents, a sample of typical queries, and a list of relevant documents for each query (the relevance judgments). The best-known test collections are those associated with the TREC6 evaluation forum. Evaluation of retrieval models and search engines is a very active area, with much of the current focus on using largevolumesof log data from user interactions, such as click through data, which records the documents that were clicked on during a search session. Click through and other log data is strongly correlated with relevance so it can be used to evaluate search, but search engine companies still use relevance judgments in addition to log data to ensure the validity of theirresults. The third core issue for information retrieval is theemphasis on users and their information needs. This should be clear given that the evaluation of search is user centered. That is, the users of a search engine are the ultimate judges of quality. This has led to numerous studies on how people interact with search engines and, in particular, to the development of techniques to help people express their information needs. An information need is the underlying cause of the query that a person submits to a search engine. In contrast to a request to a database system, such as for the balance of a bank account, text queries are often poor descriptions of what the user actually wants. A one-word query such asscould be a request for information onwhere to buy cats or for a description of the Broadway musical. Despite their lack of specificity, however, one- word queries are very common in web search. Techniques such as query suggestion, query expansion, and relevance feedback use interaction and context to refine the initial query in order to produce better rankedlists.
  • 22. UNIT 2 1. Define an inverted index. • An inverted index (also referred to as postings file or inverted file) is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents. • Its purpose is to allow fast full text searches, at a cost of increased processing when a document is added to the database. Term Document frequency → Postings lists approach 1 → 3 breakthrough 1 → 1 drug 2 → 1 → 2 for 3 → 1 → 3 → 4 hopes 1 → 4 new 3 → 2 → 3 → 4 • • Dictionary Postings 2. Discuss the process of stemming. Give example. • Stemming is the process of reducing terms to their “roots” before indexing. • “Stemming” suggests crude affix chopping o It’s language dependent o E.g., automate(s), automatic, automation - all reduced to automat. 3. Compare information retrieval and web search. Differentiator Web Search Information Retrieval Languages Documents in many different languages. Databases usually cover only one language or indexing of documents written in different languages with the same vocabulary. Document structure HTML documents are semi structures. Structured documents allow complex field searching 4. What do you mean information retrieval models? A retrieval model can be a description of either the computational process or the human process of retrieval: The process of choosing documents for retrieval; the process by which information needs are first articulated and then refined.
  • 23. 5. What is cosine similarity? This metric is frequently used when trying to determine similarity between two documents. Since there are more words that are in common between two documents, it is useless to use the other methods of calculating similarities. 6. What is language model based IR? A language model is a probabilistic mechanism for generating text. Language models estimate the probability distribution of various natural language phenomena. 7. Define unigram language. A unigram (1-gram) language model makes the strong independence assumption that words are generated independently from a multinomial distribution 8. What are the characteristics of relevance feedback? It shields the user from the details of the query reformulation process. It breaks down the whole searching task into a sequence of small steps which are easier to grasp. Provide a controlled process designed to emphasize some terms and de-emphasize others. 9.What are the assumptions of vector space model? The degree of matching can be used to rank-order documents; this rank-ordering corresponds to how well a document satisfying a user’s information needs. 10. What are the disadvantages of Boolean model? It is not simple to translate an information need into a Boolean expression. Exact matching may lead to retrieval of too many documents. The retrieved documents are not ranked. The model does not use term weights. 11. Explain Luhn’s ideas Luhn’s basic idea to use various properties of texts, including statistical ones, was critical in opening handling of input by computers for IR. Automatic input joined the already automated output. 12. Define Latent semantic Indexing. Latent Semantic Indexing is a technique that projects queries and documents into a space with “latent” Semantic dimensions. It is statistical method for automatic indexing
  • 24. and retrieval that attempts to solve the major problems of the current technology. It is intended to uncover latent semantic structure in the data that is hidden. It creates a semantic space where in terms and documents that are associated are placed near one another. 13.State Baye’s Rule. 14. How do you calculate the term weighting in document and Query term weight? 15. What is Zone index? Document titles and abstracts are generally treated as zones. We have built a separate inverted index for each zone of a document. 16. List down the major retrieval models • Boolean Exact Match • Vector space Best Match – Basic vector space – Extended Boolean model – Latent Semantic Indexing (LSI) • Probabilistic models Best Match – Basic probabilistic model – Bayesian inference networks – Language models • Citation analysis models – Hubs & authorities (Kleinberg, IBM Clever) Best Match – Page rank (Google) Exact Match 17. Initial stages of text processing Tokenization • Cut character sequence into word tokens Deal with “John’s”, a state-of-the-art solution Normalization • Map text and query term to same form You want U.S.A. and USA to match
  • 25. Stemming • We may wish different forms of a root to match authorize, authorization Stop words • We may omit very common words (or not) the, a, to, of 18. What is meant by Boolean Retrieval Model? Boolean retrieval model is being able to ask a query that is a Boolean expression: • Boolean Queries are queries using AND, OR and NOT to join query terms Views each document as a set of words Is precise: document matches condition or not. • Perhaps the simplest model to build an IR system onPrimary commercial retrieval tool for 3 decades. Many search systems you still use are Boolean: • Email, library catalog, Mac OS X Spotlight 19. Define inverse document frequency (idf) Document frequency refer the number of document in which t occurs. dft is an inverse measure of the in formativeness of term t. idft= log10(N/dfi)Here, N – Number of documents Part-B 1. Draw the term-document incidence matrix for this document collection.Draw the inverted index representation for this collection. Doc 1: breakthrough drug for schizophrenia Doc 2: new schizophrenia drug Doc 3: new approach for treatment of schizophrenia Doc 4: new hopes for schizophrenia patients Term‐document incidence matrix It is amxn matrix where m represents the number of distinct terms (words) as rows in the matrix and n represents the total number of documents, as columns in the matrix. Term Document1 Document2 Document3 Document4 approach 0 0 1 0 breakthrough 1 0 0 0 drug 1 1 0 0 for 1 0 1 1 hopes 0 0 0 1 new 0 1 1 1 of 0 0 1 0 patients 0 0 0 1 schizophrenia 1 1 1 1 treatment 0 0 1 0
  • 26. Inverted index representation for this collection Within a document collection, we assume that each document has a unique serial number, the document identifier (docID) b) Sort the terms alphabetically Term docID Term docID breakthrough 1 approach 3 drug 1 breakthrough 1 for 1 drug 1 schizophrenia 1 drug 2 new 2 for 1 schizophrenia 2 for 3 drug 2 for 4 new 3 => hopes 4 approach 3 new 2 for 3 new 3 treatment 3 new 4 of 3 of 3 schizophrenia 3 patients 4 new 4 schizophrenia 1 hopes 4 schizophrenia 2 for 4 schizophrenia 3 schizophrenia 4 schizophrenia 4 patients 4 treatment 3 c) Merge multiple occurrences of the same term Record the frequency of occurrences of the term in the document Group instances of the same term and split dictionary and postings Term Document frequency approach 1 breakthrough 1 drug 2 for 3 hopes 1 new 3 of 1 patients 1 schizophrenia 4 treatment 1
  • 27. → Postings lists → 3 → 1 → 1 → 2 → 1 → 3 → 4 → 4 → 2 → 3 → 4 → 3 → 4 → 1 → 2 → 3 → 4 → 3 Term Document1 Document2 ambitious 0 1 be 0 1 brutus 1 1 capitol 1 0 caesar 1 1 did 1 0 enact 1 0 hath 0 1 I 1 0 i' 1 0 it 0 1 julius 1 0 killed 1 0 let 0 1 me 1 0 noble 0 1 so 0 1 the 1 1 told 0 1 you 0 1 was 1 1 with 0 1 2. Draw the term-document incidence matrix for this document collection. Draw the inverted index representation for this collection. Doc 1: “I did enact Julius Caesar: I was killed i’ the Capitol; Brutus killed me.” Doc 2: “So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious:” Term‐document incidence matrix It is amxn matrix where m represents the number of distinct terms (words) as rows in the matrix and n represents the total number of documents, as columns in the matrix.
  • 28. Inverted index representation Within a document collection, we assume that each document has a unique serial number, the document identifier (docID) a) List of normalized tokens b) Sort the terms for each document alphabetically Term docID Term docID I 1 ambitious 2 did 1 be 2 enact 1 brutus 1 julius 1 brutus 2 caesar 1 capitol 1 I 1 caesar 1 was 1 caesar 2 killed 1 caesar 2 i' 1 did 1 the 1 enact 1 capitol 1 hath 1 brutus 1 I 1 killed 1 => I 1 me 1 i' 1 so 2 it 2 let 2 julius 1 it 2 killed 1 be 2 killed 1 with 2 let 2 caesar 2 me 1 the 2 noble 2 noble 2 so 2 brutus 2 the 1 hath 2 the 2 told 2 told 2 you 2 you 2 caesar 2 was 1 was 2 was 2 ambitious 2 with 2 c) Merge multiple occurrences of the same term Record the frequency of occurrences of the term in the document Group instances of the same term and split dictionary and postings Term Document frequency → Postings lists ambitious 1 → 2
  • 29. be 1 → 2 brutus 2 → 1 → 2 capitol 1 → 1 caesar 2 → 1 → 2 did 1 → 1 enact 1 → 1 hath 1 → 2 I 1 → 1 i' 1 → 1 it 1 → 2 julius 1 → 1 killed 1 → 1 let 1 → 2 me 1 → 1 noble 1 → 2 so 1 → 2 the 2 → 1 → 2 told 1 → 2 you 1 → 2 was 2 → 1 → 2 3. What is search engine? Explain with diagrammatic illustration the components of a search engine. Search engine is a program that searches for and identifies items in a database that correspond to keywords or characters specified by the user, used especially for finding particular sites on the World Wide Web Search engine major functions: Indexing process - builds the structures that enable searching Query process - uses those structures and a person’s query to produce a ranked list of documents I. Indexing Process The major components of the indexing process are text acquisition, text transformation, and index creation. Indexing Process I.1. Text acquisition • The task of the text acquisition component is to identify and make available the documents that will be searched. • It often requires building a collection by crawling or scanning the Web, a corporate intranet, a desktop, or other sources of information.
  • 30. • It creates a document data store, which contains the text and metadata for all the documents. • Metadata is information about a document that is not part of the text content, such as the document type (e.g., email or web page), document structure, and other features, such as document length. I.2. Text transformation • The text transformation component transforms documents into index terms or features. • Index terms are the parts of a document that are stored in the index and used in searching. • The simplest index term is a word, but not every word may be used for searching. • A “feature” is more often used in the field of machine learning to refer to a part of a text document that is used to represent its content, which also describes an index term. • Examples of other types of index terms or features are phrases, names of people, dates, and links in a web page. • Index terms are sometimes referred to as “terms.” The set of all the terms that are indexed for a document collection is called the index vocabulary. I.3. Index creation • The index creation component takes the output of the text transformation component and creates the indexes or data structures that enable fast searching. It must be efficient in terms of time and space. • Indexes must also be able to be efficiently updated when new documents are acquired. Inverted indexes are the most common form of index. • An inverted index contains a list for every index term of the documents that contain that index term. II. Query Process The major components of the query process are user interaction, ranking, and evaluation. Query Process II.1. User interaction • The user interaction component provides the interface between the person doing the searching and the search engine. • One task for this component is accepting the user’s query and transforming it into index terms. Another task is to take the ranked list of documents from the search engine and organize it into the results shown to the user.
  • 31. • Example: generating the snippets used to summarize documents. The document data store is one of the sources of information used in generating the results. • This component also provides a range of techniques for refining the query so that it better represents the information need. II.2. Ranking • The ranking component is the core of the search engine. • It takes the transformed query from the user interaction component and generates a ranked list of documents using scores based on a retrieval model. • Ranking must be both efficient, since many queries may need to be processed in a short time, and effective, since the quality of the ranking determines whether the search engine accomplishes the goal of finding relevant information. • The efficiency of ranking depends on the indexes, and the effectiveness depends on the retrieval model. II.3. Evaluation • The task of the evaluation component is to measure and monitor effectiveness and efficiency. • It records and analyzes user behaviour using log data. • The results of evaluation are used to tune and improve the ranking component. • Most of the evaluation component is not part of the online search engine, apart from logging user and system data. • Evaluation is primarily an offline activity, but it is a critical part of any search application 4. Explain in detail about Boolean retrieval model • The Boolean retrieval model is being able to ask a query that is a Boolean expression: – Boolean Queries are queries using AND, OR and NOT to join query terms • Views each document as a set of words • Is precise: document matches condition or not. – Perhaps the simplest model to build an IR system on • Primary commercial retrieval tool for 3 decades. • Many search systems you still use are Boolean: – Email, library catalog, Mac OS X Spotlight
  • 32. Types of Retrieval Models Exact Match Vs Best Match Retrieval Exact Match • Query specifies precise retrieval criteria • Every document either matches or fails to match query • Result is a set of documents – Usually in no particular order – Often in reverse-chronological order Best Match • Query describes retrieval criteria for desired documents • Every document matches a query to some degree • Result is a ranked list of documents, “best” first Term-document incidence matrices • So we have a 0/1 vector for each term. BrutusANDCaesarBUTNOTCalpurnia 1 if play contains word, 0 otherwise
  • 33. • To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented) ➔ bitwise AND. – 110100 AND – 110111 AND – 101111 = – 100100 Boolean IR Algorithm D: Set of words present in a document Each term is either present(1) or absent (0) Q: A Boolean expression Terms are index terms Operators are AND, OR, and NOT F: Boolean algebra over set’s of terms and set of documents R: A document is predicted as related to a query expression, if it satisfies the query expression. Each query term specifies a set of documents containing the term AND( ^) – The intersection of 2 sets OR(V) – The union of two sets NOT (~) – Set inverse Advantages: • Easy to understand • Clean formalism • Predictable, easy to explain • Boolean model can be extended to include ranking. • Structured queries • Works well when searchers knows exactly what is wanted Disadvantages: • Most people find it difficult to create good Boolean queries – Difficulty increases with size of collection 6. Explain in detail about vector space model • This model is best known and most widely used model. • The advantage of this model is being simple and appealing framework for implementing term weighting, ranking and relevance feedback. • The vector model proposes a framework in which partial matching is possible. • Assume t distinct terms remain after preprocessing; call them index terms or the vocabulary. • These “orthogonal” terms form a vector space. Dimensionality = t = |vocabulary| • Each term, i, in a document or query, j, is given a real-valued weight, wij. • Both documents and queries are expressed as t-dimensional vectors: dj = (w1j, w2j, …, wtj) Document Collection • A collection of n documents can be represented in the vector space model by a term- document matrix. • An entry in the matrix corresponds to the “weight” of a term in the document; zero means the term has no significance in the document or it simply doesn’t exist in the document.
  • 34. Term frequency tf ▪ The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d. ▪ We want to use tf when computing query-document match scores. But how? ▪ Raw term frequency is not what we want: ▪ A document with 10 occurrences of the term is more relevant than a document with 1 occurrence of the term. ▪ But not 10 times more relevant. ▪ Relevance does not increase proportionally with term frequency. Document frequency ▪ Rare terms are more informative than frequent terms ▪ Consider a term in the query that is rare in the collection (e.g., arachnocentric) ▪ A document containing this term is very likely to be relevant to the query arachnocentric ▪ → We want a high weight for rare terms like arachnocentric. ▪ Frequent terms are less informative than rare terms ▪ Consider a query term that is frequent in the collection (e.g., high, increase, line) ▪ A document containing such a term is more likely to be relevant than a document that doesn’t ▪ But it’s not a sure indicator of relevance. ▪ → For frequent terms, we want high positive weights for words like high, increase, and line ▪ But lower weights than for rare terms. ▪ We will use document frequency (df) to capture this. idf weight: ▪ To scale down the term weights of the term with high collection frequency, defined to be the total number of occurrences of a term in the collection. ▪ To reduce tf weight of a term by a factor that grows with its collection frequency, document frequency dft is defined as the number of documents in the collection that contain a term t. Collection vs. Document frequency ▪ The collection frequency of t is the number of occurrences of t in the collection, counting multiple occurrences. ▪ the idf (inverse document frequency) of t by ▪ log (N/dft) is used instead of N/dft to “dampen” the effect of idf. Thus the idf of arare term is high, whereas the idf of a frequent term is likely to be low )/df(logidf 10 tt N=
  • 35. Tf-idf weighting: • The combination of term frequency and inverse document frequency produce a composite weight for each term in each document. • The tf-idf weighting scheme assigns to term t a weight in document d that is • • Highest when t occurs many times within a small number of documents • Lower when the term occurs fewer times in a document, or occurs in many documents. • Lowest when the term occurs in virtually all documents. • Each document may also be viewed as a vector with one component corresponding to each term in the dictionary, together with a weight for each component that is given by equation above. For dictionary terms that do not occur in a document, this weight is zero. • Document d is sum , over all query terms, of the number of times each of the query terms occurs in d. • Still this can be refined as adding not the occurrences of each term t in d,but instead the tf- idf weight of each term in d. Cosine Similarity: Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0° is 1, and it is less than 1 for any other angle. The documents are ranked by computing the distance between the points representing the documents and query. More commonly a similarity measure is used so that the documents with the highest scores are the most similar to the query. The numerator of this measure is the sum of the products of the term weights for the matching query and document terms. • Retrieval based on similarity between query and documents. • Output documents are ranked according to similarity to query. • Similarity based on occurrence frequencies of keywords in query and document. • Automatic relevance feedback can be supported: – Relevant documents “added” to query. – Irrelevant documents “subtracted” from query. • A similarity measure is a function that computes the degree of similarity between two vectors. • Using a similarity measure between the query and each document:
  • 36. – It is possible to rank the retrieved documents in the order of presumed relevance. – It is possible to enforce a certain threshold so that the size of the retrieved set can be controlled. 7. Explain in detail about similarity calculation based on inner product • Similarity between vectors for the document di and query q can be computed as the vector inner product (a.k.a. dot product): sim(dj,q) = dj•q = wherewijis the weight of term i in document j andwiqis the weight of term i in the query • For binary vectors, the inner product is the number of matched query terms in the document (size of intersection). • For weighted term vectors, it is the sum of the products of the weights of the matched terms. Properties of Inner Product • The inner product is unbounded. • Favors long documents with a large number of unique terms. • Measures how many terms matched but not how many terms are not matched. Example : Example 2: 8. For example consider four documents and the term document matrix for collection of documents is iq t i ij ww=1
  • 37. Document 3 for example is represented by the vector (1,1,0,2,0,1,0,1,0,0,1) Queries are represented the same way as documents. A query Q is represented by a vector of t weights. Q=(q1,q2,……,qt), whereqj is the weight of the jth term in the query. For example the Query was “tropical fish” then the vector representation of the query would be (0,0,0,1,0,0,0,0,0,0,1). 9. Explain in detail about text and Information pre-processing Before the documents in a collection are used for retrieval, some preprocessing tasks are usually performed. For traditional text documents the tasks are stopword removal, stemming and handling of digits, hyphens,punctuations and cases of letter. For web pages, additional tasks such as HTML tag removal and identification of main content blocks also require careful considerations. • Stop word removal • Stemming • Text pre-processing • Web Page pre-processing 1.Stopword Removal o Stopwords are frequently occurring and insignificant words in a language that help construct sentences but do not represent any content of the documents. • Articles, prepositions and conjunctions and some pronouns are natural candidates.Common stopwords in English include: • a,about,an,are,as,at,be,by,for,from,how,in,is,of,on,or,that,the,these,this,to,was,what,where,w ho,will,with
  • 38. • Such words should be removed before documents are indexed and stored. • Stopwords in the query are also removed before retrieval is performed. 2. Stemming: • In many languages, a word has various syntactical forms depending on the contexts that is used. • For example ,in English ,nouns have plural forms,verb have gerund forms(by adding “ing”),and verbs used in the past tense are different from the present tense. • These are considered as syntactic variations of the same root form. • Such variations cause low recall for a retrieval system because a relevant document may contain a variation of a query word but not the exact word itself. • This problem can be partially dealt with by stemming. • Stemming refers to the process of reducing words to their stems or roots. • A stem is a portion of a word that is left after removing its prefixes and suffixes. • In English, most variants of a word are generated by the introduction of suffixes rather than prefixes. • Thus stemming in English usually means suffix removal or stripping. • For example,”Computer”,”Computing”and “compute” are reduced to “comput”.”walks”, ”walking” and “walker” are reduced to “walk”. • Stemming enables different variations of the word to be considered in retrieval, which improves the recall. • There are several stemming algorithms, also known as stemmers. • Stemming increases recall and reduces the size of the indexing structure. However it may hurt precision because many irrelevant documents may be considered relevant. • For example,both “cop” and “cope” are reduced to the stem “cop”, However if one is looking for documents about police,a document that contains only “cope” is unlikely to be relevant. 3.Text preprocessing: a.Digits: Numbers and terms that contain digits are removed in traditional IR systems except some specific types, e.g: dates, times and other pre specified types expressed with regular expressions. However, in search engines, they are usually indexed. b.Hyphens: o Breaking hyphens are usually applied to dela with inconsistency of usage. For example, some people use “state-of-the-art”, but others use “State of the art”. o If the hyphens in the first case are removed, we eliminate the inconsistency problem. However some words may have a hyphen as an integral part of the word, eg,”Y-21”. Two types of removal 1.each hyphen is replaced with a space 2.each hyphen is simply removed without leaving a space so that “state-of-the-art “ may be replaced with “state of the art” or “stateoftheart”. c)PunctuationMarks:is similar as hyphens d) Case of letters: all the letters are usually converted to either the upper or lower case. Web Page pre-processing:
  • 39. 1. Identifying different text fields: In HTML, there are different text fields, e.g.title,metadata and body. Identifying them allows the retrieval system to treat terms in different fields differently. For example, in search engines terms that appear in other field of a page are regarded as more important than terms that appear in other fields and are assigned higher weights because the title is usually a concise description of the page. In the body text, those emphasized terms (eg.under header tag<h1>,<h2>,…..bold tag<b>,etc.)are also given higher weights. 2. Identifying anchor text: Anchor text associated with a hyperlink is treated specially in search engines because the anchor text often represents more accurate description of the information contained in the page pointed to by its link. In case that the hyperlink points to an external page, it is especially valuable because it is a summary description of the page given by other people rather than the author/owner of the page, and thus more trustworthy. 3.Removing HTML tags: The removal of HTML tags can be dealt with similarly to punctuation. One issue needs careful consideration which affects proximity queries and phrase queries.HTML is inherently a visual presentation language. 4. Identifying main content block: A typical web page, especially a commercial page, contains a large amount of information that is not a part of the main content of the page.For example it contain banner ads,navigation bars,copyright notices,etc.,which can lead to poor results for search and mining. Two techniques for finding main content block in webpages a) Partitioning based on visual cues: o This method uses visual information to help to find main content blocks in the page.Visual or rendering information of each HTML element in a page can be obtained from the web browser. • For example internet explorer provides an API that can output the X and Y coordinates of each element. • A machine larning model can then be built based on the location and appearance features for identifying main content blocks of pages. b) Tree matching • This method is based on the observation that in most commercial web sites pages are generated by using some fixed templates. • This method thus aims to find hidden templates. Since HTML has a nested structure,it is easy to build a tag tree for each page. • Tree matching of multiple pages from the same site can be performed to find such templates. 9. Explain in detail about Probabilistic Approach based Information Retrieval • Given a user information need (represented as a query) and a collection of documents (transformed into document representations), a system must determine how well the documents satisfy the query
  • 40. • Boolean or vector space models of IR: query-document matching done in a formally defined but semantically imprecise calculus of index terms • An IR system has an uncertain understanding of the user query , and makes an uncertain guess of whether a document satisfies the query • Probability theory provides a principled foundation for such reasoning under uncertainty • Probabilistic models exploit this foundation to estimate how likely it is that a document is relevant to a query Probabilistic IR Models ▪ Classical probabilistic retrieval model ▪ Probability ranking principle ▪ Binary Independence Model, BestMatch25 (Okapi) ▪ Bayesian networks for text retrieval ▪ Language model approach to IR ▪ Important recent work, competitive performance Probabilistic methods are one of the oldest but also one of the currently hottest topics in IR Basic Probability Theory ▪ For events A and B ▪ Joint probability P(A, B) of both events occurring ▪ Conditional probability P(A|B) of event A occurring given that event B has occurred ▪ Chain rule gives fundamental relationship between joint and conditional probabilities:
  • 41. 10. Vector model example D1: “Shipment of gold delivered in a fire” D2: “Delivery of silver arrived in a silver truck” D3: “Shipment of gold arrived in a truck” Q: “gold silver truck” ❖ let assume we deal with a basic term vector model in which we : 1. do not take into account WHERE the terms occur in documents. (documents consist of passages and passages consist of sentences ) 2. Remove stopwords. 3. do not reduce terms to root terms (stemming). use raw frequencies for terms and queries (unnormalized data). Stop Words
  • 42. Now we need to find similarity measures by using Similarity Methods. ❑ Inner Product (Dot Product) ➢ SC (Q,D1)=(0)(0) + (0)(0) + (0)(0.477) + (0)(0) + (0)(0.477) + (0.176)(0.176) + (0)(0) + (0)(0)=0.0309 ➢ SC (Q,D2)=0.4862 ➢ SC (Q,D3)=0.0620 Document ID Similarity score D1 0.0309 D2 0.4862 D3 0.0620 Now, we rank the documents in a descending order according to their similarity score, as the following: Document ID Similarity score
  • 43. D2 0.4862 (most relevant) D3 0.0620 D1 0.0309 We can use threshold to retrieve documents above the value of that threshold ❑ Cosine ❖ For cosine method we must calculate the length of each documents and the length of query as the following: ▪ Length of D1 = sqrt(0.477^2+0.477^2+0.176^2+0.176^2)= 0.7195 ▪ Length of D2 = sqrt(0.176^2+0.477^2+0.954^2+0.176^2)= 1.095 ▪ Length of D3 = sqrt(0.176^2+0.176^2+0.176^2+0.176^2)= 0.352 ▪ Length of Q = sqrt(0.1761^2+0.477^2+0.1761^2)= 0.538 ❖ Inner product for each document is: ▪ D1= 0.0309 ▪ D2=0.4862 ▪ D3=0.0620 ❖ Then the similarity values are: ▪ cosSim(D1,Q) = 0.0309 / 0.719 * 0.538 = 0.0801 ▪ cosSim(D2,Q) = 0.4862 / 1.095 * 0.538 = 0.8246 cosSim(D3,Q) = 0.061 / 0.352 * 0.538 = 0.3271 ❖ Now, we rank the documents in a descending order according to their similarity score, as the following Document ID Similarity score D2 0.8246 (most relevant)    = = = • • t k t k t k kik qd qd kik1 1 22 1 )(
  • 44. D3 0.3271 D1 0.0801 11. Explain in detail about Relevance feedback and Query expansion ▪ The user issues a (short, simple) query. ▪ The search engine returns a set of documents. ▪ User marks some docs as relevant, some as irrelevant. ▪ Search engine computes a new representation of the information need. Hope: better than the initial query. ▪ Search engine runs new query and returns new results. ▪ New results have (hopefully) better recall [Accuracy]. ▪ The centroid is the center of mass of a set of points. ▪ Recall that we represent documents as points in a high-dimensional space. ▪ Thus: we can compute centroids of documents. ▪ Definition: where, D is a set of documents and is the vector weuse to represent document d. Rocchio’ algorithm ▪ The Rocchio’ algorithm implements relevance feedback in the vector space model. Rocchio’ chooses the query that maximizes Dr : set of relevant docs; Dnr : set of nonrelevant docs ▪ Intent: ~qopt is the vector that separates relevant and nonrelevant docs maximally. ▪ Making some additional assumptions, we can rewrite as: The optimal query vector is: We move the centroid of the relevant documents by the difference between the two centroids. qm: modified query vector; q0: original query vector;
  • 45. Dr and Dnr : sets of known relevant and irrelevant documents respectively; α, β, and γ: weights ▪ New query moves towards relevant documents and away from nonrelevant documents. ▪ Tradeoff α vs. β/γ: If we have a lot of judged documents, we want a higher β/γ. ▪ Set negative term weights to 0. ▪ “Negative weight” for a term doesn’t make sense in the vector space model. Positive vs. negative relevance feedback ▪ Positive feedback is more valuable than negative feedback. ▪ For example, setβ = 0.75, γ = 0.25 to give higher weight to positive feedback. ▪ Many systems only allow positive feedback. Relevance feedback: Problems ▪ Relevance feedback is expensive. ▪ Relevance feedback creates long modified queries. ▪ Long queries are expensive to process. ▪ Users are reluctant to provide explicit feedback. ▪ It’s often hard to understand why a particular document was retrieved after applying relevance feedback. ▪ The search engine Excite had full relevance feedback at one point, but abandoned it later. Pseudo-relevance feedback ▪ Pseudo-relevance feedback automates the “manual” part of true relevance feedback. ▪ Pseudo-relevance algorithm: ▪ Retrieve a ranked list of hits for the user’s query ▪ Assume that the top k documents are relevant. ▪ Do relevance feedback (e.g., Rocchio) ▪ Works very well on average ▪ But can go horribly wrong for some queries.Several iterations can cause query drift. Pseudo-relevance feedback at TREC4 ▪ Cornell SMART system ▪ Results show number of relevant documents out of top 100 for 50 queries (so total number of documents is 5000): ▪ Results contrast two length normalization schemes (L vs. l) and pseudo-relevance feedback (PsRF). ▪ The pseudo-relevance feedback method used added only 20 terms to the query. (Rocchio will add many more.) ▪ This demonstrates that pseudo-relevance feedback is effective on average. Query expansion ▪ Query expansion is another method for increasing recall. ▪ We use “global query expansion” to refer to “global methods for query reformulation”. ▪ In global query expansion, the query is modified based on some global resource, i.e. a resource that is not query-dependent. ▪ Main information we use: (near-)synonymy ▪ A publication or database that collects (near-)synonyms is called a thesaurus. We will look at two types of thesauri: manually created and automatically created.
  • 46. Types of user feedback ▪ User gives feedback on documents. ▪ More common in relevance feedback ▪ User gives feedback on words or phrases. ▪ More common in query expansion Types of query expansion ▪ Manual thesaurus (maintained by editors, e.g., PubMed) ▪ Automatically derived thesaurus (e.g., based on co-occurrence statistics) ▪ Query-equivalence based on query log mining (common on the web as in the “palm” example) Interactive relevance feedback: improve initial retrieval results by telling the IR system which docs are relevant / nonrelevant Best known relevance feedback method: Rocchio feedback Query expansion: improve retrieval results by adding synonyms / related terms to the query Sources for related terms: Manual thesauri, automatic thesauri, query logs Indirect relevance feedback ▪ On the web, DirectHit introduced a form of indirect relevance feedback. ▪ DirectHit ranked documents higher that users look at more often. ▪ Clicked on links are assumed likely to be relevant ▪ Assuming the displayed summaries are good, etc. ▪ Globally: Not necessarily user or query specific. ▪ This is the general area of clickstream mining 12. Explain in detail about language model A Language model is a function that puts a probability over strings drawn from some vocabulary.Language Model M over an alphabet ∑: ∑ 𝑃(𝑠) = 1 𝑠∈∑∗ Full set of strings that can be generated is called the language of the automation. Likelihood Ratio To compare two models for data set.Dividing the probability of the data according to one model by the probability of the data according to other model. Unigram language model Estimate each term independently. Simply throws away all conditioning context. Puni(t1t2t3t4) = P(t1) P(t2) P(t3) P(t4) q1 q 1 1
  • 47. Bigram language models Complex model Condition on the previous term Puni(t1t2t3t4) = P(t1) P(t2|t1) P(t3|t2) P(t4|t3) Unigram models are more efficient to estimate and apply than higher order models. Using language models (LMs) for IR ❶ LM = language model ❷ We view the document as a generative model that generates the query. ❸ What we need to do: ❹ Define the precise generative model we want to use ❺ Estimate parameters (different parameters for each document’s model) ❻ Smooth to avoid zeros ❼ Apply to query and find document most likely to have generated the query ❽ Present most likely document(s) to user ❾ Note that x – y is pretty much what we did in Naive Bayes. What is a language model? We can view a finite state automaton as a deterministic language model. I wish I wish I wish I wish . . . Cannot generate: “wish I wish” or “I wish I”. Our basic model: each document was generated by a different automaton like this except that these automata are probabilistic Using language models in IR ▪ Each document is treated as (the basis for) a language model. ▪ Given a query q ▪ Rank documents based on P(d|q) ▪ P(q) is the same for all documents, so ignore ▪ P(d) is the prior – often treated as the same for all d ▪ But we can give a prior to “high-quality” documents, e.g., those with high PageRank. ▪ P(q|d) is the probability of q given d.
  • 48. ▪ So to rank documents according to relevance to q, ranking according to P(q|d) and P(d|q) is equivalent. ▪ In the LM approach to IR, we attempt to model the query generation process. ▪ Then we rank documents by the probability that a query would be observed as a random sample from the respective document model. ▪ That is, we rank according to P(q|d). ▪ Next: how do we compute P(q|d)? How to compute P(q|d) ▪ We will make the same conditional independence assumption as for Naive Bayes. (|q|: length ofrq; tk : the token occurring at position k in q) ▪ This is equivalent to: tft,q: term frequency (# occurrences) of t in q ▪ Multinomial model (omitting constant factor) Parameter estimation ▪ Missing piece: Where do the parameters P(t|Md). come from? ▪ Start with maximum likelihood estimates (as we did for Naive Bayes) (|d|: length of d; tft,d : # occurrences of t in d) ▪ As in Naive Bayes, we have a problem with zeros. ▪ A single t with P(t|Md) = 0 will make zero. ▪ We would give a single term “veto power”. ▪ For example, for query [Michael Jackson top hits] a document about “top songs” (but not using the word “hits”) would have P(t|Md) = 0. – That’s bad. ▪ We need to smooth the estimates to avoid zeros. Mixture model ▪ P(t|d) = λP(t|Md) + (1 - λ)P(t|Mc) ▪ Mixes the probability from the document with the general collection frequency of the word. ▪ High value of λ: “conjunctive-like” search – tends to retrieve documents containing all query words. ▪ Low value of λ: more disjunctive, suitable for long queries ▪ Correctly setting λis very important for good performance. What we model: The user has a document in mind and generates the query from this document. ▪ The equation represents the probability that the document that the user had in mind was in fact this one.
  • 49. Example ▪ Collection: d1 and d2 ▪ d1 : Jackson was one of the most talented entertainers of all time ▪ d2: Michael Jackson anointed himself King of Pop ▪ Query q: Michael Jackson ▪ Use mixture model with λ = 1/2 ▪ P(q|d1) = [(0/11 + 1/18)/2] · [(1/11 + 2/18)/2] ≈ 0.003 ▪ P(q|d2) = [(1/7 + 1/18)/2] · [(1/7 + 2/18)/2] ≈ 0.013 ▪ Ranking: d2 >d1 Exercise 2 ▪ Collection: d1 and d2 ▪ d1 : Xerox reports a profit but revenue is down ▪ d2: Lucene narrows quarter loss but decreases further ▪ Query q: revenue down ▪ Use mixture model with λ = 1/2 ▪ P(q|d1) = [(1/8 + 2/16)/2] · [(1/8 + 1/16)/2] = 1/8 · 3/32 = 3/256 ▪ P(q|d2) = [(1/8 + 2/16)/2] · [(0/8 + 1/16)/2] = 1/8 · 1/32 = 1/256 ▪ Ranking: d2 >d1 13. Explain about the Web Characteristics The essential feature that led to the explosive growth of the web decentralized content publishing with essentially no central control of authorship turned out to be the biggestchallengeforwebsearchenginesintheirquesttoindexandretrievethiscontent. Web page authors created content in dozens of (natural) languages and thousands of dialects, thus demanding many different forms of stemming and other linguisticoperations. Trust of Web: The democratization of content creation on the web meant a new level of granularity in opinion on virtually any subject. This meant that the web contained truth, lies, contradictions and suppositions on a grand scale. This gives rise to the question: which web page does one trust?In a simplistic approach, one might argue that some publishers are trustworthy and others not begging the question of how a search engine is to assign such a measure of trust to each website or webpage. Size: While the question how big the web has?The answer is not an easy. Static Vs Dynamic: Static web pages are those whose content does not vary from one request for that page to the next. For this purpose, a professor who manually updateshis web pages with his desired content. Dynamic pages are typically mechanically generated by an application server in response to a query to a database.
  • 50. Fig: Dynamic web page generation The web graph: We can view the static Web consisting of static HTML pages together with the hyperlinks between them as a directed graph in which each web page is a node and each hyperlink a directededge. Fig :Two nodes of web graph joined by link Figure shows two nodes A and B from the web graph, each corresponding to a web page, with a hyperlink from A to B. We refer to the set of all such nodes and directed edges as the web graph. This text is generally encapsulated in the href attribute of the <a> (for anchor) tag that encodes the hyperlink in the HTML code of page A, and is referred to as anchor text. As one might suspect, this directed graph is not strongly connected: there are pairs of pages such that one cannot proceed from one page of the pair to the other by the following hyperlinks. We refer to the hyperlinks into a page as in-links and those out of a page as out- links. The number of in-links to a page (also known as its in-degree) has averaged from roughly 8 to 15, in a range of studies. We similarly define the out-degree of a web page to be the number of links out ofit.
  • 51. Fig: Sample web graph. Thereissampleevidencethattheselinksarenotrandomlydistributed;this distribution is widely reported to be a power law, in which the total number of web pages within- degreeiisproportionalto1/iα; The directed graph connecting web pages has a bowtie shape: there are threemajor categories of web pages that are sometimes referred to as IN, OUT and SCC (Strongly Connected Component). A web surfer can pass from any page in IN to any page in SCC, by following hyperlinks. Likewise, a surfer can pass from page in SCC to any page in OUT. Finally, the surfer can surf from any page in SCC to any other page in SCC. However, it is not possible to pass from a page in SCC to any page in IN, or from a page in OUT to a page in SCC. The remaining pages form into tubes that are small sets of pages outside SCC that lead directly from IN to OUT, and tendrils that either lead nowhere from IN, or from nowhere toOUT. Fig: Bowtie Structure of the web Spam: Web search engines were an important means for connecting advertisers to prospective buyers. A user searching for Maui golf real estate is not merely seeking news or entertainment on the subject of housing on golf courses on the island of Maui, but instead likely to be seeking to purchase such a property. Sellers of such property and their agents, therefore, have a strong incentive to create web pages that rank highly on this query. In a search engine whose scoring was based on term frequencies, a web page with numerous repetitions of maui golf real estate would rank highly. This led to the first generation of
  • 52. spam in which is the manipulation of web page content for the purpose of appearing high up in search results for selectedkeywords. Spammers resorted to such tricks as rendering these repeated terms in the same colour as the background. Despite these words being consequently invisible to the human user, a search engine indexer would parse the invisible words out of the HTMLrepresentationofthewebpageandindexthesewordsasbeingpresentinthepage. 14. Explain in detail about sparse vectors • Vocabulary and therefore dimensionality of vectors can be very large, ~104 . • However, most documents and queries do not contain most words, so vectors are sparse (i.e. most entries are 0). • Need efficient methods for storing and computing with sparse vectors. • Store vectors as linked lists of non-zero-weight tokens paired with a weight. • Space proportional to number of unique tokens (n) in document. • Requires linear search of the list to find (or change) the weight of a specific token. • Requires quadratic time in worst case to compute vector for a document: Sparse Vectors as Trees • Index tokens in a document in a balanced binary tree or tree with weights stored with tokens at the leaves. • Space overhead for tree structure: ~2n nodes. • O(log n) time to find or update weight of a specific token. • O(n log n) time to construct vector. • Need software package to support such data structures. • Store tokens in hashtable, with token string as key and weight as value. • Storage overhead for hashtable ~1.5n. • Table must fit in main memory. • Constant time to find or update weight of a specific token (ignoring collisions). • O(n) time to construct vector (ignoring collisions). Implementation Based on Inverted Files )( 2 )1( 2 1 nO nn i n i = + == memory < < film variable Variable2Memory1film1Bit2 Balanced Binary Tree
  • 53. • In practice, document vectors are not stored directly; an inverted organization provides much better efficiency. • The keyword-to-document index can be implemented as a hash table, a sorted array, or a tree- based data structure (trie, B-tree). • Critical issue is logarithmic or constant-time access to token information. Inverted Index: Basic Structure • Term list: a list of all terms • Document node: a structure that contains information such as term frequency, document ID, and others • Posting list: for each term, a list containing document node for each document in which the term appears Creating an Inverted Index Create an empty index term list I; For each document, D, in the document set V For each (non-zero) token, T, in D: If T is not already in I Insert T into I; Find the location for T in I; If (T, D) is in the posting list for T increase its term frequency for T; Else Create (T, D); Add it to the posting list for T; Computing IDF Let N be the total number of documents; For each token, T, in I: Determine the total number of documents, M, in which T occurs (the length of T’s postinglist); Set the IDF for T to log(N/M); Retrieval with an Inverted Index system computer database science D 2 , 4 D 5 , 2 D 1 , 3 D 7 , 4 Index terms df 3 2 4 1 D j , tf j Term List Postings lists •• •
  • 54. • Tokens that are not in both the query and the document do not affect cosine similarity. – Product of token weights is zero and does not contribute to the dot product. • Usually the query is fairly short, and therefore its vector is extremely sparse. Use inverted index to find the limited set of documents that contain at least one of the query words. 15. Latent semantic indexing – Explain ▪ Term-document matrices are very large ▪ But the number of topics that people talk about is small (in some sense) ➢ Clothes, movies, politics, … ▪ Can we represent the term-document space by a lower dimensional latent space? ▪ Develop a class of operations from linear algebra, known as matrix decomposition. ▪ Examine the application of low-rank approximations to indexing and retrieving documents, a technique referred to as latent semantic. ▪ Latent semantic indexing has not been established – intriguing approach to clustering. ▪ Let C be an matrix M X N Matrix with real-valued entries. ▪ Term-document matrix all entries are in non-negative. ▪ The rank of the matrix is the number of linearly independent rows (or columns ) in it. ▪ Rank (C) ≤ min (M,N) Eigenvalues & Eigenvectors ▪ Eigenvectors (for a square M X M matrix S) How many eigenvalues are there at most? (right) eigenvector eigenvalue only has a non-zero solution if This is aMth order equation in λ which can have at most M distinct solutions (roots of the characteristic polynomial) – can be complex even though S is real. has eigenvalues 30, 20, 1 with corresponding eigenvectors
  • 55. On each eigenvector, S acts as a multiple of the identitymatrix: but as a different multiple on each. Matrix-vector multiplication ▪ Thus a matrix-vector multiplication such as Sx (S, x as in the previous slide) can be rewritten in terms of the eigenvalues/vectors: ▪ Even though x is an arbitrary vector, the action of S on x is determined by the eigenvalues/vectors. For symmetric matrices, eigenvectors for distinct Eigenvalues are orthogonal All eigenvalues of a real symmetric matrix are real. All eigenvalues of a positive semidefinite matrixare non-negative Any vector (say x= ) can be viewed as a combination ofthe eigenvectors:x = 2v 1 + 4v 2 + 6v 3