2. Main points
Introduction
Text operations and Indexing
Performance evaluation
Search engines as IR tools
Metasearch engines
IR Applications
Some current researches in IRS
Current conferences in information retrieval
3. Introduction
Information Retrieval (IR) is the discipline that deals with retrieval of
unstructured data, especially textual documents, in response to a
query .
User Interface
User need
Text Operations
Indexing
Inverted
file
Documents
Similarity Computation
(Searching)
Retrieved docs
Ranking
Ranked docs
Index
4. Text operation and Indexing
Text operations: reduce the complexity of the document
representation
Q=List of the European countries
List , Europe , country
Indexing: A simple alternative is to search the whole text
sequentially
Vocabular
y
beautiful
flowers
garden
house
70
45, 58
18, 29
6
Occurrences
6. Popular search engines
Google
Yahoo
Bing
…
Google search engine
Google search is based on priority
Priority rank used “PageRank” algorithm
Search Google can be using Boolean operators such as :
exclusion ( -aa ) , alternatives ( aa OR bb)
7. PageRank algorithm
PageRank is an algorithm used by Google search
engine to rank websites in their search engine
results.
PR(B) = PR(E) + PR(F) + PR(D) + P(C)
8. Googlebot : Google’s Web Crawler
Googlebot is Google’s web crawling robot, which finds
and retrieves pages on the web and hands them off to
the Google indexer.
Googlebot finds pages in two ways:
Through an add URL form, www.google.com/addurl.html
Finding links by crawling the web.
12. Metasearch engines
A meta search engine is a search tool that send user
requests to several other search engines and/or
databases and aggregate results into a single list or
displays them according to their source.
Metasearch engines enable users to enter search criteria
once and access several search engines simultaneously.
15. Some current research topics in IRS
Visual Indexing
Indexing of (video, images, audio).
Visual content extraction
Machine learning in information retrieval
Web information retrieval (including blogs)
Mobile computing related information retrieval issues
Performance measures
Query languages and optimization
16. What is MapReduce ?
MapReduce is a programming model for processing
large data sets
The first is the map job, which takes a set of data
and converts it into another set of data, where
individual elements are broken down into tuples
(key/value pairs)
The reduce job takes the output from a map as input
and combines those data tuples into a smaller set of
tuples.
18. Programming Model
Map(k1,v1) → list(k2,v2)
Reduce(k2, list (v2)) → list(v3)
Ex: 5 files
Toronto, 20
Whitby, 25
New York, 22
Rome, 32
Toronto, 4
Rome, 33
New York, 18
File 1
19. Programming Model (continued..)
we want to find the maximum tem-perature for each
city across all of the data files
Break this into 5 Map tasks
Each mapper work on 1 file and return the Max tem
in each city
All five of these output streams would be fed into the
reduce tasks, which combine the input results and
output a single value for each city, producing a final
result.
20. Programming Model(continued..)
Map(output) : (Toronto, 18) (Whitby, 27) (New York,
32) (Rome, 37)(Toronto, 32) (Whitby, 20) (New York,
33) (Rome, 38)(Toronto, 22) (Whitby, 19) (New York,
20) (Rome, 31)(Toronto, 31) (Whitby, 22) (New York,
19) (Rome, 30)
Reduce(output):(Toronto, 32) (Whitby, 27) (New
York, 33) (Rome, 38)
21. MapReduce uses
MapReduce is useful in a wide range of applications,
including distributed pattern-based searching, distributed
sorting, web link-graph reversal, term-vector per host,
web access log stats, inverted index construction,
document clustering, and machine learning
Moreover, the MapReduce model has been adapted to
several computing environments like multi-core systems,
desktop grids, dynamic cloud environments, and mobile
environments.
At Google, MapReduce was used to completely
regenerate Google's index of the World Wide Web. It
replaced the old ad hoc programs that updated the index
and ran the various analyses.
22. Current conferences in information retrieval
3rd Spanish Conference on Information Retrieval
The European Conference on Information Retrieval
2014 , June 20
Spain
2014, April 17
Netherland
7th International Workshop on Information Filtering
and Retrieval
2013, Dec 6
Italy
Digital libraries: video recordings, ppt slides, presentations, audio recordings, …The electronic content may be stored locally, or accessed remotely via computer networksEnterprise search is how your organization helps people seek the information they need from anywhere, in any format, from anywhere inside their company – in databases, document management systems, on paper, wherever. Just because there are powerful search tools available, does not mean that you should not organize your content. Desktop search all pc + internet browsing + mails
Result : (Toronto, 18) (Whitby, 27) (New York, 32) (Rome, 37)(Toronto, 32) (Whitby, 20) (New York, 33) (Rome, 38)(Toronto, 22) (Whitby, 19) (New York, 20) (Rome, 31)(Toronto, 31) (Whitby, 22) (New York, 19) (Rome, 30)(Toronto, 32) (Whitby, 27) (New York, 33) (Rome, 38)