Information organization

SIMS 202
Information Organization
and Retrieval

Prof. Marti Hearst and Prof. Ray Larson
UC Berkeley SIMS
Tues/Thurs 9:30-11:00am
Fall 2000

Uploaded by: CarAutoDriver

Last Time
Web Search
– Directories vs. Search engines
– How web search differs from other search
» Type of data searched over
» Type of searches done
» Type of searchers doing search
– Web queries are short
» This probably means people are often using search
engines to find starting points
» Once at a useful site, they must follow links or use
site search
– Web search ranking combines many features

What about Ranking?
Lots of variation here
– Pretty messy in many cases
– Details usually proprietary and fluctuating
Combining subsets of:
– Term frequencies
– Term proximities
– Term position (title, top of page, etc)
– Term characteristics (boldface, capitalized, etc)
– Link analysis information
– Category information
– Popularity information
Most use a variant of vector space ranking to
combine these
Here’s how it might work:
– Make a vector of weights for each feature
– Multiply this by the counts for each feature

From description of the NorthernLight search engine, by Mark Krellenstein
http://www.infonortics.com/searchengines/sh00/krellenstein_files/frame.htm

High-Precision Ranking

Proximity search can help get high-
precision results if > 1 term
– Hearst ’96 paper:
» Combine Boolean and passage-level proximity
» Proves significant improvements when
retrieving top 5, 10, 20, 30 documents
» Results reproduced by Mitra et al. 98
» Google uses something similar

Boolean Formulations, Hearst 96

Results

Spam

Email Spam:
– Undesired content
Web Spam:
– Content is disguised as something it is
not, in order to
» Be retrieved more often than it otherwise
would
» Be retrieved in contexts that it otherwise
would not be retrieved in

Web Spam
What are the types of Web spam?
– Add extra terms to get a higher ranking
» Repeat “cars” thousands of times
– Add irrelevant terms to get more hits
» Put a dictionary in the comments field
» Put extra terms in the same color as the background
of the web page
– Add irrelevant terms to get different types of
hits
» Put “sex” in the title field in sites that are selling
cars
– Add irrelevant links to boost your link analysis
ranking
There is a constant “arms race” between
web search companies and spammers

Commercial Issues
General internet search is often
commercially driven
– Commercial sector sometimes hides things –
harder to track than research
– On the other hand, most CTOs for search
engine companies used to be researchers, and
so help us out
– Commercial search engine information changes
monthly
– Sometimes motivations are commercial rather
than technical
» Goto.com uses payments to determine ranking order
» iwon.com gives out prizes

Web Search Architecture

Preprocessing
– Collection gathering phase
» Web crawling
– Collection indexing phase
Online
– Query servers
– This part not talked about in the
readings

From description of the FAST search engine, by Knut Risvik
http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm

Standard Web Search Engine Architecture
Check for duplicates,
crawl the store the
web documents
DocIds

create an
user inverted
query index

Search
Inverted
Show results engine
To user index
servers

More detailed
architecture,
from Brin & Page
98.

Only covers the
preprocessing in
detail, not the
query serving.

Inverted Indexes for Web Search Engines

Inverted indexes are still used, even
though the web is so huge
Some systems partition the indexes across
different machines; each machine handles
different parts of the data
Other systems duplicate the data across
many machines; queries are distributed
among the machines
Most do a combination of these

In this example, the data
for the pages is
partitioned across
machines. Additionally,
each partition is allocated
multiple machines to
handle the queries.

Each row can handle 120
queries per second

Each column can handle
7M pages

To handle more queries,
add another row.

From description of the FAST search engine, by Knut Risvik
http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm

Cascading Allocation of CPUs
A variation on this that produces a
cost-savings:
– Put high-quality/common pages on many
machines
– Put lower quality/less common pages on
fewer machines
– Query goes to high quality machines
first
– If no hits found there, go to other
machines

Web Crawlers

How do the web search engines get all
of the items they index?
Main idea:
– Start with known sites
– Record information for these sites
– Follow the links from each site
– Record information found at new sites
– Repeat

Web Crawlers
How do the web search engines get all of
the items they index?
More precisely:
– Put a set of known sites on a queue
– Repeat the following until the queue is empty:
» Take the first page off of the queue
» If this page has not yet been processed:
Record the information found on this page
– Positions of words, links going out, etc
Add each link on the current page to the queue
Record that this page has been processed
In what order should the links be followed?

Page Visit Order
Animated examples of breadth-first vs depth-first search on trees:
http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html

Structure to be traversed

Page Visit Order

Breadth-first search
(must be in presentation mode to see this animation)

Page Visit Order

Depth-first search
(must be in presentation mode to see this animation)

Page Visit Order

Depth-First Crawling
(more complex – graphs & sites)
Site Page
1 1
1 2
Page 1 1 4
Site 1 Page 1 Site 2 1 6
1 3
1 5
3 1
Page 3 Page 2 5 1
Page 3
Page 2 6
5
1
2
2 1
2 2
Page 5 Page 1 2 3

Page 4
Site 5 Page 1

Page 6 Page 1 Page 2 Site 6
Site 3

Breadth First Crawling
(more complex – graphs & sites)
Site Page
1 1
Page 1 2 1
Site 1 Page 1 Site 2 1 2
1 6
1 3
Page 3 Page 2 2 2
Page 3
Page 2 2 3
1 4
3 1
1 5
Page 5 Page 1 5 1
Page 4 5 2
Site 5 Page 1 6 1

Page 6 Page 1 Page 2 Site 6
Site 3

Web Crawling Issues
Keep out signs
– A file called norobots.txt tells the crawler which
directories are off limits
Freshness
– Figure out which pages change often
– Recrawl these often
Duplicates, virtual hosts, etc
– Convert page contents with a hash function
– Compare new pages to the hash table
Lots of problems
– Server unavailable
– Incorrect html
– Missing links
– Infinite loops
Web crawling is difficult to do robustly!

Cha-Cha

Cha-cha searches an intranet
– Sites associated with an organization
Instead of hand-edited categories
– Computes shortest path from the root
for each hit
– Organizes search results according to
which subdomain the pages are found in

Cha-Cha Web Crawling Algorithm
Start with a list of servers to crawl
– for UCB, simply start with www.berkeley.edu
Restrict crawl to certain domain(s)
– *.berkeley.edu
Obey No Robots standard
Follow hyperlinks only
– do not read local filesystems
» links are placed on a queue
» traversal is breadth-first
See first lecture or the technical papers for
more information

Summary
Web search differs from traditional IR
systems
– Different kind of collection
– Different kinds of users/queries
– Different economic motivations
Ranking combines many features in a
difficult-to-specify manner
– Link analysis and proximity of terms seems
especially important
– This is in contrast to the term-frequency
orientation of standard search
» Why?

Summary (cont.)

Web search engine archicture
– Similar in many ways to standard IR
– Indexes usually duplicated across
machines to handle many queries quickly
Web crawling
– Used to create the collection
– Can be guided by quality metrics
– Is very difficult to do robustly

Searches
per Day

Info missing
For fast.com,
Excite,
Northernlight,
etc.

Information from searchenginewatch.com

Web
Search
Engine
Visits


Percentage
of web users
who visit the
site shown


Search
Engine
Size
(July
2000)


Does size
matter?
You can’t
access
many hits
anyhow.


Increasing
numbers of
indexed
pages, self-
reported


Increasing
numbers of
indexed
pages
(more
recent)
self-
reported


Web
Coverage


Directory
sizes


Information organization

Recommended

Recommended

More Related Content

Similar to Information organization

Similar to Information organization (20)

More from Stefanos Anastasiadis

More from Stefanos Anastasiadis (15)

Information organization