1. SIMS 202
Information Organization
and Retrieval
Prof. Marti Hearst and Prof. Ray Larson
UC Berkeley SIMS
Tues/Thurs 9:30-11:00am
Fall 2000
Uploaded by: CarAutoDriver
2. Last Time
Web Search
– Directories vs. Search engines
– How web search differs from other search
» Type of data searched over
» Type of searches done
» Type of searchers doing search
– Web queries are short
» This probably means people are often using search
engines to find starting points
» Once at a useful site, they must follow links or use
site search
– Web search ranking combines many features
3. What about Ranking?
Lots of variation here
– Pretty messy in many cases
– Details usually proprietary and fluctuating
Combining subsets of:
– Term frequencies
– Term proximities
– Term position (title, top of page, etc)
– Term characteristics (boldface, capitalized, etc)
– Link analysis information
– Category information
– Popularity information
Most use a variant of vector space ranking to
combine these
Here’s how it might work:
– Make a vector of weights for each feature
– Multiply this by the counts for each feature
4. From description of the NorthernLight search engine, by Mark Krellenstein
http://www.infonortics.com/searchengines/sh00/krellenstein_files/frame.htm
5. High-Precision Ranking
Proximity search can help get high-
precision results if > 1 term
– Hearst ’96 paper:
» Combine Boolean and passage-level proximity
» Proves significant improvements when
retrieving top 5, 10, 20, 30 documents
» Results reproduced by Mitra et al. 98
» Google uses something similar
7. Spam
Email Spam:
– Undesired content
Web Spam:
– Content is disguised as something it is
not, in order to
» Be retrieved more often than it otherwise
would
» Be retrieved in contexts that it otherwise
would not be retrieved in
8. Web Spam
What are the types of Web spam?
– Add extra terms to get a higher ranking
» Repeat “cars” thousands of times
– Add irrelevant terms to get more hits
» Put a dictionary in the comments field
» Put extra terms in the same color as the background
of the web page
– Add irrelevant terms to get different types of
hits
» Put “sex” in the title field in sites that are selling
cars
– Add irrelevant links to boost your link analysis
ranking
There is a constant “arms race” between
web search companies and spammers
9. Commercial Issues
General internet search is often
commercially driven
– Commercial sector sometimes hides things –
harder to track than research
– On the other hand, most CTOs for search
engine companies used to be researchers, and
so help us out
– Commercial search engine information changes
monthly
– Sometimes motivations are commercial rather
than technical
» Goto.com uses payments to determine ranking order
» iwon.com gives out prizes
11. Web Search Architecture
Preprocessing
– Collection gathering phase
» Web crawling
– Collection indexing phase
Online
– Query servers
– This part not talked about in the
readings
12. From description of the FAST search engine, by Knut Risvik
http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm
13. Standard Web Search Engine Architecture
Check for duplicates,
crawl the store the
web documents
DocIds
create an
user inverted
query index
Search
Inverted
Show results engine
To user index
servers
15. Inverted Indexes for Web Search Engines
Inverted indexes are still used, even
though the web is so huge
Some systems partition the indexes across
different machines; each machine handles
different parts of the data
Other systems duplicate the data across
many machines; queries are distributed
among the machines
Most do a combination of these
16. In this example, the data
for the pages is
partitioned across
machines. Additionally,
each partition is allocated
multiple machines to
handle the queries.
Each row can handle 120
queries per second
Each column can handle
7M pages
To handle more queries,
add another row.
From description of the FAST search engine, by Knut Risvik
http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm
17. Cascading Allocation of CPUs
A variation on this that produces a
cost-savings:
– Put high-quality/common pages on many
machines
– Put lower quality/less common pages on
fewer machines
– Query goes to high quality machines
first
– If no hits found there, go to other
machines
18. Web Crawlers
How do the web search engines get all
of the items they index?
Main idea:
– Start with known sites
– Record information for these sites
– Follow the links from each site
– Record information found at new sites
– Repeat
19. Web Crawlers
How do the web search engines get all of
the items they index?
More precisely:
– Put a set of known sites on a queue
– Repeat the following until the queue is empty:
» Take the first page off of the queue
» If this page has not yet been processed:
Record the information found on this page
– Positions of words, links going out, etc
Add each link on the current page to the queue
Record that this page has been processed
In what order should the links be followed?
20. Page Visit Order
Animated examples of breadth-first vs depth-first search on trees:
http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html
Structure to be traversed
21. Page Visit Order
Animated examples of breadth-first vs depth-first search on trees:
http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html
Breadth-first search
(must be in presentation mode to see this animation)
22. Page Visit Order
Animated examples of breadth-first vs depth-first search on trees:
http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html
Depth-first search
(must be in presentation mode to see this animation)
23. Page Visit Order
Animated examples of breadth-first vs depth-first search on trees:
http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html
26. Web Crawling Issues
Keep out signs
– A file called norobots.txt tells the crawler which
directories are off limits
Freshness
– Figure out which pages change often
– Recrawl these often
Duplicates, virtual hosts, etc
– Convert page contents with a hash function
– Compare new pages to the hash table
Lots of problems
– Server unavailable
– Incorrect html
– Missing links
– Infinite loops
Web crawling is difficult to do robustly!
27. Cha-Cha
Cha-cha searches an intranet
– Sites associated with an organization
Instead of hand-edited categories
– Computes shortest path from the root
for each hit
– Organizes search results according to
which subdomain the pages are found in
28. Cha-Cha Web Crawling Algorithm
Start with a list of servers to crawl
– for UCB, simply start with www.berkeley.edu
Restrict crawl to certain domain(s)
– *.berkeley.edu
Obey No Robots standard
Follow hyperlinks only
– do not read local filesystems
» links are placed on a queue
» traversal is breadth-first
See first lecture or the technical papers for
more information
29. Summary
Web search differs from traditional IR
systems
– Different kind of collection
– Different kinds of users/queries
– Different economic motivations
Ranking combines many features in a
difficult-to-specify manner
– Link analysis and proximity of terms seems
especially important
– This is in contrast to the term-frequency
orientation of standard search
» Why?
30. Summary (cont.)
Web search engine archicture
– Similar in many ways to standard IR
– Indexes usually duplicated across
machines to handle many queries quickly
Web crawling
– Used to create the collection
– Can be guided by quality metrics
– Is very difficult to do robustly