Improving Web search and access to scientific information

Searching the Web: General and
Scientific Information Access
Steve Lawrence and C. Lee Giles, NEC Research Institute

ABSTRACT The World Wide Web has revolutionized the way people access
information, and has opened up new possibilities in areas such
as digital libraries, general and scientific information dissemination and retrieval, educa-
tion, commerce, entertainment, government, and health care. There are many avenues for is possible today, and can improve
improvement of the Web; for example, in the areas of locating and organizing informa- access to scientific information on the
tion. Current techniques for access to both general and scientific information on the Web Web or in other digital libraries of sci-
provide much room for improvement; search engines do not provide comprehensive entific articles.
indices of the Web and have difficulty in accurately ranking the relevance of results. Scien-
tific information on the Web is very disorganized. We discuss the effectiveness of Web
search engines, including results showing that the major Web search engines cover only a WEB SEARCH
fraction of the “publicly indexable Web.” Current research into improved searching of the One of the key aspects of the World
Web is discussed, including new techniques for ranking the relevance of results, and new Wide Web which makes it a valuable
techniques in metasearch that can improve the efficiency and effectiveness of Web search. information resource is that the full text
The creation of digital libraries incorporating autonomous citation indexing is discussed for of documents can be searched using
improved access to scientific information on the Web. Web search engines such as AltaVista
and HotBot. Just how effective are the
Web search engines? The following sec-

T he World Wide Web is revolutionizing the way
people access information, and has opened up
new possibilities in areas such as digital libraries, general
tions discuss the effectiveness of current engines and current
research into improved techniques.

and scientific information dissemination and retrieval, edu- THE COMPREHENSIVENESS AND RECENCY OF THE
cation, commerce, entertainment, government, and health WEB SEARCH ENGINES
care. The amount of publicly available information on the
Web is increasing rapidly [1]. The Web is a gigantic digital This section considers the effectiveness of the major Web
library, a searchable 15 billion word encyclopedia [2]. It has search engines in terms of comprehensiveness and recency.
stimulated research and development in information retrieval We provide results on the size of the Web, the coverage of
and dissemination, and fostered search engines such as each search engine, and the freshness of the search engine
AltaVista. These new developments are not limited to the databases. These results show that none of the search engines
Web, and can enhance access to virtually all forms of digital covers more than about one third of the publicly indexable
libraries. Web, and that the freshness of the various databases varies
The revolution the Web has brought to information significantly.
access is not so much due to the availability of information Typical quotes regarding the coverage and recency of the
(huge amounts of information has long been available in major search engine databases include: “If you can’t find it
libraries and elsewhere), but rather the increased efficiency using AltaVista search, it’s probably not out there” [3], “[With
of accessing information, which can make previously imprac- AltaVista] you can find new information just about as quickly
tical tasks practical. There are many avenues for improve- as it’s available on the Web” [3], and “HotBot is the first
ment in the efficiency of accessing information on the Web, search robot capable of indexing and searching the entire Web”
for example, in the areas of locating and organizing infor- [4]. However, the World Wide Web is a distributed, dynam-
mation. ic, and rapidly growing [1] information resource that pre-
This article discusses general and scientific information sents difficulties to traditional information retrieval
access on the Web, and many of our comments are applica- technologies. Traditional information retrieval software was
ble to digital libraries in general. The effectiveness of Web designed for different environments and has typically been
search engines is discussed, including results that show that used for indexing a static collection of directly accessible
the major search engines cover only a fraction of the “pub- documents. The nature of the Web brings up questions such
licly indexable Web” (the part of the Web which is consid- as: can the centralized architecture of the search engines
ered for indexing by the major engines, which excludes pages keep up with the increasing number of documents on the
hidden behind search forms, pages with authorization Web? Can they update their databases regularly to detect
requirements, etc.). Current research into improved search- modified, deleted, and relocated information? Answers to
ing of the Web is discussed, including new techniques for these questions impact on the best methodology to use
ranking the relevance of results, and new techniques in when searching the Web, and on the future of Web search
metasearch that can improve the efficiency and effectiveness technology.
of Web search. We performed a study of the comprehensiveness and
The amount of scientific information and the number of recency of the major Web search engines in December
electronic journals on the Internet continues to increase. 1997 by analyzing the responses of AltaVista, Excite, Hot-
Researchers are increasingly making their work available Bot, Infoseek, Lycos, and Northern Light for 575 queries
online. This article also discusses the creation of digital made by employees at the NEC Research Institute [1].
libraries of the scientific literature, incorporating autonomous Search engines rank documents differently and can return
citation indexing. The autonomous creation of citation indices documents that do not contain the query terms (e.g., pages

116 0163-6804/99/$10.00 © 1999 IEEE IEEE Communications Magazine • January 1999

Search engine HotBot Alta Northern Excite Infoseek Lycos
Vista Light
with morphological variants or
synonyms). Therefore, we only Coverage WRT estimated Web size 34% 28% 20% 14% 10% 3%
considered queries for which we Percentage of dead links returned 5.3% 2.5% 5.0% 2.0% 2.6% 1.6%
could download the full text of
every document that each engine s Table 1. Estimated coverage of each engine with respect to the estimated size of the Web,
reports as matching the query. and the percentage of invalid links returned by each engine (from 575 queries performed
Documents were only counted if December 15–17, 1997).
they could be downloaded and
contained the query terms. We
handled other important details such as the normalization added and modified, a truly comprehensive index would have
of URLs, and capitalization and morphology (full details to index all pages simultaneously, which is not currently possi-
can be found in [1]). ble. Furthermore, there may be many pages that contain no
Table 1 shows the estimated coverage of the search links to them, making it difficult for the search engines to
engines, which varies by an order of magnitude. This variation know that the pages exist.
is much greater than would be expected from considering the We also looked at the percentage of dead links returned
number of pages that each engine reports to have indexed. by the search engines, which is related to how often the
The variation may be explained by differences in indexing or engines update their databases. Intuitively, it is possible for a
retrieval technology between the engines (e.g., an engine trade-off to exist between the comprehensiveness and fresh-
would appear to be smaller if it only indexed part of the text ness of a search engine; it should be possible to check for
on some pages), or differences in the kinds of pages indexed modified documents and update the index more rapidly if
(our study used mostly scientific queries which may not be the index is smaller. Some evidence of such a trade-off was
covered as well if an engine focuses more on well-connected, found — the most comprehensive engine had the largest
“popular” pages). Note that the results in the table are specif- percentage of dead links, and the least comprehensive
ic to the particular queries performed (typical queries made engine had the smallest percentage of dead links. Table 1
by scientists), and the state of the engine databases at the shows the percentage of invalid links for each search engine.
time they were performed. However, we found that the rating of the engines in terms of
We estimated a lower bound on the size of the publicly the percentage of dead links varies greatly over time. This
indexable Web to be 320 million pages. In order to produce provides evidence that the search engines may not be very
this estimate, we analyzed the overlap between pairs of regular in their indexing processes; for example, an engine
engines [1]. Consider two engines a and b. Using the assump- might suspend the processing of new pages for a period of
tion that each engine samples the Web independently, the time during upgrades.
quantity n o /n b , where n o is the number of documents How can this knowledge of the effectiveness of the search
returned by both engines and nb is the number of documents engines be used to improve Web search? The coverage inves-
returned by engine b, is an estimate of the fraction of the tigations indicate that the coverage of the Web engines is
indexable Web, p a , covered by engine a. The size of the much lower than commonly believed, and that the engines
indexable Web can then be estimated with s a/p a where s a is tend to index different sets of pages. This indicates that when
the number of pages indexed by engine a. This technique is searching for less popular information, it can be very useful to
limited because the engines do not choose pages to sample combine the results of multiple engines. The freshness investi-
independently; they all allow pages to be registered, and they gations indicate that it is difficult to predict ahead of time
are typically biased toward indexing more popular or well- which search engine will be the best engine to use when look-
connected pages. To estimate the size of the Web we used ing for recent information. Therefore, it can also be very use-
the overlap between the largest two engines where the inde- ful to combine the results of multiple engines when looking
pendence assumption is more valid (the larger engines can for recent information. There are other ways to compare the
index more of the nonregistered and less popular pages). search engines besides comprehensiveness and recency, such
Some dependence between the sampling of the engines as how well the engines rank the relevance of results (dis-
remains between the largest two engines, and therefore this cussed in the next section), and features of the query inter-
estimate is a lower bound. Using this estimate of the size of face.
the Web, we found that no engine indexes more than about
one third of the indexable Web. We also found that combin- RESEARCH IN WEB SEARCH
ing the results of the six engines returned approximately 3.5 Research into technology for searching the Web is abun-
times more documents on average when compared to using dant, which is not surprising considering that the existence
only one engine. of full-text search engines is one of the major differences
Recall that the queries used in the study were from the between the Web and previous means of accessing informa-
employees of the NEC Research Institute. Most of the tion. The following sections look specifically at some of the
employees are scientists, and scientists tend to search for recent research: improved methods for ranking pages that
less “popular” or harder-to-find information. This is benefi- utilize the graph structure of the Web, a metasearch tech-
cial when estimating the size of the Web as above. Howev- nique that can improve the efficiency of Web search by
er, the search engines are typically biased toward indexing downloading matching pages in order to extract query term
more “popular” information. Therefore, the coverage of context and analyze the pages, and “softbots” which can be
the search engines is typically better for more popular used to locate pages that may not be indexed by any of the
information. engines.
There are a number of possible reasons why the major
search engines do not provide comprehensive indices of the Page Relevance — A common complaint against search
Web: the engines may be limited by network bandwidth, disk engines is that they return too many pages, and that many of
storage, computational power, scalability of their indexing and them have low relevance to the query. This has been used as
retrieval technology, or a combination of these items (despite an argument for not providing comprehensive indices of the
claims to the contrary [5]). Because Web pages are continually Web (“people are already overloaded with too much infor-

IEEE Communications Magazine • January 1999 117

mation”). However, a search engine could be more compre- results, improved document ranking using proximity informa-
hensive while still returning the same set of pages first. One tion (because Inquirus has the full text of all pages it avoids
of the main problems is that the search engines do not rank the ranking problem with standard metasearch engines), dra-
the relevance of results very well. Research search engines matically improved precision for certain queries by using spe-
such as Google [6] and LASER [7] promise improved rank- cific expressive forms, and quick jump links and highlighting
ing of results. These engines make greater use of HTML when viewing the full documents.
structure and the graph formed by hyperlinks in order to One of the fundamental features of Inquirus is that it ana-
determine page relevancy than do the major Web search lyzes each document and displays local context around the
engines. For example, Google uses a ranking algorithm query terms. The benefit of displaying the local context, rather
called PageRank that iteratively uses information from the than an abstract or query-insensitive summary of the docu-
number of pages pointing to each page (which is related to ment, is that the user may be able to more readily determine
the popularity of the pages). Google also uses the text in if the document answers his or her specific query (without
links to a page as descriptors of the page (the links often repeatedly clicking and waiting for pages to download). A
contain better descriptions of the pages than the pages user can therefore find documents of high relevance by quick-
themselves). Another engine with a novel ranking measure is ly scanning the local context of the query terms. This tech-
Direct Hit (http://www.directhit.com), which is typically good nique is simple, but can be very effective, especially for Web
for common queries. Direct Hit ranks results for a given search where the database is very large, diverse, and poorly
query according to the number of times previous users have organized.
clicked on the pages (i.e., the more popular pages are A study by Tombros (1997) shows that query-sensitive
ranked higher). summaries can improve the efficiency of search. Tombros
Kleinberg [8] has presented a method for locating two considered the use of query-sensitive summaries and per-
types of useful pages — authorities, which are highly refer- formed a user study which showed that users working
enced pages, and hubs, which are pages that contain links to with query-sensitive summaries had a higher success
many authorities. The underlying principle is the following: rate. Query-sensitive summaries allowed users to perform
good hub pages point to many good authority pages, and a relevance judgments more accurately and rapidly, and
good authority page is pointed to by many good hub pages. greatly reduced the need to refer to the full text of docu-
An iterative process can be used to find hubs and authorities [8]. ments.
Future search engines may use this method to classify hub and One interesting feature of Inquirus is the Specific Expres-
authority pages, and to rank the pages within these classes. sive Forms (SEF) search technique. The Web is highly redun-
dant, and techniques that trade recall (the fraction of all
Metasearch — Limitations of the search services have led to relevant documents returned) for improved precision (the
the introduction of metasearch engines [9]. A metasearch fraction of returned documents that are relevant) are often
engine searches the Web by making requests to multiple useful. The SEF search technique transforms queries in the
search engines such as AltaVista or HotBot. The primary form of a question into specific forms for expressing the
advantages of current metasearch engines are the ability to answer. For example, the query “What does NASDAQ stand
combine the results of multiple search engines and the ability for?” is transformed into the query “NASDAQ stands
to provide a consistent user interface for searching these for” “NASDAQ is an abbreviation” “NASDAQ
engines. means”. Clearly the information may be contained in a dif-
The idea of querying and collating results from multiple ferent form than these three possibilities; however, if the
databases is not new. Companies like PLS, Lexis-Nexis, DIA- information does exist in one of these forms, there is a higher
LOG, and Verity have long since created systems that inte- likelihood that finding these phrases will provide the answer
grate the results of multiple heterogeneous databases [9]. to the query. For many queries, the answer might exist on the
Many other Web metasearch services exist, such as the popu- Web, but not in any of the specific forms used. However, our
lar and useful MetaCrawler service [9]. Services similar to experiments indicate that the method works well enough to be
MetaCrawler include SavvySearch and Infoseek Express. effective for certain queries.
Metasearch engines can introduce their own deficiencies; Inquirus is surprisingly efficient. Inquirus downloads search
for example, they can have difficulty ranking the list of results. engine responses and Web pages in parallel, and typically
If one engine returns many low-relevance documents, these returns the first result faster than the average response time
documents may make it more difficult to find relevant pages of a search engine.
in the list. Most of the metasearch engines on the Web also In summary, metasearch techniques can improve the effi-
limit the number of results that can be obtained, and typically ciency of Web search by combining the results of multiple
do not support all of the features of the query languages for search engines, and by implementing functionality which is
each engine. not provided by the underlying engines (e.g., extracting query
The NEC Research Institute has been developing an term context and filtering dead links). The Inquirus
experimental metasearch engine called Inquirus [10]. Inquirus metasearch prototype at the NEC Research Institute has
was motivated by problems with current metasearch engines, shown that downloading and analyzing pages in real time is
as well as the poor precision, limited coverage, limited avail- feasible. Inquirus, like other meta engines and various Web
ability, limited user interfaces, and out-of-date databases of tools, relies on the underlying search engines, which provide
the major Web search engines. Rather than work with the list important and valuable services. Widescale use of this or any
of documents and summaries returned by search engines, as metasearch engine would require an amicable arrangement
current metasearch engines typically do, Inquirus works by with the underlying search engines. Such arrangements may
downloading and analyzing the individual documents. Inquirus include passing through ads or micro-payment systems.
makes improvements over existing engines in a number of
areas, such as more useful document summaries incorporating IMPROVING WEB SEARCH
query term context, identification of both pages that no longer Users tend to make queries that result in poor precision.
exist and pages that no longer contain the query terms, About 70 percent of queries to Infoseek contain only one
improved detection of duplicate pages, progressive display of term (Harry Motro, Infoseek CEO, CNBC, May 7, 1998).

118 IEEE Communications Magazine • January 1999

About 40 percent of queries made by the employees of the AVAILABILITY
NEC Research Institute to the Inquirus engine contain only
one term. In information retrieval, there is typically a A lot of scientific literature is copyrighted by the authors or
trade-off between precision and recall. Simple (e.g., single- publishers and is not generally available on the “publicly
term) queries can return thousands or millions of docu- indexable Web.” However, the amount of scientific material
ments. Unfortunately, ranking the relevance of these available on the publicly indexable Web is growing. Some
documents is a difficult problem, and the desired docu- journals owned by societies such as IEEE (the largest techni-
ments may not appear near the top of the list. One way to cal/scientific society) and ACM are permitting their papers to
improve the precision of results is to use more query terms, be placed on the author’s Web sites as long as the proper
and to tell the search engines that relevant documents must copyright notices are posted. Some private publishers, MIT
contain certain terms (required terms). Other ways include Press for example, are doing the same. Some publishers per-
using phrases or proximity (e.g., searching for specific mit prepublication Web access but do not allow posting of the
phrases rather than single terms), using constraints offered final version of papers. We predict that more and more
by some search engines such as date ranges and geographic papers will be available on the publicly indexable Web in the
restrictions, or using the refinement features offered by future.
some engines (e.g., AltaVista offers a refine function, and We used six major Web search engines to search for the
Infoseek allows subsequent searches within the results set papers in a recent issue of Neural Computation, after the table
of previous searches). of contents was released but before we obtained our copy of
Another alternative is to combine available search engines the journal. We found that about 50 percent of the papers
with automated online searching. One example is the Internet were available on the homepages of the authors. As men-
“softbot” [11]. The softbot transforms queries into goals and tioned before, the coverage of any one search engine is limit-
uses a planning algorithm (with extensive knowledge of the ed. The simplest means of improving the chances of finding a
information sources) in order to generate a sequence of particular scientist or paper on the publicly indexable Web is
actions that satisfies the goal. AHOY! is a successful softbot to combine the results of multiple engines, as is done with
that locates homepages for individuals [11]. Shakes et al. per- metasearch engines such as MetaCrawler.
formed a study where they searched for the homepages of 582 Although more and more scientific papers are being made
researchers, and AHOY! was able to locate more homepages available on the publicly indexable Web, these papers are
than MetaCrawler (which located more homepages than Hot- spread throughout researcher and institution homepages,
Bot or AltaVista). AHOY! also provided greatly improved technical report archives, and journal sites. The Web search
precision. engines do not make it easy to locate these papers because
More comprehensive and relevant results may also be the search engines typically do not index Postscript or PDF
possible using a search engine that specializes in a particu- documents, which account for a large percentage of the avail-
lar area; for example, Excite NewsTracker specializes in able articles. The next section introduces a technique for
indexing news sites, and OpenText Pinstripe specializes in organizing and indexing this literature.
indexing business sites. Because there are fewer pages to
index, the engines may be able to be more comprehensive DIGITAL LIBRARIES AND CITATION INDEXING
within their area, and may also be able to update the index The Web offers the possibility of providing easy and effi-
more regularly. When searching for popular information, cient services for organizing and accessing scientific infor-
directories constructed by hand, such as Yahoo’s directory, mation. A citation index is one such service. Citation indices
can be very useful because fewer low-relevance results are [13] index the citations in an article, linking the article with
returned. the cited works. Citation indices were originally designed
In summary, there exist several ways of improving on the for literature search, allowing a researcher to find subse-
major Web search engines, depending on the type of informa- quent articles that cite a given article. Citation indices are
tion desired. For harder-to-find information, metasearch and also valuable for other purposes, including evaluation of
softbots can improve coverage. If the topic being queried is articles, authors, and so on, and analysis of research trends.
covered by one of the more specialized engines, these engines The most popular citation indices of academic research are
can be used, and they often provide more comprehensive and produced by the ISI. One such index, the Science Citation
up-to-date indices within their specialty compared to the gen- Index, is intended to be a practical cost-effective tool for
eral Web search engines. indexing the significant scientific journals. Unfortunately,
the ISI databases are expensive and not available to all
researchers. Much of the expense is due to the manual
SCIENTIFIC INFORMATION RETRIEVAL effort required during indexing.
Immediate access to scientific literature has long been The rise of the Internet and the Web has led to proposals
desired by scientists, and the Web search engines have made for online digital libraries that incorporate citation indexing.
a large and growing body of scientific literature and other For example, Cameron proposed a “universal, [Internet-
information resources accessible within seconds. Advances in based,] bibliographic and citation database linking every
computing and communications, and the rapid rise of the scholarly work ever written” [14]. Such a database would be
Web have led to the increasingly widespread availability of highly “comprehensive and up-to-date”, making it a powerful
online research articles, as well as a simple-to-use Web version tool for academic literature research, and for the production
of the Institute for Scientific Information’s ® (ISI) Science of statistics as with traditional citation indices. However,
Citation Index® — the Web of Science®. The Web is changing Cameron’s proposal presents significant difficulty for imple-
the way researchers locate and access scientific publications. mentation, and requires authors or institutions to provide cita-
Many print journals now provide access to the full text of tion information in a specific format.
articles on the Web, and the number of online journals was The NEC Research Institute is working on a digital
about 1000 in 1996 [12]. Researchers are increasingly making library of scientific publications that creates a citation index
their work available on their homepages or in technical autonomously (using Autonomous Citation Indexing, ACI),
report archives. without the requirement of any additional effort on the part


Searching for “simulated annealing” in Machine Learning [small test index] (13828
documents 278202 citations total).
example of how an ACI system can
1218 citations found
extract the context of citations to a
Click on the [Context] links to see the citing documents and the context of the citations.
given paper and display them for
easy browsing. Note that finding
Citations (self) Article and extracting the context of cita-
tions to a given paper could previ-
196 S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, “Optimization by ously be done by using traditional
simulated annealing,” Science, vol. 220, 1983, pp. 671–680. [Context] citation indices and manually locat-
[Check] ing and searching the citing papers
— the difference is that the
49 (1) D. S. Johnson et al., “Optimization by simulated annealing: An automation and Web interface
experimental evaluation,” Technical report, Bell Labs preprint. [Context] make the task far more efficient,
[Check] and thus practical, where it may
not have been before.
39 E. Aarts and J. Korst, “Simulated Annealing and Boltzmann Machines,” Digital libraries incorporating
John Wiley and Sons, 1989. [Context] [Check] citation indexing can be used to
organize the scientific literature,
[... section deleted...] and help with literature search and
evaluation. A “universal citation
s Figure 1. An ACI system can group variant forms of citations to the same paper (citations can database” which accurately indexes
be written in many different formats), and rank search results by the number of citations. all literature would be ideal, but is
currently impractical because of
the limited availability of articles
of the authors or institutions, and without any manual assis- in electronic form and the lack of standardization in citation
tance [15]. An ACI system autonomously extracts citations, practices. However, CiteSeer shows that it is possible to
identifies identical citations that occur in different formats, organize and index the subset of literature available on the
and identifies the context of citations in the body of articles. Web, and to autonomously process freeform citations with
As with traditional citation indices like the Science Citation reasonable accuracy. As long as there is a significant portion
Index, ACI allows literature search using citation links, and of publishing through the Web, be it the publicly indexable
the ranking of papers, journals, authors, and so on by the Web or the subscription-only Web, there is great value in
number of citations. Compared to traditional citation index- being able to prepare citation indices from the machine-
ing systems, ACI has both disadvantages and advantages. readable material. Citation indices may appear that index
The disadvantages include lower accuracy (which is expected from both parts of the Web. Access to the full text of articles
to be less of a disadvantage over time). However, the advan- may be done openly or by subscription, depending on how
tages are significant and include no manual effort required the Web and the publication business evolve. Citation
for indexing, resulting in a corresponding reduction in cost indices for subscription-only data may be offered by the pub-
and increase in availability, and literature search based on lisher or performed by a third party that has an agreement
the context of citations — given a particular paper of inter- with the publisher.
est, an ACI system can display the context of how the paper
is cited in subsequent publications. The context of citations
can be very useful for efficient literature search and evalua- THE FUTURE OF WEB SEARCH AND
tion. ACI has the potential for broader coverage of the liter-
ature because human indexers are not required, and can
DIGITAL LIBRARIES
provide more timely feedback and evaluation by indexing What is the future of the Web, Web search, and digital
items such as conference proceedings and technical reports. libraries? Improvements in technology will enable new appli-
Overall, ACI can improve scientific communication, and cations. Computational and storage resources will continue to
facilitates an increased rate of scientific dissemination and improve. Bandwidth is likely to increase significantly as tech-
feedback. nology advances and the following positive spiral works: more
ACI is ideal for operation on the Web — new articles can people are becoming connected to the Internet as it becomes
be automatically located and indexed when they are posted easier to use and more popular, and as new access mecha-
on the Web or announced on mailing lists, and an efficient nisms are introduced (e.g., cable modems and digital sub-
interface for browsing the articles, citations, and the context scriber lines). This provides incentive for the infrastructure
of the citations can be created. Part of the benefit of companies to make more investment in the backbone, improv-
autonomous citation indexing is due to the ability to format ing bandwidth. More investment in the backbone improves
and organize information on demand using a Web interface access, so more people want to be connected.
to the citation index. Figure 1 shows an example of the out- Will the fraction of the Web covered by the major search
put from the NEC Research Institute’s prototype engines increase? Some search engines are focusing on index-
autonomous citation indexing digital library system: Cite- ing Web pages that satisfy the majority of searches, as
Seer. This example shows the results of a search for citations opposed to trying to catalog all of the Web. However, there
containing the phrase “simulated annealing” in a small test are still some engines that aim to index the Web comprehen-
database of the machine learning literature (only a subset of sively. Improvements in indexing technology and computa-
the machine learning literature on the Web). Searching for tional resources will allow the creation of larger indices.
citations to papers by a given author can also be performed Nevertheless, it is unlikely to become economically practical
(including secondary authors). The [Context] links show the for a single search engine to index close to all of the publicly
context of the individual citations. The [Check] links show indexable Web in the near future. However, it is predicted
the individual citations in each group and can be used to that the cost of indexing and storage will decline over time
check for errors in the citation grouping.Figure 2 shows an relative to the increase in the size of the indexable Web [6],


S. Kirkpatrick, C. D. Gelatt Jr., and M. Vecchi, Optimization by simulated annealing, Science 220 (4598) (1983), 671–680.

This paper is cited in the following contexts:

M.I.T Media Laboratory Perceptual Computing Section Technical Report No. 177 - Appeared: IEEE ICASSP. San Francisco. March 1992,
vol. III. pp. 45–48. - GIBBS RANDOM FIELD: - TEMPERATURE AND PARAMETER ANALYSIS - Rosalind W. Picard –M.I.T Lab, E15-392: 20
Ames Street, Cambridge, MA 02139 – picard@media.mit.edu [Details] [Full Text] [Related Articles]
[ftp://whitechapel.media.mit.edu/pub/tech-reports/TR -177.ps.Z]

...... Simulated annealing is a popular nonlinear optimization technique where a cost function is substituted for E(x), and consequently
minimized. There is a key observation in the simulated annealing literature that prompts the study of temperature presented in this
paper. Kirkpatrick, et al. [3] observed that “more optimization” occurs at certain temperatures than at others. These favored
temperatures are analogous to the physical idea of a “critical temperature,” a point that marks transition between different “phases” of
the data. The reason for considering these physical......

...... region, we have shown in earlier work that a similar kind of point, which we call a “transition” temperature, T, does occur [8]. By
measuring the specific heat of the binary process it can be shown to correspond to the same region where the “most optimiza-
tion” occurs in simulated annealing [3]. For GRF analysis, this region is where the energy fluctuation peaks, and where small changes
in the parameters become more significant. In [8] the transition temperature for n = 2 was estimated to be at I =T = 1:7. This analysis
suggests that attempts to estimate parameters should take......

[3] S. Kirkpatrick, C. D. Gelatt Jr., and M. P. Vecchi, Optimization by simulated annealing, Science 220 (4598) 671–680, 1983.

Technical Report No. 9805. Department of Statistics, University of Toronto - Annealed Importance Sampling - Radford M. Neal –
Department of Statistics and Department of Computer Science - University of Toronto, Toronto, Ontario, Canada-
http://www.cs.utoronto.ca/radford/ - radford@stat.utoronto.ca - 18 Feb. 1998 [Details] [Full Text] [Related Articles] [ftp://ftp.cs.utoron-
to.ca/pub/radford/ais.ps.Z]

...... respect to these transitions. Because such a chain will move between modes only rarely, it will take a long time to reach equilibri-
um, and will exhibit high autocorrelations for functions of the state variables out to long timelags. The method of simulated anneal-
ing was introduced by Kirkpatrick, Gelatt, and Vecchi (1983) as a way of handling multiple modes in an optimization context.
It employs a sequence of distributions, with probabilities or probability densities given by p0(x) to 2 pn(x), in which each pj differs only
slightly from pj+l. The distribution p0 is the one of interest. The......

Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P., (1983) “Optimization by simulated annealing,” Science, vol. 220, pp. 671–680.

[...section deleted...]

s Figure 2. An example of how an autonomous citation indexing system can show the context of citations to a given paper. The sen-
tences containing the citations are automatically highlighted.

resulting in favorable scaling properties for centralized text SUMMARY
search engines. In the meantime, an increased number of
specialized search services may arise that cover specific types The Web is revolutionizing information access; however, cur-
of information. rent techniques for access to both general and scientific infor-
The use of more expensive and better algorithms (e.g., as mation on the Web leave room for much improvement. The
in Google) will produce improved page rankings. More infor- Web search engines are limited in terms of coverage, recency,
mation retrieval techniques aimed at the large, diverse, low how well they rank query results, and the query options they
signal-to-noise ratio database of the Web will be developed. support. Access to the growing body of scientific literature on
One interesting possibility is the use of machine learning in the publicly indexable Web is limited by the lack of organiza-
order to create query transformations similar to those used in tion and because the major search engines do not index
the SEF technique discussed earlier. Postscript or PDF documents. We have discussed several
Metasearch techniques, which combine the results of mul- fruitful research directions that will improve access to general
tiple engines, are likely to continue to be useful when and scientific information, and greatly enhance the utility of
searching for hard-to-find information, or when comprehen- the Web: improved ranking methods, metasearch engines,
sive results are desired. The major Web search engines are softbots, and autonomous citation indexing. It is not clear how
also likely to continue to focus on performing queries as availability will evolve, because this depends on how the Web
quickly as possible, and therefore metasearch engines that emerges as a business platform for publishers. Nevertheless,
perform additional client-side processing (e.g., query term improved ways to do basic searching, and specialized citation
context summaries) may become increasingly popular as searching are likely to evolve and replace present methods,
these products become more powerful, address problems and will greatly increase the utility of the Web over what is
with data fusion from different sources, and learn to deal available today.
better with the constantly evolving search services. Improve-
ments in bandwidth should improve the feasibility of ACKNOWLEDGMENTS
metasearch techniques. We thank H. Stone and the reviewers for very useful com-
Digital libraries incorporating ACI should become more ments and suggestions.
widely available, bringing the benefits of citation indexing to
groups who cannot afford the commercial services, and REFERENCES
improving the dissemination and retrieval of scientific litera- [1] S. Lawrence and C. L. Giles, “Searching the World Wide Web,” Science,
ture. vol. 280, no. 5360, 1998, pp. 98–100.


[2] J. Barrie and D. Presti, “The World Wide Web as an instructional tool,” Digital Libraries 98 — The Third ACM Conf. Digital Libraries, Pittsburgh,
Science, vol. 274, 1996, pp. 371–72. PA, 1998, pp. 89–98.
[3] R. Seltzer, E. Ray, and D. Ray, The AltaVista Search Revolution: How to
Find Anything on the Internet, McGraw-Hill, 1997. BIOGRAPHIES
[4] Inktomi, http://www.inktomi.com/new/press/bellsouth.html, 1997.
[5] S. Steinberg, “Seek and ye shall find (maybe),” Wired, vol. 4, no. 5, STEVE LAWRENCE (lawrence@research.nj.nec.com) is a research scientist at
1996. the NEC Research Institute, Princeton, New Jersey. His research interests
[6] S. Brin and L. Page, “The anatomy of a large-scale hypertextual Web include information retrieval and dissemination, machine learning, artificial
search engine,” Proc. 7th Int’l. WWW Conf., Brisbane, Australia, 1998. intelligence, neural networks, face recognition, speech recognition, time
[7] J. Boyan, D. Freitag, and T. Joachims, “A machine learning architecture series prediction, and natural language. His awards include an NEC
for optimizing Web search engines,” Proc. AAAI Wksp. Internet-Based Research Institute excellence award, ATERB and APRA priority scholarships,
Info. Sys., 1996. a QUT university medal and award for excellence, QEC and Telecom Aus-
[8] J. Kleinberg, “Authoritative sources in a hyperlinked environment,” Proc. tralia Engineering prizes, and three successive prizes in the annual Aus-
ACM-SIAM Symp. Discrete Algorithms, 1998. tralian Mathematics Competition. He received a B.Sc. in computing and a
[9] E. Selberg and O. Etzioni, “Multi-service search and comparison using B.Eng. in electronic systems from the Queensland University of Technology,
the MetaCrawler,” Proc. 1995 WWW Conf., 1995. Australia, and a Ph.D. from the University of Queensland, Australia.
[10] S. Lawrence and C. L. Giles, “Context and page analysis for improved
Web search,” IEEE Internet Comp., vol. 2, no. 4, 1998, pp. 38–46. C. LEE GILES [F] (giles@research.nj.nec.com) is a senior research scientist in
[11] O. Etzioni and D. Weld, “A softbot-based interface to the Internet,” Computer Science at NEC Research Institute, Princeton, New Jersey. Cur-
Commun. ACM, vol. 37, no. 7, 1994, pp. 72–76. rently he is an adjunct professor at the Institute for Advanced Computer
[12] G. Taubes, Science, vol. 271, 1996, p. 764. Studies at the University of Maryland. His research interests are in novel
[13] E. Garfield, Citation Indexing: Its Theory and Application in Science, applications of neural and machine learning, agents and AI in the Web,
Technology, and Humanities, New York: Wiley, 1979. and computing. He is on the editorial boards of IEEE Intelligent Systems,
[14] R. D. Cameron, “A universal citation database as a catalyst for reform IEEE Transactions on Knowledge and Data Engineering, IEEE Transactions
in scholarly communication,” First Monday, vol. 2, no. 4, 1997. on Neural Networks, the Journal of Computational Intelligence in Finance,
[15] C. L. Giles, K. Bollacker, and S. Lawrence, “CiteSeer: An automatic cita- Journal of Parallel and Distributed Computing, Neural Networks, Neural
tion indexing system,” I. Witten, R. Akscyn, and F. M. Shipman III, Eds., Computation, and Applied Optics.


Improving Web search and access to scientific information

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (18)

Destacado

Destacado (7)

Similar a Improving Web search and access to scientific information

Similar a Improving Web search and access to scientific information (20)

Más de Stefanos Anastasiadis

Más de Stefanos Anastasiadis (8)

Último

Último (20)

Improving Web search and access to scientific information