SlideShare una empresa de Scribd logo
1 de 7
Descargar para leer sin conexión
Searching the Web: General and
Scientific Information Access
Steve Lawrence and C. Lee Giles, NEC Research Institute



ABSTRACT                    The World Wide Web has revolutionized the way people access
                            information, and has opened up new possibilities in areas such
as digital libraries, general and scientific information dissemination and retrieval, educa-
tion, commerce, entertainment, government, and health care. There are many avenues for             is possible today, and can improve
improvement of the Web; for example, in the areas of locating and organizing informa-              access to scientific information on the
tion. Current techniques for access to both general and scientific information on the Web          Web or in other digital libraries of sci-
provide much room for improvement; search engines do not provide comprehensive                     entific articles.
indices of the Web and have difficulty in accurately ranking the relevance of results. Scien-
tific information on the Web is very disorganized. We discuss the effectiveness of Web
search engines, including results showing that the major Web search engines cover only a                      WEB SEARCH
fraction of the “publicly indexable Web.” Current research into improved searching of the      One of the key aspects of the World
Web is discussed, including new techniques for ranking the relevance of results, and new       Wide Web which makes it a valuable
techniques in metasearch that can improve the efficiency and effectiveness of Web search.      information resource is that the full text
The creation of digital libraries incorporating autonomous citation indexing is discussed for  of documents can be searched using
improved access to scientific information on the Web.                                          Web search engines such as AltaVista
                                                                                               and HotBot. Just how effective are the
                                                                                               Web search engines? The following sec-

         T    he World Wide Web is revolutionizing the way
              people access information, and has opened up
new possibilities in areas such as digital libraries, general
                                                                        tions discuss the effectiveness of current engines and current
                                                                        research into improved techniques.

and scientific information dissemination and retrieval, edu-                   THE COMPREHENSIVENESS AND RECENCY OF THE
cation, commerce, entertainment, government, and health                                  WEB SEARCH ENGINES
care. The amount of publicly available information on the
Web is increasing rapidly [1]. The Web is a gigantic digital            This section considers the effectiveness of the major Web
library, a searchable 15 billion word encyclopedia [2]. It has          search engines in terms of comprehensiveness and recency.
stimulated research and development in information retrieval            We provide results on the size of the Web, the coverage of
and dissemination, and fostered search engines such as                  each search engine, and the freshness of the search engine
AltaVista. These new developments are not limited to the                databases. These results show that none of the search engines
Web, and can enhance access to virtually all forms of digital           covers more than about one third of the publicly indexable
libraries.                                                              Web, and that the freshness of the various databases varies
    The revolution the Web has brought to information                   significantly.
access is not so much due to the availability of information                Typical quotes regarding the coverage and recency of the
(huge amounts of information has long been available in                 major search engine databases include: “If you can’t find it
libraries and elsewhere), but rather the increased efficiency           using AltaVista search, it’s probably not out there” [3], “[With
of accessing information, which can make previously imprac-             AltaVista] you can find new information just about as quickly
tical tasks practical. There are many avenues for improve-              as it’s available on the Web” [3], and “HotBot is the first
ment in the efficiency of accessing information on the Web,             search robot capable of indexing and searching the entire Web”
for example, in the areas of locating and organizing infor-             [4]. However, the World Wide Web is a distributed, dynam-
mation.                                                                 ic, and rapidly growing [1] information resource that pre-
    This article discusses general and scientific information           sents difficulties to traditional information retrieval
access on the Web, and many of our comments are applica-                technologies. Traditional information retrieval software was
ble to digital libraries in general. The effectiveness of Web           designed for different environments and has typically been
search engines is discussed, including results that show that           used for indexing a static collection of directly accessible
the major search engines cover only a fraction of the “pub-             documents. The nature of the Web brings up questions such
licly indexable Web” (the part of the Web which is consid-              as: can the centralized architecture of the search engines
ered for indexing by the major engines, which excludes pages            keep up with the increasing number of documents on the
hidden behind search forms, pages with authorization                    Web? Can they update their databases regularly to detect
requirements, etc.). Current research into improved search-             modified, deleted, and relocated information? Answers to
ing of the Web is discussed, including new techniques for               these questions impact on the best methodology to use
ranking the relevance of results, and new techniques in                 when searching the Web, and on the future of Web search
metasearch that can improve the efficiency and effectiveness            technology.
of Web search.                                                              We performed a study of the comprehensiveness and
    The amount of scientific information and the number of              recency of the major Web search engines in December
electronic journals on the Internet continues to increase.              1997 by analyzing the responses of AltaVista, Excite, Hot-
Researchers are increasingly making their work available                Bot, Infoseek, Lycos, and Northern Light for 575 queries
online. This article also discusses the creation of digital             made by employees at the NEC Research Institute [1].
libraries of the scientific literature, incorporating autonomous        Search engines rank documents differently and can return
citation indexing. The autonomous creation of citation indices          documents that do not contain the query terms (e.g., pages


116                                  0163-6804/99/$10.00 © 1999 IEEE                            IEEE Communications Magazine • January 1999
Search engine                       HotBot Alta       Northern   Excite Infoseek Lycos
                                                                                     Vista      Light
with morphological variants or
synonyms). Therefore, we only             Coverage WRT estimated Web size 34%            28%     20%         14%    10%        3%
considered queries for which we           Percentage of dead links returned 5.3%         2.5%    5.0%        2.0% 2.6%         1.6%
could download the full text of
every document that each engine         s Table 1. Estimated coverage of each engine with respect to the estimated size of the Web,
reports as matching the query.             and the percentage of invalid links returned by each engine (from 575 queries performed
Documents were only counted if             December 15–17, 1997).
they could be downloaded and
contained the query terms. We
handled other important details such as the normalization             added and modified, a truly comprehensive index would have
of URLs, and capitalization and morphology (full details              to index all pages simultaneously, which is not currently possi-
can be found in [1]).                                                 ble. Furthermore, there may be many pages that contain no
    Table 1 shows the estimated coverage of the search                links to them, making it difficult for the search engines to
engines, which varies by an order of magnitude. This variation        know that the pages exist.
is much greater than would be expected from considering the               We also looked at the percentage of dead links returned
number of pages that each engine reports to have indexed.             by the search engines, which is related to how often the
The variation may be explained by differences in indexing or          engines update their databases. Intuitively, it is possible for a
retrieval technology between the engines (e.g., an engine             trade-off to exist between the comprehensiveness and fresh-
would appear to be smaller if it only indexed part of the text        ness of a search engine; it should be possible to check for
on some pages), or differences in the kinds of pages indexed          modified documents and update the index more rapidly if
(our study used mostly scientific queries which may not be            the index is smaller. Some evidence of such a trade-off was
covered as well if an engine focuses more on well-connected,          found — the most comprehensive engine had the largest
“popular” pages). Note that the results in the table are specif-      percentage of dead links, and the least comprehensive
ic to the particular queries performed (typical queries made          engine had the smallest percentage of dead links. Table 1
by scientists), and the state of the engine databases at the          shows the percentage of invalid links for each search engine.
time they were performed.                                             However, we found that the rating of the engines in terms of
    We estimated a lower bound on the size of the publicly            the percentage of dead links varies greatly over time. This
indexable Web to be 320 million pages. In order to produce            provides evidence that the search engines may not be very
this estimate, we analyzed the overlap between pairs of               regular in their indexing processes; for example, an engine
engines [1]. Consider two engines a and b. Using the assump-          might suspend the processing of new pages for a period of
tion that each engine samples the Web independently, the              time during upgrades.
quantity n o /n b , where n o is the number of documents                  How can this knowledge of the effectiveness of the search
returned by both engines and nb is the number of documents            engines be used to improve Web search? The coverage inves-
returned by engine b, is an estimate of the fraction of the           tigations indicate that the coverage of the Web engines is
indexable Web, p a , covered by engine a. The size of the             much lower than commonly believed, and that the engines
indexable Web can then be estimated with s a/p a where s a is         tend to index different sets of pages. This indicates that when
the number of pages indexed by engine a. This technique is            searching for less popular information, it can be very useful to
limited because the engines do not choose pages to sample             combine the results of multiple engines. The freshness investi-
independently; they all allow pages to be registered, and they        gations indicate that it is difficult to predict ahead of time
are typically biased toward indexing more popular or well-            which search engine will be the best engine to use when look-
connected pages. To estimate the size of the Web we used              ing for recent information. Therefore, it can also be very use-
the overlap between the largest two engines where the inde-           ful to combine the results of multiple engines when looking
pendence assumption is more valid (the larger engines can             for recent information. There are other ways to compare the
index more of the nonregistered and less popular pages).              search engines besides comprehensiveness and recency, such
Some dependence between the sampling of the engines                   as how well the engines rank the relevance of results (dis-
remains between the largest two engines, and therefore this           cussed in the next section), and features of the query inter-
estimate is a lower bound. Using this estimate of the size of         face.
the Web, we found that no engine indexes more than about
one third of the indexable Web. We also found that combin-                               RESEARCH IN WEB SEARCH
ing the results of the six engines returned approximately 3.5         Research into technology for searching the Web is abun-
times more documents on average when compared to using                dant, which is not surprising considering that the existence
only one engine.                                                      of full-text search engines is one of the major differences
    Recall that the queries used in the study were from the           between the Web and previous means of accessing informa-
employees of the NEC Research Institute. Most of the                  tion. The following sections look specifically at some of the
employees are scientists, and scientists tend to search for           recent research: improved methods for ranking pages that
less “popular” or harder-to-find information. This is benefi-         utilize the graph structure of the Web, a metasearch tech-
cial when estimating the size of the Web as above. Howev-             nique that can improve the efficiency of Web search by
er, the search engines are typically biased toward indexing           downloading matching pages in order to extract query term
more “popular” information. Therefore, the coverage of                context and analyze the pages, and “softbots” which can be
the search engines is typically better for more popular               used to locate pages that may not be indexed by any of the
information.                                                          engines.
    There are a number of possible reasons why the major
search engines do not provide comprehensive indices of the            Page Relevance — A common complaint against search
Web: the engines may be limited by network bandwidth, disk            engines is that they return too many pages, and that many of
storage, computational power, scalability of their indexing and       them have low relevance to the query. This has been used as
retrieval technology, or a combination of these items (despite        an argument for not providing comprehensive indices of the
claims to the contrary [5]). Because Web pages are continually        Web (“people are already overloaded with too much infor-


IEEE Communications Magazine • January 1999                                                                                        117
mation”). However, a search engine could be more compre-             results, improved document ranking using proximity informa-
hensive while still returning the same set of pages first. One       tion (because Inquirus has the full text of all pages it avoids
of the main problems is that the search engines do not rank          the ranking problem with standard metasearch engines), dra-
the relevance of results very well. Research search engines          matically improved precision for certain queries by using spe-
such as Google [6] and LASER [7] promise improved rank-              cific expressive forms, and quick jump links and highlighting
ing of results. These engines make greater use of HTML               when viewing the full documents.
structure and the graph formed by hyperlinks in order to                 One of the fundamental features of Inquirus is that it ana-
determine page relevancy than do the major Web search                lyzes each document and displays local context around the
engines. For example, Google uses a ranking algorithm                query terms. The benefit of displaying the local context, rather
called PageRank that iteratively uses information from the           than an abstract or query-insensitive summary of the docu-
number of pages pointing to each page (which is related to           ment, is that the user may be able to more readily determine
the popularity of the pages). Google also uses the text in           if the document answers his or her specific query (without
links to a page as descriptors of the page (the links often          repeatedly clicking and waiting for pages to download). A
contain better descriptions of the pages than the pages              user can therefore find documents of high relevance by quick-
themselves). Another engine with a novel ranking measure is          ly scanning the local context of the query terms. This tech-
Direct Hit (http://www.directhit.com), which is typically good       nique is simple, but can be very effective, especially for Web
for common queries. Direct Hit ranks results for a given             search where the database is very large, diverse, and poorly
query according to the number of times previous users have           organized.
clicked on the pages (i.e., the more popular pages are                   A study by Tombros (1997) shows that query-sensitive
ranked higher).                                                      summaries can improve the efficiency of search. Tombros
   Kleinberg [8] has presented a method for locating two             considered the use of query-sensitive summaries and per-
types of useful pages — authorities, which are highly refer-         formed a user study which showed that users working
enced pages, and hubs, which are pages that contain links to         with query-sensitive summaries had a higher success
many authorities. The underlying principle is the following:         rate. Query-sensitive summaries allowed users to perform
good hub pages point to many good authority pages, and a             relevance judgments more accurately and rapidly, and
good authority page is pointed to by many good hub pages.            greatly reduced the need to refer to the full text of docu-
An iterative process can be used to find hubs and authorities [8].   ments.
Future search engines may use this method to classify hub and            One interesting feature of Inquirus is the Specific Expres-
authority pages, and to rank the pages within these classes.         sive Forms (SEF) search technique. The Web is highly redun-
                                                                     dant, and techniques that trade recall (the fraction of all
Metasearch — Limitations of the search services have led to          relevant documents returned) for improved precision (the
the introduction of metasearch engines [9]. A metasearch             fraction of returned documents that are relevant) are often
engine searches the Web by making requests to multiple               useful. The SEF search technique transforms queries in the
search engines such as AltaVista or HotBot. The primary              form of a question into specific forms for expressing the
advantages of current metasearch engines are the ability to          answer. For example, the query “What does NASDAQ stand
combine the results of multiple search engines and the ability       for?” is transformed into the query “NASDAQ stands
to provide a consistent user interface for searching these           for” “NASDAQ is an abbreviation” “NASDAQ
engines.                                                             means”. Clearly the information may be contained in a dif-
   The idea of querying and collating results from multiple          ferent form than these three possibilities; however, if the
databases is not new. Companies like PLS, Lexis-Nexis, DIA-          information does exist in one of these forms, there is a higher
LOG, and Verity have long since created systems that inte-           likelihood that finding these phrases will provide the answer
grate the results of multiple heterogeneous databases [9].           to the query. For many queries, the answer might exist on the
Many other Web metasearch services exist, such as the popu-          Web, but not in any of the specific forms used. However, our
lar and useful MetaCrawler service [9]. Services similar to          experiments indicate that the method works well enough to be
MetaCrawler include SavvySearch and Infoseek Express.                effective for certain queries.
   Metasearch engines can introduce their own deficiencies;              Inquirus is surprisingly efficient. Inquirus downloads search
for example, they can have difficulty ranking the list of results.   engine responses and Web pages in parallel, and typically
If one engine returns many low-relevance documents, these            returns the first result faster than the average response time
documents may make it more difficult to find relevant pages          of a search engine.
in the list. Most of the metasearch engines on the Web also              In summary, metasearch techniques can improve the effi-
limit the number of results that can be obtained, and typically      ciency of Web search by combining the results of multiple
do not support all of the features of the query languages for        search engines, and by implementing functionality which is
each engine.                                                         not provided by the underlying engines (e.g., extracting query
   The NEC Research Institute has been developing an                 term context and filtering dead links). The Inquirus
experimental metasearch engine called Inquirus [10]. Inquirus        metasearch prototype at the NEC Research Institute has
was motivated by problems with current metasearch engines,           shown that downloading and analyzing pages in real time is
as well as the poor precision, limited coverage, limited avail-      feasible. Inquirus, like other meta engines and various Web
ability, limited user interfaces, and out-of-date databases of       tools, relies on the underlying search engines, which provide
the major Web search engines. Rather than work with the list         important and valuable services. Widescale use of this or any
of documents and summaries returned by search engines, as            metasearch engine would require an amicable arrangement
current metasearch engines typically do, Inquirus works by           with the underlying search engines. Such arrangements may
downloading and analyzing the individual documents. Inquirus         include passing through ads or micro-payment systems.
makes improvements over existing engines in a number of
areas, such as more useful document summaries incorporating                            IMPROVING WEB SEARCH
query term context, identification of both pages that no longer      Users tend to make queries that result in poor precision.
exist and pages that no longer contain the query terms,              About 70 percent of queries to Infoseek contain only one
improved detection of duplicate pages, progressive display of        term (Harry Motro, Infoseek CEO, CNBC, May 7, 1998).


118                                                                                     IEEE Communications Magazine • January 1999
About 40 percent of queries made by the employees of the                                    AVAILABILITY
NEC Research Institute to the Inquirus engine contain only
one term. In information retrieval, there is typically a           A lot of scientific literature is copyrighted by the authors or
trade-off between precision and recall. Simple (e.g., single-      publishers and is not generally available on the “publicly
term) queries can return thousands or millions of docu-            indexable Web.” However, the amount of scientific material
ments. Unfortunately, ranking the relevance of these               available on the publicly indexable Web is growing. Some
documents is a difficult problem, and the desired docu-            journals owned by societies such as IEEE (the largest techni-
ments may not appear near the top of the list. One way to          cal/scientific society) and ACM are permitting their papers to
improve the precision of results is to use more query terms,       be placed on the author’s Web sites as long as the proper
and to tell the search engines that relevant documents must        copyright notices are posted. Some private publishers, MIT
contain certain terms (required terms). Other ways include         Press for example, are doing the same. Some publishers per-
using phrases or proximity (e.g., searching for specific           mit prepublication Web access but do not allow posting of the
phrases rather than single terms), using constraints offered       final version of papers. We predict that more and more
by some search engines such as date ranges and geographic          papers will be available on the publicly indexable Web in the
restrictions, or using the refinement features offered by          future.
some engines (e.g., AltaVista offers a refine function, and           We used six major Web search engines to search for the
Infoseek allows subsequent searches within the results set         papers in a recent issue of Neural Computation, after the table
of previous searches).                                             of contents was released but before we obtained our copy of
   Another alternative is to combine available search engines      the journal. We found that about 50 percent of the papers
with automated online searching. One example is the Internet       were available on the homepages of the authors. As men-
“softbot” [11]. The softbot transforms queries into goals and      tioned before, the coverage of any one search engine is limit-
uses a planning algorithm (with extensive knowledge of the         ed. The simplest means of improving the chances of finding a
information sources) in order to generate a sequence of            particular scientist or paper on the publicly indexable Web is
actions that satisfies the goal. AHOY! is a successful softbot     to combine the results of multiple engines, as is done with
that locates homepages for individuals [11]. Shakes et al. per-    metasearch engines such as MetaCrawler.
formed a study where they searched for the homepages of 582           Although more and more scientific papers are being made
researchers, and AHOY! was able to locate more homepages           available on the publicly indexable Web, these papers are
than MetaCrawler (which located more homepages than Hot-           spread throughout researcher and institution homepages,
Bot or AltaVista). AHOY! also provided greatly improved            technical report archives, and journal sites. The Web search
precision.                                                         engines do not make it easy to locate these papers because
   More comprehensive and relevant results may also be             the search engines typically do not index Postscript or PDF
possible using a search engine that specializes in a particu-      documents, which account for a large percentage of the avail-
lar area; for example, Excite NewsTracker specializes in           able articles. The next section introduces a technique for
indexing news sites, and OpenText Pinstripe specializes in         organizing and indexing this literature.
indexing business sites. Because there are fewer pages to
index, the engines may be able to be more comprehensive                    DIGITAL LIBRARIES AND CITATION INDEXING
within their area, and may also be able to update the index        The Web offers the possibility of providing easy and effi-
more regularly. When searching for popular information,            cient services for organizing and accessing scientific infor-
directories constructed by hand, such as Yahoo’s directory,        mation. A citation index is one such service. Citation indices
can be very useful because fewer low-relevance results are         [13] index the citations in an article, linking the article with
returned.                                                          the cited works. Citation indices were originally designed
   In summary, there exist several ways of improving on the        for literature search, allowing a researcher to find subse-
major Web search engines, depending on the type of informa-        quent articles that cite a given article. Citation indices are
tion desired. For harder-to-find information, metasearch and       also valuable for other purposes, including evaluation of
softbots can improve coverage. If the topic being queried is       articles, authors, and so on, and analysis of research trends.
covered by one of the more specialized engines, these engines      The most popular citation indices of academic research are
can be used, and they often provide more comprehensive and         produced by the ISI. One such index, the Science Citation
up-to-date indices within their specialty compared to the gen-     Index, is intended to be a practical cost-effective tool for
eral Web search engines.                                           indexing the significant scientific journals. Unfortunately,
                                                                   the ISI databases are expensive and not available to all
                                                                   researchers. Much of the expense is due to the manual
       SCIENTIFIC INFORMATION RETRIEVAL                            effort required during indexing.
Immediate access to scientific literature has long been               The rise of the Internet and the Web has led to proposals
desired by scientists, and the Web search engines have made        for online digital libraries that incorporate citation indexing.
a large and growing body of scientific literature and other        For example, Cameron proposed a “universal, [Internet-
information resources accessible within seconds. Advances in       based,] bibliographic and citation database linking every
computing and communications, and the rapid rise of the            scholarly work ever written” [14]. Such a database would be
Web have led to the increasingly widespread availability of        highly “comprehensive and up-to-date”, making it a powerful
online research articles, as well as a simple-to-use Web version   tool for academic literature research, and for the production
of the Institute for Scientific Information’s ® (ISI) Science      of statistics as with traditional citation indices. However,
Citation Index® — the Web of Science®. The Web is changing         Cameron’s proposal presents significant difficulty for imple-
the way researchers locate and access scientific publications.     mentation, and requires authors or institutions to provide cita-
Many print journals now provide access to the full text of         tion information in a specific format.
articles on the Web, and the number of online journals was            The NEC Research Institute is working on a digital
about 1000 in 1996 [12]. Researchers are increasingly making       library of scientific publications that creates a citation index
their work available on their homepages or in technical            autonomously (using Autonomous Citation Indexing, ACI),
report archives.                                                   without the requirement of any additional effort on the part


IEEE Communications Magazine • January 1999                                                                                    119
Searching for “simulated annealing” in Machine Learning [small test index] (13828
           documents 278202 citations total).
                                                                                                       example of how an ACI system can
  1218 citations found
                                                                                                       extract the context of citations to a
  Click on the [Context] links to see the citing documents and the context of the citations.
                                                                                                       given paper and display them for
                                                                                                       easy browsing. Note that finding
  Citations (self)                                    Article                                          and extracting the context of cita-
                                                                                                       tions to a given paper could previ-
  196                 S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, “Optimization by                 ously be done by using traditional
                      simulated annealing,” Science, vol. 220, 1983, pp. 671–680. [Context]            citation indices and manually locat-
                      [Check]                                                                          ing and searching the citing papers
                                                                                                       — the difference is that the
  49 (1)              D. S. Johnson et al., “Optimization by simulated annealing: An                   automation and Web interface
                      experimental evaluation,” Technical report, Bell Labs preprint. [Context]        make the task far more efficient,
                      [Check]                                                                          and thus practical, where it may
                                                                                                       not have been before.
  39                  E. Aarts and J. Korst, “Simulated Annealing and Boltzmann Machines,”                 Digital libraries incorporating
                      John Wiley and Sons, 1989. [Context] [Check]                                     citation indexing can be used to
                                                                                                       organize the scientific literature,
                                  [... section deleted...]                                             and help with literature search and
                                                                                                       evaluation. A “universal citation
s Figure 1. An ACI system can group variant forms of citations to the same paper (citations can        database” which accurately indexes
   be written in many different formats), and rank search results by the number of citations.          all literature would be ideal, but is
                                                                                                       currently impractical because of
                                                                                                       the limited availability of articles
of the authors or institutions, and without any manual assis-              in electronic form and the lack of standardization in citation
tance [15]. An ACI system autonomously extracts citations,                 practices. However, CiteSeer shows that it is possible to
identifies identical citations that occur in different formats,            organize and index the subset of literature available on the
and identifies the context of citations in the body of articles.           Web, and to autonomously process freeform citations with
As with traditional citation indices like the Science Citation             reasonable accuracy. As long as there is a significant portion
Index, ACI allows literature search using citation links, and              of publishing through the Web, be it the publicly indexable
the ranking of papers, journals, authors, and so on by the                 Web or the subscription-only Web, there is great value in
number of citations. Compared to traditional citation index-               being able to prepare citation indices from the machine-
ing systems, ACI has both disadvantages and advantages.                    readable material. Citation indices may appear that index
The disadvantages include lower accuracy (which is expected                from both parts of the Web. Access to the full text of articles
to be less of a disadvantage over time). However, the advan-               may be done openly or by subscription, depending on how
tages are significant and include no manual effort required                the Web and the publication business evolve. Citation
for indexing, resulting in a corresponding reduction in cost               indices for subscription-only data may be offered by the pub-
and increase in availability, and literature search based on               lisher or performed by a third party that has an agreement
the context of citations — given a particular paper of inter-              with the publisher.
est, an ACI system can display the context of how the paper
is cited in subsequent publications. The context of citations
can be very useful for efficient literature search and evalua-                       THE FUTURE OF WEB SEARCH AND
tion. ACI has the potential for broader coverage of the liter-
ature because human indexers are not required, and can
                                                                                               DIGITAL LIBRARIES
provide more timely feedback and evaluation by indexing                    What is the future of the Web, Web search, and digital
items such as conference proceedings and technical reports.                libraries? Improvements in technology will enable new appli-
Overall, ACI can improve scientific communication, and                     cations. Computational and storage resources will continue to
facilitates an increased rate of scientific dissemination and              improve. Bandwidth is likely to increase significantly as tech-
feedback.                                                                  nology advances and the following positive spiral works: more
    ACI is ideal for operation on the Web — new articles can               people are becoming connected to the Internet as it becomes
be automatically located and indexed when they are posted                  easier to use and more popular, and as new access mecha-
on the Web or announced on mailing lists, and an efficient                 nisms are introduced (e.g., cable modems and digital sub-
interface for browsing the articles, citations, and the context            scriber lines). This provides incentive for the infrastructure
of the citations can be created. Part of the benefit of                    companies to make more investment in the backbone, improv-
autonomous citation indexing is due to the ability to format               ing bandwidth. More investment in the backbone improves
and organize information on demand using a Web interface                   access, so more people want to be connected.
to the citation index. Figure 1 shows an example of the out-                   Will the fraction of the Web covered by the major search
put from the NEC Research Institute’s prototype                            engines increase? Some search engines are focusing on index-
autonomous citation indexing digital library system: Cite-                 ing Web pages that satisfy the majority of searches, as
Seer. This example shows the results of a search for citations             opposed to trying to catalog all of the Web. However, there
containing the phrase “simulated annealing” in a small test                are still some engines that aim to index the Web comprehen-
database of the machine learning literature (only a subset of              sively. Improvements in indexing technology and computa-
the machine learning literature on the Web). Searching for                 tional resources will allow the creation of larger indices.
citations to papers by a given author can also be performed                Nevertheless, it is unlikely to become economically practical
(including secondary authors). The [Context] links show the                for a single search engine to index close to all of the publicly
context of the individual citations. The [Check] links show                indexable Web in the near future. However, it is predicted
the individual citations in each group and can be used to                  that the cost of indexing and storage will decline over time
check for errors in the citation grouping.Figure 2 shows an                relative to the increase in the size of the indexable Web [6],


120                                                                                          IEEE Communications Magazine • January 1999
S. Kirkpatrick, C. D. Gelatt Jr., and M. Vecchi, Optimization by simulated annealing, Science 220 (4598) (1983), 671–680.

 This paper is cited in the following contexts:

 M.I.T Media Laboratory Perceptual Computing Section Technical Report No. 177 - Appeared: IEEE ICASSP. San Francisco. March 1992,
 vol. III. pp. 45–48. - GIBBS RANDOM FIELD: - TEMPERATURE AND PARAMETER ANALYSIS - Rosalind W. Picard –M.I.T Lab, E15-392: 20
 Ames Street, Cambridge, MA 02139 – picard@media.mit.edu [Details] [Full Text] [Related Articles]
 [ftp://whitechapel.media.mit.edu/pub/tech-reports/TR -177.ps.Z]

 ...... Simulated annealing is a popular nonlinear optimization technique where a cost function is substituted for E(x), and consequently
 minimized. There is a key observation in the simulated annealing literature that prompts the study of temperature presented in this
 paper. Kirkpatrick, et al. [3] observed that “more optimization” occurs at certain temperatures than at others. These favored
 temperatures are analogous to the physical idea of a “critical temperature,” a point that marks transition between different “phases” of
 the data. The reason for considering these physical......

 ...... region, we have shown in earlier work that a similar kind of point, which we call a “transition” temperature, T, does occur [8]. By
 measuring the specific heat of the binary process it can be shown to correspond to the same region where the “most optimiza-
 tion” occurs in simulated annealing [3]. For GRF analysis, this region is where the energy fluctuation peaks, and where small changes
 in the parameters become more significant. In [8] the transition temperature for n = 2 was estimated to be at I =T = 1:7. This analysis
 suggests that attempts to estimate parameters should take......

 [3] S. Kirkpatrick, C. D. Gelatt Jr., and M. P. Vecchi, Optimization by simulated annealing, Science 220 (4598) 671–680, 1983.

 Technical Report No. 9805. Department of Statistics, University of Toronto - Annealed Importance Sampling - Radford M. Neal –
 Department of Statistics and Department of Computer Science - University of Toronto, Toronto, Ontario, Canada-
 http://www.cs.utoronto.ca/radford/ - radford@stat.utoronto.ca - 18 Feb. 1998 [Details] [Full Text] [Related Articles] [ftp://ftp.cs.utoron-
 to.ca/pub/radford/ais.ps.Z]

 ...... respect to these transitions. Because such a chain will move between modes only rarely, it will take a long time to reach equilibri-
 um, and will exhibit high autocorrelations for functions of the state variables out to long timelags. The method of simulated anneal-
 ing was introduced by Kirkpatrick, Gelatt, and Vecchi (1983) as a way of handling multiple modes in an optimization context.
 It employs a sequence of distributions, with probabilities or probability densities given by p0(x) to 2 pn(x), in which each pj differs only
 slightly from pj+l. The distribution p0 is the one of interest. The......

 Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P., (1983) “Optimization by simulated annealing,” Science, vol. 220, pp. 671–680.

                                                               [...section deleted...]

s Figure 2. An example of how an autonomous citation indexing system can show the context of citations to a given paper. The sen-
  tences containing the citations are automatically highlighted.


resulting in favorable scaling properties for centralized text                                        SUMMARY
search engines. In the meantime, an increased number of
specialized search services may arise that cover specific types           The Web is revolutionizing information access; however, cur-
of information.                                                           rent techniques for access to both general and scientific infor-
    The use of more expensive and better algorithms (e.g., as             mation on the Web leave room for much improvement. The
in Google) will produce improved page rankings. More infor-               Web search engines are limited in terms of coverage, recency,
mation retrieval techniques aimed at the large, diverse, low              how well they rank query results, and the query options they
signal-to-noise ratio database of the Web will be developed.              support. Access to the growing body of scientific literature on
One interesting possibility is the use of machine learning in             the publicly indexable Web is limited by the lack of organiza-
order to create query transformations similar to those used in            tion and because the major search engines do not index
the SEF technique discussed earlier.                                      Postscript or PDF documents. We have discussed several
    Metasearch techniques, which combine the results of mul-              fruitful research directions that will improve access to general
tiple engines, are likely to continue to be useful when                   and scientific information, and greatly enhance the utility of
searching for hard-to-find information, or when comprehen-                the Web: improved ranking methods, metasearch engines,
sive results are desired. The major Web search engines are                softbots, and autonomous citation indexing. It is not clear how
also likely to continue to focus on performing queries as                 availability will evolve, because this depends on how the Web
quickly as possible, and therefore metasearch engines that                emerges as a business platform for publishers. Nevertheless,
perform additional client-side processing (e.g., query term               improved ways to do basic searching, and specialized citation
context summaries) may become increasingly popular as                     searching are likely to evolve and replace present methods,
these products become more powerful, address problems                     and will greatly increase the utility of the Web over what is
with data fusion from different sources, and learn to deal                available today.
better with the constantly evolving search services. Improve-
ments in bandwidth should improve the feasibility of                                             ACKNOWLEDGMENTS
metasearch techniques.                                                    We thank H. Stone and the reviewers for very useful com-
    Digital libraries incorporating ACI should become more                ments and suggestions.
widely available, bringing the benefits of citation indexing to
groups who cannot afford the commercial services, and                                                 REFERENCES
improving the dissemination and retrieval of scientific litera-           [1] S. Lawrence and C. L. Giles, “Searching the World Wide Web,” Science,
ture.                                                                         vol. 280, no. 5360, 1998, pp. 98–100.



IEEE Communications Magazine • January 1999                                                                                                   121
[2] J. Barrie and D. Presti, “The World Wide Web as an instructional tool,”         Digital Libraries 98 — The Third ACM Conf. Digital Libraries, Pittsburgh,
     Science, vol. 274, 1996, pp. 371–72.                                           PA, 1998, pp. 89–98.
[3] R. Seltzer, E. Ray, and D. Ray, The AltaVista Search Revolution: How to
     Find Anything on the Internet, McGraw-Hill, 1997.                                                         BIOGRAPHIES
[4] Inktomi, http://www.inktomi.com/new/press/bellsouth.html, 1997.
[5] S. Steinberg, “Seek and ye shall find (maybe),” Wired, vol. 4, no. 5,        STEVE LAWRENCE (lawrence@research.nj.nec.com) is a research scientist at
     1996.                                                                       the NEC Research Institute, Princeton, New Jersey. His research interests
[6] S. Brin and L. Page, “The anatomy of a large-scale hypertextual Web          include information retrieval and dissemination, machine learning, artificial
     search engine,” Proc. 7th Int’l. WWW Conf., Brisbane, Australia, 1998.      intelligence, neural networks, face recognition, speech recognition, time
[7] J. Boyan, D. Freitag, and T. Joachims, “A machine learning architecture      series prediction, and natural language. His awards include an NEC
     for optimizing Web search engines,” Proc. AAAI Wksp. Internet-Based         Research Institute excellence award, ATERB and APRA priority scholarships,
     Info. Sys., 1996.                                                           a QUT university medal and award for excellence, QEC and Telecom Aus-
[8] J. Kleinberg, “Authoritative sources in a hyperlinked environment,” Proc.    tralia Engineering prizes, and three successive prizes in the annual Aus-
     ACM-SIAM Symp. Discrete Algorithms, 1998.                                   tralian Mathematics Competition. He received a B.Sc. in computing and a
[9] E. Selberg and O. Etzioni, “Multi-service search and comparison using        B.Eng. in electronic systems from the Queensland University of Technology,
     the MetaCrawler,” Proc. 1995 WWW Conf., 1995.                               Australia, and a Ph.D. from the University of Queensland, Australia.
[10] S. Lawrence and C. L. Giles, “Context and page analysis for improved
     Web search,” IEEE Internet Comp., vol. 2, no. 4, 1998, pp. 38–46.           C. LEE GILES [F] (giles@research.nj.nec.com) is a senior research scientist in
[11] O. Etzioni and D. Weld, “A softbot-based interface to the Internet,”        Computer Science at NEC Research Institute, Princeton, New Jersey. Cur-
     Commun. ACM, vol. 37, no. 7, 1994, pp. 72–76.                               rently he is an adjunct professor at the Institute for Advanced Computer
[12] G. Taubes, Science, vol. 271, 1996, p. 764.                                 Studies at the University of Maryland. His research interests are in novel
[13] E. Garfield, Citation Indexing: Its Theory and Application in Science,      applications of neural and machine learning, agents and AI in the Web,
     Technology, and Humanities, New York: Wiley, 1979.                          and computing. He is on the editorial boards of IEEE Intelligent Systems,
[14] R. D. Cameron, “A universal citation database as a catalyst for reform      IEEE Transactions on Knowledge and Data Engineering, IEEE Transactions
     in scholarly communication,” First Monday, vol. 2, no. 4, 1997.             on Neural Networks, the Journal of Computational Intelligence in Finance,
[15] C. L. Giles, K. Bollacker, and S. Lawrence, “CiteSeer: An automatic cita-   Journal of Parallel and Distributed Computing, Neural Networks, Neural
     tion indexing system,” I. Witten, R. Akscyn, and F. M. Shipman III, Eds.,   Computation, and Applied Optics.




122                                                                                                     IEEE Communications Magazine • January 1999

Más contenido relacionado

La actualidad más candente

A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...butest
 
Pagerank
PagerankPagerank
Pageranktkgcse
 
The anatomy of google
The anatomy of googleThe anatomy of google
The anatomy of googlemaelmardi
 
Study on Web Content Extraction Techniques
Study on Web Content Extraction TechniquesStudy on Web Content Extraction Techniques
Study on Web Content Extraction Techniquesijtsrd
 
Internet search btp_r[1].pptx edit
Internet search btp_r[1].pptx editInternet search btp_r[1].pptx edit
Internet search btp_r[1].pptx editkarlz016
 
C03406021027
C03406021027C03406021027
C03406021027theijes
 
Linked Open Data_mlanet13
Linked Open Data_mlanet13Linked Open Data_mlanet13
Linked Open Data_mlanet13Kristi Holmes
 
Big data for qualitative research by kathy a. mills (z lib.org)
Big data for qualitative research by kathy a. mills (z lib.org)Big data for qualitative research by kathy a. mills (z lib.org)
Big data for qualitative research by kathy a. mills (z lib.org)MiguelRosario24
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)theijes
 
The SFX Framework for Context-Sensitive Reference Linking
The SFX Framework for  Context-Sensitive Reference LinkingThe SFX Framework for  Context-Sensitive Reference Linking
The SFX Framework for Context-Sensitive Reference LinkingHerbert Van de Sompel
 

La actualidad más candente (18)

A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...
 
5463 26 web mining
5463 26 web mining5463 26 web mining
5463 26 web mining
 
Pagerank
PagerankPagerank
Pagerank
 
Nordic health data metadata
Nordic health data   metadataNordic health data   metadata
Nordic health data metadata
 
The anatomy of google
The anatomy of googleThe anatomy of google
The anatomy of google
 
Future of Library Discovery Services
Future of Library Discovery ServicesFuture of Library Discovery Services
Future of Library Discovery Services
 
Google
GoogleGoogle
Google
 
Study on Web Content Extraction Techniques
Study on Web Content Extraction TechniquesStudy on Web Content Extraction Techniques
Study on Web Content Extraction Techniques
 
Internet search btp_r[1].pptx edit
Internet search btp_r[1].pptx editInternet search btp_r[1].pptx edit
Internet search btp_r[1].pptx edit
 
C03406021027
C03406021027C03406021027
C03406021027
 
Linked Open Data_mlanet13
Linked Open Data_mlanet13Linked Open Data_mlanet13
Linked Open Data_mlanet13
 
Big data for qualitative research by kathy a. mills (z lib.org)
Big data for qualitative research by kathy a. mills (z lib.org)Big data for qualitative research by kathy a. mills (z lib.org)
Big data for qualitative research by kathy a. mills (z lib.org)
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
NISO Virtual Conference: Web-Scale Discovery Services: Transforming Access to...
NISO Virtual Conference: Web-Scale Discovery Services: Transforming Access to...NISO Virtual Conference: Web-Scale Discovery Services: Transforming Access to...
NISO Virtual Conference: Web-Scale Discovery Services: Transforming Access to...
 
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Wor...
NISO/NFAIS Joint Virtual Conference:  Connecting the Library to the Wider Wor...NISO/NFAIS Joint Virtual Conference:  Connecting the Library to the Wider Wor...
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Wor...
 
Web mining
Web miningWeb mining
Web mining
 
The SFX Framework for Context-Sensitive Reference Linking
The SFX Framework for  Context-Sensitive Reference LinkingThe SFX Framework for  Context-Sensitive Reference Linking
The SFX Framework for Context-Sensitive Reference Linking
 
McGeary Data Curation Network: Developing and Scaling
McGeary Data Curation Network: Developing and ScalingMcGeary Data Curation Network: Developing and Scaling
McGeary Data Curation Network: Developing and Scaling
 

Destacado (7)

Seminar algorithms of web
Seminar algorithms of webSeminar algorithms of web
Seminar algorithms of web
 
Tips and technics for search engine market
Tips and technics for search engine marketTips and technics for search engine market
Tips and technics for search engine market
 
Search engine strategies 8 04
Search engine strategies 8 04Search engine strategies 8 04
Search engine strategies 8 04
 
Get your-web-site-to-be-found
Get your-web-site-to-be-foundGet your-web-site-to-be-found
Get your-web-site-to-be-found
 
Ultra search
Ultra searchUltra search
Ultra search
 
Information organization
Information organization Information organization
Information organization
 
Webmaster guide-en
Webmaster guide-enWebmaster guide-en
Webmaster guide-en
 

Similar a Improving Web search and access to scientific information

Building efficient and effective metasearch engines
Building efficient and effective metasearch enginesBuilding efficient and effective metasearch engines
Building efficient and effective metasearch enginesunyil96
 
Classification-based Retrieval Methods to Enhance Information Discovery on th...
Classification-based Retrieval Methods to Enhance Information Discovery on th...Classification-based Retrieval Methods to Enhance Information Discovery on th...
Classification-based Retrieval Methods to Enhance Information Discovery on th...IJMIT JOURNAL
 
Web resources
Web resourcesWeb resources
Web resourcesshobifk
 
Google Research Paper
Google Research PaperGoogle Research Paper
Google Research Paperdidip
 
A Study Web Data Mining Challenges And Application For Information Extraction
A Study  Web Data Mining Challenges And Application For Information ExtractionA Study  Web Data Mining Challenges And Application For Information Extraction
A Study Web Data Mining Challenges And Application For Information ExtractionScott Bou
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Smart Crawler for Efficient Deep-Web Harvesting
Smart Crawler for Efficient Deep-Web HarvestingSmart Crawler for Efficient Deep-Web Harvesting
Smart Crawler for Efficient Deep-Web Harvestingpaperpublications3
 
Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web miningDatamining Tools
 
Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web miningDataminingTools Inc
 
Inverted textindexing
Inverted textindexingInverted textindexing
Inverted textindexingKhwaja Aamer
 
ΟΚΤΩΒΡΙΟΣ 2010
ΟΚΤΩΒΡΙΟΣ 2010ΟΚΤΩΒΡΙΟΣ 2010
ΟΚΤΩΒΡΙΟΣ 2010steverz
 
Perception Determined Constructing Algorithm for Document Clustering
Perception Determined Constructing Algorithm for Document ClusteringPerception Determined Constructing Algorithm for Document Clustering
Perception Determined Constructing Algorithm for Document ClusteringIRJET Journal
 
Information Organisation for the Future Web: with Emphasis to Local CIRs
Information Organisation for the Future Web: with Emphasis to Local CIRs Information Organisation for the Future Web: with Emphasis to Local CIRs
Information Organisation for the Future Web: with Emphasis to Local CIRs inventionjournals
 
Deep Web: Databases on the Web
Deep Web: Databases on the WebDeep Web: Databases on the Web
Deep Web: Databases on the WebDenis Shestakov
 

Similar a Improving Web search and access to scientific information (20)

Building efficient and effective metasearch engines
Building efficient and effective metasearch enginesBuilding efficient and effective metasearch engines
Building efficient and effective metasearch engines
 
Classification-based Retrieval Methods to Enhance Information Discovery on th...
Classification-based Retrieval Methods to Enhance Information Discovery on th...Classification-based Retrieval Methods to Enhance Information Discovery on th...
Classification-based Retrieval Methods to Enhance Information Discovery on th...
 
Web resources
Web resourcesWeb resources
Web resources
 
Web Mining
Web MiningWeb Mining
Web Mining
 
Google Research Paper
Google Research PaperGoogle Research Paper
Google Research Paper
 
A Study Web Data Mining Challenges And Application For Information Extraction
A Study  Web Data Mining Challenges And Application For Information ExtractionA Study  Web Data Mining Challenges And Application For Information Extraction
A Study Web Data Mining Challenges And Application For Information Extraction
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Test
TestTest
Test
 
Smart Crawler for Efficient Deep-Web Harvesting
Smart Crawler for Efficient Deep-Web HarvestingSmart Crawler for Efficient Deep-Web Harvesting
Smart Crawler for Efficient Deep-Web Harvesting
 
Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web mining
 
Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web mining
 
Inverted textindexing
Inverted textindexingInverted textindexing
Inverted textindexing
 
Paper24
Paper24Paper24
Paper24
 
L017447590
L017447590L017447590
L017447590
 
ΟΚΤΩΒΡΙΟΣ 2010
ΟΚΤΩΒΡΙΟΣ 2010ΟΚΤΩΒΡΙΟΣ 2010
ΟΚΤΩΒΡΙΟΣ 2010
 
search
searchsearch
search
 
search
searchsearch
search
 
Perception Determined Constructing Algorithm for Document Clustering
Perception Determined Constructing Algorithm for Document ClusteringPerception Determined Constructing Algorithm for Document Clustering
Perception Determined Constructing Algorithm for Document Clustering
 
Information Organisation for the Future Web: with Emphasis to Local CIRs
Information Organisation for the Future Web: with Emphasis to Local CIRs Information Organisation for the Future Web: with Emphasis to Local CIRs
Information Organisation for the Future Web: with Emphasis to Local CIRs
 
Deep Web: Databases on the Web
Deep Web: Databases on the WebDeep Web: Databases on the Web
Deep Web: Databases on the Web
 

Más de Stefanos Anastasiadis

Más de Stefanos Anastasiadis (8)

Web design ing
Web design ingWeb design ing
Web design ing
 
Web search engines and search technology
Web search engines and search technologyWeb search engines and search technology
Web search engines and search technology
 
The little-joomla-seo-book-v1
The little-joomla-seo-book-v1The little-joomla-seo-book-v1
The little-joomla-seo-book-v1
 
The google best_practices_guide
The google best_practices_guideThe google best_practices_guide
The google best_practices_guide
 
Web search algorithms and user interfaces
Web search algorithms and user interfacesWeb search algorithms and user interfaces
Web search algorithms and user interfaces
 
Integration visualization
Integration visualizationIntegration visualization
Integration visualization
 
Search engines
Search enginesSearch engines
Search engines
 
Ecommerce webinar-oct-2010
Ecommerce webinar-oct-2010Ecommerce webinar-oct-2010
Ecommerce webinar-oct-2010
 

Último

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 

Último (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Improving Web search and access to scientific information

  • 1. Searching the Web: General and Scientific Information Access Steve Lawrence and C. Lee Giles, NEC Research Institute ABSTRACT The World Wide Web has revolutionized the way people access information, and has opened up new possibilities in areas such as digital libraries, general and scientific information dissemination and retrieval, educa- tion, commerce, entertainment, government, and health care. There are many avenues for is possible today, and can improve improvement of the Web; for example, in the areas of locating and organizing informa- access to scientific information on the tion. Current techniques for access to both general and scientific information on the Web Web or in other digital libraries of sci- provide much room for improvement; search engines do not provide comprehensive entific articles. indices of the Web and have difficulty in accurately ranking the relevance of results. Scien- tific information on the Web is very disorganized. We discuss the effectiveness of Web search engines, including results showing that the major Web search engines cover only a WEB SEARCH fraction of the “publicly indexable Web.” Current research into improved searching of the One of the key aspects of the World Web is discussed, including new techniques for ranking the relevance of results, and new Wide Web which makes it a valuable techniques in metasearch that can improve the efficiency and effectiveness of Web search. information resource is that the full text The creation of digital libraries incorporating autonomous citation indexing is discussed for of documents can be searched using improved access to scientific information on the Web. Web search engines such as AltaVista and HotBot. Just how effective are the Web search engines? The following sec- T he World Wide Web is revolutionizing the way people access information, and has opened up new possibilities in areas such as digital libraries, general tions discuss the effectiveness of current engines and current research into improved techniques. and scientific information dissemination and retrieval, edu- THE COMPREHENSIVENESS AND RECENCY OF THE cation, commerce, entertainment, government, and health WEB SEARCH ENGINES care. The amount of publicly available information on the Web is increasing rapidly [1]. The Web is a gigantic digital This section considers the effectiveness of the major Web library, a searchable 15 billion word encyclopedia [2]. It has search engines in terms of comprehensiveness and recency. stimulated research and development in information retrieval We provide results on the size of the Web, the coverage of and dissemination, and fostered search engines such as each search engine, and the freshness of the search engine AltaVista. These new developments are not limited to the databases. These results show that none of the search engines Web, and can enhance access to virtually all forms of digital covers more than about one third of the publicly indexable libraries. Web, and that the freshness of the various databases varies The revolution the Web has brought to information significantly. access is not so much due to the availability of information Typical quotes regarding the coverage and recency of the (huge amounts of information has long been available in major search engine databases include: “If you can’t find it libraries and elsewhere), but rather the increased efficiency using AltaVista search, it’s probably not out there” [3], “[With of accessing information, which can make previously imprac- AltaVista] you can find new information just about as quickly tical tasks practical. There are many avenues for improve- as it’s available on the Web” [3], and “HotBot is the first ment in the efficiency of accessing information on the Web, search robot capable of indexing and searching the entire Web” for example, in the areas of locating and organizing infor- [4]. However, the World Wide Web is a distributed, dynam- mation. ic, and rapidly growing [1] information resource that pre- This article discusses general and scientific information sents difficulties to traditional information retrieval access on the Web, and many of our comments are applica- technologies. Traditional information retrieval software was ble to digital libraries in general. The effectiveness of Web designed for different environments and has typically been search engines is discussed, including results that show that used for indexing a static collection of directly accessible the major search engines cover only a fraction of the “pub- documents. The nature of the Web brings up questions such licly indexable Web” (the part of the Web which is consid- as: can the centralized architecture of the search engines ered for indexing by the major engines, which excludes pages keep up with the increasing number of documents on the hidden behind search forms, pages with authorization Web? Can they update their databases regularly to detect requirements, etc.). Current research into improved search- modified, deleted, and relocated information? Answers to ing of the Web is discussed, including new techniques for these questions impact on the best methodology to use ranking the relevance of results, and new techniques in when searching the Web, and on the future of Web search metasearch that can improve the efficiency and effectiveness technology. of Web search. We performed a study of the comprehensiveness and The amount of scientific information and the number of recency of the major Web search engines in December electronic journals on the Internet continues to increase. 1997 by analyzing the responses of AltaVista, Excite, Hot- Researchers are increasingly making their work available Bot, Infoseek, Lycos, and Northern Light for 575 queries online. This article also discusses the creation of digital made by employees at the NEC Research Institute [1]. libraries of the scientific literature, incorporating autonomous Search engines rank documents differently and can return citation indexing. The autonomous creation of citation indices documents that do not contain the query terms (e.g., pages 116 0163-6804/99/$10.00 © 1999 IEEE IEEE Communications Magazine • January 1999
  • 2. Search engine HotBot Alta Northern Excite Infoseek Lycos Vista Light with morphological variants or synonyms). Therefore, we only Coverage WRT estimated Web size 34% 28% 20% 14% 10% 3% considered queries for which we Percentage of dead links returned 5.3% 2.5% 5.0% 2.0% 2.6% 1.6% could download the full text of every document that each engine s Table 1. Estimated coverage of each engine with respect to the estimated size of the Web, reports as matching the query. and the percentage of invalid links returned by each engine (from 575 queries performed Documents were only counted if December 15–17, 1997). they could be downloaded and contained the query terms. We handled other important details such as the normalization added and modified, a truly comprehensive index would have of URLs, and capitalization and morphology (full details to index all pages simultaneously, which is not currently possi- can be found in [1]). ble. Furthermore, there may be many pages that contain no Table 1 shows the estimated coverage of the search links to them, making it difficult for the search engines to engines, which varies by an order of magnitude. This variation know that the pages exist. is much greater than would be expected from considering the We also looked at the percentage of dead links returned number of pages that each engine reports to have indexed. by the search engines, which is related to how often the The variation may be explained by differences in indexing or engines update their databases. Intuitively, it is possible for a retrieval technology between the engines (e.g., an engine trade-off to exist between the comprehensiveness and fresh- would appear to be smaller if it only indexed part of the text ness of a search engine; it should be possible to check for on some pages), or differences in the kinds of pages indexed modified documents and update the index more rapidly if (our study used mostly scientific queries which may not be the index is smaller. Some evidence of such a trade-off was covered as well if an engine focuses more on well-connected, found — the most comprehensive engine had the largest “popular” pages). Note that the results in the table are specif- percentage of dead links, and the least comprehensive ic to the particular queries performed (typical queries made engine had the smallest percentage of dead links. Table 1 by scientists), and the state of the engine databases at the shows the percentage of invalid links for each search engine. time they were performed. However, we found that the rating of the engines in terms of We estimated a lower bound on the size of the publicly the percentage of dead links varies greatly over time. This indexable Web to be 320 million pages. In order to produce provides evidence that the search engines may not be very this estimate, we analyzed the overlap between pairs of regular in their indexing processes; for example, an engine engines [1]. Consider two engines a and b. Using the assump- might suspend the processing of new pages for a period of tion that each engine samples the Web independently, the time during upgrades. quantity n o /n b , where n o is the number of documents How can this knowledge of the effectiveness of the search returned by both engines and nb is the number of documents engines be used to improve Web search? The coverage inves- returned by engine b, is an estimate of the fraction of the tigations indicate that the coverage of the Web engines is indexable Web, p a , covered by engine a. The size of the much lower than commonly believed, and that the engines indexable Web can then be estimated with s a/p a where s a is tend to index different sets of pages. This indicates that when the number of pages indexed by engine a. This technique is searching for less popular information, it can be very useful to limited because the engines do not choose pages to sample combine the results of multiple engines. The freshness investi- independently; they all allow pages to be registered, and they gations indicate that it is difficult to predict ahead of time are typically biased toward indexing more popular or well- which search engine will be the best engine to use when look- connected pages. To estimate the size of the Web we used ing for recent information. Therefore, it can also be very use- the overlap between the largest two engines where the inde- ful to combine the results of multiple engines when looking pendence assumption is more valid (the larger engines can for recent information. There are other ways to compare the index more of the nonregistered and less popular pages). search engines besides comprehensiveness and recency, such Some dependence between the sampling of the engines as how well the engines rank the relevance of results (dis- remains between the largest two engines, and therefore this cussed in the next section), and features of the query inter- estimate is a lower bound. Using this estimate of the size of face. the Web, we found that no engine indexes more than about one third of the indexable Web. We also found that combin- RESEARCH IN WEB SEARCH ing the results of the six engines returned approximately 3.5 Research into technology for searching the Web is abun- times more documents on average when compared to using dant, which is not surprising considering that the existence only one engine. of full-text search engines is one of the major differences Recall that the queries used in the study were from the between the Web and previous means of accessing informa- employees of the NEC Research Institute. Most of the tion. The following sections look specifically at some of the employees are scientists, and scientists tend to search for recent research: improved methods for ranking pages that less “popular” or harder-to-find information. This is benefi- utilize the graph structure of the Web, a metasearch tech- cial when estimating the size of the Web as above. Howev- nique that can improve the efficiency of Web search by er, the search engines are typically biased toward indexing downloading matching pages in order to extract query term more “popular” information. Therefore, the coverage of context and analyze the pages, and “softbots” which can be the search engines is typically better for more popular used to locate pages that may not be indexed by any of the information. engines. There are a number of possible reasons why the major search engines do not provide comprehensive indices of the Page Relevance — A common complaint against search Web: the engines may be limited by network bandwidth, disk engines is that they return too many pages, and that many of storage, computational power, scalability of their indexing and them have low relevance to the query. This has been used as retrieval technology, or a combination of these items (despite an argument for not providing comprehensive indices of the claims to the contrary [5]). Because Web pages are continually Web (“people are already overloaded with too much infor- IEEE Communications Magazine • January 1999 117
  • 3. mation”). However, a search engine could be more compre- results, improved document ranking using proximity informa- hensive while still returning the same set of pages first. One tion (because Inquirus has the full text of all pages it avoids of the main problems is that the search engines do not rank the ranking problem with standard metasearch engines), dra- the relevance of results very well. Research search engines matically improved precision for certain queries by using spe- such as Google [6] and LASER [7] promise improved rank- cific expressive forms, and quick jump links and highlighting ing of results. These engines make greater use of HTML when viewing the full documents. structure and the graph formed by hyperlinks in order to One of the fundamental features of Inquirus is that it ana- determine page relevancy than do the major Web search lyzes each document and displays local context around the engines. For example, Google uses a ranking algorithm query terms. The benefit of displaying the local context, rather called PageRank that iteratively uses information from the than an abstract or query-insensitive summary of the docu- number of pages pointing to each page (which is related to ment, is that the user may be able to more readily determine the popularity of the pages). Google also uses the text in if the document answers his or her specific query (without links to a page as descriptors of the page (the links often repeatedly clicking and waiting for pages to download). A contain better descriptions of the pages than the pages user can therefore find documents of high relevance by quick- themselves). Another engine with a novel ranking measure is ly scanning the local context of the query terms. This tech- Direct Hit (http://www.directhit.com), which is typically good nique is simple, but can be very effective, especially for Web for common queries. Direct Hit ranks results for a given search where the database is very large, diverse, and poorly query according to the number of times previous users have organized. clicked on the pages (i.e., the more popular pages are A study by Tombros (1997) shows that query-sensitive ranked higher). summaries can improve the efficiency of search. Tombros Kleinberg [8] has presented a method for locating two considered the use of query-sensitive summaries and per- types of useful pages — authorities, which are highly refer- formed a user study which showed that users working enced pages, and hubs, which are pages that contain links to with query-sensitive summaries had a higher success many authorities. The underlying principle is the following: rate. Query-sensitive summaries allowed users to perform good hub pages point to many good authority pages, and a relevance judgments more accurately and rapidly, and good authority page is pointed to by many good hub pages. greatly reduced the need to refer to the full text of docu- An iterative process can be used to find hubs and authorities [8]. ments. Future search engines may use this method to classify hub and One interesting feature of Inquirus is the Specific Expres- authority pages, and to rank the pages within these classes. sive Forms (SEF) search technique. The Web is highly redun- dant, and techniques that trade recall (the fraction of all Metasearch — Limitations of the search services have led to relevant documents returned) for improved precision (the the introduction of metasearch engines [9]. A metasearch fraction of returned documents that are relevant) are often engine searches the Web by making requests to multiple useful. The SEF search technique transforms queries in the search engines such as AltaVista or HotBot. The primary form of a question into specific forms for expressing the advantages of current metasearch engines are the ability to answer. For example, the query “What does NASDAQ stand combine the results of multiple search engines and the ability for?” is transformed into the query “NASDAQ stands to provide a consistent user interface for searching these for” “NASDAQ is an abbreviation” “NASDAQ engines. means”. Clearly the information may be contained in a dif- The idea of querying and collating results from multiple ferent form than these three possibilities; however, if the databases is not new. Companies like PLS, Lexis-Nexis, DIA- information does exist in one of these forms, there is a higher LOG, and Verity have long since created systems that inte- likelihood that finding these phrases will provide the answer grate the results of multiple heterogeneous databases [9]. to the query. For many queries, the answer might exist on the Many other Web metasearch services exist, such as the popu- Web, but not in any of the specific forms used. However, our lar and useful MetaCrawler service [9]. Services similar to experiments indicate that the method works well enough to be MetaCrawler include SavvySearch and Infoseek Express. effective for certain queries. Metasearch engines can introduce their own deficiencies; Inquirus is surprisingly efficient. Inquirus downloads search for example, they can have difficulty ranking the list of results. engine responses and Web pages in parallel, and typically If one engine returns many low-relevance documents, these returns the first result faster than the average response time documents may make it more difficult to find relevant pages of a search engine. in the list. Most of the metasearch engines on the Web also In summary, metasearch techniques can improve the effi- limit the number of results that can be obtained, and typically ciency of Web search by combining the results of multiple do not support all of the features of the query languages for search engines, and by implementing functionality which is each engine. not provided by the underlying engines (e.g., extracting query The NEC Research Institute has been developing an term context and filtering dead links). The Inquirus experimental metasearch engine called Inquirus [10]. Inquirus metasearch prototype at the NEC Research Institute has was motivated by problems with current metasearch engines, shown that downloading and analyzing pages in real time is as well as the poor precision, limited coverage, limited avail- feasible. Inquirus, like other meta engines and various Web ability, limited user interfaces, and out-of-date databases of tools, relies on the underlying search engines, which provide the major Web search engines. Rather than work with the list important and valuable services. Widescale use of this or any of documents and summaries returned by search engines, as metasearch engine would require an amicable arrangement current metasearch engines typically do, Inquirus works by with the underlying search engines. Such arrangements may downloading and analyzing the individual documents. Inquirus include passing through ads or micro-payment systems. makes improvements over existing engines in a number of areas, such as more useful document summaries incorporating IMPROVING WEB SEARCH query term context, identification of both pages that no longer Users tend to make queries that result in poor precision. exist and pages that no longer contain the query terms, About 70 percent of queries to Infoseek contain only one improved detection of duplicate pages, progressive display of term (Harry Motro, Infoseek CEO, CNBC, May 7, 1998). 118 IEEE Communications Magazine • January 1999
  • 4. About 40 percent of queries made by the employees of the AVAILABILITY NEC Research Institute to the Inquirus engine contain only one term. In information retrieval, there is typically a A lot of scientific literature is copyrighted by the authors or trade-off between precision and recall. Simple (e.g., single- publishers and is not generally available on the “publicly term) queries can return thousands or millions of docu- indexable Web.” However, the amount of scientific material ments. Unfortunately, ranking the relevance of these available on the publicly indexable Web is growing. Some documents is a difficult problem, and the desired docu- journals owned by societies such as IEEE (the largest techni- ments may not appear near the top of the list. One way to cal/scientific society) and ACM are permitting their papers to improve the precision of results is to use more query terms, be placed on the author’s Web sites as long as the proper and to tell the search engines that relevant documents must copyright notices are posted. Some private publishers, MIT contain certain terms (required terms). Other ways include Press for example, are doing the same. Some publishers per- using phrases or proximity (e.g., searching for specific mit prepublication Web access but do not allow posting of the phrases rather than single terms), using constraints offered final version of papers. We predict that more and more by some search engines such as date ranges and geographic papers will be available on the publicly indexable Web in the restrictions, or using the refinement features offered by future. some engines (e.g., AltaVista offers a refine function, and We used six major Web search engines to search for the Infoseek allows subsequent searches within the results set papers in a recent issue of Neural Computation, after the table of previous searches). of contents was released but before we obtained our copy of Another alternative is to combine available search engines the journal. We found that about 50 percent of the papers with automated online searching. One example is the Internet were available on the homepages of the authors. As men- “softbot” [11]. The softbot transforms queries into goals and tioned before, the coverage of any one search engine is limit- uses a planning algorithm (with extensive knowledge of the ed. The simplest means of improving the chances of finding a information sources) in order to generate a sequence of particular scientist or paper on the publicly indexable Web is actions that satisfies the goal. AHOY! is a successful softbot to combine the results of multiple engines, as is done with that locates homepages for individuals [11]. Shakes et al. per- metasearch engines such as MetaCrawler. formed a study where they searched for the homepages of 582 Although more and more scientific papers are being made researchers, and AHOY! was able to locate more homepages available on the publicly indexable Web, these papers are than MetaCrawler (which located more homepages than Hot- spread throughout researcher and institution homepages, Bot or AltaVista). AHOY! also provided greatly improved technical report archives, and journal sites. The Web search precision. engines do not make it easy to locate these papers because More comprehensive and relevant results may also be the search engines typically do not index Postscript or PDF possible using a search engine that specializes in a particu- documents, which account for a large percentage of the avail- lar area; for example, Excite NewsTracker specializes in able articles. The next section introduces a technique for indexing news sites, and OpenText Pinstripe specializes in organizing and indexing this literature. indexing business sites. Because there are fewer pages to index, the engines may be able to be more comprehensive DIGITAL LIBRARIES AND CITATION INDEXING within their area, and may also be able to update the index The Web offers the possibility of providing easy and effi- more regularly. When searching for popular information, cient services for organizing and accessing scientific infor- directories constructed by hand, such as Yahoo’s directory, mation. A citation index is one such service. Citation indices can be very useful because fewer low-relevance results are [13] index the citations in an article, linking the article with returned. the cited works. Citation indices were originally designed In summary, there exist several ways of improving on the for literature search, allowing a researcher to find subse- major Web search engines, depending on the type of informa- quent articles that cite a given article. Citation indices are tion desired. For harder-to-find information, metasearch and also valuable for other purposes, including evaluation of softbots can improve coverage. If the topic being queried is articles, authors, and so on, and analysis of research trends. covered by one of the more specialized engines, these engines The most popular citation indices of academic research are can be used, and they often provide more comprehensive and produced by the ISI. One such index, the Science Citation up-to-date indices within their specialty compared to the gen- Index, is intended to be a practical cost-effective tool for eral Web search engines. indexing the significant scientific journals. Unfortunately, the ISI databases are expensive and not available to all researchers. Much of the expense is due to the manual SCIENTIFIC INFORMATION RETRIEVAL effort required during indexing. Immediate access to scientific literature has long been The rise of the Internet and the Web has led to proposals desired by scientists, and the Web search engines have made for online digital libraries that incorporate citation indexing. a large and growing body of scientific literature and other For example, Cameron proposed a “universal, [Internet- information resources accessible within seconds. Advances in based,] bibliographic and citation database linking every computing and communications, and the rapid rise of the scholarly work ever written” [14]. Such a database would be Web have led to the increasingly widespread availability of highly “comprehensive and up-to-date”, making it a powerful online research articles, as well as a simple-to-use Web version tool for academic literature research, and for the production of the Institute for Scientific Information’s ® (ISI) Science of statistics as with traditional citation indices. However, Citation Index® — the Web of Science®. The Web is changing Cameron’s proposal presents significant difficulty for imple- the way researchers locate and access scientific publications. mentation, and requires authors or institutions to provide cita- Many print journals now provide access to the full text of tion information in a specific format. articles on the Web, and the number of online journals was The NEC Research Institute is working on a digital about 1000 in 1996 [12]. Researchers are increasingly making library of scientific publications that creates a citation index their work available on their homepages or in technical autonomously (using Autonomous Citation Indexing, ACI), report archives. without the requirement of any additional effort on the part IEEE Communications Magazine • January 1999 119
  • 5. Searching for “simulated annealing” in Machine Learning [small test index] (13828 documents 278202 citations total). example of how an ACI system can 1218 citations found extract the context of citations to a Click on the [Context] links to see the citing documents and the context of the citations. given paper and display them for easy browsing. Note that finding Citations (self) Article and extracting the context of cita- tions to a given paper could previ- 196 S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, “Optimization by ously be done by using traditional simulated annealing,” Science, vol. 220, 1983, pp. 671–680. [Context] citation indices and manually locat- [Check] ing and searching the citing papers — the difference is that the 49 (1) D. S. Johnson et al., “Optimization by simulated annealing: An automation and Web interface experimental evaluation,” Technical report, Bell Labs preprint. [Context] make the task far more efficient, [Check] and thus practical, where it may not have been before. 39 E. Aarts and J. Korst, “Simulated Annealing and Boltzmann Machines,” Digital libraries incorporating John Wiley and Sons, 1989. [Context] [Check] citation indexing can be used to organize the scientific literature, [... section deleted...] and help with literature search and evaluation. A “universal citation s Figure 1. An ACI system can group variant forms of citations to the same paper (citations can database” which accurately indexes be written in many different formats), and rank search results by the number of citations. all literature would be ideal, but is currently impractical because of the limited availability of articles of the authors or institutions, and without any manual assis- in electronic form and the lack of standardization in citation tance [15]. An ACI system autonomously extracts citations, practices. However, CiteSeer shows that it is possible to identifies identical citations that occur in different formats, organize and index the subset of literature available on the and identifies the context of citations in the body of articles. Web, and to autonomously process freeform citations with As with traditional citation indices like the Science Citation reasonable accuracy. As long as there is a significant portion Index, ACI allows literature search using citation links, and of publishing through the Web, be it the publicly indexable the ranking of papers, journals, authors, and so on by the Web or the subscription-only Web, there is great value in number of citations. Compared to traditional citation index- being able to prepare citation indices from the machine- ing systems, ACI has both disadvantages and advantages. readable material. Citation indices may appear that index The disadvantages include lower accuracy (which is expected from both parts of the Web. Access to the full text of articles to be less of a disadvantage over time). However, the advan- may be done openly or by subscription, depending on how tages are significant and include no manual effort required the Web and the publication business evolve. Citation for indexing, resulting in a corresponding reduction in cost indices for subscription-only data may be offered by the pub- and increase in availability, and literature search based on lisher or performed by a third party that has an agreement the context of citations — given a particular paper of inter- with the publisher. est, an ACI system can display the context of how the paper is cited in subsequent publications. The context of citations can be very useful for efficient literature search and evalua- THE FUTURE OF WEB SEARCH AND tion. ACI has the potential for broader coverage of the liter- ature because human indexers are not required, and can DIGITAL LIBRARIES provide more timely feedback and evaluation by indexing What is the future of the Web, Web search, and digital items such as conference proceedings and technical reports. libraries? Improvements in technology will enable new appli- Overall, ACI can improve scientific communication, and cations. Computational and storage resources will continue to facilitates an increased rate of scientific dissemination and improve. Bandwidth is likely to increase significantly as tech- feedback. nology advances and the following positive spiral works: more ACI is ideal for operation on the Web — new articles can people are becoming connected to the Internet as it becomes be automatically located and indexed when they are posted easier to use and more popular, and as new access mecha- on the Web or announced on mailing lists, and an efficient nisms are introduced (e.g., cable modems and digital sub- interface for browsing the articles, citations, and the context scriber lines). This provides incentive for the infrastructure of the citations can be created. Part of the benefit of companies to make more investment in the backbone, improv- autonomous citation indexing is due to the ability to format ing bandwidth. More investment in the backbone improves and organize information on demand using a Web interface access, so more people want to be connected. to the citation index. Figure 1 shows an example of the out- Will the fraction of the Web covered by the major search put from the NEC Research Institute’s prototype engines increase? Some search engines are focusing on index- autonomous citation indexing digital library system: Cite- ing Web pages that satisfy the majority of searches, as Seer. This example shows the results of a search for citations opposed to trying to catalog all of the Web. However, there containing the phrase “simulated annealing” in a small test are still some engines that aim to index the Web comprehen- database of the machine learning literature (only a subset of sively. Improvements in indexing technology and computa- the machine learning literature on the Web). Searching for tional resources will allow the creation of larger indices. citations to papers by a given author can also be performed Nevertheless, it is unlikely to become economically practical (including secondary authors). The [Context] links show the for a single search engine to index close to all of the publicly context of the individual citations. The [Check] links show indexable Web in the near future. However, it is predicted the individual citations in each group and can be used to that the cost of indexing and storage will decline over time check for errors in the citation grouping.Figure 2 shows an relative to the increase in the size of the indexable Web [6], 120 IEEE Communications Magazine • January 1999
  • 6. S. Kirkpatrick, C. D. Gelatt Jr., and M. Vecchi, Optimization by simulated annealing, Science 220 (4598) (1983), 671–680. This paper is cited in the following contexts: M.I.T Media Laboratory Perceptual Computing Section Technical Report No. 177 - Appeared: IEEE ICASSP. San Francisco. March 1992, vol. III. pp. 45–48. - GIBBS RANDOM FIELD: - TEMPERATURE AND PARAMETER ANALYSIS - Rosalind W. Picard –M.I.T Lab, E15-392: 20 Ames Street, Cambridge, MA 02139 – picard@media.mit.edu [Details] [Full Text] [Related Articles] [ftp://whitechapel.media.mit.edu/pub/tech-reports/TR -177.ps.Z] ...... Simulated annealing is a popular nonlinear optimization technique where a cost function is substituted for E(x), and consequently minimized. There is a key observation in the simulated annealing literature that prompts the study of temperature presented in this paper. Kirkpatrick, et al. [3] observed that “more optimization” occurs at certain temperatures than at others. These favored temperatures are analogous to the physical idea of a “critical temperature,” a point that marks transition between different “phases” of the data. The reason for considering these physical...... ...... region, we have shown in earlier work that a similar kind of point, which we call a “transition” temperature, T, does occur [8]. By measuring the specific heat of the binary process it can be shown to correspond to the same region where the “most optimiza- tion” occurs in simulated annealing [3]. For GRF analysis, this region is where the energy fluctuation peaks, and where small changes in the parameters become more significant. In [8] the transition temperature for n = 2 was estimated to be at I =T = 1:7. This analysis suggests that attempts to estimate parameters should take...... [3] S. Kirkpatrick, C. D. Gelatt Jr., and M. P. Vecchi, Optimization by simulated annealing, Science 220 (4598) 671–680, 1983. Technical Report No. 9805. Department of Statistics, University of Toronto - Annealed Importance Sampling - Radford M. Neal – Department of Statistics and Department of Computer Science - University of Toronto, Toronto, Ontario, Canada- http://www.cs.utoronto.ca/radford/ - radford@stat.utoronto.ca - 18 Feb. 1998 [Details] [Full Text] [Related Articles] [ftp://ftp.cs.utoron- to.ca/pub/radford/ais.ps.Z] ...... respect to these transitions. Because such a chain will move between modes only rarely, it will take a long time to reach equilibri- um, and will exhibit high autocorrelations for functions of the state variables out to long timelags. The method of simulated anneal- ing was introduced by Kirkpatrick, Gelatt, and Vecchi (1983) as a way of handling multiple modes in an optimization context. It employs a sequence of distributions, with probabilities or probability densities given by p0(x) to 2 pn(x), in which each pj differs only slightly from pj+l. The distribution p0 is the one of interest. The...... Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P., (1983) “Optimization by simulated annealing,” Science, vol. 220, pp. 671–680. [...section deleted...] s Figure 2. An example of how an autonomous citation indexing system can show the context of citations to a given paper. The sen- tences containing the citations are automatically highlighted. resulting in favorable scaling properties for centralized text SUMMARY search engines. In the meantime, an increased number of specialized search services may arise that cover specific types The Web is revolutionizing information access; however, cur- of information. rent techniques for access to both general and scientific infor- The use of more expensive and better algorithms (e.g., as mation on the Web leave room for much improvement. The in Google) will produce improved page rankings. More infor- Web search engines are limited in terms of coverage, recency, mation retrieval techniques aimed at the large, diverse, low how well they rank query results, and the query options they signal-to-noise ratio database of the Web will be developed. support. Access to the growing body of scientific literature on One interesting possibility is the use of machine learning in the publicly indexable Web is limited by the lack of organiza- order to create query transformations similar to those used in tion and because the major search engines do not index the SEF technique discussed earlier. Postscript or PDF documents. We have discussed several Metasearch techniques, which combine the results of mul- fruitful research directions that will improve access to general tiple engines, are likely to continue to be useful when and scientific information, and greatly enhance the utility of searching for hard-to-find information, or when comprehen- the Web: improved ranking methods, metasearch engines, sive results are desired. The major Web search engines are softbots, and autonomous citation indexing. It is not clear how also likely to continue to focus on performing queries as availability will evolve, because this depends on how the Web quickly as possible, and therefore metasearch engines that emerges as a business platform for publishers. Nevertheless, perform additional client-side processing (e.g., query term improved ways to do basic searching, and specialized citation context summaries) may become increasingly popular as searching are likely to evolve and replace present methods, these products become more powerful, address problems and will greatly increase the utility of the Web over what is with data fusion from different sources, and learn to deal available today. better with the constantly evolving search services. Improve- ments in bandwidth should improve the feasibility of ACKNOWLEDGMENTS metasearch techniques. We thank H. Stone and the reviewers for very useful com- Digital libraries incorporating ACI should become more ments and suggestions. widely available, bringing the benefits of citation indexing to groups who cannot afford the commercial services, and REFERENCES improving the dissemination and retrieval of scientific litera- [1] S. Lawrence and C. L. Giles, “Searching the World Wide Web,” Science, ture. vol. 280, no. 5360, 1998, pp. 98–100. IEEE Communications Magazine • January 1999 121
  • 7. [2] J. Barrie and D. Presti, “The World Wide Web as an instructional tool,” Digital Libraries 98 — The Third ACM Conf. Digital Libraries, Pittsburgh, Science, vol. 274, 1996, pp. 371–72. PA, 1998, pp. 89–98. [3] R. Seltzer, E. Ray, and D. Ray, The AltaVista Search Revolution: How to Find Anything on the Internet, McGraw-Hill, 1997. BIOGRAPHIES [4] Inktomi, http://www.inktomi.com/new/press/bellsouth.html, 1997. [5] S. Steinberg, “Seek and ye shall find (maybe),” Wired, vol. 4, no. 5, STEVE LAWRENCE (lawrence@research.nj.nec.com) is a research scientist at 1996. the NEC Research Institute, Princeton, New Jersey. His research interests [6] S. Brin and L. Page, “The anatomy of a large-scale hypertextual Web include information retrieval and dissemination, machine learning, artificial search engine,” Proc. 7th Int’l. WWW Conf., Brisbane, Australia, 1998. intelligence, neural networks, face recognition, speech recognition, time [7] J. Boyan, D. Freitag, and T. Joachims, “A machine learning architecture series prediction, and natural language. His awards include an NEC for optimizing Web search engines,” Proc. AAAI Wksp. Internet-Based Research Institute excellence award, ATERB and APRA priority scholarships, Info. Sys., 1996. a QUT university medal and award for excellence, QEC and Telecom Aus- [8] J. Kleinberg, “Authoritative sources in a hyperlinked environment,” Proc. tralia Engineering prizes, and three successive prizes in the annual Aus- ACM-SIAM Symp. Discrete Algorithms, 1998. tralian Mathematics Competition. He received a B.Sc. in computing and a [9] E. Selberg and O. Etzioni, “Multi-service search and comparison using B.Eng. in electronic systems from the Queensland University of Technology, the MetaCrawler,” Proc. 1995 WWW Conf., 1995. Australia, and a Ph.D. from the University of Queensland, Australia. [10] S. Lawrence and C. L. Giles, “Context and page analysis for improved Web search,” IEEE Internet Comp., vol. 2, no. 4, 1998, pp. 38–46. C. LEE GILES [F] (giles@research.nj.nec.com) is a senior research scientist in [11] O. Etzioni and D. Weld, “A softbot-based interface to the Internet,” Computer Science at NEC Research Institute, Princeton, New Jersey. Cur- Commun. ACM, vol. 37, no. 7, 1994, pp. 72–76. rently he is an adjunct professor at the Institute for Advanced Computer [12] G. Taubes, Science, vol. 271, 1996, p. 764. Studies at the University of Maryland. His research interests are in novel [13] E. Garfield, Citation Indexing: Its Theory and Application in Science, applications of neural and machine learning, agents and AI in the Web, Technology, and Humanities, New York: Wiley, 1979. and computing. He is on the editorial boards of IEEE Intelligent Systems, [14] R. D. Cameron, “A universal citation database as a catalyst for reform IEEE Transactions on Knowledge and Data Engineering, IEEE Transactions in scholarly communication,” First Monday, vol. 2, no. 4, 1997. on Neural Networks, the Journal of Computational Intelligence in Finance, [15] C. L. Giles, K. Bollacker, and S. Lawrence, “CiteSeer: An automatic cita- Journal of Parallel and Distributed Computing, Neural Networks, Neural tion indexing system,” I. Witten, R. Akscyn, and F. M. Shipman III, Eds., Computation, and Applied Optics. 122 IEEE Communications Magazine • January 1999