SlideShare una empresa de Scribd logo
1 de 11
Descargar para leer sin conexión
Computer Networks and ISDN Systems 29 (I 997) 122% 1235




      The quest for correct information on the Web: hyper search engines
                                                           Massimo Marchiori ’
                  Department   of Pure   and Applied   Mathematics,   Universiy     of Padova,   Ka Beboni   7, 35131   Padova.   Ita!




Abstract

    Finding the right information in the World Wide Web is becoming a fundamental problem, since the amount of global
information that the WWW containsis growingat an incrediblerate. In this paper,we presenta novel methodto extract
from a web object its “hyper” informative content, in contrastwith current searchengines,which only deal with the
“textual” informativecontent.This methodis not only valuableper se,but it is shownto be ableto considerablyincrease
the precisionof current searchengines,
                                     Moreover, it integratessmoothlywith existing search    engines
                                                                                                  technologysinceit
can be implemented top of every searchengine,acting asa post-processor, automaticallytransforming search
                     on                                                      thus                          a
engineinto its corresponding “hype? version.We also showhow, interestingly,the hyper information can be usefully
employed face the search
           to              enginespersuasionproblem. 0 1997 Published by Elsevier Science B.V.

Keywords: World Wide Web; Information retrieval; Searchengines;Sep; Browsersand tools; Designprinciplesand
techniques;
          Integrationof heterogeneous
                                   informationsources; Searchtechniques



1. Introduction                                                                   report, the performance of actual search engines is far
                                                                                  from being fully satisfactory. While so far the techno-
   The World Wide Web is growing at phenome-                                      logical progress has made possible to reasonably deal
nal rates, as witnessed by every recent estimation.                                with the size of the web, in terms of number of objects
This explosion both of Internet hosts and of peo-                                 classified, the big problem is just to properly classify
ple using the web, has made crucial the problem of                                objects in response to the users’ needs: in other words,
managing such enormous amount of information. As                                   to give the user the information s/he needs.
market studies clearly indicate, in order to survive                                  In this paper, we argue for a solution to this prob-
into this informative jungle, web users have to almost                            lem, by means of a new measure of the informative
exclusively resort on search engines (automatic cata-                             content of a web object, namely its “hyper informa-
logs of the web) and repositories (human-maintained                               tion”. Informally, the problem with current search en-
collections of links usually topics-based). In turn,                              gines is that they look at a web object to evaluate,
repositories are now themselves resorting on search                               almost just like a piece of text. Even with extremely
engines to keep their databases up-to-date. Thus, the                              sophisticated techniques like those already present in
crucial component in the information management is                                 some search engine scoring mechanism, this approach
given by search engines.                                                          suffers from an intrinsic weakness: it doesn’t take into
    As all the recent studies on the subject (see e.g. [S])                       account the webstructure the object is part of.
                                                                                     The power of the web just relies on its capability
’ E-mail: max@math.unipd.it                                                       of redirecting the information flow via hyperlinks, so

0169-7552/97/$17.00      0 1997 Published by Elsevier Science B.V. Ail rights reserved.
PII   SO1 69-7552197100036-6
1226                                  M. Marchiori   / Computer   Networks   and ISDN Systems 29 ( 1997)            1225-1235


 it should appear rather natural that in order to evaluate                          quences of bytes. The intuition is that for each URL
 the informative content of a web object, the web struc-                             we can require from the web structure the corre-
 ture has to be carefully analyzed. Instead, so far only                            sponding object (an HTML 3 page, a text file, etc.).
 very limited forms of analysis of an object in the web                             The function has to be partial because for some URL
 space have been used, like “visibility” (the number of                             there is no corresponding object.
 links pointing to a web object), and the gain in obtain-                                In this paper we consider as web structure
 ing a more precise information has been very little.                               the World Wide Web structure WWW. Note that
     In this paper, instead, we develop a natural no-                               in general the real WWW is a timed structure,
 tion of “hype? information which properly takes                                    since the URL mapping varies with time (i.e. it
 into account the web structure an object is part of.                               should be written as WWW, for each time in-
 Informally, the “overall information” of a web object                              stant t). However, we will work under the hy-
 is not composed only by its static “textual informa-                               pothesis that the WWW is locally time consis-
 tion”, but also another “hype? information has to be                               tent, i.e. that there is a non-void time interval I
 added, which is the measure of the potential informa-                              such that the probability that wwW,(ur/) = seq,
tion of a web object with respect to the web space.                                 t 2 t’ 2 t + I * WWW,r(url) = seq is extremely
 Roughly speaking, it measures how much informa-                                    high. That is to say, if at a certain time t an URL
 tion one can obtain using that page with a browser,                                points to an object seq, then there is a time interval
 and navigating starting from it.                                                   in which this property stays true. Note this doesn’t
     Besides its fundamental characteristic, and future                             mean that the web structure stays the same, since
potentials, the hyper information has an immediate                                  new web objects can be added.
practical impact: indeed, it works “on top” of any                                      We will also assume that all the operations resort-
textual information function, manipulating its “local”                              ing on the WWW structure that we will employ in this
scores to obtain the more complete “global” score of                                paper, can be performed with extremely high prob-
a web object. This means that it provides a smooth                                  ability within time I: this way, we are acting on a
integration with existing search engines, since it can                              “locally consistent” time interval, and thus it is (with
be used to considerably improve the performance of                                 extremely high probability) safe to get rid of the dy-
a search engine simply by post-processing its scores.                              namic behavior of the World Wide Web and consider
     Moreover, we show how the hyper information                                   it as an untimed web structure, as we will do.
can nicely deal with the big problem of “search                                         Note that these assumptions are not restrictive,
engines persuasion” (tuning pages so to cheat a                                     since empirical observations (cf. [2]) show that
search engine, in order to make it give a higher                                    WWW is indeed locally time consistent, e.g. setting
rank), one of the most alarming factors of degrade of                               I = 1 day (in fact, this is one of the essential reasons
the performance of most search engines.                                            for which the World Wide Web works). This, in turn,
     Finally, we present the result of an extensive test-                          also implies that the second hypothesis (algorithms
ing on the hyper information, used as post-processor                               working in at most I time) is not restrictive.
in combination with the major search engines, and                                       A web object is a pair (urZ,seq), made up by an
show how the hyper information is able to consider-                                URL url and a sequence of bytes seq = WWW(url).
ably improve the quality of information provided by                                     In the sequel, we will usually consider understood
search engines.                                                                    the WWW web structure in all the situations where
                                                                                   web objects are considered.
                                                                                       As usual in the literature, when no confusion
2. World Wide Web                                                                  arises we will sometimes talk of a link meaning the
                                                                                   pointed web object (for instance, we may talk about
   In general, we consider an (untimed) web struc-                                 the score of a link meaning the score of the web
ture to be a partial function from URLs’    to se-                                 object it points to).


’ http:/Jwww.w3.orgfpubfW?VWfAddressingfrfc1738.txt                                3 http:ffwww.w3.orgfpub~~arkUpJ
M. Murchiori / Computer Networks and ISDN Systems29 (I 997) 1225-1235                     1227


3. Hyper vs. non-hyper                                                    tent must be more valuable than ather web objects
                                                                          that have less links pointing to them. This reasoning
    A great problem with search engines scoring                           would be correct if all web objects were known to
mechanism is that they tend to score text more than                       users and to search engines. But this assumption is
hypertext. Let us explain this better. It is well known                   clearly false. The fact is that a web object which is
that the essential difference between normal texts                        not known enough, for instance for its location, is
and hypertext is the relation-flow of information                         going to have a very low visibility (when it is not
which is possible via hyperlinks. Thus, when evalu-                       completely neglected by search engines), unregard-
ating the informative content of a hypertext it should                    ing of its informative content which may be by far
be kept into account this fundamental, dynamic be-                        better than other web objects.
havior of hypertext. Instead, most of search engines                          In a nutshell, visibility is likely to be a synony-
tend to simply forget the “hype? part of a hypertext                      mous of popularity, which is something completely
(the links), by simply scoring its textual components,                    different by quality, and thus its usage to give higher
i.e. providing the textual information (henceforth, we                    score by search engines is a rather poor choice.
say “information” in place of the more appropriate,
but longer, “measure of information”).                                    3.1. The hyper information
    Note that ranking in a different way words ap-
pearing in the title”, or between headings5, etc.,                            As said, what is really missing in the evaluation
like all contemporary search engines do, is not in                        of the score of a web object is its hyper part, that is
contradiction with what said above, since title, head-                    the dynamic information content which is provided
ings and so on are not “hype?‘, but they are simply                       by hyperlinks (henceforth, simply links).
attributes of the text.                                                       We call this kind of information hyper in-
    More formally, the textual information of a web                      formation: this information should be added to
object (url,seq) considers only the textual component                     the textual information of the web object, giv-
seq, not taking into account the hyper information                        ing its (overall) information in the World Wide
given by url and by the underlying World Wide Web                         Web. We indicate these three kinds of informa-
structure.                                                                tion as HYPERINFO, TEXTINFO and INFORMATION,
    A partial exception is considering the so-called                      respectively. So, for every web object A we
visibility (cf. for instance [l]) of a web object.                        have that INFORMATION(A)          = TEXTlNFO(A) +
Roughly, the visibility of a web object is the number                     HYPEIUNFO(A) (note that in general these infor-
of other web objects with a link to it. Thus, visibility                  mation functions depend on a specific query, that
is in a sense a measure of the importance of a web                        is to say they measure the informative content of a
object in the World Wide Web context. Excite6 and                         web object with respect to a certain query: in the
WebCrawIer’        have been the first search engines                     sequel, we will always consider such query to be
to provide higher ranks to web objects with high                          understood).
visibility, now followed by Lycos * and Magellan 9.                           The presence in a web object of links clearly aug-
    The problem is that visibility says nothing about                     ments the informative content with the information
the informative content of a web object. The mis-                         contained in the pointed web objects (although we
leading assumption is that if a web object has a                          have to establish to what extent).
high visibility, then this is a sign of importance and                        Recursively, links present in the pointed web ob-
consideration, and so de facto its informative con-                      jects further contribute, and so on. Thus, in principle,
                                                                          the analysis of the informative content of a web ob-
                                                                         ject A should involve all the web objects that are
’ http://www.w3.org/pubIWWW/TR/REC-html32.html#title                      reachable from it via hyperlinks (i.e., “navigating” in
’ http://www.w3.org/pubAVWW/TR/FSC-html32.html#headings
                                                                          the World Wide Web).
’ http:Nwww.excite.com
’ http:Nwww.webcrawler.com                                                    This is clearly unfeasible in practice, so, for prac-
’ http://www.lycos.com                                                    tical reasons, we have to stop the analysis at a certain
9 http://www.mckinley.com                                                 depth, just like in programs for chess analysis we
1228                      M. Marchiori/Computer   Networks   and ISDN   Systems 29 (1997)   1225-1235


have to stop considering the moves tree after few                 that A has almost zero textual information, while
moves. So, one should fix a certain upper bound for               B has an extremely high overall information. Using
the depth of the evaluation. The definition of depth is           the naive approach, A would have more informative
completely natural: given a web object 0, the (rela-              content than B, while it is clear that the user is much
tive) depth of another web object 0’ is the minimum               more interested in B than in A.
number of links that have to be activated (“clicked”)                 The problem becomes even more evident when
in order to reach 0’ from 0. So, saying that the                  we increase the depth of the analysis: consider a
evaluation has max depth k means that we consider                 situation where Bk is at depth k from A, i.e. to go
only the web objects at depth less or equal than k.               from A to B one needs k “clicks”, and k is big (e.g.
     By fixing a certain depth, we thus select a suitable         k > 10):
finite “local neighborhood” of a web object in the
World Wide Web. Now we are faced with the prob-
lem of establishing the hyper information of a web
object A w.r.t. this neighborhood (the hyper infor-                    LetA,Bt,...    , Bk-I have almost zero infOITXitiVe
mation with depth k). We denote this information by               content, and Bk have very high overall information.
with HYPERINFOtkl, and the corresponding overall                  Then A would have higher overall information than
information (with depth k) with INFORMATIONtkl.                    Bk, which is paradoxical since A has clearly a too
In most of cases, k will be considered understood,                weak relation with Bk to be more useful than Bk
and so we will omit the [k] subscript.                            itself.
    We assume that INFORMATION, TEXTINFO and                           The problem essentially is that the textual infor-
HYPERINFO are functions from web objects to non-                  mation pointed by a link cannot be considered as
negative real numbers. Their intuitive meaning is that            actual, since it is potential: for the user there is a
the more information a web object has, the greater is             cost to retain the textual information pointed by a
the corresponding number.                                         link (click and...wait).
     Observe that in order to have a feasible im-                      The solution to these two factors is: the contri-
plementation, we need that all of these functions                 bution to the hyper information of a web object at
are bounded (i.e., there is a number M such that                  depth k is not simply its textual information, but
 M 2 INFORMATION(A),            for every A). So, we              it is its textual information diminished via a fading
assume without loss of generality that TEXTINFO                   factor depending on its depth, i.e. on “how far” is the
has an upper bound of 1, and that HYPERINFO is                    information for the user (how many clicks s/he has
bounded.                                                          to perform).
                                                                       Our choice about the law regulating this fading
3.2. Single links                                                 function is that textual information fades exponen-
                                                                  tially w.r.t. the depth, i.e. the contribution to the
   To start with, consider the simple case where we               hyper information of A given by an object B at depth
have only at most one link in every web object.                    k is Fk . TEXTINFO( B), for a suitable fading factor F
   In the basic case when the depth is one, the most               (0 < F < 1).
complicated case is when we have one link from the                     Thus, in the above example, the hyper information
considered web object A to another web object B, i.e.              of A is not TEXTINFO(Bi) + . . . + TEXTINFO(Bk),
                                                                  but F . TEXTINFO( Bt ) + F* . TEXTINFO( &) + . . . +
El-43                                                             Fk . TEXTINFO(Bk), so that the overall information
    A naive approach is just to add the textual infor-             of A is not necessarily greater than that of Bk.
mation of the pointed web object. This idea, more or                   As an aside, note that when calculating the
less, is tantamount to identify a web object with the              overall information of a web object A, its tex-
web object obtained by replacing the links with the                tual information can be nicely seen as a spe-
corresponding pointed web objects.                                 cial degenerate case of hyper information, since
    This approach is attracting, but not correct, since            TEXTINFO(A) = F” . TEXTINFO(A) (viz., the ob-
it raises a number of problems. For instance, suppose             ject is at “zero distance” from itself).
h4. Marchiori/Computer        Networks   and ISDN Systems 29 (1997)         1225-1235                        1229


3.3. More motivations                                                         link (i.e. F.TEXTINFO(B1)+.     . .+F.TEXTlNFO(B,)),
                                                                              isn’t feasible since we want the hyper information to
   There is still one point to clarify. We have intro-                        be bounded.
duced an exponentially fading law, and seen how it                                This would seem in contradiction with the inter-
behaves well with respect to our expectations. But                            pretation of a link as potential information that we
one could ask why the fading function has to be                               have given earlier: if one has many links that can be
exponential, and not of different form, for instance                          activated, then one has all of their potential informa-
polynomial (e.g., multiplying by F/k instead that by                          tion. However, this paradox is only apparent: the user
Fk). Let us consider the overall information of a web                         cannot get all the links at the same time, but has to
object with depth k: if we have                                               sequentially select them. In other words, nondeter-
                                                                              minism has a cost. So, in the best case the user will
            9B-
                                                                              select the most informative link, and then the second
                                                                              more informative one, and so on. Suppose for exam-
it is completely reasonable to assume that the over-
                                                                              ple that the more informative link is BI, the second
all information of A (with depth k) is given by the
textual content of A plus the faded overall informa-                          one is B2 and so on (i.e., we have TEXTINFO(B1) >
                                                                              TEXTINFO(B2)       2    . 2 TEXTINFO(B,)).     Thus, the
tion of B (with depth k - 1). It is also reasonable
                                                                              hyper information is F . TEXTINFO(BI)          (the user
to assume that this fading can be approximated by
                                                                              selects the best link) plus F2 . TEXTINFO(B2) (the
multiplying    by an appropriate fading constant F
                                                                               second time, the user selects the second best link)
(0 -C F < 1). So, we have the recursive relation
                                                                               and so on, that is to say
lNFORMATION,kl            (A) = F . lNFORMATIONlk-             11(B)
                                                                               F TEXTlNFO(B,)                + . . . + F” . TEXTINFO(B,)
     Since readily         for every object              A we have
                                                                                   Nicely, evaluating the score this way gives a
lNFORMATIONLol(A)             = TEXTlNFO(A),              eliminating
                                                                               bounded function, since for any number of links, the
recursion we obtain that if
                                                                               sum cannot be greater than F/(F + 1).
                                                                                   Note that we chose the best sequence of selec-
                                                                               tions, since hyper information is the best “potential”
then INFORMATION(A)                 =    TEXTlNFO(A)       +                   information, so we have to assume the user does the
F . VEXTWO                + F . (TEXTINFO(B~)           + F .                  best choices: we cannot use e.g. a random selection
(. . .TEXTINFO(Bk))))         =     TEXTlNFO(A)        + F .                   of the links, or even other functions like the average
TEXTINFO(BI)        + F2 . TEXTINFO(B;?)       + . . . + Fk .                  between the contributions of the each link, since we
TEXTlNFO(Bk),         which is just the exponential fading                     cannot impose that every link has to be relevant. For
law that we have adopted.                                                      instance, if we did so, accessory links with zero score
                                                                               (e.g. think of the “powered with Netscape” ‘O-links)
3.4. Multiple     links                                                        would devalue by far the hyper information even in
                                                                               presence of highly scored links, while those acces-
    Now we turn to the case where there is more than                           sory links should simply be ignored (as the above
one link in the same web object. For simplicity, we                            method, consistently, does).
assume the depth is one. So, suppose one has the
following situation
                                                                               3.5. The general case

                  ,/.                                                             The general case (multiple links in the same
            /       ,/         ?q--~
                                                                               page and arbitrary depth k) is treated accordingly to
     /‘./
                                                                             what previously seen. Informally, all the web objects
 A                                   fBh-II@                                   at depth less of equal than k are ordered w.r.t. a
0
   What is the hyper information in this case? The                             ” http://www.netscape.com/comprodroducts/navigator/version-
easiest answer, just sum the contribution of every                             3.0/images/netnow3.gif
1230                     M. Marchiori/Computer   Networks   and ISDN   Systems 29 (1997)       1225-1235


“sequence of selections” such that the corresponding                  Another important issue is that the same defini-
hyper information is the highest one.                            tion of link in a web object is fairly from trivial. A
   For example, consider the following situation:                link present in a web object 0 is said to be active
                                                                 if the web objects it points to can be accessed by
                                                                 viewing (m-&q) with an HTML”               browser (e.g.,
                                                                 Netscape 12, Navigator l3 or Microsoft l4 Internet
                                                                 Explorer r5). This means, informally, that once we
                                                                 view 0 with the browser, we can activate the link
                                                                 by clicking over it. The previous definition is rather
                                                                 operational, but it is much more intuitive than a
with F = OS,TEXTINFO(B)          = 0.4,TEXTINFO(C)               formal technical definition which can be given by
= 0.3,TEXTINFO(D)        = 0.2,TEXTINFO(E)     = 0.6.            tediously specifying all the possible cases according
    Then via a “sequence of selections” we can go                to the HTML specification I6 (note a problem com-
from A to B, to C, to E and then to D, and                       plicating a formal analysis is that one cannot assume
this is readily the best sequence that maximizes the             that seq is composed by legal HTML code, since
hyper information, which is 0.5 . TEXTINFO(B) +                  browsers are error-tolerating).
0.52~TEXTINFO(C)+0.53.TEXTINFO(E)+0.54.                               Thus, the links mentioned in the paper should be
TEXTINFO(D)(=0.3625).                                            only the active ones.
                                                                      Also, there are many different kinds of links, and
                                                                 each of them would require a specific treatment. For
4. Refining the model                                            instance,
                                                                     Local links (links pointing to some point in the
    There is a number of subtleties that for clarity of               same web object, using the #-specifier 17) should
exposition we haven’t considered so far.                             be readily ignored.
    The first big problem comes from possible dupli-                 Frumd8 links should be automatically expanded,
cations. For instance, consider the case of backwards                i.e. if A has a frame link to B, then this link
links: suppose that a user has set up two pages A and                 should be replaced with a proper expansion of
B, with a link in from A to B, and then a “come                       B inside A (since a frame link is automatically
back” link in B from B to A:                                          activated, its pointed web object is just part of the
                                                                     original web object, and the user does not see any
                                                                     link at all).
                                                                      Other links that are automatically activated are
                                                                     for instance the image I9 links, i.e. source links
    This means that we have a kind of recursion, and                 of image tags, the background 2o links and so on:
that the textual information of A is repeatedly added                they should be treated analogously to frame links.
(although with increasingly higher fading) when cal-                 However, since usually it is a very hard problem
culating the hyper information of A (this because                    to recover some useful information from images
from A we can navigate to B and then to A and
so on...). Another example is given by duplicating
                                                                 ‘I http://www.w3.org/publWWWlMarkUpl
links (e.g. two links in a web object pointing to the            ” http://www.netscape.com
same web object). The solution to avoid these kinds              I3 http:l/www.netscape.com/comprod/products/navigator/version-
of problems, is not to consider all the “sequence of             3.0/index.html
selections”, but only those without repetitions. This            I4 http://www.microsoft.com
                                                                 I5 http://www.microsoft.com/ie
is consistent with the intuition of hyper information
                                                                 ” http:llwww.w3.orglpubiWWWMarkup/
to measure in a sense how much a web object is far               I7 http:llwww.w3.orglpub/WWWlAddressinglrfc1738.txt
from a page: if one has already reached an object,               I8 http://home.netscape.com/assist/net_sites/frames.html
s/he already has got its information, and so it makes            I9 http:Nwww.w3.orglpubIWWW/TR/RFE-html32.htmHimg
no sense to get it again.                                        2o http:Nwww.w3.org/pub/WWW/TWREC-html32.htmHmody
M. Marchiori/Computer     Networks   and ISDN Svstems 29 (1997)           1225-1235           1231


    (cf. [6]), in a practical implementation of the hy-                          persuasion techniques, “spamdexing” [4,7,8], i.e. the
    per information all these links can be ignored                               artificial repetition of relevant keywords. Despite
    (note this doesn’t mean that the textual informa-                            these efforts, the situation is getting worst and worst,
    tion function has to ignore such links, since it                             since more sophisticated techniques have been devel-
    can e.g. take into account the name of the pic-                              oped, which analyze the behavior of search engines
    tures). This also happens for active links pointing                          and tune the pages accordingly. This has also led to
    to images, movies, sounds etc., which can be syn-                            the situation where search engines maintainers tend
    tactically identified by their file extensions .gif,                         to assign penalties to pages that rank “too high”,
    .jpg, .tiff, .avi, .wav and so on.                                           and at the same time to provide less and less de-
                                                                                 tails on their information functions just in order to
4.1. Search engines persuasion                                                   prevent this kind of tuning, thus in many cases penal-
                                                                                 izing pages of high informative value that were not
    A big problem that search engines have to face                               designed with “persuasion” intentions. For a com-
is the phenomenon of so-called sep (search engines                               prehensive study of the sep phenomenon, we refer
persuasion). Indeed, search engines have become so                               to [3].
important in the advertisement market that it has                                    The hyper methodology that we have developed
become essential for companies to have their pages                               is able to a certain extent to nicely cope with the
listed in top positions of search engines, in order                              sep problem. Indeed, maintainers can keep details of
to get a significant web-based promotion. Starting                               their TEXTINFO       function hidden, and make public
with the already pioneering work of Rhodes [5],                                  the information that they are using a HYPERINFO
this phenomenon is now boosting at such a rate to                                function.
have provoked serious problems to search engines,                                    The only precaution is to distinguish between
and has revolutioned the web design companies,                                   two fundamental types of links, assigning different
which are now specifically asked not only to de-                                 fading factors to each of them. Suppose to have a
sign good web sites, but also to make them rank                                  web object (url,seq). A link contained in seqis called
high in search engines. A vast number of new com-                                outer if it has not the same domain”         of m-1,and
panies was born just to make customer web pages                                  inner in the other case. That is to say, inner links of
as visible as possible. More and more companies,                                 a web objects point to web objects in the same site
like Exploit 21, Allwilk 22, Northern Webs 23, Ry                                (its “local world”, so to say), while outer links point
ley & Associates 24, etc., explicitly study ways to                              to web objects of other sites (the “outer world”).
rank high a page in search engines. OpenText                                         Now, inner links are dangerous from the sep point
arrived even to sell “preferred listings”, i.e. assuring                         of view, since they are on the direct control of the
a particular entry to stay in the top ten for some                               site maintainer. For instance, a user that wants to
time, a policy that has provoked some controversies                              artificially increase the hyper information of a web
(cf. [91>.                                                                       object A could set up a very similar web object B
    This has led to a bad performance degradation                                (i.e. such that TEXTINFO(A)      % TEXTINFO(B)),     and
of search engines, since an increasingly high num-                               put a link from A to B: this would increase the score
ber of pages is designed to have an artificially high                            of A by roughly F . TEXTINFO(A).
textual content. The phenomenon is so serious that                                    On the other hand. outer links do not present
search engines like InfoSeekz6 and Lycos 27 have                                 this problem since they are out of direct control and
introduced penalties to face the most common of this                             manipulation (at least in principle, cf. [23).
                                                                                     Thus, one should consequently assign a very low
” http://www.exploit.com                                                         or even null fading factor (Fin) to the inner links, and
 ” http:Nwww.allwilk.com                                                         a reasonably high fading factor (F,,) to the outer
23 http://www.digital-cafe.com/~webmaster/nonvebOl             .htm
                                                                                 links. Indeed, in our practical experimentations we
I4 http://www.ryley.com
z http:Nwww.opentext.com                                                         saw that setting Fi, to zero, i.e. completely omitting
x http:Nwww.infoseek.com
” http://www.lycos.com                                                            lx htlp://www.dns.net/dnsrd/
1232                           M. Marchiori/Computer   Networks   and ISDN Systems 29 (1997)           1225-1235


the inner link contributions, gave very good results                   slightly perturbed (improved?) with some other kind
(although the “best” value was according to our                        of WWW-based information.
test was near to 0.1). Setting Fin to zero also gave                        The search engines for which a post-processor
the advantage of making the implementation of the                      was developed were: Excite 32, HotBot 33, Lycos 34,
hyper information quite faster, since most of the links                WebCrawler 35, and OpenText 36. This includes all
in web objects are inner. As far as outer links are                    of the major search engines, but for AltaVista 37 and
concerned, we set F,,, to 0.75. Again, from our tests                  InfoSeek 38, which unfortunately do not give the user
it resulted that similar values did not significantly                  access to the scores, and thus cannot be remotely
affect the bounty of the hyper information.                            post-processed. The implemented model included all
    We said earlier that the information about the use                 the accessory refinements seen in Section 4.
of the hyper information could be given as white                            We then conducted some tests in order to see how
box to the external users (while keeping as black                      the values of Fin and F,,, affected the quality of the
box the details of the TEXTINFO         function). This                hyper information, The depth and fading parameters
way, search engines persuasion has the effect of                       can be flexibly employed to tune the hyper informa-
reshaping the web, by considerably improving its                       tion. However, it is important for its practical use
connectivity. Indeed, as noticed in [I], at present the                that the behaviour of the hyper information is tested
inter-connectivity is rather poor, since almost 80% of                 when the depth and the fading factors are fixed in
sites contain no outer link (!), and a relatively small                advance.
number of web sites is carrying most of the load of                         We randomly selected 25 queries, collected all
hypertext navigation.                                                  the corresponding rankings from the aforementioned
    Using hyper information thus forces the sites that                 search engines, and then run several tests in order to
want to rank better to improve their connectivity,                     maximize the effectiveness of the hyper information.
improving the overall web structure.                                   We arrived to select the following global parameters
                                                                       for the hyper information: Fin = 0, F,,, = 0.75, and
                                                                       depth one. Although the best values were slightly
5. Testing                                                             different (Fi, = 0.1, depth two), the differences were
                                                                       almost insignificant. On the other hand, choosing
    The hyper information has been implemented as                      Fi, = 0 and depth one had great advantages in terms
post-processor of the main search engines now avail-                   of execution speed of the hyper information.
able, i.e. remotely querying them, extracting the cor-                      After this setting, we chose other 25 queries, and
responding scores (i.e., their TEXTINFO function),                     tested again the bounty of the hyper information with
and calculating the hyper information and therefore                    these fixed parameters: the results showed that these
the overall information.                                               initial settings also gave extremely good results for
    Indeed, as said in the introduction, one of the most               these new set of queries.
appealing features of the hyper methodology is that                         While our personal tests clearly indicated a con-
it can be implemented “on top” of existing scoring                     siderable increasing of precision for the search en-
functions. Note that, strictly speaking, scoring func-                 gines, there was obviously the need to get external
tions employing visibility, like those of Excite29,                    confirmation of the effectiveness of the approach.
WebCrawler 3o and Lycos 3’, are not pure “textual                      Also, although the tests indicated that slightly per-
information”. However, this is of little importance:                   turbing the fading factors did not substantially affect
although visibility, as shown, is not a good choice,                   the hyper information, the bounty of the values of the
it provides information which is disjoint to the hy-
per information, and so we can view such scoring                        32 http://www.excite.com
functions like providing purely textual information                     33 http://www.hotbot.com
                                                                        34 http://www.lycos.com
                                                                        35 http://www.webcrawler.com
*’ http://www.excite.com                                                36 http://www.opentext.com
3o http://www.webcrawler.com                                            37 http://www.altavista.com
31 http://www.lycos.com                                                 38 http:l/www.infoseek.com
M. Marchiori        /Computer   Networks   and ISDN Systems 29 (I 997) 1225-1235                                        1233


fading factors could somehow have been dependent                                            post-processed data, so that the user couldn’t see
on our way of choosing queries.                                                             any difference between the two rankings but for
    Therefore, we performed a deeper and more chal-                                         their links content.
lenging test. We considered a group of thirty persons,                                   l The columns where to view the original         search
and asked each of them to arbitrarily select five dif-                                      engine results and the post-processed results were
ferent topics to search on in the World Wide Web.                                           randomly chosen.
    Then each person had to perform the following                                        l The users were provided with the only informa-

test: s/he had to search for relevant information on                                        tion that we were performing a test on evaluating
each chosen topic by inputting suitable queries to                                          several different scoring mechanisms, and that
our post-processor, and give an evaluation mark to                                          the choice of the column where to view each
the obtained ranking lists. The evaluation consisted                                        ranking was random. This also avoided possible
in an integer ranging from 0 (terrible ranking, of no                                       “chain” effects, where a user, after having given
use), to 100 (perfect ranking).                                                             for a certain number of times higher scores to a
    The post-processor was implemented as a cgi-                                            fixed column, can be unconsciously led to give it
script: the prompt asks for a query, and returns two                                        bonuses even if it actually doesn’t deserve them.
ranking lists relative to that query, one per column in                                     Fig. 1 shows an example snapshot of a session,
the same page (frames39 are used). In one column                                        where the selected query is “search engine score”:
there is the original top ten of the search engine. In                                  the randomly selected search engine in this case was
the other column, the top ten of the ranking obtained                                   HotBot @, and its original ranking (filtered) is in the
by: (1) taking the first 100 items of the ranking of                                    left column.
the search engine, and (2) post-processing them with                                        The final results of the test are pictured in Fig. 2.
the hyper information.                                                                  As it can be seen, the chart clearly shows (especially
    We made serious efforts in order to provide each                                    in view of the blind evaluation process), that the
user with the most natural conditions in order to eval-                                 hyper information is able to considerably improve
uate the rankings. Indeed, to avoid “psychological                                      the evaluation of the informative content.
pressure” we provided each user with a password,                                            The precise data regarding the evaluation test are
so that each user could remotely perform the test on                                    shown in the next table:
hisker favorite terminal, in every time period (the
test process could be frozen via a special button, ter-                                              Excite   HotBot   Lycos    WebCrawler   OpenText   Average
minating the session, and resumed whenever wanted),                                     Normal       80.1     62.2     59.0     54.2         63.4       63.2
with all the time necessary to evaluate a ranking (e.g.                                 Hyper        85.2     77.3     75.4     68.5         77.1       76.7
possibly via looking at each link using the browser).
Besides the evident beneficial effects on the bounty                                       The next table shows the evaluation increment for
of the data, this also had the side-effect that we hadn’t                               each search engine w.r.t. its hyper version. and the
to personally take care of each test, something which                                   corresponding standard deviation:
could have been extremely time consuming.
    Another extremely important point is that the post-                                              Excite   HotBot   Lycos    WebCrawler   OpenText   Average
processor was implemented to make the test com-                                         Evaluation  increment
pletely blind. This was achieved in the following way.                                           +5.1      +15.1       +16.4    +14.3        +13.7      +12.9
    Each time a query was run, the order in which the                                   Standard deviation
                                                                                                  2.2         4.1         3.6      1.6          3.0        2.9
five search engines had to be applied was randomly
chosen. Then, for each of the five search engines the
following process occurred:                                                                As it can be seen, the small standard deviations
 l The data obtained      by the search engine were                                     are a further empirical evidence of the superiority
    “filtered” and presented in a standard “neutral”                                    of the hyper search engines over their non-hyper
    format. The same format was adopted for the                                         versions.

j9 http://home.netscape.com/assist/net_sites/frames.html                                “I http:Nwww.hotbot.com
234       M. Marchiori/Computer          Networks    and ISDN Systems 29 (1997)         12251235




                                  Fig. 1. Snapshot    from a test session.




      0                                      Lycos      WelKrwim         OpenTeti
            Exdte            HotBol                                                         A=-tP

                    Fig. 2. Evaluation   of search engines    vs. their hyper   versions.
M. Marchiori/Computer       NetMorks     and ISDN Svstems 29 (1997)           1225-1235                                           1235


6. Conclusion                                                                             M. Marchiori             (mailto:max@math.unipd.it),                 Security       of
                                                                                          World Wide Web search engines, 3rd International                            Confer-
                                                                                          ence on Reliability,         Qualio and Safev of SqfnYare-Intensive
   We have presented a new way to considerably
                                                                                          S:wems. Athens, Greece, Chapman & Hall (http:Nwww.tho
improve the quality of search engines evaluations by                                      mson.com:8866/chaphall/default.htmi),                    I997
considering the “hyper information” of web objects.                                       K. Murphy             (mailto:kathleen@webweek.com).                       Cheaters
We have shown that besides its practical success, the                                     never win (http:Nwww.webweek.com/96May20/undercon/c
method presents several other interesting features:                                       heatershtml).        Web Week (http://www.webweek.com),                        May,
                                                                                           1996
   It allows to focus separately on the “textual” and
                                                                                          J. Rhodes (mailto:deadJock@deadlock.com).                            The Art of
   “hype? components.                                                                     Business Web Site Promotion                  (http://deadlock.com/-dead1
   It allows to improve current search engines in                                         o&/promote).         Deadlock Design (http://deadlock.com/-dead1
   a smooth way, since it works “on top” of any                                           ock), 1996.
   existing scoring function.                                                             S. Sclaroff       (mailto:sclaroff@cs.bu.edu).               World Wide Web
                                                                                          image search engines (http://www.cs.bu.edu/techreports/95-0
   It can be used for locally improving performances
                                                                                           16.www-image-search-engines.ps.z),                   NSF Workskop           on Vi-
   of search engines, just using a local post-proces-                                     sual Information        Management,        Cambridge,          MA. 1995
   sor on the client side (like done in this paper)                                       D. Sullivan (mailto:danny@calafia.com).                       The Webmaster’s
   Analogously, it can be used to improve so-called                                       Guide to Search Engines and Directories                     (http:Ncalafia.com/
   “meta searchers”. like SavySearch “, Meta-                                             webmasters/).       Calafia Consulting         (http://calafia.com),         1996
                                                                                          G.     Venditto       (maiJto:venditto@iw.com).                  Search      engine
   Crawler 42, WebCompass 43, NlightNj4          and so
                                                                                          showdown          (http:Nwww.intemetworld.com/l996/05/showdo
   on. These suffer even more from the problem of                                         wnhtml).       Internet       World (http:Nwww.intemetworid.com).
   good scoring, since they have to mix data coming                                       Vol. 7(S) (http:Nwww.intemetworld.com/1996/05/toc.html),
   from different search engines, and the result is in                                    May. 1996.
   most of cases not very readable. The method can                                        N. Wingfeld           (mailto:nickw@cnet.com),                Engine sells re-
                                                                                          sults. draws tire (http:l/www.news.comINews/Item/O,4,163
   considerably improve the effectiveness and utility
                                                                                          5.00.html).     Clnet Inc. (http://www.news.com).                  June. 1996.
   for the user of such meta-engines.
   Last but not least, it behaves extremely well with                                                                       Massimo        Marchiori       received       the
   respect to search engines persuasion attacks.                                                                            M.S. in Mathematics            with Highest
                                                                                                                            Honors.       and the Ph.D. in Com-
                                                                                                                           puter Science. both from the Uni-
                                                                                                                            versity of Padua. His research in-
References                                                                                                                  terests include         World     Wide Web
                                                                                                                            and intranets (information           retrieval.
[I] T. Bray         (mailto:tbray@textuahty.com),             Measuring    the                                              search engines, metadata. digital li-
    Web (http://www5conf.inria.fr/fich_htmupapersiP9/Overvie                                                               braries), visualisation,        programming
    w.html).     5th International           World Wide Web Conference                                                     languages (constraint          programming,
    (http:llwww.w3.orglpub/ConferenceslWWWS/),                    May, Paris.                                               visual programming.          functional     pro-
    1996                                                                                                                   gramming,        logic programming),           ge-
[2] M. Marchiori        (mailto:max@math.unipd.it),          The hyper infor-         netic algorithms,      and rewriting        systems. He has published over
    mation: theory and practice, Technical Report No. 36. Dept.                       twenty refereed       papers on the above topics in various journals
    of Pure & Applied Mathematics.                University of Padova. 1996          and proceedings       of international       conferences.




‘I http:Nwww.cs.colostate.edu/“dreiling/smartform.htmI
4’ http://www.metacrawler.com
43 http://arachnid.qdeck.com/qdecMdemosoft/lite/
41 http:Nwww.nlightn.com

Más contenido relacionado

Destacado

Martyanovy shakleiny voinovy
Martyanovy shakleiny voinovyMartyanovy shakleiny voinovy
Martyanovy shakleiny voinovyDemanessa
 
Extended Essay Poster part 1
Extended Essay Poster part 1Extended Essay Poster part 1
Extended Essay Poster part 1Assia Chelaghma
 
Yourvid YES Digital LdV TOI Project
Yourvid YES Digital LdV TOI ProjectYourvid YES Digital LdV TOI Project
Yourvid YES Digital LdV TOI ProjectNicoleta Olcott
 
Banking java projects
Banking java projectsBanking java projects
Banking java projectsmsudan92
 
Bad Boy Beacon MAR 2012
Bad Boy Beacon MAR 2012Bad Boy Beacon MAR 2012
Bad Boy Beacon MAR 2012Task Frg
 
Brain injury and mental health 2010[1].pptx bisno
Brain injury and mental health 2010[1].pptx bisnoBrain injury and mental health 2010[1].pptx bisno
Brain injury and mental health 2010[1].pptx bisnoMrFattman
 
доступ до публ інформації
доступ до публ інформаціїдоступ до публ інформації
доступ до публ інформаціїOlena Ursu
 
Brochure rafael velásquez
Brochure rafael velásquezBrochure rafael velásquez
Brochure rafael velásquezrafaelhvv
 
ドキッ!記号だらけの無名関数
ドキッ!記号だらけの無名関数ドキッ!記号だらけの無名関数
ドキッ!記号だらけの無名関数Shinya Hayakawa
 
Segmintation
SegmintationSegmintation
SegmintationAhmad rub
 
О современных информационных технологиях
О современных информационных технологияхО современных информационных технологиях
О современных информационных технологияхAlexey Vidanov
 
CBRANDS Point of Sale_June 2015
CBRANDS Point of Sale_June 2015CBRANDS Point of Sale_June 2015
CBRANDS Point of Sale_June 2015Major Brands
 
Mgsdp for ualrc 13 09 2012
Mgsdp for ualrc 13 09 2012Mgsdp for ualrc 13 09 2012
Mgsdp for ualrc 13 09 2012Olena Ursu
 
Клименко ю
Клименко юКлименко ю
Клименко юOlena Ursu
 
Вискуб O
Вискуб OВискуб O
Вискуб OOlena Ursu
 

Destacado (20)

Martyanovy shakleiny voinovy
Martyanovy shakleiny voinovyMartyanovy shakleiny voinovy
Martyanovy shakleiny voinovy
 
Extended Essay Poster part 1
Extended Essay Poster part 1Extended Essay Poster part 1
Extended Essay Poster part 1
 
Yourvid YES Digital LdV TOI Project
Yourvid YES Digital LdV TOI ProjectYourvid YES Digital LdV TOI Project
Yourvid YES Digital LdV TOI Project
 
Soap
SoapSoap
Soap
 
Banking java projects
Banking java projectsBanking java projects
Banking java projects
 
Bad Boy Beacon MAR 2012
Bad Boy Beacon MAR 2012Bad Boy Beacon MAR 2012
Bad Boy Beacon MAR 2012
 
Nfc base door
Nfc base doorNfc base door
Nfc base door
 
Chemerys
ChemerysChemerys
Chemerys
 
Brain injury and mental health 2010[1].pptx bisno
Brain injury and mental health 2010[1].pptx bisnoBrain injury and mental health 2010[1].pptx bisno
Brain injury and mental health 2010[1].pptx bisno
 
Cs3 3
Cs3 3Cs3 3
Cs3 3
 
доступ до публ інформації
доступ до публ інформаціїдоступ до публ інформації
доступ до публ інформації
 
Brochure rafael velásquez
Brochure rafael velásquezBrochure rafael velásquez
Brochure rafael velásquez
 
ドキッ!記号だらけの無名関数
ドキッ!記号だらけの無名関数ドキッ!記号だらけの無名関数
ドキッ!記号だらけの無名関数
 
Segmintation
SegmintationSegmintation
Segmintation
 
Catalina bertumet 2
Catalina bertumet 2Catalina bertumet 2
Catalina bertumet 2
 
О современных информационных технологиях
О современных информационных технологияхО современных информационных технологиях
О современных информационных технологиях
 
CBRANDS Point of Sale_June 2015
CBRANDS Point of Sale_June 2015CBRANDS Point of Sale_June 2015
CBRANDS Point of Sale_June 2015
 
Mgsdp for ualrc 13 09 2012
Mgsdp for ualrc 13 09 2012Mgsdp for ualrc 13 09 2012
Mgsdp for ualrc 13 09 2012
 
Клименко ю
Клименко юКлименко ю
Клименко ю
 
Вискуб O
Вискуб OВискуб O
Вискуб O
 

Similar a Marchiori articolo scientifico-1997

Inverted textindexing
Inverted textindexingInverted textindexing
Inverted textindexingKhwaja Aamer
 
Smart Crawler for Efficient Deep-Web Harvesting
Smart Crawler for Efficient Deep-Web HarvestingSmart Crawler for Efficient Deep-Web Harvesting
Smart Crawler for Efficient Deep-Web Harvestingpaperpublications3
 
Real Time Search
Real Time SearchReal Time Search
Real Time SearchWowd
 
Opportunities for AI in Intelligent Web-based Technology-Supported Learning
Opportunities for AI in Intelligent Web-based Technology-Supported LearningOpportunities for AI in Intelligent Web-based Technology-Supported Learning
Opportunities for AI in Intelligent Web-based Technology-Supported LearningCarsten Ullrich
 
Web Crawling Using Location Aware Technique
Web Crawling Using Location Aware TechniqueWeb Crawling Using Location Aware Technique
Web Crawling Using Location Aware Techniqueijsrd.com
 
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET Journal
 
Web-Application Framework for E-Business Solution
Web-Application Framework for E-Business SolutionWeb-Application Framework for E-Business Solution
Web-Application Framework for E-Business SolutionIRJET Journal
 
A LITERATURE REVIEW ON SEMANTIC WEB – UNDERSTANDING THE PIONEERS’ PERSPECTIVE
A LITERATURE REVIEW ON SEMANTIC WEB – UNDERSTANDING THE PIONEERS’ PERSPECTIVEA LITERATURE REVIEW ON SEMANTIC WEB – UNDERSTANDING THE PIONEERS’ PERSPECTIVE
A LITERATURE REVIEW ON SEMANTIC WEB – UNDERSTANDING THE PIONEERS’ PERSPECTIVEcsandit
 
Security-Challenges-in-Implementing-Semantic-Web-Unifying-Logic
Security-Challenges-in-Implementing-Semantic-Web-Unifying-LogicSecurity-Challenges-in-Implementing-Semantic-Web-Unifying-Logic
Security-Challenges-in-Implementing-Semantic-Web-Unifying-LogicNana Kwame(Emeritus) Gyamfi
 
Volume 2-issue-6-2056-2060
Volume 2-issue-6-2056-2060Volume 2-issue-6-2056-2060
Volume 2-issue-6-2056-2060Editor IJARCET
 
Volume 2-issue-6-2056-2060
Volume 2-issue-6-2056-2060Volume 2-issue-6-2056-2060
Volume 2-issue-6-2056-2060Editor IJARCET
 
Web personalization using clustering of web usage data
Web personalization using clustering of web usage dataWeb personalization using clustering of web usage data
Web personalization using clustering of web usage dataijfcstjournal
 
Google Research Paper
Google Research PaperGoogle Research Paper
Google Research Paperdidip
 
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...ijwscjournal
 
Topic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep WebpagesTopic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep Webpagescsandit
 
Topic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep WebpagesTopic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep Webpagescsandit
 
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A ReviewIRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A ReviewIRJET Journal
 

Similar a Marchiori articolo scientifico-1997 (20)

Inverted textindexing
Inverted textindexingInverted textindexing
Inverted textindexing
 
Smart Crawler for Efficient Deep-Web Harvesting
Smart Crawler for Efficient Deep-Web HarvestingSmart Crawler for Efficient Deep-Web Harvesting
Smart Crawler for Efficient Deep-Web Harvesting
 
Real Time Search
Real Time SearchReal Time Search
Real Time Search
 
Opportunities for AI in Intelligent Web-based Technology-Supported Learning
Opportunities for AI in Intelligent Web-based Technology-Supported LearningOpportunities for AI in Intelligent Web-based Technology-Supported Learning
Opportunities for AI in Intelligent Web-based Technology-Supported Learning
 
Web Crawling Using Location Aware Technique
Web Crawling Using Location Aware TechniqueWeb Crawling Using Location Aware Technique
Web Crawling Using Location Aware Technique
 
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
 
By
ByBy
By
 
H0314450
H0314450H0314450
H0314450
 
Web-Application Framework for E-Business Solution
Web-Application Framework for E-Business SolutionWeb-Application Framework for E-Business Solution
Web-Application Framework for E-Business Solution
 
A LITERATURE REVIEW ON SEMANTIC WEB – UNDERSTANDING THE PIONEERS’ PERSPECTIVE
A LITERATURE REVIEW ON SEMANTIC WEB – UNDERSTANDING THE PIONEERS’ PERSPECTIVEA LITERATURE REVIEW ON SEMANTIC WEB – UNDERSTANDING THE PIONEERS’ PERSPECTIVE
A LITERATURE REVIEW ON SEMANTIC WEB – UNDERSTANDING THE PIONEERS’ PERSPECTIVE
 
Security-Challenges-in-Implementing-Semantic-Web-Unifying-Logic
Security-Challenges-in-Implementing-Semantic-Web-Unifying-LogicSecurity-Challenges-in-Implementing-Semantic-Web-Unifying-Logic
Security-Challenges-in-Implementing-Semantic-Web-Unifying-Logic
 
Volume 2-issue-6-2056-2060
Volume 2-issue-6-2056-2060Volume 2-issue-6-2056-2060
Volume 2-issue-6-2056-2060
 
Volume 2-issue-6-2056-2060
Volume 2-issue-6-2056-2060Volume 2-issue-6-2056-2060
Volume 2-issue-6-2056-2060
 
Searching the web general
Searching the web generalSearching the web general
Searching the web general
 
Web personalization using clustering of web usage data
Web personalization using clustering of web usage dataWeb personalization using clustering of web usage data
Web personalization using clustering of web usage data
 
Google Research Paper
Google Research PaperGoogle Research Paper
Google Research Paper
 
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
 
Topic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep WebpagesTopic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep Webpages
 
Topic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep WebpagesTopic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep Webpages
 
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A ReviewIRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
 

Último

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Último (20)

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

Marchiori articolo scientifico-1997

  • 1. Computer Networks and ISDN Systems 29 (I 997) 122% 1235 The quest for correct information on the Web: hyper search engines Massimo Marchiori ’ Department of Pure and Applied Mathematics, Universiy of Padova, Ka Beboni 7, 35131 Padova. Ita! Abstract Finding the right information in the World Wide Web is becoming a fundamental problem, since the amount of global information that the WWW containsis growingat an incrediblerate. In this paper,we presenta novel methodto extract from a web object its “hyper” informative content, in contrastwith current searchengines,which only deal with the “textual” informativecontent.This methodis not only valuableper se,but it is shownto be ableto considerablyincrease the precisionof current searchengines, Moreover, it integratessmoothlywith existing search engines technologysinceit can be implemented top of every searchengine,acting asa post-processor, automaticallytransforming search on thus a engineinto its corresponding “hype? version.We also showhow, interestingly,the hyper information can be usefully employed face the search to enginespersuasionproblem. 0 1997 Published by Elsevier Science B.V. Keywords: World Wide Web; Information retrieval; Searchengines;Sep; Browsersand tools; Designprinciplesand techniques; Integrationof heterogeneous informationsources; Searchtechniques 1. Introduction report, the performance of actual search engines is far from being fully satisfactory. While so far the techno- The World Wide Web is growing at phenome- logical progress has made possible to reasonably deal nal rates, as witnessed by every recent estimation. with the size of the web, in terms of number of objects This explosion both of Internet hosts and of peo- classified, the big problem is just to properly classify ple using the web, has made crucial the problem of objects in response to the users’ needs: in other words, managing such enormous amount of information. As to give the user the information s/he needs. market studies clearly indicate, in order to survive In this paper, we argue for a solution to this prob- into this informative jungle, web users have to almost lem, by means of a new measure of the informative exclusively resort on search engines (automatic cata- content of a web object, namely its “hyper informa- logs of the web) and repositories (human-maintained tion”. Informally, the problem with current search en- collections of links usually topics-based). In turn, gines is that they look at a web object to evaluate, repositories are now themselves resorting on search almost just like a piece of text. Even with extremely engines to keep their databases up-to-date. Thus, the sophisticated techniques like those already present in crucial component in the information management is some search engine scoring mechanism, this approach given by search engines. suffers from an intrinsic weakness: it doesn’t take into As all the recent studies on the subject (see e.g. [S]) account the webstructure the object is part of. The power of the web just relies on its capability ’ E-mail: max@math.unipd.it of redirecting the information flow via hyperlinks, so 0169-7552/97/$17.00 0 1997 Published by Elsevier Science B.V. Ail rights reserved. PII SO1 69-7552197100036-6
  • 2. 1226 M. Marchiori / Computer Networks and ISDN Systems 29 ( 1997) 1225-1235 it should appear rather natural that in order to evaluate quences of bytes. The intuition is that for each URL the informative content of a web object, the web struc- we can require from the web structure the corre- ture has to be carefully analyzed. Instead, so far only sponding object (an HTML 3 page, a text file, etc.). very limited forms of analysis of an object in the web The function has to be partial because for some URL space have been used, like “visibility” (the number of there is no corresponding object. links pointing to a web object), and the gain in obtain- In this paper we consider as web structure ing a more precise information has been very little. the World Wide Web structure WWW. Note that In this paper, instead, we develop a natural no- in general the real WWW is a timed structure, tion of “hype? information which properly takes since the URL mapping varies with time (i.e. it into account the web structure an object is part of. should be written as WWW, for each time in- Informally, the “overall information” of a web object stant t). However, we will work under the hy- is not composed only by its static “textual informa- pothesis that the WWW is locally time consis- tion”, but also another “hype? information has to be tent, i.e. that there is a non-void time interval I added, which is the measure of the potential informa- such that the probability that wwW,(ur/) = seq, tion of a web object with respect to the web space. t 2 t’ 2 t + I * WWW,r(url) = seq is extremely Roughly speaking, it measures how much informa- high. That is to say, if at a certain time t an URL tion one can obtain using that page with a browser, points to an object seq, then there is a time interval and navigating starting from it. in which this property stays true. Note this doesn’t Besides its fundamental characteristic, and future mean that the web structure stays the same, since potentials, the hyper information has an immediate new web objects can be added. practical impact: indeed, it works “on top” of any We will also assume that all the operations resort- textual information function, manipulating its “local” ing on the WWW structure that we will employ in this scores to obtain the more complete “global” score of paper, can be performed with extremely high prob- a web object. This means that it provides a smooth ability within time I: this way, we are acting on a integration with existing search engines, since it can “locally consistent” time interval, and thus it is (with be used to considerably improve the performance of extremely high probability) safe to get rid of the dy- a search engine simply by post-processing its scores. namic behavior of the World Wide Web and consider Moreover, we show how the hyper information it as an untimed web structure, as we will do. can nicely deal with the big problem of “search Note that these assumptions are not restrictive, engines persuasion” (tuning pages so to cheat a since empirical observations (cf. [2]) show that search engine, in order to make it give a higher WWW is indeed locally time consistent, e.g. setting rank), one of the most alarming factors of degrade of I = 1 day (in fact, this is one of the essential reasons the performance of most search engines. for which the World Wide Web works). This, in turn, Finally, we present the result of an extensive test- also implies that the second hypothesis (algorithms ing on the hyper information, used as post-processor working in at most I time) is not restrictive. in combination with the major search engines, and A web object is a pair (urZ,seq), made up by an show how the hyper information is able to consider- URL url and a sequence of bytes seq = WWW(url). ably improve the quality of information provided by In the sequel, we will usually consider understood search engines. the WWW web structure in all the situations where web objects are considered. As usual in the literature, when no confusion 2. World Wide Web arises we will sometimes talk of a link meaning the pointed web object (for instance, we may talk about In general, we consider an (untimed) web struc- the score of a link meaning the score of the web ture to be a partial function from URLs’ to se- object it points to). ’ http:/Jwww.w3.orgfpubfW?VWfAddressingfrfc1738.txt 3 http:ffwww.w3.orgfpub~~arkUpJ
  • 3. M. Murchiori / Computer Networks and ISDN Systems29 (I 997) 1225-1235 1227 3. Hyper vs. non-hyper tent must be more valuable than ather web objects that have less links pointing to them. This reasoning A great problem with search engines scoring would be correct if all web objects were known to mechanism is that they tend to score text more than users and to search engines. But this assumption is hypertext. Let us explain this better. It is well known clearly false. The fact is that a web object which is that the essential difference between normal texts not known enough, for instance for its location, is and hypertext is the relation-flow of information going to have a very low visibility (when it is not which is possible via hyperlinks. Thus, when evalu- completely neglected by search engines), unregard- ating the informative content of a hypertext it should ing of its informative content which may be by far be kept into account this fundamental, dynamic be- better than other web objects. havior of hypertext. Instead, most of search engines In a nutshell, visibility is likely to be a synony- tend to simply forget the “hype? part of a hypertext mous of popularity, which is something completely (the links), by simply scoring its textual components, different by quality, and thus its usage to give higher i.e. providing the textual information (henceforth, we score by search engines is a rather poor choice. say “information” in place of the more appropriate, but longer, “measure of information”). 3.1. The hyper information Note that ranking in a different way words ap- pearing in the title”, or between headings5, etc., As said, what is really missing in the evaluation like all contemporary search engines do, is not in of the score of a web object is its hyper part, that is contradiction with what said above, since title, head- the dynamic information content which is provided ings and so on are not “hype?‘, but they are simply by hyperlinks (henceforth, simply links). attributes of the text. We call this kind of information hyper in- More formally, the textual information of a web formation: this information should be added to object (url,seq) considers only the textual component the textual information of the web object, giv- seq, not taking into account the hyper information ing its (overall) information in the World Wide given by url and by the underlying World Wide Web Web. We indicate these three kinds of informa- structure. tion as HYPERINFO, TEXTINFO and INFORMATION, A partial exception is considering the so-called respectively. So, for every web object A we visibility (cf. for instance [l]) of a web object. have that INFORMATION(A) = TEXTlNFO(A) + Roughly, the visibility of a web object is the number HYPEIUNFO(A) (note that in general these infor- of other web objects with a link to it. Thus, visibility mation functions depend on a specific query, that is in a sense a measure of the importance of a web is to say they measure the informative content of a object in the World Wide Web context. Excite6 and web object with respect to a certain query: in the WebCrawIer’ have been the first search engines sequel, we will always consider such query to be to provide higher ranks to web objects with high understood). visibility, now followed by Lycos * and Magellan 9. The presence in a web object of links clearly aug- The problem is that visibility says nothing about ments the informative content with the information the informative content of a web object. The mis- contained in the pointed web objects (although we leading assumption is that if a web object has a have to establish to what extent). high visibility, then this is a sign of importance and Recursively, links present in the pointed web ob- consideration, and so de facto its informative con- jects further contribute, and so on. Thus, in principle, the analysis of the informative content of a web ob- ject A should involve all the web objects that are ’ http://www.w3.org/pubIWWW/TR/REC-html32.html#title reachable from it via hyperlinks (i.e., “navigating” in ’ http://www.w3.org/pubAVWW/TR/FSC-html32.html#headings the World Wide Web). ’ http:Nwww.excite.com ’ http:Nwww.webcrawler.com This is clearly unfeasible in practice, so, for prac- ’ http://www.lycos.com tical reasons, we have to stop the analysis at a certain 9 http://www.mckinley.com depth, just like in programs for chess analysis we
  • 4. 1228 M. Marchiori/Computer Networks and ISDN Systems 29 (1997) 1225-1235 have to stop considering the moves tree after few that A has almost zero textual information, while moves. So, one should fix a certain upper bound for B has an extremely high overall information. Using the depth of the evaluation. The definition of depth is the naive approach, A would have more informative completely natural: given a web object 0, the (rela- content than B, while it is clear that the user is much tive) depth of another web object 0’ is the minimum more interested in B than in A. number of links that have to be activated (“clicked”) The problem becomes even more evident when in order to reach 0’ from 0. So, saying that the we increase the depth of the analysis: consider a evaluation has max depth k means that we consider situation where Bk is at depth k from A, i.e. to go only the web objects at depth less or equal than k. from A to B one needs k “clicks”, and k is big (e.g. By fixing a certain depth, we thus select a suitable k > 10): finite “local neighborhood” of a web object in the World Wide Web. Now we are faced with the prob- lem of establishing the hyper information of a web object A w.r.t. this neighborhood (the hyper infor- LetA,Bt,... , Bk-I have almost zero infOITXitiVe mation with depth k). We denote this information by content, and Bk have very high overall information. with HYPERINFOtkl, and the corresponding overall Then A would have higher overall information than information (with depth k) with INFORMATIONtkl. Bk, which is paradoxical since A has clearly a too In most of cases, k will be considered understood, weak relation with Bk to be more useful than Bk and so we will omit the [k] subscript. itself. We assume that INFORMATION, TEXTINFO and The problem essentially is that the textual infor- HYPERINFO are functions from web objects to non- mation pointed by a link cannot be considered as negative real numbers. Their intuitive meaning is that actual, since it is potential: for the user there is a the more information a web object has, the greater is cost to retain the textual information pointed by a the corresponding number. link (click and...wait). Observe that in order to have a feasible im- The solution to these two factors is: the contri- plementation, we need that all of these functions bution to the hyper information of a web object at are bounded (i.e., there is a number M such that depth k is not simply its textual information, but M 2 INFORMATION(A), for every A). So, we it is its textual information diminished via a fading assume without loss of generality that TEXTINFO factor depending on its depth, i.e. on “how far” is the has an upper bound of 1, and that HYPERINFO is information for the user (how many clicks s/he has bounded. to perform). Our choice about the law regulating this fading 3.2. Single links function is that textual information fades exponen- tially w.r.t. the depth, i.e. the contribution to the To start with, consider the simple case where we hyper information of A given by an object B at depth have only at most one link in every web object. k is Fk . TEXTINFO( B), for a suitable fading factor F In the basic case when the depth is one, the most (0 < F < 1). complicated case is when we have one link from the Thus, in the above example, the hyper information considered web object A to another web object B, i.e. of A is not TEXTINFO(Bi) + . . . + TEXTINFO(Bk), but F . TEXTINFO( Bt ) + F* . TEXTINFO( &) + . . . + El-43 Fk . TEXTINFO(Bk), so that the overall information A naive approach is just to add the textual infor- of A is not necessarily greater than that of Bk. mation of the pointed web object. This idea, more or As an aside, note that when calculating the less, is tantamount to identify a web object with the overall information of a web object A, its tex- web object obtained by replacing the links with the tual information can be nicely seen as a spe- corresponding pointed web objects. cial degenerate case of hyper information, since This approach is attracting, but not correct, since TEXTINFO(A) = F” . TEXTINFO(A) (viz., the ob- it raises a number of problems. For instance, suppose ject is at “zero distance” from itself).
  • 5. h4. Marchiori/Computer Networks and ISDN Systems 29 (1997) 1225-1235 1229 3.3. More motivations link (i.e. F.TEXTINFO(B1)+. . .+F.TEXTlNFO(B,)), isn’t feasible since we want the hyper information to There is still one point to clarify. We have intro- be bounded. duced an exponentially fading law, and seen how it This would seem in contradiction with the inter- behaves well with respect to our expectations. But pretation of a link as potential information that we one could ask why the fading function has to be have given earlier: if one has many links that can be exponential, and not of different form, for instance activated, then one has all of their potential informa- polynomial (e.g., multiplying by F/k instead that by tion. However, this paradox is only apparent: the user Fk). Let us consider the overall information of a web cannot get all the links at the same time, but has to object with depth k: if we have sequentially select them. In other words, nondeter- minism has a cost. So, in the best case the user will 9B- select the most informative link, and then the second more informative one, and so on. Suppose for exam- it is completely reasonable to assume that the over- ple that the more informative link is BI, the second all information of A (with depth k) is given by the textual content of A plus the faded overall informa- one is B2 and so on (i.e., we have TEXTINFO(B1) > TEXTINFO(B2) 2 . 2 TEXTINFO(B,)). Thus, the tion of B (with depth k - 1). It is also reasonable hyper information is F . TEXTINFO(BI) (the user to assume that this fading can be approximated by selects the best link) plus F2 . TEXTINFO(B2) (the multiplying by an appropriate fading constant F second time, the user selects the second best link) (0 -C F < 1). So, we have the recursive relation and so on, that is to say lNFORMATION,kl (A) = F . lNFORMATIONlk- 11(B) F TEXTlNFO(B,) + . . . + F” . TEXTINFO(B,) Since readily for every object A we have Nicely, evaluating the score this way gives a lNFORMATIONLol(A) = TEXTlNFO(A), eliminating bounded function, since for any number of links, the recursion we obtain that if sum cannot be greater than F/(F + 1). Note that we chose the best sequence of selec- tions, since hyper information is the best “potential” then INFORMATION(A) = TEXTlNFO(A) + information, so we have to assume the user does the F . VEXTWO + F . (TEXTINFO(B~) + F . best choices: we cannot use e.g. a random selection (. . .TEXTINFO(Bk)))) = TEXTlNFO(A) + F . of the links, or even other functions like the average TEXTINFO(BI) + F2 . TEXTINFO(B;?) + . . . + Fk . between the contributions of the each link, since we TEXTlNFO(Bk), which is just the exponential fading cannot impose that every link has to be relevant. For law that we have adopted. instance, if we did so, accessory links with zero score (e.g. think of the “powered with Netscape” ‘O-links) 3.4. Multiple links would devalue by far the hyper information even in presence of highly scored links, while those acces- Now we turn to the case where there is more than sory links should simply be ignored (as the above one link in the same web object. For simplicity, we method, consistently, does). assume the depth is one. So, suppose one has the following situation 3.5. The general case ,/. The general case (multiple links in the same / ,/ ?q--~ page and arbitrary depth k) is treated accordingly to /‘./ what previously seen. Informally, all the web objects A fBh-II@ at depth less of equal than k are ordered w.r.t. a 0 What is the hyper information in this case? The ” http://www.netscape.com/comprodroducts/navigator/version- easiest answer, just sum the contribution of every 3.0/images/netnow3.gif
  • 6. 1230 M. Marchiori/Computer Networks and ISDN Systems 29 (1997) 1225-1235 “sequence of selections” such that the corresponding Another important issue is that the same defini- hyper information is the highest one. tion of link in a web object is fairly from trivial. A For example, consider the following situation: link present in a web object 0 is said to be active if the web objects it points to can be accessed by viewing (m-&q) with an HTML” browser (e.g., Netscape 12, Navigator l3 or Microsoft l4 Internet Explorer r5). This means, informally, that once we view 0 with the browser, we can activate the link by clicking over it. The previous definition is rather operational, but it is much more intuitive than a with F = OS,TEXTINFO(B) = 0.4,TEXTINFO(C) formal technical definition which can be given by = 0.3,TEXTINFO(D) = 0.2,TEXTINFO(E) = 0.6. tediously specifying all the possible cases according Then via a “sequence of selections” we can go to the HTML specification I6 (note a problem com- from A to B, to C, to E and then to D, and plicating a formal analysis is that one cannot assume this is readily the best sequence that maximizes the that seq is composed by legal HTML code, since hyper information, which is 0.5 . TEXTINFO(B) + browsers are error-tolerating). 0.52~TEXTINFO(C)+0.53.TEXTINFO(E)+0.54. Thus, the links mentioned in the paper should be TEXTINFO(D)(=0.3625). only the active ones. Also, there are many different kinds of links, and each of them would require a specific treatment. For 4. Refining the model instance, Local links (links pointing to some point in the There is a number of subtleties that for clarity of same web object, using the #-specifier 17) should exposition we haven’t considered so far. be readily ignored. The first big problem comes from possible dupli- Frumd8 links should be automatically expanded, cations. For instance, consider the case of backwards i.e. if A has a frame link to B, then this link links: suppose that a user has set up two pages A and should be replaced with a proper expansion of B, with a link in from A to B, and then a “come B inside A (since a frame link is automatically back” link in B from B to A: activated, its pointed web object is just part of the original web object, and the user does not see any link at all). Other links that are automatically activated are for instance the image I9 links, i.e. source links This means that we have a kind of recursion, and of image tags, the background 2o links and so on: that the textual information of A is repeatedly added they should be treated analogously to frame links. (although with increasingly higher fading) when cal- However, since usually it is a very hard problem culating the hyper information of A (this because to recover some useful information from images from A we can navigate to B and then to A and so on...). Another example is given by duplicating ‘I http://www.w3.org/publWWWlMarkUpl links (e.g. two links in a web object pointing to the ” http://www.netscape.com same web object). The solution to avoid these kinds I3 http:l/www.netscape.com/comprod/products/navigator/version- of problems, is not to consider all the “sequence of 3.0/index.html selections”, but only those without repetitions. This I4 http://www.microsoft.com I5 http://www.microsoft.com/ie is consistent with the intuition of hyper information ” http:llwww.w3.orglpubiWWWMarkup/ to measure in a sense how much a web object is far I7 http:llwww.w3.orglpub/WWWlAddressinglrfc1738.txt from a page: if one has already reached an object, I8 http://home.netscape.com/assist/net_sites/frames.html s/he already has got its information, and so it makes I9 http:Nwww.w3.orglpubIWWW/TR/RFE-html32.htmHimg no sense to get it again. 2o http:Nwww.w3.org/pub/WWW/TWREC-html32.htmHmody
  • 7. M. Marchiori/Computer Networks and ISDN Svstems 29 (1997) 1225-1235 1231 (cf. [6]), in a practical implementation of the hy- persuasion techniques, “spamdexing” [4,7,8], i.e. the per information all these links can be ignored artificial repetition of relevant keywords. Despite (note this doesn’t mean that the textual informa- these efforts, the situation is getting worst and worst, tion function has to ignore such links, since it since more sophisticated techniques have been devel- can e.g. take into account the name of the pic- oped, which analyze the behavior of search engines tures). This also happens for active links pointing and tune the pages accordingly. This has also led to to images, movies, sounds etc., which can be syn- the situation where search engines maintainers tend tactically identified by their file extensions .gif, to assign penalties to pages that rank “too high”, .jpg, .tiff, .avi, .wav and so on. and at the same time to provide less and less de- tails on their information functions just in order to 4.1. Search engines persuasion prevent this kind of tuning, thus in many cases penal- izing pages of high informative value that were not A big problem that search engines have to face designed with “persuasion” intentions. For a com- is the phenomenon of so-called sep (search engines prehensive study of the sep phenomenon, we refer persuasion). Indeed, search engines have become so to [3]. important in the advertisement market that it has The hyper methodology that we have developed become essential for companies to have their pages is able to a certain extent to nicely cope with the listed in top positions of search engines, in order sep problem. Indeed, maintainers can keep details of to get a significant web-based promotion. Starting their TEXTINFO function hidden, and make public with the already pioneering work of Rhodes [5], the information that they are using a HYPERINFO this phenomenon is now boosting at such a rate to function. have provoked serious problems to search engines, The only precaution is to distinguish between and has revolutioned the web design companies, two fundamental types of links, assigning different which are now specifically asked not only to de- fading factors to each of them. Suppose to have a sign good web sites, but also to make them rank web object (url,seq). A link contained in seqis called high in search engines. A vast number of new com- outer if it has not the same domain” of m-1,and panies was born just to make customer web pages inner in the other case. That is to say, inner links of as visible as possible. More and more companies, a web objects point to web objects in the same site like Exploit 21, Allwilk 22, Northern Webs 23, Ry (its “local world”, so to say), while outer links point ley & Associates 24, etc., explicitly study ways to to web objects of other sites (the “outer world”). rank high a page in search engines. OpenText Now, inner links are dangerous from the sep point arrived even to sell “preferred listings”, i.e. assuring of view, since they are on the direct control of the a particular entry to stay in the top ten for some site maintainer. For instance, a user that wants to time, a policy that has provoked some controversies artificially increase the hyper information of a web (cf. [91>. object A could set up a very similar web object B This has led to a bad performance degradation (i.e. such that TEXTINFO(A) % TEXTINFO(B)), and of search engines, since an increasingly high num- put a link from A to B: this would increase the score ber of pages is designed to have an artificially high of A by roughly F . TEXTINFO(A). textual content. The phenomenon is so serious that On the other hand. outer links do not present search engines like InfoSeekz6 and Lycos 27 have this problem since they are out of direct control and introduced penalties to face the most common of this manipulation (at least in principle, cf. [23). Thus, one should consequently assign a very low ” http://www.exploit.com or even null fading factor (Fin) to the inner links, and ” http:Nwww.allwilk.com a reasonably high fading factor (F,,) to the outer 23 http://www.digital-cafe.com/~webmaster/nonvebOl .htm links. Indeed, in our practical experimentations we I4 http://www.ryley.com z http:Nwww.opentext.com saw that setting Fi, to zero, i.e. completely omitting x http:Nwww.infoseek.com ” http://www.lycos.com lx htlp://www.dns.net/dnsrd/
  • 8. 1232 M. Marchiori/Computer Networks and ISDN Systems 29 (1997) 1225-1235 the inner link contributions, gave very good results slightly perturbed (improved?) with some other kind (although the “best” value was according to our of WWW-based information. test was near to 0.1). Setting Fin to zero also gave The search engines for which a post-processor the advantage of making the implementation of the was developed were: Excite 32, HotBot 33, Lycos 34, hyper information quite faster, since most of the links WebCrawler 35, and OpenText 36. This includes all in web objects are inner. As far as outer links are of the major search engines, but for AltaVista 37 and concerned, we set F,,, to 0.75. Again, from our tests InfoSeek 38, which unfortunately do not give the user it resulted that similar values did not significantly access to the scores, and thus cannot be remotely affect the bounty of the hyper information. post-processed. The implemented model included all We said earlier that the information about the use the accessory refinements seen in Section 4. of the hyper information could be given as white We then conducted some tests in order to see how box to the external users (while keeping as black the values of Fin and F,,, affected the quality of the box the details of the TEXTINFO function). This hyper information, The depth and fading parameters way, search engines persuasion has the effect of can be flexibly employed to tune the hyper informa- reshaping the web, by considerably improving its tion. However, it is important for its practical use connectivity. Indeed, as noticed in [I], at present the that the behaviour of the hyper information is tested inter-connectivity is rather poor, since almost 80% of when the depth and the fading factors are fixed in sites contain no outer link (!), and a relatively small advance. number of web sites is carrying most of the load of We randomly selected 25 queries, collected all hypertext navigation. the corresponding rankings from the aforementioned Using hyper information thus forces the sites that search engines, and then run several tests in order to want to rank better to improve their connectivity, maximize the effectiveness of the hyper information. improving the overall web structure. We arrived to select the following global parameters for the hyper information: Fin = 0, F,,, = 0.75, and depth one. Although the best values were slightly 5. Testing different (Fi, = 0.1, depth two), the differences were almost insignificant. On the other hand, choosing The hyper information has been implemented as Fi, = 0 and depth one had great advantages in terms post-processor of the main search engines now avail- of execution speed of the hyper information. able, i.e. remotely querying them, extracting the cor- After this setting, we chose other 25 queries, and responding scores (i.e., their TEXTINFO function), tested again the bounty of the hyper information with and calculating the hyper information and therefore these fixed parameters: the results showed that these the overall information. initial settings also gave extremely good results for Indeed, as said in the introduction, one of the most these new set of queries. appealing features of the hyper methodology is that While our personal tests clearly indicated a con- it can be implemented “on top” of existing scoring siderable increasing of precision for the search en- functions. Note that, strictly speaking, scoring func- gines, there was obviously the need to get external tions employing visibility, like those of Excite29, confirmation of the effectiveness of the approach. WebCrawler 3o and Lycos 3’, are not pure “textual Also, although the tests indicated that slightly per- information”. However, this is of little importance: turbing the fading factors did not substantially affect although visibility, as shown, is not a good choice, the hyper information, the bounty of the values of the it provides information which is disjoint to the hy- per information, and so we can view such scoring 32 http://www.excite.com functions like providing purely textual information 33 http://www.hotbot.com 34 http://www.lycos.com 35 http://www.webcrawler.com *’ http://www.excite.com 36 http://www.opentext.com 3o http://www.webcrawler.com 37 http://www.altavista.com 31 http://www.lycos.com 38 http:l/www.infoseek.com
  • 9. M. Marchiori /Computer Networks and ISDN Systems 29 (I 997) 1225-1235 1233 fading factors could somehow have been dependent post-processed data, so that the user couldn’t see on our way of choosing queries. any difference between the two rankings but for Therefore, we performed a deeper and more chal- their links content. lenging test. We considered a group of thirty persons, l The columns where to view the original search and asked each of them to arbitrarily select five dif- engine results and the post-processed results were ferent topics to search on in the World Wide Web. randomly chosen. Then each person had to perform the following l The users were provided with the only informa- test: s/he had to search for relevant information on tion that we were performing a test on evaluating each chosen topic by inputting suitable queries to several different scoring mechanisms, and that our post-processor, and give an evaluation mark to the choice of the column where to view each the obtained ranking lists. The evaluation consisted ranking was random. This also avoided possible in an integer ranging from 0 (terrible ranking, of no “chain” effects, where a user, after having given use), to 100 (perfect ranking). for a certain number of times higher scores to a The post-processor was implemented as a cgi- fixed column, can be unconsciously led to give it script: the prompt asks for a query, and returns two bonuses even if it actually doesn’t deserve them. ranking lists relative to that query, one per column in Fig. 1 shows an example snapshot of a session, the same page (frames39 are used). In one column where the selected query is “search engine score”: there is the original top ten of the search engine. In the randomly selected search engine in this case was the other column, the top ten of the ranking obtained HotBot @, and its original ranking (filtered) is in the by: (1) taking the first 100 items of the ranking of left column. the search engine, and (2) post-processing them with The final results of the test are pictured in Fig. 2. the hyper information. As it can be seen, the chart clearly shows (especially We made serious efforts in order to provide each in view of the blind evaluation process), that the user with the most natural conditions in order to eval- hyper information is able to considerably improve uate the rankings. Indeed, to avoid “psychological the evaluation of the informative content. pressure” we provided each user with a password, The precise data regarding the evaluation test are so that each user could remotely perform the test on shown in the next table: hisker favorite terminal, in every time period (the test process could be frozen via a special button, ter- Excite HotBot Lycos WebCrawler OpenText Average minating the session, and resumed whenever wanted), Normal 80.1 62.2 59.0 54.2 63.4 63.2 with all the time necessary to evaluate a ranking (e.g. Hyper 85.2 77.3 75.4 68.5 77.1 76.7 possibly via looking at each link using the browser). Besides the evident beneficial effects on the bounty The next table shows the evaluation increment for of the data, this also had the side-effect that we hadn’t each search engine w.r.t. its hyper version. and the to personally take care of each test, something which corresponding standard deviation: could have been extremely time consuming. Another extremely important point is that the post- Excite HotBot Lycos WebCrawler OpenText Average processor was implemented to make the test com- Evaluation increment pletely blind. This was achieved in the following way. +5.1 +15.1 +16.4 +14.3 +13.7 +12.9 Each time a query was run, the order in which the Standard deviation 2.2 4.1 3.6 1.6 3.0 2.9 five search engines had to be applied was randomly chosen. Then, for each of the five search engines the following process occurred: As it can be seen, the small standard deviations l The data obtained by the search engine were are a further empirical evidence of the superiority “filtered” and presented in a standard “neutral” of the hyper search engines over their non-hyper format. The same format was adopted for the versions. j9 http://home.netscape.com/assist/net_sites/frames.html “I http:Nwww.hotbot.com
  • 10. 234 M. Marchiori/Computer Networks and ISDN Systems 29 (1997) 12251235 Fig. 1. Snapshot from a test session. 0 Lycos WelKrwim OpenTeti Exdte HotBol A=-tP Fig. 2. Evaluation of search engines vs. their hyper versions.
  • 11. M. Marchiori/Computer NetMorks and ISDN Svstems 29 (1997) 1225-1235 1235 6. Conclusion M. Marchiori (mailto:max@math.unipd.it), Security of World Wide Web search engines, 3rd International Confer- ence on Reliability, Qualio and Safev of SqfnYare-Intensive We have presented a new way to considerably S:wems. Athens, Greece, Chapman & Hall (http:Nwww.tho improve the quality of search engines evaluations by mson.com:8866/chaphall/default.htmi), I997 considering the “hyper information” of web objects. K. Murphy (mailto:kathleen@webweek.com). Cheaters We have shown that besides its practical success, the never win (http:Nwww.webweek.com/96May20/undercon/c method presents several other interesting features: heatershtml). Web Week (http://www.webweek.com), May, 1996 It allows to focus separately on the “textual” and J. Rhodes (mailto:deadJock@deadlock.com). The Art of “hype? components. Business Web Site Promotion (http://deadlock.com/-dead1 It allows to improve current search engines in o&/promote). Deadlock Design (http://deadlock.com/-dead1 a smooth way, since it works “on top” of any ock), 1996. existing scoring function. S. Sclaroff (mailto:sclaroff@cs.bu.edu). World Wide Web image search engines (http://www.cs.bu.edu/techreports/95-0 It can be used for locally improving performances 16.www-image-search-engines.ps.z), NSF Workskop on Vi- of search engines, just using a local post-proces- sual Information Management, Cambridge, MA. 1995 sor on the client side (like done in this paper) D. Sullivan (mailto:danny@calafia.com). The Webmaster’s Analogously, it can be used to improve so-called Guide to Search Engines and Directories (http:Ncalafia.com/ “meta searchers”. like SavySearch “, Meta- webmasters/). Calafia Consulting (http://calafia.com), 1996 G. Venditto (maiJto:venditto@iw.com). Search engine Crawler 42, WebCompass 43, NlightNj4 and so showdown (http:Nwww.intemetworld.com/l996/05/showdo on. These suffer even more from the problem of wnhtml). Internet World (http:Nwww.intemetworid.com). good scoring, since they have to mix data coming Vol. 7(S) (http:Nwww.intemetworld.com/1996/05/toc.html), from different search engines, and the result is in May. 1996. most of cases not very readable. The method can N. Wingfeld (mailto:nickw@cnet.com), Engine sells re- sults. draws tire (http:l/www.news.comINews/Item/O,4,163 considerably improve the effectiveness and utility 5.00.html). Clnet Inc. (http://www.news.com). June. 1996. for the user of such meta-engines. Last but not least, it behaves extremely well with Massimo Marchiori received the respect to search engines persuasion attacks. M.S. in Mathematics with Highest Honors. and the Ph.D. in Com- puter Science. both from the Uni- versity of Padua. His research in- References terests include World Wide Web and intranets (information retrieval. [I] T. Bray (mailto:tbray@textuahty.com), Measuring the search engines, metadata. digital li- Web (http://www5conf.inria.fr/fich_htmupapersiP9/Overvie braries), visualisation, programming w.html). 5th International World Wide Web Conference languages (constraint programming, (http:llwww.w3.orglpub/ConferenceslWWWS/), May, Paris. visual programming. functional pro- 1996 gramming, logic programming), ge- [2] M. Marchiori (mailto:max@math.unipd.it), The hyper infor- netic algorithms, and rewriting systems. He has published over mation: theory and practice, Technical Report No. 36. Dept. twenty refereed papers on the above topics in various journals of Pure & Applied Mathematics. University of Padova. 1996 and proceedings of international conferences. ‘I http:Nwww.cs.colostate.edu/“dreiling/smartform.htmI 4’ http://www.metacrawler.com 43 http://arachnid.qdeck.com/qdecMdemosoft/lite/ 41 http:Nwww.nlightn.com