SlideShare una empresa de Scribd logo
1 de 16
Descargar para leer sin conexión
Measurement, Modeling, and Analysis
                            of a Peer-to-Peer File-Sharing Workload

                                     Krishna P. Gummadi, Richard J. Dunn, Stefan Saroiu,
                                     Steven D. Gribble, Henry M. Levy, and John Zahorjan
                                         Department of Computer Science and Engineering
                                                    University of Washington
                                                        Seattle, WA 98195
                                  {gummadi,rdunn,tzoompy,gribble,levy,zahorjan}@cs.washington.edu



ABSTRACT                                                                            Categories and Subject Descriptors
Peer-to-peer (P2P) file sharing accounts for an astonishing                          C.4 [Performance of Systems]: Modeling Techniques;
volume of current Internet traffic. This paper probes deeply                          C.2.4 [Computer-Communication Networks]: Distrib-
into modern P2P file sharing systems and the forces that                             uted Systems—Distributed applications
drive them. By doing so, we seek to increase our under-
standing of P2P file sharing workloads and their implica-                            General Terms
tions for future multimedia workloads. Our research uses
a three-tiered approach. First, we analyze a 200-day trace                          Measurement, performance, design
of over 20 terabytes of Kazaa P2P traffic collected at the
University of Washington. Second, we develop a model of                             Keywords
multimedia workloads that lets us isolate, vary, and explore                        Peer-to-peer, multimedia workloads, Zipf’s law, modeling,
the impact of key system parameters. Our model, which we                            measurement
parameterize with statistics from our trace, lets us confirm
various hypotheses about file-sharing behavior observed in
the trace. Third, we explore the potential impact of locality-
                                                                                    1. INTRODUCTION
awareness in Kazaa.                                                                    A decade after its birth, the Internet continues its rapid
   Our results reveal dramatic differences between P2P file                           and often surprising evolution. Recent studies have shown
sharing and Web traffic. For example, we show how the                                 a dramatic shift of Internet traffic away from HTML text
immutability of Kazaa’s multimedia objects leads clients                            pages and images and towards multimedia file sharing. For
to fetch objects at most once; in contrast, a World-Wide                            example, a March 2000 study at the University of Wisconsin
Web client may fetch a popular page (e.g., CNN or Google)                           found that the bandwidth consumed by Napster had edged
thousands of times. Moreover, we demonstrate that: (1)                              ahead of HTTP bandwidth [28]. Only two years later, a Uni-
this “fetch-at-most-once” behavior causes the Kazaa popu-                           versity of Washington study showed that peer-to-peer file
larity distribution to deviate substantially from Zipf curves                       sharing dominates the campus network, consuming 43% of
we see for the Web, and (2) this deviation has significant                           all bandwidth compared to only 14% for WWW traffic [29].
implications for the performance of multimedia file-sharing                          When comparing these bandwidth consumption statistics
systems. Unlike the Web, whose workload is driven by doc-                           to those in a 1999 study of a similar campus workload [37],
ument change, we demonstrate that clients’ fetch-at-most-                           the portion of network bytes ascribed to audio and video
once behavior, the creation of new objects, and the addition                        increased by 300% and 400%, respectively, over a 3-year pe-
of new clients to the system are the primary forces that drive                      riod. Without question, multimedia file sharing has become
multimedia workloads such as Kazaa. We also show that                               a dominant factor in today’s Internet. In all likelihood, it
there is substantial untapped locality in the Kazaa workload.                       will dominate the Internet of the future, barring a chilling
Finally, we quantify the potential bandwidth savings that                           effect from legal assaults.
locality-aware P2P file-sharing architectures would achieve.                            Two traits of today’s file-sharing systems distinguish them
                                                                                    from Web-based content distribution. First, current file-
                                                                                    sharing systems use a “peer-to-peer” design: peers volun-
                                                                                    tarily provide resources as well as consume them. Because
                                                                                    of this, the system must dynamically adapt to maintain ser-
Permission to make digital or hard copies of all or part of this work for           vice continuity as individual peers come and go. Second,
personal or classroom use is granted without fee provided that copies are           file-sharing is being used predominantly to distribute mul-
not made or distributed for profit or commercial advantage and that copies           timedia files, and as a result, file-sharing workloads differ
bear this notice and the full citation on the first page. To copy otherwise, to      substantially from Web workloads [29]. Multimedia files are
republish, to post on servers or to redistribute to lists, requires prior specific   large (several megabytes or gigabytes, compared to several
permission and/or a fee.
SOSP’03, October 19–22, 2003, Bolton Landing, New York, USA.                        kilobytes for typical Web pages), they are immutable, and
Copyright 2003 ACM 1-58113-757-5/03/0010 ...$5.00.                                  there are currently far fewer distinct multimedia files than
Web pages. Video-on-demand (VoD) systems also distribute         trace length                     203 days, 5 hours, 6 minutes
multimedia files; we contrast our Kazaa measurements and          # of requests                    1,640,912
                                                                 # of transactions                98,997,622
analysis with related work on VoD systems in Section 3.
                                                                 # of unsuccessful transactions   65,505,165 (66.2%)
  This paper presents an in-depth analysis of a modern P2P                                        252KB (all transactions)
                                                                 average transaction size
multimedia file-sharing workload, considering the “peer-to-                                        752KB (successful transactions only)
peer” and the “multimedia” aspects of the workload inde-         # of users                       24,578
pendently. Our goals are:                                        # of unique objects              633,106 (totaling 8.85TB)
                                                                 bytes transferred                22.72TB
     1. to understand the fundamental properties of multime-     content demanded                 43.87TB
        dia file-sharing systems, independent of the design of
        the delivery system,                                    Table 1:    Kazaa trace summary statistics, 5/28/02 to
                                                                12/17/02. A transaction refers to a single Kazaa HTTP
     2. to explore the forces driving P2P file-sharing work-     transfer; a request refers to the set of transactions a
        loads so we can anticipate the potential impacts of     client issues while downloading an object, potentially
        future change, and                                      in pieces from many servers. Clients are identified by
                                                                Kazaa username. We only present statistics for down-
     3. to demonstrate that significant opportunity exists to    loads made by university-internal clients for data on
        optimize performance in current file-sharing systems     university-external peers.
        by exploiting untapped locality in the workload.
   To meet these goals, we employ several approaches. First,
we analyze a 200-day trace of Kazaa [19] traffic that we          ies have described high-level characteristics of P2P work-
collected at the University of Washington between May and       loads [6, 29, 30, 32]. Our goals in this section are to: (1)
December of 2002. We traced over 20 terabytes of traffic          dig beneath these high-level studies to uncover the processes
during that period, from which we distill several key lessons   that drive the workloads, and (2) demonstrate ways in which
about the Kazaa workload. Second, we derive a model of          these processes fundamentally differ from those of the Web.
this multimedia traffic based on our analysis. The model          2.1 Trace Methodology
helps to explain the root causes of many of the trends shown
by Kazaa and to predict how the trends may change as the           The data in this section are based on a 200-day trace
workload evolves. Third, we use trace-driven simulation to      of Kazaa peer-to-peer file-sharing traffic collected at the
quantify the significant potential that exists to improve the    University of Washington between May 28th and December
performance of multimedia file-sharing in our environment.       17th, 2002. The University of Washington (UW) is a large
   Our analysis reveals that the Kazaa workload is driven       campus with over 60,000 faculty, students, and staff. Table 1
by considerably different forces than the Web. Kazaa ob-         describes the trace, which saw over 20 terabytes of incoming
jects are immutable, and as a result, the vast majority of      data resulting from 1.6 million requests. Our trace was long
its objects are fetched at most once per client; in contrast,   enough for us to observe seasonal traffic variations, includ-
Web pages (e.g., Google or CNN) can be fetched thousands        ing the end of spring quarter in June, the summer months,
of times per client. Our measurements show that the pop-        and the start and end of the fall quarter. We also observed
ularity distribution of Kazaa objects deviates substantially    the impact of bandwidth rate-limiting instituted by the uni-
from the Zipf curves we commonly see for the Web, and           versity’s networking organization midway through the trace
our model confirms that the “fetch-at-most-once” behavior        in an attempt to control the cost of Kazaa traffic.1
of Kazaa clients is the cause. Our model demonstrates an-          We collected our trace using hardware and software in-
other consequence of object immutability: unlike the Web,       stalled at the network border between the university and
whose workload is driven by document change, the primary        the Internet. Our hardware consists of a 2.0 GHz Pentium-
forces in Kazaa are the creation of new objects and the addi-   III workstation that monitored traffic flowing between the
tion of new users. Without these forces, fetch-at-most-once     university and the Internet. Our workstation had sufficient
behavior would drive the system to stagnation.                  CPU and network capacity to ensure that no packets were
   The structure of this paper follows the multi-tiered ap-     dropped, even during peak load. An adjacent workstation
proach cited above. Section 2 describes our trace method-       acted as a one-terabyte file store for archiving trace data.
ology and presents our trace-based analysis of Kazaa. In        Our software used a kernel packet filter [24] to deliver TCP
Section 3, we analyze the popularity distribution of Kazaa      packets to a user-level process, which identified HTTP re-
requests and the forces that shape it. Section 4 uses our       quests within the TCP flows. Throughout our trace, the
observations and analysis to develop an analytical model;       packet filter reported packet drop rates of less than 0.0001%.
we then use the model to explore the processes that drive       We made all sensitive information anonymous – including
Kazaa’s behavior in greater depth. Section 5 considers the      IP addresses, URLs, usernames, and object names – before
performance potential of bandwidth-saving techniques sug-       compressing and storing the trace. Overall, our tracing and
gested by our modeling and analysis. We describe research       analysis software consists of over 30,000 lines of code.
related to our study in Section 6, while Section 7 summarizes      Our hardware monitored all incoming and outgoing traf-
our results and presents conclusions.                           fic. However, the data presented in this paper (including
                                                                Table 1) are for one direction only: requests made by uni-
                                                                versity-internal peers to download data stored on university-
2.     THE MEASURED PROPERTIES OF P2P
                                                                1
       FILE-SHARING WORKLOADS                                     The imposed rate limits bounded upload traffic out of the
                                                                university’s dormitory population and had little effect on
  This section uses our trace data to identify key properties   download traffic (to the dorms or to the university as a
of the Kazaa multimedia file-sharing system. Recent stud-        whole), which is the focus of our research.
external peers. This unidirectional trace captures all re-                               1
quests issued by a stable, complete user population over a                              0.9
                                                                                                  <10MB objects
period of time. Kazaa control traffic, which consists primar-                             0.8




                                                                  % of requests (CDF)
ily of queries and their responses, is encrypted and was not                            0.7
captured as part of our trace.                                                          0.6
   Throughout this paper, the term “user” refers to a person                            0.5
                                                                                                                                 100MB+ objects
and “client” refers to the application instance running on be-                          0.4
                                                                                        0.3
half of a user. We assume there is largely a one-to-one corre-
                                                                                        0.2
spondence between users and specific application instances
                                                                                        0.1
in our environment (although this may not always be true);                               0
therefore, we draw conclusions about users based on observa-                             5 -3
                                                                                           mins            1 hour             1 day    1 week     150 days
tions of clients in our trace. Note, however, that client-side                                                      download latency
caches may absorb some requests from users, meaning that
the client request rate, which we observe in our trace, may      Figure 1: Users are patient. A CDF of object transfer
be lower than the true user request rate, which we cannot        times in Kazaa for small (<10 MB) and large (100 MB+)
directly observe.                                                objects. The X-axis is on a log-scale.
   Kazaa clients supply Kazaa-specific “usernames” as an
HTTP header in each transaction. We use these usernames
(rather than IP addresses) to distinguish between differ-         filtering removed 0.4% of total transactions (0.3% of bytes)
ent users in our trace. Unfortunately, an unofficial ver-          from our trace.
sion of Kazaa, called “KazaaLite,”2 became popular dur-
ing our tracing period and is compiled with a predefined          2.2 User Characteristics
username embedded in the application itself. We “special-           Our first slice through the trace data focuses on the prop-
case” requests from KazaaLite, resorting to distinguishing       erties of Kazaa users. Previous studies [29, 2] have shown
between KazaaLite users by their IP addresses. Although          that peer-to-peer users in general are “greedy” (i.e., most
DHCP is used in portions of our campus, and identifying          users consume data but provide little in return) and have
users by IP address is known to have issues when DHCP is         poor availability [30]. We confirm some of these character-
present [6], only 5.7% of transactions in our trace were from    istics, but we also explore others, such as user activity.
KazaaLite clients. Furthermore, KazaaLite clients did not
appear within the first 59 days of our 203 day trace.             2.2.1 Kazaa Users Are Patient
   Kazaa file-transfer traffic consists of unencrypted HTTP
                                                                    As any Web-based enterprise knows, users are very sensi-
transfers; all transfers include Kazaa-specific HTTP head-
                                                                 tive to page-fetch latency. Beyond a certain small threshold
ers (e.g., “X-Kazaa-IP”). These headers make it simple to
                                                                 (measured in seconds), they will abandon a site and move
distinguish between Kazaa activity and other HTTP activ-
                                                                 to another, possibly competing, site. For this reason, many
ity. They also provide enough information for us to identify
                                                                 online businesses engage services such as Keynote [20] to tell
precisely which object is being transferred in a given trans-
                                                                 them quickly if their servers are not sufficiently responsive.
action. When a client attempts to download an object, that
                                                                 In the world of the Web, users expect instant gratification
object may be downloaded in pieces (often called “chunks”)
                                                                 and are unforgiving if they do not receive it.
from several sources over a long period of time. We define a
                                                                    In this context, the behavior of Kazaa users is surprising.
“transaction” to be a single HTTP transfer between a client
                                                                 Figure 1 shows the distribution of transfer times in Kazaa;
and a server, and a “request” to be the set of transactions a
                                                                 transfer time is defined as the difference between the start
client participates in to download an entire object. A failed
                                                                 time of the first transaction and the end time of the last
transaction occurs when a client successfully contacts a re-
                                                                 transaction of a given user request. We filtered out partial
mote peer, but that remote peer does not return any data,
                                                                 requests (i.e., we only counted transfers for which the user
instead returning an HTTP 500 error code.
                                                                 eventually obtained the entire object). To deal with edge
   A single request may span many minutes, hours, or even
                                                                 effects, we ignored requests for which at least one transac-
days, since the Kazaa client software will continue to at-
                                                                 tion occurred during the first month of the trace; note that
tempt to download the object long after the user has asked
                                                                 this will tend to result in an underestimate of user patience.
for it. Occasionally, a client may download only a subset of
                                                                 We present our results in terms of requests for “small” ob-
the entire object (either because the user gives up or because
                                                                 jects (less than 10MB, typically audio files) and requests for
the object becomes permanently inaccessible in the middle
                                                                 “large” objects (more than 100MB, typically video files). As
of a request). We call this a “partial request.”
                                                                 we will show in Section 2.3.1, this is a natural and represen-
   The Kazaa application has an auto-update feature, mean-
                                                                 tative way to decompose the overall workload.
ing that a running instance of Kazaa will periodically check
                                                                    The results show incredible patience on the part of Kazaa
for updated versions of itself. If found, it downloads the new
                                                                 users. Even for small objects, 30% of the requests take over
executable over the Kazaa network. We chose to filter out
                                                                 an hour, and 10% take nearly a day. For large requests, less
these auto-update transactions from our logs, as they are
                                                                 than 10% complete in an hour, 50% take more than a day,
not representative of multimedia requests from users. Such
                                                                 and nearly 20% of users are willing to wait a week for their
2                                                                downloads to complete! From this graph, it is clear that the
  Many more unofficial versions of Kazaa that use generic          dynamics of multimedia downloads differ totally from Web
usernames have appeared since our trace period finished;
precisely distinguishing between peer-to-peer users will be-     usage. The Web is an interactive system where users want
come very difficult, given that neither IP addresses nor           an immediate response. Kazaa is a batch-mode delivery sys-
application-specific usernames are unique.                        tem, where downloads are made in the background and the
8                                                             6000                                           1.4                                                            1
                                                       TB requested                                                                                               data requested
                7                                      population size                                                       1.2                                  probability of request
                                                                              5000                                                                                                          0.8




                                                                                                        data requested per




                                                                                                                                                                                                  makes >0 requests
                                                                                     population size
                6




                                                                                                                                                                                                  probabilty a client
                                                                                                          live client (GB)
                                                                                                                              1
 TB requested



                                                                              4000
                5                                                                                                                                                                           0.6
                                                                                                                             0.8
                4                                                             3000
                                                                                                                             0.6                                                            0.4
                3
                                                                              2000                                           0.4
                2                                                                                                                                                                           0.2
                                                                              1000                                           0.2
                1
                                                                                                                              0                                                             0
                0                                                             0
                                                                                                                                   1   2   3   4   5    6     7    8      9     10     11
                    1   2   3   4   5    6     7   8   9       10        11
                                        week                                                                                                           week

Figure 2: Bytes requested by the population and attrition                                              Figure 3: Older clients have slower request rates. On av-
as a function of age. “Older” clients request a smaller                                                erage, clients use the system equally often as they age
fraction of bytes than newer clients. There are fewer old                                              (having approximately a 50% chance of using the system
clients than young clients, but attrition occurs at a more                                             any given week), but they request less data per session
gradual rate than the slowdown in bytes requested.                                                     as they age. Note that the point corresponding to new
                                                                                                       clients (week 1) is artificially high, since by definition
                                                                                                       every new client requests an object immediately.
content is examined later. Users do not wait for their con-
tent to arrive. They go about their business and eventually
return to review the content after it has been received.                                               request rates for two reasons: (1) they may use the system
                                                                                                       less often, or (2) they may ask for less when they use the
2.2.2 Users Slow Down As They Age                                                                      system. Figure 3 shows that clients are equally likely to use
   An interesting question we explored is how user interest                                            the system regardless of age: on average, clients have about
in Kazaa varies over time. Do users become hungrier for                                                a 50% chance of making a request on any given week. Older
content as they gain experience with Kazaa? Are their re-                                              clients slow down because they ask for less each time they
quest rates relatively constant? Do they lose interest over                                            use the system, not because they use the system less often.
time? The answer to such questions significantly affects the                                                In summary, new clients generate most of the load in
growth of the system.                                                                                  Kazaa, and older clients consume fewer bytes as they age.
   To understand user behavior over time, we first calcu-                                               In part, this is because of attrition: clients leave the system
lated the average number of bytes consumed by clients as a                                             permanently as they grow older. Also, older clients tend to
function of age. The methodology for this measurement is                                               interact with the system at a constant rate, but ask for less
complex: our trace has finite length, so we must avoid end                                              during each interaction.
effects that would overcount short-lived or undercount long-
lived users. We compensate, first, by counting transferred
                                                                                                       2.2.3 Client Activity
bytes only from clients whose “births” we could observe. Be-                                              Quantifying the availability of clients in a peer-to-peer sys-
cause there are no detectable birth events in our trace, we                                            tem is notoriously difficult [6]: no one metric can accurately
used the heuristic of treating the first observed download                                              capture availability in this environment, since any individ-
from a client as a birth event if at least a full month had                                            ual client might exist only for a fraction of the traced time
elapsed in our trace before seeing that first download. To                                              period. Given our passive tracing methodology, we faced
compensate at the end of the trace, we counted bytes only                                              an additional methodological problem: we can detect that
from clients born prior to the last 11 weeks of the trace.                                             users are participating in the system only when their clients
Because of this “end threshold,” we could draw definitive                                               transfer data (either by downloading or uploading files). If a
conclusions about clients’ behavior only during the first 11                                            client is on-line but not active, we could not observe them.
weeks of their lifetimes.                                                                              Because of this, we report statistics about client activity
   Figure 2 shows the total number of bytes requested by                                               only, which is a lower bound on availability.3
the population as a function of its age. From this graph, we                                              We use two specific metrics to quantify the amount of
can see that older clients consume fewer bytes than newer                                              client transfer activity: (1) activity fraction, which measures
clients. There are two reasons for this effect: (1) attrition                                           the fraction of time a client is transferring content over the
reduces the number of older clients, since clients may “die”                                           client’s lifetime or over the duration of the entire trace, and
(i.e., leave the system forever) over time, and (2) some clients                                       (2) average session length, in which a session is defined as
may continue to issue requests but do so at a slower rate as                                           an unbroken period of time during which a client has one or
they age. We explore each of these in turn.                                                            more active transactions. Average session length measures
   Attrition. To understand attrition in the system, we an-                                            the typical duration of the periods during which a client is
alyzed the number of clients that remain alive as a function                                           receiving or transmitting data. Our measurements indicate
of age (also shown in Figure 2). Population size declines at                                           that the distributions of average session length and activity
a more gradual rate than bytes requested, at least over the                                            fraction over the measured population are heavy-tailed.
first 11 weeks of clients’ lifetimes. Attrition therefore only                                          3
partially explains why older clients demand less in aggregate                                            P2P software is often designed to make it difficult to close
                                                                                                       the program once it starts, “fooling” users into making their
from the system. To fully explain this phenomenon, clients                                             clients more available than intended. Accordingly, we sug-
must also slow down their request rates as they age.                                                   gest that client activity is a more universally comparable
   Slowing down over time. Older clients may have slower                                               and stable indicator of “availability” than other metrics.
1                                                                                                          1
          0.9                                                                                                                                          requests
                                                                                                                      0.9
                                                                                                                                                       bytes transferred




                                                                                                % of requests/bytes
          0.8                                                                                                         0.8
          0.7                                                                                                         0.7
                             % requests
          0.6                                                                                                         0.6
 (CDF)




          0.5                                                                                                         0.5
          0.4                                                                                                         0.4
          0.3                                                                                                         0.3
          0.2                                                                                                         0.2
                                                                    % bytes consumed
          0.1                                                                                                         0.1
            0                                                                                                          0
                0.1             1                10                   100                1000                               <= 10        10-100             >100
                                          object size (MB)                                                                          object size (MB)
                                                 (a)                                                                                       (b)

Figure 4: Bandwidth consumed vs. object size. (a) CDFs of the total bandwidth consumed and the number of requests
generated, as a function of object size, and (b) bandwidth consumed and requests generated as a function of object
size, grouped into three regions.


                                                                                                2.3 Object Characteristics
                                                           all         clients with >1
                                                        clients            transfer                We now turn our attention to the characteristics of re-
         activity fraction         median               66.08%                5.54%             quested objects. Kazaa objects consist of a mixture of media
          over lifetime        90th percentile           100%                94.41%             types, including audio, video, images, and executables [29].
         activity fraction         median                0.01%                0.20%             Our trace captured 1,640,912 requests to 633,106 different
           over trace          90th percentile           2.29%                4.15%
                                                                                                objects, resulting in 98,997,622 transactions. In the remain-
          total activity           median           0.69 hours            10.09 hours
           over trace          90th percentile     111.93 hours          202.65 hours
                                                                                                der of this section, we discuss the most prominent properties
             average               median               2.40 mins            2.40 mins
                                                                                                of the requested objects.
         session length        90th percentile         28.25 mins           28.33 mins
                                                                                                2.3.1 Kazaa Is Not One Workload
Table 2:    Client session lengths and activity fractions.                                         While we tend to think of a system like Kazaa as generat-
Summary statistics over two different subsets of our                                             ing a single workload, Kazaa is better described as a blend
client population: all clients, and only those clients that                                     of workloads that each have different properties. Figure 4(a)
have transferred more than one file. We show activity                                            shows the percentage of total bandwidth consumed in our
fraction summary statistics relative to both the clients’                                       trace as a function of object size. The graph shows three
lifetimes and to the entire trace.                                                              prominent regions: “small” objects that are less than 10
                                                                                                MB in size, medium-sized objects from 10 to a few hundred
                                                                                                MB in size, and large objects that are close to 1 GB in size.
                                                                                                Relative to the Web, our “small” objects are three orders
   Table 2 summarizes these metrics. Average session lengths                                    of magnitude larger than an average Web object, while our
are typically small: the median average session length is                                       large objects are nearly five orders of magnitude larger. This
only 2.4 minutes, while the 90th percentile is 28.25 minutes.                                   was shown in [29] as well.
Most clients have high activity fractions relative to their                                        Figure 4(a) also graphs the percentage of requests ob-
lifetimes. However, this is a potentially misleading statistic,                                 served in the trace as a function of object size, demonstrat-
since many clients have very short lifetimes, leaving the sys-                                  ing that most requests are for the “small” objects. Tak-
tem permanently after requesting only one file. The activity                                     ing a closer look, Figure 4(b) breaks down the number of
fraction for clients with greater than one request is a more                                    requests and bytes transferred into three regions: 10MB
revealing statistic. The median client in this category was                                     or less, 10MB to 100MB, and over 100MB. We chose the
active only 5.54% of its lifetime, or 0.20% of the entire trace.                                value 10MB as this corresponds to an obvious plateau in
Highly active clients in this category are active for most of                                   Figure 4(a) and it serves as an upper bound on the size of
their lifetime, but even these clients are typically active for                                 most digitized audio clips. We chose the value 100MB as it
only a short fraction of the trace: the 90th percentile client                                  is a reasonable lower bound on the size of most digitized full-
has an activity fraction of only 4.15% over the trace.                                          length movies transferred. While the majority of requests
   At first glance, our user patience statistics may appear at                                   (91%) are for objects smaller than 10MB (primarily audio
odds with our average session length statistics. The median                                     clips), the majority of bytes transferred (65%) are due to the
small object takes about 17 minutes to download fully, but                                      largest objects (video files). Thus, if our concern is band-
the median average session length is only 2.41 minutes. This                                    width consumption, we need to focus on the small number
apparent paradox is reconciled by several facts: (1) many                                       of requests to those large objects. However, if our concern
transactions fail (Table 1), producing many very short ses-                                     is to improve the overall user experience, then we must fo-
sions, (2) periods of no activity may occur during a request                                    cus on the majority of requests for small files, despite their
if the client cannot find an available server with the object,                                   relatively small bandwidth demand.
and (3) some successful transactions are short, either be-                                         The stark differences between the large and small object
cause servers close connections prematurely or the “chunk”                                      workloads lead us to analyze their behaviors independently
size requested is small.                                                                        for much of the evaluation that follows.
small objects             large objects          cently born objects. Given that there is significant turn-
                              (primarily audio)         (primarily video)
                                                                               over in popularity within Kazaa, we wanted to understand
                           top 10          top 100   top 10          top 100
                                                                               whether objects that become popular are old objects that
    overlap in the most
        popular objects                                                        have grown in popularity over time or recently born objects
                           0 of 10       5 of 100    1 of 10      44 of 100
  between first and last                                                       that enjoy sudden popularity. Using the same month-long
       30 days of trace
                                                                               segments as before, we calculated the fraction of objects that
     # of newly popular
        objects that are   6 of 10       73 of 95    2 of 9        47 of 56    were newly popular (i.e., in the top-10 or top-100 in the last
           recently born                                                       month of the trace but not the first month of the trace), but
                                                                               did not receive any requests at all during the first month of
Table 3: Object popularity dynamics in Kazaa. There is
                                                                               the trace (i.e., they were likely “born” after the first month
significant turnover in the set of most popular objects
                                                                               of the trace). Table 3 shows the results: newly popular ob-
between the first 30 days and the last 30 days of the
                                                                               jects tend to be recently born in Kazaa, although this is
trace. The newly popular objects (those in the set of
                                                                               more true for audio objects than video objects.
most popular objects over the last 30 days but not in
the set over the first 30 days) tend to be recently born.                          Most requests are for old objects. The previous ex-
                                                                               periments confirmed that the most popular objects tend to
                                                                               decay in popularity, and that the newly popular objects that
2.3.2 Kazaa Object Dynamics                                                    replace them tend to be newly born. A related, but different,
                                                                               question is whether most requests go to old or new objects.
   A simple but crucial difference between multimedia and                       We categorize an object as “old” if at least a month has
Web workloads is that multimedia objects are immutable,                        passed since we observed the first request for that object.
while Web pages are not. Though obvious, this fact and                         We categorize it as “new” if it has been less than a month
its implications have not been discussed in the research lit-                  since it was first requested. Note that we can be sure that
erature. A video clip of “Bambi Meets Godzilla” will be                        an object is old, but we can never be sure that an object
the same clip tomorrow or the next day: it never changes.                      is new, since we may have missed requests for the object
On the other hand, the Web page CNN.com may change                             before our trace began. To deal with edge effects, we do not
every hour, or upon every access if the page is personal-                      include the first month of requests in our statistics, but we
ized for clients. Web workloads are thus strongly driven                       do use them to help distinguish between old and new objects
by dynamic content creation. It has been shown that the                        in subsequent months.
rate of document change is a key factor in Internet behav-                        Using this methodology, 72% of requests for large objects
ior and has enormous implications for caching, performance,                    go to old objects, while 28% go to new objects. For small
and content delivery in general [11, 37]. We now show how                      objects, 52% of requests go to old objects, and 48% go to new
immutability affects object dynamics.                                           objects. This shows that a substantial fraction of requests
   Kazaa clients fetch objects at most once. Because                           are for the old objects. Large objects requested tend to
objects are immutable and take non-trivial time to down-                       be older than small objects, reinforcing our assertion that
load, we believe that users typically download a Kazaa ob-                     Kazaa is really a mixture of workloads: the pace of life is
ject only once. Our traces confirm that 94% of the time,                        slower for large objects than for small objects.
a Kazaa client requests a given object at most once; 99%                          From the above discussion, it is clear that the forces driv-
of the time, a Kazaa client requests a given object at most                    ing the Kazaa workload differ in many ways from those driv-
twice. In comparison, based on a Web trace we gathered                         ing the Web. Web users may download the same pages many
during the first nine days of our Kazaa trace period, a Web                     times; Kazaa users tend to download objects at most once.
client requests a given object at most once only 57% of the                    The arrival of new objects plays an important role in P2P
time.                                                                          file-sharing systems, while changes to existing pages are a
   The popularity of Kazaa objects is often short-                             more important dynamic in the Web. We discuss implica-
lived. Object immutability also has an impact on object                        tions of these differences in Sections 3 and 4.
popularity dynamics. The set of most popular pages remains
relatively stable for the Web and these pages account for                      2.3.3 Kazaa Is Not Zipf
a significant fraction of overall accesses [26]. In contrast,                      Much has been written about the Zipf-like qualities of
many of the most popular audio/video objects are routinely                     the WWW [7]. In fact, researchers commonly quote the
replaced by newly released objects, often in only a few weeks.                 Zipf parameter of the popularity distributions seen in their
   To illustrate this change in popular Kazaa objects, we                      traces [14, 27], in part to demonstrate that their results
compared the first 30 days of the trace to the last 30 days of                  are “correct.” This Zipf property of Web access patterns
the trace. For each of these 30-day (month-long) segments,                     is thought to be a basic fact of nature: a small number of
we identified the top-10 most popular and the top-100 most                      objects are extremely popular, but there is a long tail of
popular objects (Table 3). For small objects, there was no                     unpopular requests. Zipf’s law states that the popularity of
overlap between the top-10 most popular objects: the most                      the ith-most popular object is proportional to i−α , where α
popular small objects had changed completely in the space                      is the “Zipf coefficient” or “Zipf parameter.” Zipf distribu-
of only six months. For large objects, there was only one                      tions look linear when plotted on a log-log scale.
object in common in the top-10 across these segments. The                         Figure 5 shows the Kazaa object popularity distribution
top-100 objects show only 5 small objects in common across                     on a log-log scale for large (>100MB) objects, along with
the segments, and 44 objects in common across the large                        the best-fit Zipf curve; a qualitatively similar curve exists
objects. Popularity is fleeting in Kazaa, and popular audio                     for small (<10MB) objects. This figure also shows the pop-
files tend to lose popularity faster than popular video files.                   ularity distribution of Web objects, drawn from our Web
   The most popular Kazaa objects tend to be re-                               trace. Unlike the WWW, the Kazaa object request distribu-
10,000,000                                                                 servers and the Web. Following this, we present a model of
                  1,000,000
                                                                                            multimedia workloads in Section 4, and we use this model
                                                                                            to explore implications of non-Zipf behavior.
                   100,000
 # of requests



                                                            WWW objects
                    10,000

                      1,000
                                                                                            3. ZIPF’S LAW AND MULTIMEDIA WORK-
                       100
                                                                                               LOADS
                                                                                               Previous studies of multimedia workloads examined ob-
                        10
                                  100MB+ Kazaa objects                                      ject popularity and found conflicting results, noting both
                         1                                                                  Zipf and non-Zipf behavior [4, 7, 8, 12, 33]. This section
                              1        10     100   1000   10000 100000 1E+06 1E+07 1E+08
                                                                                            examines previous work in the context of our own, with the
                                                         object rank
                                                                                            goal of explaining the similarities, differences, and causes of
                                                                                            the behavior observed both by us and others. We begin by
Figure 5: Kazaa is not Zipf. The popularity distribution                                    presenting a hypothesis that explains the non-Zipf behav-
for large objects is much flatter than Zipf would predict,                                   ior in Kazaa. Next, we discuss previous studies that have
with the most popular object being requested 100x less                                      observed or modeled non-Zipf workloads and contrast our
than expected. Similarly shaped distributions exist for                                     hypothesis with previous explanations. Finally, we attempt
small objects and the aggregate Kazaa workload. For                                         to show the generality of our claim by revealing non-Zipf be-
comparison, we show the popularity distribution of Web                                      havior in previously studied workloads. Section 4 then sup-
requests during a subset of the Kazaa trace period: the                                     ports our hypothesis through the use of a generative work-
Web is well described by Zipf.                                                              load model whose output closely matches our observations.

                                                                                            3.1 Why Kazaa Is Not Zipf
tion as observed in our trace does not seem to follow a Zipf                                   The previous section highlighted a crucial difference be-
curve. The strongest difference from Zipf appears among                                      tween the Web and the Kazaa P2P file-sharing system:
the most popular objects, where the curves are noticeably                                   Kazaa objects are immutable, while Web objects change,
flatter than the Zipf lines. The most popular multimedia                                     sometimes frequently. As we observed in Section 2.3.2, the
objects are significantly less popular than Zipf would pre-                                  immutability of Kazaa objects causes two important effects,
dict. For example, our trace shows that the most popular                                    one affecting user behavior and the other object dynam-
large object has only 272 requests, and only 47 objects have                                ics. First, Web clients often fetch the same Web page many
100 requests or more. Popularity distributions of Web traffic                                 times (“fetch-repeatedly”). In contrast, Kazaa clients rarely
sometimes have slightly flattened “heads,” but the objects                                   request the same object twice (“fetch-at-most-once”). Sec-
in the flattened region account for only a small percentage of                               ond, the Web’s primary object dynamic is updates to exist-
total requests and bytes. The flattened portion of the Kazaa                                 ing pages. In contrast, the primary object dynamic in the
popularity curve consists of the top 1000 objects, and these                                Kazaa workload is the arrival of entirely new objects.
objects account for more than 50% of bandwidth consumed.                                       These object and user characteristics can be used as two
                                                                                            axes for classifying systems: immutable objects vs. mu-
2.4 Summary                                                                                 table objects, and fetch-repeatedly clients vs. fetch-
   Peer-to-peer file-sharing workloads are driven by consid-                                 at-most-once clients. In systems with large client-side
erably different processes than the Web. Kazaa users are                                     caches, these characteristics are intimately linked: fetch-
much more patient than Web users, waiting hours, days,                                      repeatedly behavior exists when there are object updates,
and, in some cases, weeks for their objects to arrive. Even                                 such as in the Web, and fetch-at-most-once user behavior
so, the median session length is only 2.4 minutes, because:                                 exists when the only object dynamic is new arrivals, such as
(1) many transactions fail, (2) many transactions are short,                                in the Kazaa system. We believe that these differences in
and (3) clients occasionally have periods of no activity while                              object and user dynamics explain why we observed Zipf re-
satisfying a request. As Kazaa clients age, they demand less                                quest distributions in our measured Web workload, but not
from the system. Part of this is due to attrition (clients per-                             in Kazaa.
manently leaving the system), but part is also due to older                                    As an initial comparison of fetch-at-most-once and fetch-
clients requesting fewer bytes per session.                                                 repeatedly behavior, we simulated the request distribution
   The Kazaa file-sharing workload consists of a mixture of                                  generated by two hypothetical 1,000-user populations: one
large, immutable audio and video objects, but the measured                                  in which users fetch objects repeatedly, and one in which
popularity distribution of these objects is not Zipf. The                                   they fetch objects at most once. In both cases, we simulated
popularity of the most requested objects is limited, since                                  the same initial Zipf distribution (with Zipf parameter α =
clients typically fetch objects at most once, unlike the Web.                               1.0 over 40,000 objects). However, in the fetch-at-most-once
Nonetheless, there is substantial non-uniformity in Kazaa                                   case, we prevented users from making subsequent requests
object popularity, which suggests that caching is a poten-                                  for the same object. (Section 4 more fully explains the model
tially effective bandwidth savings tool. In Kazaa, object                                    behind this simulation.)
arrivals play an important role, while in the Web, updates                                     Figure 6 shows the results when users have made an av-
to existing pages are a more important dynamic.                                             erage of 1000 requests each. While fetch-repeatedly behav-
   In the next section, we explore what we believe are the rea-                             ior recreates the underlying Zipf distribution, fetch-at-most-
sons why Kazaa’s workload does not appear to follow Zipf’s                                  once behavior shows markedly different results: the most
law. We also compare Kazaa’s non-Zipf behavior to non-                                      popular objects are requested much less, while objects down
Zipf behavior in other systems, including video-on-demand                                   the tail show elevated numbers of requests. This behav-
100000                                                             show up many times (potentially as many times as there
                                         fetch-repeatedly                           are users) in a fetch-at-most-once workload such as Kazaa,
                  10000
                                                                                    even with individual client caches. Both result in non-Zipf
 # of requests




                   1000                                                             distributions, but for different reasons.
                              fetch-at-most-once
                   100                                                              3.2.2 Kazaa vs. Other Multimedia Workloads
                    10                                                                 Tang et al. [33] show that streaming media server work-
                                                                                    loads exhibit flattened popularity distributions similar to
                      1                                                             those we observed. They attribute this flattened distribu-
                          1         10          100         1000   10000   100000   tion to the long duration over which they gathered their
                                                   object rank                      trace. Their hypothesis is that with a longer trace, more
                                                                                    files with similar popularity can be observed, in effect cre-
Figure 6: Simulated object popularity for fetch-repeatedly
                                                                                    ating “groups” of files with similar properties. Instead of
and fetch-at-most-once behavior. When users fetch an in-
                                                                                    Zipf’s law being appropriate for describing the popularity of
dividual object at most once, the resulting request distri-
                                                                                    individual files, they show that Zipf’s law better describes
bution has a significantly flattened head, approximating
                                                                                    the popularity of these groups, and they provide a mathe-
the Kazaa popularity distribution.
                                                                                    matical transformation inspired by these groups to convert
                                                                                    observed non-Zipf distributions into a Zipf-like distribution.
                                                                                       We believe that fetch-at-most-once behavior is the likely
ior strikingly resembles the measurement results shown in                           cause of the non-Zipf popularity in their workload. A short
Figure 5, confirming our intuition that fetch-at-most-once                           trace can cause a non-Zipf workload to appear Zipf-like, as
behavior is a key contributor to non-Zipf popularity.                               not enough requests have been gathered within the trace for
                                                                                    objects with similar popularity to be observed [8]. However,
3.2 Non-Zipf Workloads in Other Systems                                             the fact that an adequate sample size will reveal a non-Zipf
  Non-Zipf popularity distributions have been observed in                           workload does not explain why this true workload is non-
many previous studies [1, 3, 8, 33]. We now compare our                             Zipf in the first place. The length of a trace by itself does
results to two categories of previous studies: those that an-                       not provide a satisfactory explanation of the forces driving
alyzed Web workloads and those that analyzed video (or                              popularity in a system. Fetch-at-most-once behavior par-
other multimedia) workloads.                                                        tially provides this explanation. Additionally, in a system
                                                                                    that experiences object births (for example, new video ob-
3.2.1 Kazaa vs. Web Workloads                                                       jects becoming available), a short trace may miss important
   While most Web studies conclude that Web popularity                              birth events. A longer trace may reveal objects spread out
distributions follow Zipf’s law, some studies have observed                         over time that have equivalent short-term popularity, similar
non-Zipf behavior, specifically in workloads that emerge as                          to the “group” effect Tang et al. proposed.
miss traffic from a proxy cache [4, 7, 12]. Web proxy caches,                            In another study, Almeida et al. [3] observe a flattened
by design, absorb requests to popular documents; conse-                             Zipf-like popularity distribution in an educational media
quently, the most popular documents in a workload will ap-                          server. They propose that this distribution is well described
pear only once (as cold misses) in cache miss traces. Mod-                          by the concatenation of two true Zipf distributions. Al-
erately popular documents will occasionally be absorbed by                          though the data appears to be well modeled by two Zipf
the proxy cache, but they will also appear in the miss traffic                        distributions, their study does not provide an explanation
as capacity misses. The least popular documents will always                         of what causes this effect.
miss in the proxy cache, assuming the cache has finite size                             Another often studied multimedia workload is that of vid-
and cannot hold the entire object population. As a result, a                        eo-on-demand (VoD) servers. VoD researchers frequently
Zipf workload, when fed through a shared proxy cache, will                          use Zipf distributions to model the popularity of video doc-
result in a non-Zipf popularity distribution with a flattened                        uments in their systems [10, 18, 31]. Surprisingly, many of
head [12], visually similar to the non-Zipf distributions we                        these researchers appear to base their Zipf assumption on
observed in Kazaa. Breslau et al. [7] derive equations that                         results published from a single data set: one week of rental
relate the Zipf parameter of a workload fed to a proxy cache                        data from a video store [35].
and the hit rate that the proxy cache will experience.                                 Our results suggest that their workload (i.e., requests for
   Although proxy cache miss distributions may visually re-                         immutable video objects) should reveal similar non-Zipf be-
semble our measured Kazaa popularity distributions, we be-                          havior stemming from fetch-at-most-once client behavior.
lieve the two workloads diverge from Zipf for different rea-                         To understand the apparent discrepancy between their as-
sons. In Kazaa, users’ requests do not flow through a shared                         sumptions and our results, we manually extracted the video
proxy cache: the workload we observed is the aggregate of a                         rental data set from Figure 2 of [10]. In Figure 7(a), we show
large number of individual fetch-at-most-once streams, with                         this data as it appeared in that paper, plotted on a linear
an absence of shared proxy caches. The only caches that                             scale with a Zipf curve fit. Figure 7(b) shows the same
currently exist in Kazaa are client-side caches. Note that                          data plotted on a log-log scale with the same Zipf curve
there is no practical difference between true fetch-at-most-                         fit. While the linear scale plot seems to suggest the data
once user behavior, and fetch-repeatedly user behavior fil-                          may be well described by Zipf’s law, the log-log plot reveals
tered through infinite client-side caches resulting in fetch-                        that even this data set appears to show the flattened head
at-most-once client behavior.                                                       that is characteristic of fetch-at-most-once systems. We also
   The most popular Web object will show up only one time                           gathered recent box-office movie ticket sales data [34]. Fig-
in a proxy cache miss trace. The most popular object will                           ures 7(c) and (d) show that this data, too, is consistent with
100                                                                               1000
                           90
                           80
       rental frequency




                                                                                         rental frequency
                           70
                                                                                                             100
                           60
                           50
                           40
                           30                                                                                 10
                           20
                           10
                            0                                                                                  1
                                0         50        100           150   200   250                                  1               10                  100                 1000
                                                     movie index                                                                         movie index
                                                           (a)                                                                                (b)

                                500                                                                         1000000
                                450                                                                         100000
                                400                                                                          10000
     box office sales




                                                                                    box office sales
                                350
       ($millions)




                                                                                      ($millions)
                                                                                                              1000
                                300
                                                                                                               100
                                250
                                200                                                                                10
                                150                                                                                1
                                100                                                                             0.1
                                 50                                                                            0.01
                                  0                                                                           0.001
                                      0        50    100          150   200   250                                       1           10                 100               1000
                                                      movie index                                                                        movie index
                                                            (c)                                                                               (d)

Figure 7: Video rental and box office sales popularity. (a) The popularity distribution from a 1992 video rental data set
used to justify Zipf ’s law in many video-on-demand papers, along with a Zipf curve fit with α = 0.9, and (b) the same
data set and curve fit plotted on a log-log scale. Contrary to the assumption of many papers, video rental data does
not appear to follow Zipf ’s law. (c) The distribution of 2002 U.S. box office ticket sales on a linear scale, along with a
Zipf-fit with α = 2.28, and (d) on a log-log scale. This data set also appears to be non-Zipf.


our fetch-at-most-once popularity distribution.                                                                Symbol                 Meaning                  Base value
   The VoD community has recently begun to recognize the                                                         C                   # of clients                1,000
importance of object births in video-on-demand systems [16].                                                     O                  # of objects                 40,000
Because new video titles are created (often with high pop-                                                       λR           per-user request rate           2 objects/day
ularity), and because the popularity of existing video titles                                                                Zipf parameter driving
                                                                                                                        α                                          1.0
tends to diminish over time, VoD workload models are start-                                                                      object popularity
ing to incorporate schemes for changing popularity distribu-                                                                  probability that a user
                                                                                                                                                             based on Zipf(1)
tions over time. This is consistent with our earlier observa-                                                      P(x)       requests an object of
                                                                                                                                                                (see text)
tions that both client and object birth rates are important                                                                      popularity rank x
factors in multimedia workloads.                                                                                             probability that a newly
   In the next section, we present a model of fetch-at-most-                                                       A(x)     arrived object is inserted           Zipf(1)
                                                                                                                               at popularity rank x
once systems. We use this model to demonstrate the impli-
                                                                                                                            cache size, measured as
cations of the resulting non-Zipf behavior and explore the                                                              M
                                                                                                                              fraction of all objects
                                                                                                                                                                  varies
importance of client and object birth rates on system per-                                                             λO       object arrival rate               varies
formance. We believe that this model is relevant to both                                                               λC        client arrival rate              varies
P2P file-sharing and video-on-demand systems.
                                                                                    Table 4: Model structure and notation. These parame-
4.         A MODEL OF P2P FILE-SHARING                                              ter settings reflect the values seen in our trace for large
                                                                                    (>100MB) objects.
           WORKLOADS
   In the previous section, we hypothesized that the non-Zipf
behavior of Kazaa, and of multimedia workloads in gen-
eral, is best explained by the fetch-at-most-once behavior                          fetch-at-most-once behavior has on overall object popularity.
of clients. This section presents a model of P2P file-sharing                        We use caching as a lens to observe how the addition of
systems that enables us to explore this hypothesis further.                         new objects and new clients affects performance in a fetch-
The model, used to generate Figure 6 in the previous section,                       at-most-once file-sharing system. Finally, we validate the
captures key parameters of the P2P file-sharing workload,                            model by comparing the popularity distributions of a model-
such as request rates, number of clients, number of objects,                        generated workload with that extracted from our trace data.
and changes to the set of clients and objects. From these
parameters, it produces a stream of client requests for ob-                         4.1 Model Description
jects that we then analyze. By varying model parameters,                              Table 4 summarizes the parameters used in our model. We
we explore the underlying processes that lead to certain ob-                        chose parameter values that reflect our trace measurements
servable workload characteristics and how changes to the                            for large objects, since this component dominates bandwidth
parameters affect the system and its performance.                                    consumption. One parameter value that differs from the
   The remainder of this section describes our basic model of                       measured trace is the number of clients. In order to run a
a P2P file-sharing system. This model shows the impact that                          substantially larger number of experiments, we model a sys-
1
tem with 1,000 clients rather than the roughly 7,000 large-                                                                            fetch-repeatedly
object clients in our trace. We verified that the predictions                                                              4000                        1000
                                                                                0.8                                                    2000
of our model were not affected by this difference. To simplify
our model, we also assumed that all objects in the system




                                                                     hit rate
                                                                                0.6
were of equal size.
   Our model captures key aspects of our P2P file-sharing                        0.4
workload, in particular, the differences between file-sharing
                                                                                0.2
and Web workloads. In a Web workload, clients select ob-                              1000      2000
                                                                                                            4000
jects from a Zipf distribution, P (x), in an independent and                                 fetch-at-most-once
                                                                                 0
identically distributed fashion. In contrast, the object selec-                       0              200           400           600               800       1000
tion process in a file-sharing system depends on three fac-                                                               days
tors: (1) the Zipf distribution, P (x), (2) the way in which
new objects are inserted into that distribution, A(x), and          Figure 8: File-sharing effectiveness diminishes with client
(3) the clients’ fetch-at-most-once behavior.                       age. After an initial cache warm-up period, the request
   Our model generates requests as follows. On average, a           stream generated by fetch-at-most-once client behavior
client requests two objects per day, choosing which object to       experiences rapidly decreasing cache performance, while
fetch from a Zipf probability distribution with parameter 1.0       the stream generated by fetch-repeatedly clients remains
(“Zipf(1)”).4 We hypothesize that the underlying popularity         stable. This effect is shown for various cache sizes (1000,
of objects in a fetch-at-most-once file-sharing system is still      2000, and 4000 objects).
driven by Zipf’s law, even though the observed workload
becomes non-Zipf because of fetch-at-most-once clients. In
our model, subsequent requests from the same client obey            the optimistic assumption that all objects are cachable and
distributions obtained by removing already fetched objects          are never updated.
from the candidate object set and re-scaling so the total
probability is 1.0. Given two previously unrequested objects,       4.2 File-Sharing Effectiveness Diminishes with
the ratio of the probabilities that the client will request these       Client Age
objects next is identical to their ratio in the original Zipf
                                                                       Imagine an organization experiencing current demand for
distribution. For fetch-repeatedly systems, each request is
                                                                    external bandwidth due to fetches of P2P file-sharing ob-
made according to the original Zipf distribution.
                                                                    jects. How should this organization expect bandwidth de-
   When modeling fetch-at-most-once systems, λO > 0 is the
                                                                    mand to change over time, given a shared proxy cache? We
object arrival rate. When an object is born in a fetch-at-
                                                                    address this question using the “static” case in which no new
most-once system, its popularity rank is determined by se-
                                                                    clients or objects arrive. This static analysis allows us to fo-
lecting randomly from the Zipf(1) distribution. Pre-existing
                                                                    cus on one factor at a time (fetch-at-most-once behavior, in
objects of equal or lesser popularity are “pushed down” one
                                                                    this case). We relax these assumptions in later subsections
Zipf position, and the resulting distribution is re-normalized
                                                                    to show the impact of other factors, such as new object and
so the total probability is again 1.0. In fetch-repeatedly sys-
                                                                    client arrivals.
tems, we set the object arrival rate to 0. Objects may be
                                                                       Figure 8 shows hit rate against time for various shared
updated, but for simplicity we ignore the second-order effect
                                                                    cache sizes, assuming that at time zero no clients have fetched
of completely new objects on request behavior.
                                                                    any objects. After a brief cache warm-up period, hit rate de-
   While our trace shows all requested objects, it cannot ob-
                                                                    creases as clients age, even for a cache that can hold 10% of
serve the total object population, since many available ob-
                                                                    all file-sharing objects. This is because fetch-at-most-once
jects were never accessed. However, total object population
                                                                    clients consume the most popular objects early. Later re-
is a key parameter of our model, as it influences the amount
                                                                    quests are selected with nearly uniform likelihood from an
of overlap that will likely occur in requests from different
                                                                    increasingly large number of objects. That is, the system
clients. We therefore estimated a base value for total object
                                                                    evolves towards one in which there is no locality, and ob-
population by back-inference: how many large objects are
                                                                    jects are chosen at random from a very large space.
most likely to have existed in total, given that we saw about
                                                                       This behavior suggests that if client request rates remain
18,000 distinct large objects requested in the trace? We find
                                                                    constant over time, the external bandwidth load they present
that a total population of about 40,000 large-media objects
                                                                    increases, since more of their requests are directed to objects
is consistent with the trace data; therefore, we use this num-
                                                                    that are only available externally. Conversely, if we hope to
ber as the base value. This number is also comparable to
                                                                    have stable bandwidth demand over time, clients must be-
statistics that describe commercial movie releases: the In-
                                                                    have in a way that reduces the intensity of their requests in
ternet Movie Database reports between 50,000 and 60,000
                                                                    a manner consistent with the shape of the hit-rate decreases
movie releases world-wide over the past 20 years [34].
                                                                    shown in Figure 8.
   To quantify file-sharing effectiveness, we use the hit rate
                                                                       The decrease in hit-rate over time is a strong property of
that the aggregate workload experiences against a 100%
                                                                    fetch-at-most-once behavior. The underlying popularity dis-
available shared cache with LRU replacement, whose size
                                                                    tribution need not be heavy tailed for it to occur. We have
we vary in each experiment. Selected experiments using
                                                                    performed experiments using initial object popularity distri-
optimal replacement showed no qualitative differences from
                                                                    butions that have higher locality (i.e., Zipf parameters larger
LRU results, and quantitative differences varied by only a
                                                                    than 1.0). In a fetch-repeatedly context, the larger skew of
few percent. For Web (fetch-repeatedly) scenarios, we make
                                                                    these distributions towards the most popular objects makes
4
  Attempting to best-fit a Zipf curve to our measured non-           file sharing easier and hit rates rise. For fetch-at-most-once
Zipf distribution resulted in a Zipf parameter of 0.98.             systems, hit rates start out much higher, but as clients age
Measurement and Modeling Reveal Forces Driving P2P File Sharing
Measurement and Modeling Reveal Forces Driving P2P File Sharing
Measurement and Modeling Reveal Forces Driving P2P File Sharing
Measurement and Modeling Reveal Forces Driving P2P File Sharing
Measurement and Modeling Reveal Forces Driving P2P File Sharing
Measurement and Modeling Reveal Forces Driving P2P File Sharing

Más contenido relacionado

La actualidad más candente

COMPARATIVE STUDY OF CAN, PASTRY, KADEMLIA AND CHORD DHTS
COMPARATIVE STUDY OF CAN, PASTRY, KADEMLIA  AND CHORD DHTS COMPARATIVE STUDY OF CAN, PASTRY, KADEMLIA  AND CHORD DHTS
COMPARATIVE STUDY OF CAN, PASTRY, KADEMLIA AND CHORD DHTS ijp2p
 
CHALLENGES IN SIGNIFICANT ADOPTION OF ACTIVE QUEUE MANAGEMENT IN THE PHILIPPI...
CHALLENGES IN SIGNIFICANT ADOPTION OF ACTIVE QUEUE MANAGEMENT IN THE PHILIPPI...CHALLENGES IN SIGNIFICANT ADOPTION OF ACTIVE QUEUE MANAGEMENT IN THE PHILIPPI...
CHALLENGES IN SIGNIFICANT ADOPTION OF ACTIVE QUEUE MANAGEMENT IN THE PHILIPPI...ijmpict
 
The Semantic Web
The Semantic WebThe Semantic Web
The Semantic WebAnil Mishra
 
Security-Challenges-in-Implementing-Semantic-Web-Unifying-Logic
Security-Challenges-in-Implementing-Semantic-Web-Unifying-LogicSecurity-Challenges-in-Implementing-Semantic-Web-Unifying-Logic
Security-Challenges-in-Implementing-Semantic-Web-Unifying-LogicNana Kwame(Emeritus) Gyamfi
 
the darknet and the future of content distribution
the darknet and the future of content distributionthe darknet and the future of content distribution
the darknet and the future of content distributionmustafa sarac
 
Bit Torrent Protocol Report
Bit Torrent Protocol ReportBit Torrent Protocol Report
Bit Torrent Protocol Reportgkmv
 

La actualidad más candente (9)

COMPARATIVE STUDY OF CAN, PASTRY, KADEMLIA AND CHORD DHTS
COMPARATIVE STUDY OF CAN, PASTRY, KADEMLIA  AND CHORD DHTS COMPARATIVE STUDY OF CAN, PASTRY, KADEMLIA  AND CHORD DHTS
COMPARATIVE STUDY OF CAN, PASTRY, KADEMLIA AND CHORD DHTS
 
CHALLENGES IN SIGNIFICANT ADOPTION OF ACTIVE QUEUE MANAGEMENT IN THE PHILIPPI...
CHALLENGES IN SIGNIFICANT ADOPTION OF ACTIVE QUEUE MANAGEMENT IN THE PHILIPPI...CHALLENGES IN SIGNIFICANT ADOPTION OF ACTIVE QUEUE MANAGEMENT IN THE PHILIPPI...
CHALLENGES IN SIGNIFICANT ADOPTION OF ACTIVE QUEUE MANAGEMENT IN THE PHILIPPI...
 
Network Science: Theory, Modeling and Applications
Network Science: Theory, Modeling and ApplicationsNetwork Science: Theory, Modeling and Applications
Network Science: Theory, Modeling and Applications
 
The Semantic Web
The Semantic WebThe Semantic Web
The Semantic Web
 
Pptx present
Pptx presentPptx present
Pptx present
 
Darknet5
Darknet5Darknet5
Darknet5
 
Security-Challenges-in-Implementing-Semantic-Web-Unifying-Logic
Security-Challenges-in-Implementing-Semantic-Web-Unifying-LogicSecurity-Challenges-in-Implementing-Semantic-Web-Unifying-Logic
Security-Challenges-in-Implementing-Semantic-Web-Unifying-Logic
 
the darknet and the future of content distribution
the darknet and the future of content distributionthe darknet and the future of content distribution
the darknet and the future of content distribution
 
Bit Torrent Protocol Report
Bit Torrent Protocol ReportBit Torrent Protocol Report
Bit Torrent Protocol Report
 

Similar a Measurement and Modeling Reveal Forces Driving P2P File Sharing

2014 IEEE DOTNET NETWORKING PROJECT A proximity aware interest-clustered p2p ...
2014 IEEE DOTNET NETWORKING PROJECT A proximity aware interest-clustered p2p ...2014 IEEE DOTNET NETWORKING PROJECT A proximity aware interest-clustered p2p ...
2014 IEEE DOTNET NETWORKING PROJECT A proximity aware interest-clustered p2p ...IEEEFINALSEMSTUDENTSPROJECTS
 
IEEE 2014 DOTNET NETWORKING PROJECTS A proximity aware interest-clustered p2p...
IEEE 2014 DOTNET NETWORKING PROJECTS A proximity aware interest-clustered p2p...IEEE 2014 DOTNET NETWORKING PROJECTS A proximity aware interest-clustered p2p...
IEEE 2014 DOTNET NETWORKING PROJECTS A proximity aware interest-clustered p2p...IEEEMEMTECHSTUDENTPROJECTS
 
FUTURE OF PEER-TO-PEER TECHNOLOGY WITH THE RISE OF CLOUD COMPUTING
FUTURE OF PEER-TO-PEER TECHNOLOGY WITH  THE RISE OF CLOUD COMPUTINGFUTURE OF PEER-TO-PEER TECHNOLOGY WITH  THE RISE OF CLOUD COMPUTING
FUTURE OF PEER-TO-PEER TECHNOLOGY WITH THE RISE OF CLOUD COMPUTINGijp2p
 
Effective Approach For Content Based Image Retrieval In Peer-Peer To Networks
Effective Approach For Content Based Image Retrieval In Peer-Peer To NetworksEffective Approach For Content Based Image Retrieval In Peer-Peer To Networks
Effective Approach For Content Based Image Retrieval In Peer-Peer To NetworksIRJET Journal
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)ijceronline
 
International Refereed Journal of Engineering and Science (IRJES)
International Refereed Journal of Engineering and Science (IRJES)International Refereed Journal of Engineering and Science (IRJES)
International Refereed Journal of Engineering and Science (IRJES)irjes
 
F233842
F233842F233842
F233842irjes
 
Analyse the performance of mobile peer to Peer network using ant colony optim...
Analyse the performance of mobile peer to Peer network using ant colony optim...Analyse the performance of mobile peer to Peer network using ant colony optim...
Analyse the performance of mobile peer to Peer network using ant colony optim...IJCI JOURNAL
 
A Brief Note On Peer And Peer ( P2P ) Applications Have No...
A Brief Note On Peer And Peer ( P2P ) Applications Have No...A Brief Note On Peer And Peer ( P2P ) Applications Have No...
A Brief Note On Peer And Peer ( P2P ) Applications Have No...Brenda Thomas
 
Cloud Camp Milan 2K9 Telecom Italia: Where P2P?
Cloud Camp Milan 2K9 Telecom Italia: Where P2P?Cloud Camp Milan 2K9 Telecom Italia: Where P2P?
Cloud Camp Milan 2K9 Telecom Italia: Where P2P?Gabriele Bozzi
 
CloudCamp Milan 2009: Telecom Italia
CloudCamp Milan 2009: Telecom ItaliaCloudCamp Milan 2009: Telecom Italia
CloudCamp Milan 2009: Telecom ItaliaGabriele Bozzi
 
云计算及其应用
云计算及其应用云计算及其应用
云计算及其应用lantianlcdx
 
ANALYSIS STUDY ON CACHING AND REPLICA PLACEMENT ALGORITHM FOR CONTENT DISTRIB...
ANALYSIS STUDY ON CACHING AND REPLICA PLACEMENT ALGORITHM FOR CONTENT DISTRIB...ANALYSIS STUDY ON CACHING AND REPLICA PLACEMENT ALGORITHM FOR CONTENT DISTRIB...
ANALYSIS STUDY ON CACHING AND REPLICA PLACEMENT ALGORITHM FOR CONTENT DISTRIB...ijp2p
 
ANALYSIS STUDY ON CACHING AND REPLICA PLACEMENT ALGORITHM FOR CONTENT DISTRIB...
ANALYSIS STUDY ON CACHING AND REPLICA PLACEMENT ALGORITHM FOR CONTENT DISTRIB...ANALYSIS STUDY ON CACHING AND REPLICA PLACEMENT ALGORITHM FOR CONTENT DISTRIB...
ANALYSIS STUDY ON CACHING AND REPLICA PLACEMENT ALGORITHM FOR CONTENT DISTRIB...ijp2p
 
On client’s interactive behaviour to design peer selection policies for bitto...
On client’s interactive behaviour to design peer selection policies for bitto...On client’s interactive behaviour to design peer selection policies for bitto...
On client’s interactive behaviour to design peer selection policies for bitto...IJCNCJournal
 
International Journal of Peer to Peer Networks (IJP2P) Vol.6, No.2, August 20...
International Journal of Peer to Peer Networks (IJP2P) Vol.6, No.2, August 20...International Journal of Peer to Peer Networks (IJP2P) Vol.6, No.2, August 20...
International Journal of Peer to Peer Networks (IJP2P) Vol.6, No.2, August 20...ijp2p
 
Congestion control for_p2_p_live_streaming
Congestion control for_p2_p_live_streamingCongestion control for_p2_p_live_streaming
Congestion control for_p2_p_live_streamingijp2p
 
CONGESTION CONTROL FOR P2P LIVE STREAMING
CONGESTION CONTROL FOR P2P LIVE STREAMINGCONGESTION CONTROL FOR P2P LIVE STREAMING
CONGESTION CONTROL FOR P2P LIVE STREAMINGijp2p
 
Ijp2 p
Ijp2 pIjp2 p
Ijp2 pijp2p
 

Similar a Measurement and Modeling Reveal Forces Driving P2P File Sharing (20)

2014 IEEE DOTNET NETWORKING PROJECT A proximity aware interest-clustered p2p ...
2014 IEEE DOTNET NETWORKING PROJECT A proximity aware interest-clustered p2p ...2014 IEEE DOTNET NETWORKING PROJECT A proximity aware interest-clustered p2p ...
2014 IEEE DOTNET NETWORKING PROJECT A proximity aware interest-clustered p2p ...
 
IEEE 2014 DOTNET NETWORKING PROJECTS A proximity aware interest-clustered p2p...
IEEE 2014 DOTNET NETWORKING PROJECTS A proximity aware interest-clustered p2p...IEEE 2014 DOTNET NETWORKING PROJECTS A proximity aware interest-clustered p2p...
IEEE 2014 DOTNET NETWORKING PROJECTS A proximity aware interest-clustered p2p...
 
FUTURE OF PEER-TO-PEER TECHNOLOGY WITH THE RISE OF CLOUD COMPUTING
FUTURE OF PEER-TO-PEER TECHNOLOGY WITH  THE RISE OF CLOUD COMPUTINGFUTURE OF PEER-TO-PEER TECHNOLOGY WITH  THE RISE OF CLOUD COMPUTING
FUTURE OF PEER-TO-PEER TECHNOLOGY WITH THE RISE OF CLOUD COMPUTING
 
Effective Approach For Content Based Image Retrieval In Peer-Peer To Networks
Effective Approach For Content Based Image Retrieval In Peer-Peer To NetworksEffective Approach For Content Based Image Retrieval In Peer-Peer To Networks
Effective Approach For Content Based Image Retrieval In Peer-Peer To Networks
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
International Refereed Journal of Engineering and Science (IRJES)
International Refereed Journal of Engineering and Science (IRJES)International Refereed Journal of Engineering and Science (IRJES)
International Refereed Journal of Engineering and Science (IRJES)
 
F233842
F233842F233842
F233842
 
Analyse the performance of mobile peer to Peer network using ant colony optim...
Analyse the performance of mobile peer to Peer network using ant colony optim...Analyse the performance of mobile peer to Peer network using ant colony optim...
Analyse the performance of mobile peer to Peer network using ant colony optim...
 
A Brief Note On Peer And Peer ( P2P ) Applications Have No...
A Brief Note On Peer And Peer ( P2P ) Applications Have No...A Brief Note On Peer And Peer ( P2P ) Applications Have No...
A Brief Note On Peer And Peer ( P2P ) Applications Have No...
 
Cloud Camp Milan 2K9 Telecom Italia: Where P2P?
Cloud Camp Milan 2K9 Telecom Italia: Where P2P?Cloud Camp Milan 2K9 Telecom Italia: Where P2P?
Cloud Camp Milan 2K9 Telecom Italia: Where P2P?
 
CloudCamp Milan 2009: Telecom Italia
CloudCamp Milan 2009: Telecom ItaliaCloudCamp Milan 2009: Telecom Italia
CloudCamp Milan 2009: Telecom Italia
 
presentation_SB_v01
presentation_SB_v01presentation_SB_v01
presentation_SB_v01
 
云计算及其应用
云计算及其应用云计算及其应用
云计算及其应用
 
ANALYSIS STUDY ON CACHING AND REPLICA PLACEMENT ALGORITHM FOR CONTENT DISTRIB...
ANALYSIS STUDY ON CACHING AND REPLICA PLACEMENT ALGORITHM FOR CONTENT DISTRIB...ANALYSIS STUDY ON CACHING AND REPLICA PLACEMENT ALGORITHM FOR CONTENT DISTRIB...
ANALYSIS STUDY ON CACHING AND REPLICA PLACEMENT ALGORITHM FOR CONTENT DISTRIB...
 
ANALYSIS STUDY ON CACHING AND REPLICA PLACEMENT ALGORITHM FOR CONTENT DISTRIB...
ANALYSIS STUDY ON CACHING AND REPLICA PLACEMENT ALGORITHM FOR CONTENT DISTRIB...ANALYSIS STUDY ON CACHING AND REPLICA PLACEMENT ALGORITHM FOR CONTENT DISTRIB...
ANALYSIS STUDY ON CACHING AND REPLICA PLACEMENT ALGORITHM FOR CONTENT DISTRIB...
 
On client’s interactive behaviour to design peer selection policies for bitto...
On client’s interactive behaviour to design peer selection policies for bitto...On client’s interactive behaviour to design peer selection policies for bitto...
On client’s interactive behaviour to design peer selection policies for bitto...
 
International Journal of Peer to Peer Networks (IJP2P) Vol.6, No.2, August 20...
International Journal of Peer to Peer Networks (IJP2P) Vol.6, No.2, August 20...International Journal of Peer to Peer Networks (IJP2P) Vol.6, No.2, August 20...
International Journal of Peer to Peer Networks (IJP2P) Vol.6, No.2, August 20...
 
Congestion control for_p2_p_live_streaming
Congestion control for_p2_p_live_streamingCongestion control for_p2_p_live_streaming
Congestion control for_p2_p_live_streaming
 
CONGESTION CONTROL FOR P2P LIVE STREAMING
CONGESTION CONTROL FOR P2P LIVE STREAMINGCONGESTION CONTROL FOR P2P LIVE STREAMING
CONGESTION CONTROL FOR P2P LIVE STREAMING
 
Ijp2 p
Ijp2 pIjp2 p
Ijp2 p
 

Measurement and Modeling Reveal Forces Driving P2P File Sharing

  • 1. Measurement, Modeling, and Analysis of a Peer-to-Peer File-Sharing Workload Krishna P. Gummadi, Richard J. Dunn, Stefan Saroiu, Steven D. Gribble, Henry M. Levy, and John Zahorjan Department of Computer Science and Engineering University of Washington Seattle, WA 98195 {gummadi,rdunn,tzoompy,gribble,levy,zahorjan}@cs.washington.edu ABSTRACT Categories and Subject Descriptors Peer-to-peer (P2P) file sharing accounts for an astonishing C.4 [Performance of Systems]: Modeling Techniques; volume of current Internet traffic. This paper probes deeply C.2.4 [Computer-Communication Networks]: Distrib- into modern P2P file sharing systems and the forces that uted Systems—Distributed applications drive them. By doing so, we seek to increase our under- standing of P2P file sharing workloads and their implica- General Terms tions for future multimedia workloads. Our research uses a three-tiered approach. First, we analyze a 200-day trace Measurement, performance, design of over 20 terabytes of Kazaa P2P traffic collected at the University of Washington. Second, we develop a model of Keywords multimedia workloads that lets us isolate, vary, and explore Peer-to-peer, multimedia workloads, Zipf’s law, modeling, the impact of key system parameters. Our model, which we measurement parameterize with statistics from our trace, lets us confirm various hypotheses about file-sharing behavior observed in the trace. Third, we explore the potential impact of locality- 1. INTRODUCTION awareness in Kazaa. A decade after its birth, the Internet continues its rapid Our results reveal dramatic differences between P2P file and often surprising evolution. Recent studies have shown sharing and Web traffic. For example, we show how the a dramatic shift of Internet traffic away from HTML text immutability of Kazaa’s multimedia objects leads clients pages and images and towards multimedia file sharing. For to fetch objects at most once; in contrast, a World-Wide example, a March 2000 study at the University of Wisconsin Web client may fetch a popular page (e.g., CNN or Google) found that the bandwidth consumed by Napster had edged thousands of times. Moreover, we demonstrate that: (1) ahead of HTTP bandwidth [28]. Only two years later, a Uni- this “fetch-at-most-once” behavior causes the Kazaa popu- versity of Washington study showed that peer-to-peer file larity distribution to deviate substantially from Zipf curves sharing dominates the campus network, consuming 43% of we see for the Web, and (2) this deviation has significant all bandwidth compared to only 14% for WWW traffic [29]. implications for the performance of multimedia file-sharing When comparing these bandwidth consumption statistics systems. Unlike the Web, whose workload is driven by doc- to those in a 1999 study of a similar campus workload [37], ument change, we demonstrate that clients’ fetch-at-most- the portion of network bytes ascribed to audio and video once behavior, the creation of new objects, and the addition increased by 300% and 400%, respectively, over a 3-year pe- of new clients to the system are the primary forces that drive riod. Without question, multimedia file sharing has become multimedia workloads such as Kazaa. We also show that a dominant factor in today’s Internet. In all likelihood, it there is substantial untapped locality in the Kazaa workload. will dominate the Internet of the future, barring a chilling Finally, we quantify the potential bandwidth savings that effect from legal assaults. locality-aware P2P file-sharing architectures would achieve. Two traits of today’s file-sharing systems distinguish them from Web-based content distribution. First, current file- sharing systems use a “peer-to-peer” design: peers volun- tarily provide resources as well as consume them. Because of this, the system must dynamically adapt to maintain ser- Permission to make digital or hard copies of all or part of this work for vice continuity as individual peers come and go. Second, personal or classroom use is granted without fee provided that copies are file-sharing is being used predominantly to distribute mul- not made or distributed for profit or commercial advantage and that copies timedia files, and as a result, file-sharing workloads differ bear this notice and the full citation on the first page. To copy otherwise, to substantially from Web workloads [29]. Multimedia files are republish, to post on servers or to redistribute to lists, requires prior specific large (several megabytes or gigabytes, compared to several permission and/or a fee. SOSP’03, October 19–22, 2003, Bolton Landing, New York, USA. kilobytes for typical Web pages), they are immutable, and Copyright 2003 ACM 1-58113-757-5/03/0010 ...$5.00. there are currently far fewer distinct multimedia files than
  • 2. Web pages. Video-on-demand (VoD) systems also distribute trace length 203 days, 5 hours, 6 minutes multimedia files; we contrast our Kazaa measurements and # of requests 1,640,912 # of transactions 98,997,622 analysis with related work on VoD systems in Section 3. # of unsuccessful transactions 65,505,165 (66.2%) This paper presents an in-depth analysis of a modern P2P 252KB (all transactions) average transaction size multimedia file-sharing workload, considering the “peer-to- 752KB (successful transactions only) peer” and the “multimedia” aspects of the workload inde- # of users 24,578 pendently. Our goals are: # of unique objects 633,106 (totaling 8.85TB) bytes transferred 22.72TB 1. to understand the fundamental properties of multime- content demanded 43.87TB dia file-sharing systems, independent of the design of the delivery system, Table 1: Kazaa trace summary statistics, 5/28/02 to 12/17/02. A transaction refers to a single Kazaa HTTP 2. to explore the forces driving P2P file-sharing work- transfer; a request refers to the set of transactions a loads so we can anticipate the potential impacts of client issues while downloading an object, potentially future change, and in pieces from many servers. Clients are identified by Kazaa username. We only present statistics for down- 3. to demonstrate that significant opportunity exists to loads made by university-internal clients for data on optimize performance in current file-sharing systems university-external peers. by exploiting untapped locality in the workload. To meet these goals, we employ several approaches. First, we analyze a 200-day trace of Kazaa [19] traffic that we ies have described high-level characteristics of P2P work- collected at the University of Washington between May and loads [6, 29, 30, 32]. Our goals in this section are to: (1) December of 2002. We traced over 20 terabytes of traffic dig beneath these high-level studies to uncover the processes during that period, from which we distill several key lessons that drive the workloads, and (2) demonstrate ways in which about the Kazaa workload. Second, we derive a model of these processes fundamentally differ from those of the Web. this multimedia traffic based on our analysis. The model 2.1 Trace Methodology helps to explain the root causes of many of the trends shown by Kazaa and to predict how the trends may change as the The data in this section are based on a 200-day trace workload evolves. Third, we use trace-driven simulation to of Kazaa peer-to-peer file-sharing traffic collected at the quantify the significant potential that exists to improve the University of Washington between May 28th and December performance of multimedia file-sharing in our environment. 17th, 2002. The University of Washington (UW) is a large Our analysis reveals that the Kazaa workload is driven campus with over 60,000 faculty, students, and staff. Table 1 by considerably different forces than the Web. Kazaa ob- describes the trace, which saw over 20 terabytes of incoming jects are immutable, and as a result, the vast majority of data resulting from 1.6 million requests. Our trace was long its objects are fetched at most once per client; in contrast, enough for us to observe seasonal traffic variations, includ- Web pages (e.g., Google or CNN) can be fetched thousands ing the end of spring quarter in June, the summer months, of times per client. Our measurements show that the pop- and the start and end of the fall quarter. We also observed ularity distribution of Kazaa objects deviates substantially the impact of bandwidth rate-limiting instituted by the uni- from the Zipf curves we commonly see for the Web, and versity’s networking organization midway through the trace our model confirms that the “fetch-at-most-once” behavior in an attempt to control the cost of Kazaa traffic.1 of Kazaa clients is the cause. Our model demonstrates an- We collected our trace using hardware and software in- other consequence of object immutability: unlike the Web, stalled at the network border between the university and whose workload is driven by document change, the primary the Internet. Our hardware consists of a 2.0 GHz Pentium- forces in Kazaa are the creation of new objects and the addi- III workstation that monitored traffic flowing between the tion of new users. Without these forces, fetch-at-most-once university and the Internet. Our workstation had sufficient behavior would drive the system to stagnation. CPU and network capacity to ensure that no packets were The structure of this paper follows the multi-tiered ap- dropped, even during peak load. An adjacent workstation proach cited above. Section 2 describes our trace method- acted as a one-terabyte file store for archiving trace data. ology and presents our trace-based analysis of Kazaa. In Our software used a kernel packet filter [24] to deliver TCP Section 3, we analyze the popularity distribution of Kazaa packets to a user-level process, which identified HTTP re- requests and the forces that shape it. Section 4 uses our quests within the TCP flows. Throughout our trace, the observations and analysis to develop an analytical model; packet filter reported packet drop rates of less than 0.0001%. we then use the model to explore the processes that drive We made all sensitive information anonymous – including Kazaa’s behavior in greater depth. Section 5 considers the IP addresses, URLs, usernames, and object names – before performance potential of bandwidth-saving techniques sug- compressing and storing the trace. Overall, our tracing and gested by our modeling and analysis. We describe research analysis software consists of over 30,000 lines of code. related to our study in Section 6, while Section 7 summarizes Our hardware monitored all incoming and outgoing traf- our results and presents conclusions. fic. However, the data presented in this paper (including Table 1) are for one direction only: requests made by uni- versity-internal peers to download data stored on university- 2. THE MEASURED PROPERTIES OF P2P 1 FILE-SHARING WORKLOADS The imposed rate limits bounded upload traffic out of the university’s dormitory population and had little effect on This section uses our trace data to identify key properties download traffic (to the dorms or to the university as a of the Kazaa multimedia file-sharing system. Recent stud- whole), which is the focus of our research.
  • 3. external peers. This unidirectional trace captures all re- 1 quests issued by a stable, complete user population over a 0.9 <10MB objects period of time. Kazaa control traffic, which consists primar- 0.8 % of requests (CDF) ily of queries and their responses, is encrypted and was not 0.7 captured as part of our trace. 0.6 Throughout this paper, the term “user” refers to a person 0.5 100MB+ objects and “client” refers to the application instance running on be- 0.4 0.3 half of a user. We assume there is largely a one-to-one corre- 0.2 spondence between users and specific application instances 0.1 in our environment (although this may not always be true); 0 therefore, we draw conclusions about users based on observa- 5 -3 mins 1 hour 1 day 1 week 150 days tions of clients in our trace. Note, however, that client-side download latency caches may absorb some requests from users, meaning that the client request rate, which we observe in our trace, may Figure 1: Users are patient. A CDF of object transfer be lower than the true user request rate, which we cannot times in Kazaa for small (<10 MB) and large (100 MB+) directly observe. objects. The X-axis is on a log-scale. Kazaa clients supply Kazaa-specific “usernames” as an HTTP header in each transaction. We use these usernames (rather than IP addresses) to distinguish between differ- filtering removed 0.4% of total transactions (0.3% of bytes) ent users in our trace. Unfortunately, an unofficial ver- from our trace. sion of Kazaa, called “KazaaLite,”2 became popular dur- ing our tracing period and is compiled with a predefined 2.2 User Characteristics username embedded in the application itself. We “special- Our first slice through the trace data focuses on the prop- case” requests from KazaaLite, resorting to distinguishing erties of Kazaa users. Previous studies [29, 2] have shown between KazaaLite users by their IP addresses. Although that peer-to-peer users in general are “greedy” (i.e., most DHCP is used in portions of our campus, and identifying users consume data but provide little in return) and have users by IP address is known to have issues when DHCP is poor availability [30]. We confirm some of these character- present [6], only 5.7% of transactions in our trace were from istics, but we also explore others, such as user activity. KazaaLite clients. Furthermore, KazaaLite clients did not appear within the first 59 days of our 203 day trace. 2.2.1 Kazaa Users Are Patient Kazaa file-transfer traffic consists of unencrypted HTTP As any Web-based enterprise knows, users are very sensi- transfers; all transfers include Kazaa-specific HTTP head- tive to page-fetch latency. Beyond a certain small threshold ers (e.g., “X-Kazaa-IP”). These headers make it simple to (measured in seconds), they will abandon a site and move distinguish between Kazaa activity and other HTTP activ- to another, possibly competing, site. For this reason, many ity. They also provide enough information for us to identify online businesses engage services such as Keynote [20] to tell precisely which object is being transferred in a given trans- them quickly if their servers are not sufficiently responsive. action. When a client attempts to download an object, that In the world of the Web, users expect instant gratification object may be downloaded in pieces (often called “chunks”) and are unforgiving if they do not receive it. from several sources over a long period of time. We define a In this context, the behavior of Kazaa users is surprising. “transaction” to be a single HTTP transfer between a client Figure 1 shows the distribution of transfer times in Kazaa; and a server, and a “request” to be the set of transactions a transfer time is defined as the difference between the start client participates in to download an entire object. A failed time of the first transaction and the end time of the last transaction occurs when a client successfully contacts a re- transaction of a given user request. We filtered out partial mote peer, but that remote peer does not return any data, requests (i.e., we only counted transfers for which the user instead returning an HTTP 500 error code. eventually obtained the entire object). To deal with edge A single request may span many minutes, hours, or even effects, we ignored requests for which at least one transac- days, since the Kazaa client software will continue to at- tion occurred during the first month of the trace; note that tempt to download the object long after the user has asked this will tend to result in an underestimate of user patience. for it. Occasionally, a client may download only a subset of We present our results in terms of requests for “small” ob- the entire object (either because the user gives up or because jects (less than 10MB, typically audio files) and requests for the object becomes permanently inaccessible in the middle “large” objects (more than 100MB, typically video files). As of a request). We call this a “partial request.” we will show in Section 2.3.1, this is a natural and represen- The Kazaa application has an auto-update feature, mean- tative way to decompose the overall workload. ing that a running instance of Kazaa will periodically check The results show incredible patience on the part of Kazaa for updated versions of itself. If found, it downloads the new users. Even for small objects, 30% of the requests take over executable over the Kazaa network. We chose to filter out an hour, and 10% take nearly a day. For large requests, less these auto-update transactions from our logs, as they are than 10% complete in an hour, 50% take more than a day, not representative of multimedia requests from users. Such and nearly 20% of users are willing to wait a week for their 2 downloads to complete! From this graph, it is clear that the Many more unofficial versions of Kazaa that use generic dynamics of multimedia downloads differ totally from Web usernames have appeared since our trace period finished; precisely distinguishing between peer-to-peer users will be- usage. The Web is an interactive system where users want come very difficult, given that neither IP addresses nor an immediate response. Kazaa is a batch-mode delivery sys- application-specific usernames are unique. tem, where downloads are made in the background and the
  • 4. 8 6000 1.4 1 TB requested data requested 7 population size 1.2 probability of request 5000 0.8 data requested per makes >0 requests population size 6 probabilty a client live client (GB) 1 TB requested 4000 5 0.6 0.8 4 3000 0.6 0.4 3 2000 0.4 2 0.2 1000 0.2 1 0 0 0 0 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 week week Figure 2: Bytes requested by the population and attrition Figure 3: Older clients have slower request rates. On av- as a function of age. “Older” clients request a smaller erage, clients use the system equally often as they age fraction of bytes than newer clients. There are fewer old (having approximately a 50% chance of using the system clients than young clients, but attrition occurs at a more any given week), but they request less data per session gradual rate than the slowdown in bytes requested. as they age. Note that the point corresponding to new clients (week 1) is artificially high, since by definition every new client requests an object immediately. content is examined later. Users do not wait for their con- tent to arrive. They go about their business and eventually return to review the content after it has been received. request rates for two reasons: (1) they may use the system less often, or (2) they may ask for less when they use the 2.2.2 Users Slow Down As They Age system. Figure 3 shows that clients are equally likely to use An interesting question we explored is how user interest the system regardless of age: on average, clients have about in Kazaa varies over time. Do users become hungrier for a 50% chance of making a request on any given week. Older content as they gain experience with Kazaa? Are their re- clients slow down because they ask for less each time they quest rates relatively constant? Do they lose interest over use the system, not because they use the system less often. time? The answer to such questions significantly affects the In summary, new clients generate most of the load in growth of the system. Kazaa, and older clients consume fewer bytes as they age. To understand user behavior over time, we first calcu- In part, this is because of attrition: clients leave the system lated the average number of bytes consumed by clients as a permanently as they grow older. Also, older clients tend to function of age. The methodology for this measurement is interact with the system at a constant rate, but ask for less complex: our trace has finite length, so we must avoid end during each interaction. effects that would overcount short-lived or undercount long- lived users. We compensate, first, by counting transferred 2.2.3 Client Activity bytes only from clients whose “births” we could observe. Be- Quantifying the availability of clients in a peer-to-peer sys- cause there are no detectable birth events in our trace, we tem is notoriously difficult [6]: no one metric can accurately used the heuristic of treating the first observed download capture availability in this environment, since any individ- from a client as a birth event if at least a full month had ual client might exist only for a fraction of the traced time elapsed in our trace before seeing that first download. To period. Given our passive tracing methodology, we faced compensate at the end of the trace, we counted bytes only an additional methodological problem: we can detect that from clients born prior to the last 11 weeks of the trace. users are participating in the system only when their clients Because of this “end threshold,” we could draw definitive transfer data (either by downloading or uploading files). If a conclusions about clients’ behavior only during the first 11 client is on-line but not active, we could not observe them. weeks of their lifetimes. Because of this, we report statistics about client activity Figure 2 shows the total number of bytes requested by only, which is a lower bound on availability.3 the population as a function of its age. From this graph, we We use two specific metrics to quantify the amount of can see that older clients consume fewer bytes than newer client transfer activity: (1) activity fraction, which measures clients. There are two reasons for this effect: (1) attrition the fraction of time a client is transferring content over the reduces the number of older clients, since clients may “die” client’s lifetime or over the duration of the entire trace, and (i.e., leave the system forever) over time, and (2) some clients (2) average session length, in which a session is defined as may continue to issue requests but do so at a slower rate as an unbroken period of time during which a client has one or they age. We explore each of these in turn. more active transactions. Average session length measures Attrition. To understand attrition in the system, we an- the typical duration of the periods during which a client is alyzed the number of clients that remain alive as a function receiving or transmitting data. Our measurements indicate of age (also shown in Figure 2). Population size declines at that the distributions of average session length and activity a more gradual rate than bytes requested, at least over the fraction over the measured population are heavy-tailed. first 11 weeks of clients’ lifetimes. Attrition therefore only 3 partially explains why older clients demand less in aggregate P2P software is often designed to make it difficult to close the program once it starts, “fooling” users into making their from the system. To fully explain this phenomenon, clients clients more available than intended. Accordingly, we sug- must also slow down their request rates as they age. gest that client activity is a more universally comparable Slowing down over time. Older clients may have slower and stable indicator of “availability” than other metrics.
  • 5. 1 1 0.9 requests 0.9 bytes transferred % of requests/bytes 0.8 0.8 0.7 0.7 % requests 0.6 0.6 (CDF) 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 % bytes consumed 0.1 0.1 0 0 0.1 1 10 100 1000 <= 10 10-100 >100 object size (MB) object size (MB) (a) (b) Figure 4: Bandwidth consumed vs. object size. (a) CDFs of the total bandwidth consumed and the number of requests generated, as a function of object size, and (b) bandwidth consumed and requests generated as a function of object size, grouped into three regions. 2.3 Object Characteristics all clients with >1 clients transfer We now turn our attention to the characteristics of re- activity fraction median 66.08% 5.54% quested objects. Kazaa objects consist of a mixture of media over lifetime 90th percentile 100% 94.41% types, including audio, video, images, and executables [29]. activity fraction median 0.01% 0.20% Our trace captured 1,640,912 requests to 633,106 different over trace 90th percentile 2.29% 4.15% objects, resulting in 98,997,622 transactions. In the remain- total activity median 0.69 hours 10.09 hours over trace 90th percentile 111.93 hours 202.65 hours der of this section, we discuss the most prominent properties average median 2.40 mins 2.40 mins of the requested objects. session length 90th percentile 28.25 mins 28.33 mins 2.3.1 Kazaa Is Not One Workload Table 2: Client session lengths and activity fractions. While we tend to think of a system like Kazaa as generat- Summary statistics over two different subsets of our ing a single workload, Kazaa is better described as a blend client population: all clients, and only those clients that of workloads that each have different properties. Figure 4(a) have transferred more than one file. We show activity shows the percentage of total bandwidth consumed in our fraction summary statistics relative to both the clients’ trace as a function of object size. The graph shows three lifetimes and to the entire trace. prominent regions: “small” objects that are less than 10 MB in size, medium-sized objects from 10 to a few hundred MB in size, and large objects that are close to 1 GB in size. Relative to the Web, our “small” objects are three orders Table 2 summarizes these metrics. Average session lengths of magnitude larger than an average Web object, while our are typically small: the median average session length is large objects are nearly five orders of magnitude larger. This only 2.4 minutes, while the 90th percentile is 28.25 minutes. was shown in [29] as well. Most clients have high activity fractions relative to their Figure 4(a) also graphs the percentage of requests ob- lifetimes. However, this is a potentially misleading statistic, served in the trace as a function of object size, demonstrat- since many clients have very short lifetimes, leaving the sys- ing that most requests are for the “small” objects. Tak- tem permanently after requesting only one file. The activity ing a closer look, Figure 4(b) breaks down the number of fraction for clients with greater than one request is a more requests and bytes transferred into three regions: 10MB revealing statistic. The median client in this category was or less, 10MB to 100MB, and over 100MB. We chose the active only 5.54% of its lifetime, or 0.20% of the entire trace. value 10MB as this corresponds to an obvious plateau in Highly active clients in this category are active for most of Figure 4(a) and it serves as an upper bound on the size of their lifetime, but even these clients are typically active for most digitized audio clips. We chose the value 100MB as it only a short fraction of the trace: the 90th percentile client is a reasonable lower bound on the size of most digitized full- has an activity fraction of only 4.15% over the trace. length movies transferred. While the majority of requests At first glance, our user patience statistics may appear at (91%) are for objects smaller than 10MB (primarily audio odds with our average session length statistics. The median clips), the majority of bytes transferred (65%) are due to the small object takes about 17 minutes to download fully, but largest objects (video files). Thus, if our concern is band- the median average session length is only 2.41 minutes. This width consumption, we need to focus on the small number apparent paradox is reconciled by several facts: (1) many of requests to those large objects. However, if our concern transactions fail (Table 1), producing many very short ses- is to improve the overall user experience, then we must fo- sions, (2) periods of no activity may occur during a request cus on the majority of requests for small files, despite their if the client cannot find an available server with the object, relatively small bandwidth demand. and (3) some successful transactions are short, either be- The stark differences between the large and small object cause servers close connections prematurely or the “chunk” workloads lead us to analyze their behaviors independently size requested is small. for much of the evaluation that follows.
  • 6. small objects large objects cently born objects. Given that there is significant turn- (primarily audio) (primarily video) over in popularity within Kazaa, we wanted to understand top 10 top 100 top 10 top 100 whether objects that become popular are old objects that overlap in the most popular objects have grown in popularity over time or recently born objects 0 of 10 5 of 100 1 of 10 44 of 100 between first and last that enjoy sudden popularity. Using the same month-long 30 days of trace segments as before, we calculated the fraction of objects that # of newly popular objects that are 6 of 10 73 of 95 2 of 9 47 of 56 were newly popular (i.e., in the top-10 or top-100 in the last recently born month of the trace but not the first month of the trace), but did not receive any requests at all during the first month of Table 3: Object popularity dynamics in Kazaa. There is the trace (i.e., they were likely “born” after the first month significant turnover in the set of most popular objects of the trace). Table 3 shows the results: newly popular ob- between the first 30 days and the last 30 days of the jects tend to be recently born in Kazaa, although this is trace. The newly popular objects (those in the set of more true for audio objects than video objects. most popular objects over the last 30 days but not in the set over the first 30 days) tend to be recently born. Most requests are for old objects. The previous ex- periments confirmed that the most popular objects tend to decay in popularity, and that the newly popular objects that 2.3.2 Kazaa Object Dynamics replace them tend to be newly born. A related, but different, question is whether most requests go to old or new objects. A simple but crucial difference between multimedia and We categorize an object as “old” if at least a month has Web workloads is that multimedia objects are immutable, passed since we observed the first request for that object. while Web pages are not. Though obvious, this fact and We categorize it as “new” if it has been less than a month its implications have not been discussed in the research lit- since it was first requested. Note that we can be sure that erature. A video clip of “Bambi Meets Godzilla” will be an object is old, but we can never be sure that an object the same clip tomorrow or the next day: it never changes. is new, since we may have missed requests for the object On the other hand, the Web page CNN.com may change before our trace began. To deal with edge effects, we do not every hour, or upon every access if the page is personal- include the first month of requests in our statistics, but we ized for clients. Web workloads are thus strongly driven do use them to help distinguish between old and new objects by dynamic content creation. It has been shown that the in subsequent months. rate of document change is a key factor in Internet behav- Using this methodology, 72% of requests for large objects ior and has enormous implications for caching, performance, go to old objects, while 28% go to new objects. For small and content delivery in general [11, 37]. We now show how objects, 52% of requests go to old objects, and 48% go to new immutability affects object dynamics. objects. This shows that a substantial fraction of requests Kazaa clients fetch objects at most once. Because are for the old objects. Large objects requested tend to objects are immutable and take non-trivial time to down- be older than small objects, reinforcing our assertion that load, we believe that users typically download a Kazaa ob- Kazaa is really a mixture of workloads: the pace of life is ject only once. Our traces confirm that 94% of the time, slower for large objects than for small objects. a Kazaa client requests a given object at most once; 99% From the above discussion, it is clear that the forces driv- of the time, a Kazaa client requests a given object at most ing the Kazaa workload differ in many ways from those driv- twice. In comparison, based on a Web trace we gathered ing the Web. Web users may download the same pages many during the first nine days of our Kazaa trace period, a Web times; Kazaa users tend to download objects at most once. client requests a given object at most once only 57% of the The arrival of new objects plays an important role in P2P time. file-sharing systems, while changes to existing pages are a The popularity of Kazaa objects is often short- more important dynamic in the Web. We discuss implica- lived. Object immutability also has an impact on object tions of these differences in Sections 3 and 4. popularity dynamics. The set of most popular pages remains relatively stable for the Web and these pages account for 2.3.3 Kazaa Is Not Zipf a significant fraction of overall accesses [26]. In contrast, Much has been written about the Zipf-like qualities of many of the most popular audio/video objects are routinely the WWW [7]. In fact, researchers commonly quote the replaced by newly released objects, often in only a few weeks. Zipf parameter of the popularity distributions seen in their To illustrate this change in popular Kazaa objects, we traces [14, 27], in part to demonstrate that their results compared the first 30 days of the trace to the last 30 days of are “correct.” This Zipf property of Web access patterns the trace. For each of these 30-day (month-long) segments, is thought to be a basic fact of nature: a small number of we identified the top-10 most popular and the top-100 most objects are extremely popular, but there is a long tail of popular objects (Table 3). For small objects, there was no unpopular requests. Zipf’s law states that the popularity of overlap between the top-10 most popular objects: the most the ith-most popular object is proportional to i−α , where α popular small objects had changed completely in the space is the “Zipf coefficient” or “Zipf parameter.” Zipf distribu- of only six months. For large objects, there was only one tions look linear when plotted on a log-log scale. object in common in the top-10 across these segments. The Figure 5 shows the Kazaa object popularity distribution top-100 objects show only 5 small objects in common across on a log-log scale for large (>100MB) objects, along with the segments, and 44 objects in common across the large the best-fit Zipf curve; a qualitatively similar curve exists objects. Popularity is fleeting in Kazaa, and popular audio for small (<10MB) objects. This figure also shows the pop- files tend to lose popularity faster than popular video files. ularity distribution of Web objects, drawn from our Web The most popular Kazaa objects tend to be re- trace. Unlike the WWW, the Kazaa object request distribu-
  • 7. 10,000,000 servers and the Web. Following this, we present a model of 1,000,000 multimedia workloads in Section 4, and we use this model to explore implications of non-Zipf behavior. 100,000 # of requests WWW objects 10,000 1,000 3. ZIPF’S LAW AND MULTIMEDIA WORK- 100 LOADS Previous studies of multimedia workloads examined ob- 10 100MB+ Kazaa objects ject popularity and found conflicting results, noting both 1 Zipf and non-Zipf behavior [4, 7, 8, 12, 33]. This section 1 10 100 1000 10000 100000 1E+06 1E+07 1E+08 examines previous work in the context of our own, with the object rank goal of explaining the similarities, differences, and causes of the behavior observed both by us and others. We begin by Figure 5: Kazaa is not Zipf. The popularity distribution presenting a hypothesis that explains the non-Zipf behav- for large objects is much flatter than Zipf would predict, ior in Kazaa. Next, we discuss previous studies that have with the most popular object being requested 100x less observed or modeled non-Zipf workloads and contrast our than expected. Similarly shaped distributions exist for hypothesis with previous explanations. Finally, we attempt small objects and the aggregate Kazaa workload. For to show the generality of our claim by revealing non-Zipf be- comparison, we show the popularity distribution of Web havior in previously studied workloads. Section 4 then sup- requests during a subset of the Kazaa trace period: the ports our hypothesis through the use of a generative work- Web is well described by Zipf. load model whose output closely matches our observations. 3.1 Why Kazaa Is Not Zipf tion as observed in our trace does not seem to follow a Zipf The previous section highlighted a crucial difference be- curve. The strongest difference from Zipf appears among tween the Web and the Kazaa P2P file-sharing system: the most popular objects, where the curves are noticeably Kazaa objects are immutable, while Web objects change, flatter than the Zipf lines. The most popular multimedia sometimes frequently. As we observed in Section 2.3.2, the objects are significantly less popular than Zipf would pre- immutability of Kazaa objects causes two important effects, dict. For example, our trace shows that the most popular one affecting user behavior and the other object dynam- large object has only 272 requests, and only 47 objects have ics. First, Web clients often fetch the same Web page many 100 requests or more. Popularity distributions of Web traffic times (“fetch-repeatedly”). In contrast, Kazaa clients rarely sometimes have slightly flattened “heads,” but the objects request the same object twice (“fetch-at-most-once”). Sec- in the flattened region account for only a small percentage of ond, the Web’s primary object dynamic is updates to exist- total requests and bytes. The flattened portion of the Kazaa ing pages. In contrast, the primary object dynamic in the popularity curve consists of the top 1000 objects, and these Kazaa workload is the arrival of entirely new objects. objects account for more than 50% of bandwidth consumed. These object and user characteristics can be used as two axes for classifying systems: immutable objects vs. mu- 2.4 Summary table objects, and fetch-repeatedly clients vs. fetch- Peer-to-peer file-sharing workloads are driven by consid- at-most-once clients. In systems with large client-side erably different processes than the Web. Kazaa users are caches, these characteristics are intimately linked: fetch- much more patient than Web users, waiting hours, days, repeatedly behavior exists when there are object updates, and, in some cases, weeks for their objects to arrive. Even such as in the Web, and fetch-at-most-once user behavior so, the median session length is only 2.4 minutes, because: exists when the only object dynamic is new arrivals, such as (1) many transactions fail, (2) many transactions are short, in the Kazaa system. We believe that these differences in and (3) clients occasionally have periods of no activity while object and user dynamics explain why we observed Zipf re- satisfying a request. As Kazaa clients age, they demand less quest distributions in our measured Web workload, but not from the system. Part of this is due to attrition (clients per- in Kazaa. manently leaving the system), but part is also due to older As an initial comparison of fetch-at-most-once and fetch- clients requesting fewer bytes per session. repeatedly behavior, we simulated the request distribution The Kazaa file-sharing workload consists of a mixture of generated by two hypothetical 1,000-user populations: one large, immutable audio and video objects, but the measured in which users fetch objects repeatedly, and one in which popularity distribution of these objects is not Zipf. The they fetch objects at most once. In both cases, we simulated popularity of the most requested objects is limited, since the same initial Zipf distribution (with Zipf parameter α = clients typically fetch objects at most once, unlike the Web. 1.0 over 40,000 objects). However, in the fetch-at-most-once Nonetheless, there is substantial non-uniformity in Kazaa case, we prevented users from making subsequent requests object popularity, which suggests that caching is a poten- for the same object. (Section 4 more fully explains the model tially effective bandwidth savings tool. In Kazaa, object behind this simulation.) arrivals play an important role, while in the Web, updates Figure 6 shows the results when users have made an av- to existing pages are a more important dynamic. erage of 1000 requests each. While fetch-repeatedly behav- In the next section, we explore what we believe are the rea- ior recreates the underlying Zipf distribution, fetch-at-most- sons why Kazaa’s workload does not appear to follow Zipf’s once behavior shows markedly different results: the most law. We also compare Kazaa’s non-Zipf behavior to non- popular objects are requested much less, while objects down Zipf behavior in other systems, including video-on-demand the tail show elevated numbers of requests. This behav-
  • 8. 100000 show up many times (potentially as many times as there fetch-repeatedly are users) in a fetch-at-most-once workload such as Kazaa, 10000 even with individual client caches. Both result in non-Zipf # of requests 1000 distributions, but for different reasons. fetch-at-most-once 100 3.2.2 Kazaa vs. Other Multimedia Workloads 10 Tang et al. [33] show that streaming media server work- loads exhibit flattened popularity distributions similar to 1 those we observed. They attribute this flattened distribu- 1 10 100 1000 10000 100000 tion to the long duration over which they gathered their object rank trace. Their hypothesis is that with a longer trace, more files with similar popularity can be observed, in effect cre- Figure 6: Simulated object popularity for fetch-repeatedly ating “groups” of files with similar properties. Instead of and fetch-at-most-once behavior. When users fetch an in- Zipf’s law being appropriate for describing the popularity of dividual object at most once, the resulting request distri- individual files, they show that Zipf’s law better describes bution has a significantly flattened head, approximating the popularity of these groups, and they provide a mathe- the Kazaa popularity distribution. matical transformation inspired by these groups to convert observed non-Zipf distributions into a Zipf-like distribution. We believe that fetch-at-most-once behavior is the likely ior strikingly resembles the measurement results shown in cause of the non-Zipf popularity in their workload. A short Figure 5, confirming our intuition that fetch-at-most-once trace can cause a non-Zipf workload to appear Zipf-like, as behavior is a key contributor to non-Zipf popularity. not enough requests have been gathered within the trace for objects with similar popularity to be observed [8]. However, 3.2 Non-Zipf Workloads in Other Systems the fact that an adequate sample size will reveal a non-Zipf Non-Zipf popularity distributions have been observed in workload does not explain why this true workload is non- many previous studies [1, 3, 8, 33]. We now compare our Zipf in the first place. The length of a trace by itself does results to two categories of previous studies: those that an- not provide a satisfactory explanation of the forces driving alyzed Web workloads and those that analyzed video (or popularity in a system. Fetch-at-most-once behavior par- other multimedia) workloads. tially provides this explanation. Additionally, in a system that experiences object births (for example, new video ob- 3.2.1 Kazaa vs. Web Workloads jects becoming available), a short trace may miss important While most Web studies conclude that Web popularity birth events. A longer trace may reveal objects spread out distributions follow Zipf’s law, some studies have observed over time that have equivalent short-term popularity, similar non-Zipf behavior, specifically in workloads that emerge as to the “group” effect Tang et al. proposed. miss traffic from a proxy cache [4, 7, 12]. Web proxy caches, In another study, Almeida et al. [3] observe a flattened by design, absorb requests to popular documents; conse- Zipf-like popularity distribution in an educational media quently, the most popular documents in a workload will ap- server. They propose that this distribution is well described pear only once (as cold misses) in cache miss traces. Mod- by the concatenation of two true Zipf distributions. Al- erately popular documents will occasionally be absorbed by though the data appears to be well modeled by two Zipf the proxy cache, but they will also appear in the miss traffic distributions, their study does not provide an explanation as capacity misses. The least popular documents will always of what causes this effect. miss in the proxy cache, assuming the cache has finite size Another often studied multimedia workload is that of vid- and cannot hold the entire object population. As a result, a eo-on-demand (VoD) servers. VoD researchers frequently Zipf workload, when fed through a shared proxy cache, will use Zipf distributions to model the popularity of video doc- result in a non-Zipf popularity distribution with a flattened uments in their systems [10, 18, 31]. Surprisingly, many of head [12], visually similar to the non-Zipf distributions we these researchers appear to base their Zipf assumption on observed in Kazaa. Breslau et al. [7] derive equations that results published from a single data set: one week of rental relate the Zipf parameter of a workload fed to a proxy cache data from a video store [35]. and the hit rate that the proxy cache will experience. Our results suggest that their workload (i.e., requests for Although proxy cache miss distributions may visually re- immutable video objects) should reveal similar non-Zipf be- semble our measured Kazaa popularity distributions, we be- havior stemming from fetch-at-most-once client behavior. lieve the two workloads diverge from Zipf for different rea- To understand the apparent discrepancy between their as- sons. In Kazaa, users’ requests do not flow through a shared sumptions and our results, we manually extracted the video proxy cache: the workload we observed is the aggregate of a rental data set from Figure 2 of [10]. In Figure 7(a), we show large number of individual fetch-at-most-once streams, with this data as it appeared in that paper, plotted on a linear an absence of shared proxy caches. The only caches that scale with a Zipf curve fit. Figure 7(b) shows the same currently exist in Kazaa are client-side caches. Note that data plotted on a log-log scale with the same Zipf curve there is no practical difference between true fetch-at-most- fit. While the linear scale plot seems to suggest the data once user behavior, and fetch-repeatedly user behavior fil- may be well described by Zipf’s law, the log-log plot reveals tered through infinite client-side caches resulting in fetch- that even this data set appears to show the flattened head at-most-once client behavior. that is characteristic of fetch-at-most-once systems. We also The most popular Web object will show up only one time gathered recent box-office movie ticket sales data [34]. Fig- in a proxy cache miss trace. The most popular object will ures 7(c) and (d) show that this data, too, is consistent with
  • 9. 100 1000 90 80 rental frequency rental frequency 70 100 60 50 40 30 10 20 10 0 1 0 50 100 150 200 250 1 10 100 1000 movie index movie index (a) (b) 500 1000000 450 100000 400 10000 box office sales box office sales 350 ($millions) ($millions) 1000 300 100 250 200 10 150 1 100 0.1 50 0.01 0 0.001 0 50 100 150 200 250 1 10 100 1000 movie index movie index (c) (d) Figure 7: Video rental and box office sales popularity. (a) The popularity distribution from a 1992 video rental data set used to justify Zipf ’s law in many video-on-demand papers, along with a Zipf curve fit with α = 0.9, and (b) the same data set and curve fit plotted on a log-log scale. Contrary to the assumption of many papers, video rental data does not appear to follow Zipf ’s law. (c) The distribution of 2002 U.S. box office ticket sales on a linear scale, along with a Zipf-fit with α = 2.28, and (d) on a log-log scale. This data set also appears to be non-Zipf. our fetch-at-most-once popularity distribution. Symbol Meaning Base value The VoD community has recently begun to recognize the C # of clients 1,000 importance of object births in video-on-demand systems [16]. O # of objects 40,000 Because new video titles are created (often with high pop- λR per-user request rate 2 objects/day ularity), and because the popularity of existing video titles Zipf parameter driving α 1.0 tends to diminish over time, VoD workload models are start- object popularity ing to incorporate schemes for changing popularity distribu- probability that a user based on Zipf(1) tions over time. This is consistent with our earlier observa- P(x) requests an object of (see text) tions that both client and object birth rates are important popularity rank x factors in multimedia workloads. probability that a newly In the next section, we present a model of fetch-at-most- A(x) arrived object is inserted Zipf(1) at popularity rank x once systems. We use this model to demonstrate the impli- cache size, measured as cations of the resulting non-Zipf behavior and explore the M fraction of all objects varies importance of client and object birth rates on system per- λO object arrival rate varies formance. We believe that this model is relevant to both λC client arrival rate varies P2P file-sharing and video-on-demand systems. Table 4: Model structure and notation. These parame- 4. A MODEL OF P2P FILE-SHARING ter settings reflect the values seen in our trace for large (>100MB) objects. WORKLOADS In the previous section, we hypothesized that the non-Zipf behavior of Kazaa, and of multimedia workloads in gen- eral, is best explained by the fetch-at-most-once behavior fetch-at-most-once behavior has on overall object popularity. of clients. This section presents a model of P2P file-sharing We use caching as a lens to observe how the addition of systems that enables us to explore this hypothesis further. new objects and new clients affects performance in a fetch- The model, used to generate Figure 6 in the previous section, at-most-once file-sharing system. Finally, we validate the captures key parameters of the P2P file-sharing workload, model by comparing the popularity distributions of a model- such as request rates, number of clients, number of objects, generated workload with that extracted from our trace data. and changes to the set of clients and objects. From these parameters, it produces a stream of client requests for ob- 4.1 Model Description jects that we then analyze. By varying model parameters, Table 4 summarizes the parameters used in our model. We we explore the underlying processes that lead to certain ob- chose parameter values that reflect our trace measurements servable workload characteristics and how changes to the for large objects, since this component dominates bandwidth parameters affect the system and its performance. consumption. One parameter value that differs from the The remainder of this section describes our basic model of measured trace is the number of clients. In order to run a a P2P file-sharing system. This model shows the impact that substantially larger number of experiments, we model a sys-
  • 10. 1 tem with 1,000 clients rather than the roughly 7,000 large- fetch-repeatedly object clients in our trace. We verified that the predictions 4000 1000 0.8 2000 of our model were not affected by this difference. To simplify our model, we also assumed that all objects in the system hit rate 0.6 were of equal size. Our model captures key aspects of our P2P file-sharing 0.4 workload, in particular, the differences between file-sharing 0.2 and Web workloads. In a Web workload, clients select ob- 1000 2000 4000 jects from a Zipf distribution, P (x), in an independent and fetch-at-most-once 0 identically distributed fashion. In contrast, the object selec- 0 200 400 600 800 1000 tion process in a file-sharing system depends on three fac- days tors: (1) the Zipf distribution, P (x), (2) the way in which new objects are inserted into that distribution, A(x), and Figure 8: File-sharing effectiveness diminishes with client (3) the clients’ fetch-at-most-once behavior. age. After an initial cache warm-up period, the request Our model generates requests as follows. On average, a stream generated by fetch-at-most-once client behavior client requests two objects per day, choosing which object to experiences rapidly decreasing cache performance, while fetch from a Zipf probability distribution with parameter 1.0 the stream generated by fetch-repeatedly clients remains (“Zipf(1)”).4 We hypothesize that the underlying popularity stable. This effect is shown for various cache sizes (1000, of objects in a fetch-at-most-once file-sharing system is still 2000, and 4000 objects). driven by Zipf’s law, even though the observed workload becomes non-Zipf because of fetch-at-most-once clients. In our model, subsequent requests from the same client obey the optimistic assumption that all objects are cachable and distributions obtained by removing already fetched objects are never updated. from the candidate object set and re-scaling so the total probability is 1.0. Given two previously unrequested objects, 4.2 File-Sharing Effectiveness Diminishes with the ratio of the probabilities that the client will request these Client Age objects next is identical to their ratio in the original Zipf Imagine an organization experiencing current demand for distribution. For fetch-repeatedly systems, each request is external bandwidth due to fetches of P2P file-sharing ob- made according to the original Zipf distribution. jects. How should this organization expect bandwidth de- When modeling fetch-at-most-once systems, λO > 0 is the mand to change over time, given a shared proxy cache? We object arrival rate. When an object is born in a fetch-at- address this question using the “static” case in which no new most-once system, its popularity rank is determined by se- clients or objects arrive. This static analysis allows us to fo- lecting randomly from the Zipf(1) distribution. Pre-existing cus on one factor at a time (fetch-at-most-once behavior, in objects of equal or lesser popularity are “pushed down” one this case). We relax these assumptions in later subsections Zipf position, and the resulting distribution is re-normalized to show the impact of other factors, such as new object and so the total probability is again 1.0. In fetch-repeatedly sys- client arrivals. tems, we set the object arrival rate to 0. Objects may be Figure 8 shows hit rate against time for various shared updated, but for simplicity we ignore the second-order effect cache sizes, assuming that at time zero no clients have fetched of completely new objects on request behavior. any objects. After a brief cache warm-up period, hit rate de- While our trace shows all requested objects, it cannot ob- creases as clients age, even for a cache that can hold 10% of serve the total object population, since many available ob- all file-sharing objects. This is because fetch-at-most-once jects were never accessed. However, total object population clients consume the most popular objects early. Later re- is a key parameter of our model, as it influences the amount quests are selected with nearly uniform likelihood from an of overlap that will likely occur in requests from different increasingly large number of objects. That is, the system clients. We therefore estimated a base value for total object evolves towards one in which there is no locality, and ob- population by back-inference: how many large objects are jects are chosen at random from a very large space. most likely to have existed in total, given that we saw about This behavior suggests that if client request rates remain 18,000 distinct large objects requested in the trace? We find constant over time, the external bandwidth load they present that a total population of about 40,000 large-media objects increases, since more of their requests are directed to objects is consistent with the trace data; therefore, we use this num- that are only available externally. Conversely, if we hope to ber as the base value. This number is also comparable to have stable bandwidth demand over time, clients must be- statistics that describe commercial movie releases: the In- have in a way that reduces the intensity of their requests in ternet Movie Database reports between 50,000 and 60,000 a manner consistent with the shape of the hit-rate decreases movie releases world-wide over the past 20 years [34]. shown in Figure 8. To quantify file-sharing effectiveness, we use the hit rate The decrease in hit-rate over time is a strong property of that the aggregate workload experiences against a 100% fetch-at-most-once behavior. The underlying popularity dis- available shared cache with LRU replacement, whose size tribution need not be heavy tailed for it to occur. We have we vary in each experiment. Selected experiments using performed experiments using initial object popularity distri- optimal replacement showed no qualitative differences from butions that have higher locality (i.e., Zipf parameters larger LRU results, and quantitative differences varied by only a than 1.0). In a fetch-repeatedly context, the larger skew of few percent. For Web (fetch-repeatedly) scenarios, we make these distributions towards the most popular objects makes 4 Attempting to best-fit a Zipf curve to our measured non- file sharing easier and hit rates rise. For fetch-at-most-once Zipf distribution resulted in a Zipf parameter of 0.98. systems, hit rates start out much higher, but as clients age