SlideShare una empresa de Scribd logo
1 de 15
Descargar para leer sin conexión
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,                    VOL. 23,     NO. 1,    JANUARY 2011                                          95




              Exploring Application-Level Semantics
                      for Data Compression
             Hsiao-Ping Tsai, Member, IEEE, De-Nian Yang, and Ming-Syan Chen, Fellow, IEEE

       Abstract—Natural phenomena show that many creatures form large social groups and move in regular patterns. However, previous
       works focus on finding the movement patterns of each single object or all objects. In this paper, we first propose an efficient distributed
       mining algorithm to jointly identify a group of moving objects and discover their movement patterns in wireless sensor networks.
       Afterward, we propose a compression algorithm, called 2P2D, which exploits the obtained group movement patterns to reduce the
       amount of delivered data. The compression algorithm includes a sequence merge and an entropy reduction phases. In the sequence
       merge phase, we propose a Merge algorithm to merge and compress the location data of a group of moving objects. In the entropy
       reduction phase, we formulate a Hit Item Replacement (HIR) problem and propose a Replace algorithm that obtains the optimal
       solution. Moreover, we devise three replacement rules and derive the maximum compression ratio. The experimental results show that
       the proposed compression algorithm leverages the group movement patterns to reduce the amount of delivered data effectively and
       efficiently.

       Index Terms—Data compression, distributed clustering, object tracking.

                                                                                 Ç




                                                                                          t.c om
1    INTRODUCTION



                                                                                             om
R                                                                                       po t.c
     ECENT  advances in location-acquisition technologies,                           zebra, whales, and birds, form large social groups when
     such as global positioning systems (GPSs) and wireless
                                                                                      gs po
                                                                                     migrating to find food, or for breeding or wintering. These
                                                                                    lo s
                                                                                  .b og

sensor networks (WSNs), have fostered many novel                                     characteristics indicate that the trajectory data of multiple
                                                                                ts .bl



applications like object tracking, environmental monitoring,                         objects may be correlated for biological applications. More-
                                                                              ec ts
                                                                            oj c




and location-dependent service. These applications gener-                            over, some research domains, such as the study of animals’
                                                                          pr oje




ate a large amount of location data, and thus, lead to                               social behavior and wildlife migration [7], [8], are more
                                                                        re r
                                                                      lo rep




transmission and storage challenges, especially in resource-                         concerned with the movement patterns of groups of
                                                                    xp lo




                                                                                     animals, not individuals; hence, tracking each object is
                                                                  ee xp




constrained environments like WSNs. To reduce the data
                                                               .ie ee




volume, various algorithms have been proposed for data                               unnecessary in this case. This raises a new challenge of
                                                              w e




                                                                                     finding moving animals belonging to the same group and
                                                             w .i




compression and data aggregation [1], [2], [3], [4], [5], [6].
                                                            w w




                                                                                     identifying their aggregated group movement patterns.
                                                         :// w




However, the above works do not address application-level
                                                      tp //w




semantics, such as the group relationships and movement                              Therefore, under the assumption that objects with similar
                                                    ht ttp:




patterns, in the location data.                                                      movement patterns are regarded as a group, we define the
                                                     h




   In object tracking applications, many natural phenomena                           moving object clustering problem as given the movement
show that objects often exhibit some degree of regularity in                         trajectories of objects, partitioning the objects into non-
their movements. For example, the famous annual wild-                                overlapped groups such that the number of groups is
ebeest migration demonstrates that the movements of                                  minimized. Then, group movement pattern discovery is to
                                                                                     find the most representative movement patterns regarding
creatures are temporally and spatially correlated. Biologists
                                                                                     each group of objects, which are further utilized to
also have found that many creatures, such as elephants,
                                                                                     compress location data.
                                                                                        Discovering the group movement patterns is more
. H.-P. Tsai is with the Department of Electrical Engineering (EE), National         difficult than finding the patterns of a single object or all
  Chung Hsing University, No. 250, Kuo Kuang Road, Taichung 402,                     objects, because we need to jointly identify a group of objects
  Taiwan, ROC. E-mail: hptsai@nchu.edu.tw.
. D.-N. Yang is with the Institute of Information Science (IIS) and the
                                                                                     and discover their aggregated group movement patterns.
  Research Center of Information Technology Innovation (CITI), Academia              The constrained resource of WSNs should also be consid-
  Sinica, No. 128, Academia Road, Sec. 2, Nankang, Taipei 11529, Taiwan,             ered in approaching the moving object clustering problem.
  ROC. E-mail: dnyang@iis.sinica.edu.tw.                                             However, few of existing approaches consider these issues
. M.-S. Chen is with the Research Center of Information Technology
  Innovation (CITI), Academia Sinica, No. 128, Academia Road, Sec. 2,                simultaneously. On the one hand, the temporal-and-spatial
  Nankang, Taipei 11529, Taiwan, and the Department of Electrical                    correlations in the movements of moving objects are
  Engineering (EE), Department of Computer Science and Information                   modeled as sequential patterns in data mining to discover
  Engineer (CSIE), and Graduate Institute of Communication Engineering
  (GICE), National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei          the frequent movement patterns [9], [10], [11], [12]. How-
  10617, Taiwan, ROC. E-mail: mschen@cc.ee.ntu.edu.tw.                               ever, sequential patterns 1) consider the characteristics of all
Manuscript received 22 Sept. 2008; revised 27 Feb. 2009; accepted 29 July            objects, 2) lack information about a frequent pattern’s
2009; published online 4 Feb. 2010.                                                  significance regarding individual trajectories, and 3) carry
Recommended for acceptance by M. Ester.                                              no time information between consecutive items, which make
For information on obtaining reprints of this article, please send e-mail to:
tkde@computer.org, and reference IEEECS Log Number TKDE-2008-09-0497.                them unsuitable for location prediction and similarity
Digital Object Identifier no. 10.1109/TKDE.2010.30.                                  comparison. On the other hand, previous works, such as
                                               1041-4347/11/$26.00 ß 2011 IEEE       Published by the IEEE Computer Society
96                                         IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,     VOL. 23,   NO. 1,   JANUARY 2011


[13], [14], [15], measure the similarity among these entire        of the HIR problem based on Shannon’s theorem [17] and
trajectory sequences to group moving objects. Since objects        guarantees the reduction of entropy, which is convention-
may be close together in some types of terrain, such as            ally viewed as an optimization bound of compression
gorges, and widely distributed in less rugged areas, their         performance. As a result, our approach reduces the amount
group relationships are distinct in some areas and vague in        of delivered data and, by extension, the energy consumption
others. Thus, approaches that perform clustering among             in WSNs.
entire trajectories may not be able to identify the local group       Our contributions are threefold:
relationships. In addition, most of the above works are
                                                                       .   Different from previous works, we formulate a
centralized algorithms [9], [10], [11], [12], [13], [14], [15],
                                                                           moving object clustering problem that jointly iden-
which need to collect all data to a server before processing.
                                                                           tifies a group of objects and discovers their move-
Thus, unnecessary and redundant data may be delivered,
                                                                           ment patterns. The application-level semantics are
leading to much more power consumption because data
                                                                           useful for various applications, such as data storage
transmission needs more power than data processing in
                                                                           and transmission, task scheduling, and network
WSNs [5]. In [16], we have proposed a clustering algorithm
                                                                           construction.
to find the group relationships for query and data aggrega-
                                                                      . To approach the moving object clustering problem,
tion efficiency. The differences of [16] and this work are as
                                                                           we propose an efficient distributed mining algo-
follows: First, since the clustering algorithm itself is a
                                                                           rithm to minimize the number of groups such that
centralized algorithm, in this work, we further consider
                                                                           members in each of the discovered groups are highly
systematically combining multiple local clustering results                 related by their movement patterns.
into a consensus to improve the clustering quality and for            . We propose a novel compression algorithm to
use in the update-based tracking network. Second, when a                   compress the location data of a group of moving
delay is tolerant in the tracking application, a new data                  objects with or without loss of information. We
management approach is required to offer transmission                      formulate the HIR problem to minimize the entropy



                                                                                 t.c om
                                                                                    om
efficiency, which also motivates this study. We thus define                of location data and explore the Shannon’s theorem

                                                                               po t.c
the problem of compressing the location data of a group of                   gs po
                                                                           to solve the HIR problem. We also prove that the
                                                                           lo s
moving objects as the group data compression problem.                      proposed compression algorithm obtains the opti-
                                                                         .b og

   Therefore, in this paper, we first introduce our distrib-
                                                                       ts .bl


                                                                           mal solution of the HIR problem efficiently.
                                                                     ec ts




uted mining algorithm to approach the moving object
                                                                      The remainder of the paper is organized as follows: In
                                                                   oj c
                                                                 pr oje




clustering problem and discover group movement patterns.
                                                                   Section 2, we review related works. In Section 3, we provide
                                                               re r




Then, based on the discovered group movement patterns,
                                                             lo rep




                                                                   an overview of the network, location, and movement
we propose a novel compression algorithm to tackle the
                                                           xp lo




                                                                   models and formulate our problem. In Section 4, we
                                                         ee xp




group data compression problem. Our distributed mining             describe the distributed mining algorithm. In Section 5,
                                                      .ie ee




algorithm comprises a Group Movement Pattern Mining
                                                     w e




                                                                   we formulate the compression problems and propose our
                                                    w .i
                                                   w w




(GMPMine) and a Cluster Ensembling (CE) algorithms. It             compression algorithm. Section 6 details our experimental
                                                :// w




avoids transmitting unnecessary and redundant data by
                                             tp //w




                                                                   results. Finally, we summarize our conclusions in Section 7.
                                           ht ttp:




transmitting only the local grouping results to a base station
                                            h




(the sink), instead of all of the moving objects’ location data.
Specifically, the GMPMine algorithm discovers the local            2       RELATED WORK
group movement patterns by using a novel similarity                2.1 Movement Pattern Mining
measure, while the CE algorithm combines the local                 Agrawal and Srikant [18] first defined the sequential
grouping results to remove inconsistency and improve the           pattern mining problem and proposed an Apriori-like
grouping quality by using the information theory.                  algorithm to find the frequent sequential patterns. Han et
   Different from previous compression techniques that             al. consider the pattern projection method in mining
remove redundancy of data according to the regularity              sequential patterns and proposed FreeSpan [19], which is
within the data, we devise a novel two-phase and                   an FP-growth-based algorithm. Yang and Hu [9] developed
2D algorithm, called 2P2D, which utilizes the discovered           a new match measure for imprecise trajectory data and
group movement patterns shared by the transmitting node            proposed TrajPattern to mine sequential patterns. Many
and the receiving node to compress data. In addition to            variations derived from sequential patterns are used in
remove redundancy of data according to the correlations            various applications, e.g., Chen et al. [20] discover path
within the data of each single object, the 2P2D algorithm          traversal patterns in a Web environment, while Peng and
further leverages the correlations of multiple objects and         Chen [21] mine user moving patterns incrementally in a
their movement patterns to enhance the compressibility.            mobile computing system. However, sequential patterns
Specifically, the 2P2D algorithm comprises a sequence              and its variations like [20], [21] do not provide sufficient
merge and an entropy reduction phases. In the sequence             information for location prediction or clustering. First, they
merge phase, we propose a Merge algorithm to merge and             carry no time information between consecutive items, so
compress the location data of a group of objects. In the           they cannot provide accurate information for location
entropy reduction phase, we formulate a Hit Item Replace-          prediction when time is concerned. Second, they consider
ment (HIR) problem to minimize the entropy of the merged           the characteristics of all objects, which make the meaningful
data and propose a Replace algorithm to obtain the optimal         movement characteristics of individual objects or a group of
solution. The Replace algorithm finds the optimal solution         moving objects inconspicuous and ignored. Third, because
TSAI ET AL.: EXPLORING APPLICATION-LEVEL SEMANTICS FOR DATA COMPRESSION                                                                 97


a sequential pattern lacks information about its significance
regarding to each individual trajectory, they are not fully
representative to individual trajectories. To discover sig-
nificant patterns for location prediction, Morzy mines
frequent trajectories whose consecutive items are also
adjacent in the original trajectory data [10], [22]. Meanwhile,
Giannotti et al. [11] extract T-patterns from spatiotemporal
data sets to provide concise descriptions of frequent
movements, and Tseng and Lin [12] proposed the TMP-
Mine algorithm for discovering the temporal movement
patterns. However, the above Apriori-like or FP-growth-
based algorithms still focus on discovering frequent              Fig. 1. (a) The hierarchical- and cluster-based network structure and the
patterns of all objects and may suffer from computing             data flow of an update-based tracking network. (b) A flat view of a two-
                                                                  layer network structure with 16 clusters.
efficiency or memory problems, which make them unsui-
table for use in resource-constrained environments.
                                                                  memory constraint of a sensor device. Summarization of the
2.2 Clustering                                                    original data by regression or linear modeling has been
Recently, clustering based on objects’ movement behavior          proposed for trajectory data compression [3], [6]. However,
has attracted more attention. Wang et al. [14] transform the      the above works do not address application-level semantics
location sequences into a transaction-like data on users and      in data, such as the correlations of a group of moving
based on which to obtain a valid group, but the proposed          objects, which we exploit to enhance the compressibility.
AGP and VG growth are still Apriori-like or FP-growth-
based algorithms that suffer from high computing cost and         3    PRELIMINARIES



                                                                                 t.c om
memory demand. Nanni and Pedreschi [15] proposed a




                                                                                    om
                                                                  3.1 Network and Location Models

                                                                               po t.c
density-based clustering algorithm, which makes use of an
optimal time interval and the average euclidean distance                     gs po
                                                                  Many researchers believe that a hierarchical architecture
                                                                           lo s
                                                                  provides better coverage and scalability, and also extends
                                                                         .b og

between each point of two trajectories, to approach the
                                                                       ts .bl


trajectory clustering problem. However, the above works           the network lifetime of WSNs [25], [26]. In a hierarchical
                                                                     ec ts




                                                                  WSN, such as that proposed in [27], the energy, computing,
                                                                   oj c




discover global group relationships based on the proportion
                                                                 pr oje




of the time a group of users stay close together to the whole     and storage capacity of sensor nodes are heterogeneous. A
                                                               re r
                                                             lo rep




time duration or the average euclidean distance of the entire     high-end sophisticated sensor node, such as Intel Stargate
                                                           xp lo




trajectories. Thus, they may not be able to reveal the local      [28], is assigned as a cluster head (CH) to perform high
                                                         ee xp




                                                                  complexity tasks; while a resource-constrained sensor node,
                                                      .ie ee




group relationships, which are required for many applica-
                                                                  such as Mica2 mote [29], performs the sensing and low
                                                     w e




tions. In addition, though computing the average euclidean
                                                    w .i
                                                   w w




distance of two geometric trajectories is simple and useful,      complexity tasks. In this work, we adopt a hierarchical
                                                :// w
                                             tp //w




the geometric coordinates are expensive and not always            network structure with K layers, as shown in Fig. 1a, where
                                           ht ttp:




available. Approaches, such as EDR, LCSS, and DTW, are            sensor nodes are clustered in each level and collaboratively
                                            h




widely used to compute the similarity of symbolic trajectory      gather or relay remote information to a base station called a
sequences [13], but the above dynamic programming                 sink. A sensor cluster is a mesh network of n  n sensor
approaches suffer from scalability problem [23]. To provide       nodes handled by a CH and communicate with each other
scalability, approximation or summarization techniques are        by using multihop routing [30]. We assume that each node
used to represent original data. Guralnik and Karypis [23]        in a sensor cluster has a locally unique ID and denote the
project each sequence into a vector space of sequential           sensor IDs by an alphabet Æ. Fig. 1b shows an example of a
patterns and use a vector-based K-means algorithm to cluster      two-layer tracking network, where each sensor cluster
objects. However, the importance of a sequential pattern          contains 16 nodes identified by Æ ¼ fa, b; . . . ; pg.
regarding individual sequences can be very different, which          In this work, an object is defined as a target, such as an
is not considered in this work. To cluster sequences, Yang        animal or a bird, that is recognizable and trackable by the
and Wang proposed CLUSEQ [24], which iteratively identi-          tracking network. To represent the location of an object,
fies a sequence to a learned model, yet the generated clusters    geometric models and symbolic models are widely used
may overlap which differentiates their problem from ours.         [31]. A geometric location denotes precise two-dimension or
                                                                  three-dimension coordinates; while a symbolic location
2.3 Data Compression                                              represents an area, such as the sensing area of a sensor node
Data compression can reduce the storage and energy                or a cluster of sensor nodes, defined by the application. Since
consumption for resource-constrained applications. In [1],        the accurate geometric location is not easy to obtain and
distributed source (Slepian-Wolf) coding uses joint entropy       techniques like the Received Signal Strength (RSS) [32] can
to encode two nodes’ data individually without sharing any        simply estimate an object’s location based on the ID of the
data between them; however, it requires prior knowledge of        sensor node with the strongest signal, we employ a symbolic
cross correlations of sources. Other works, such as [2], [4],     model and describe the location of an object by using the ID
combine data compression with routing by exploiting cross         of a nearby sensor node.
correlations between sensor nodes to reduce the data size.           Object tracking is defined as a task of detecting a moving
In [5], a tailed LZW has been proposed to address the             object’s location and reporting the location data to the sink
98                                               IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,                VOL. 23,   NO. 1,   JANUARY 2011




Fig. 2. An example of an object’s moving trajectory, the obtained location
sequence, and the generated PST T .

                                                                             Fig. 3. The predict_next algorithm.
periodically at a time interval. Hence, an observation on an
object is defined by the obtained location data. We assume                   algorithm1 learns from a location sequence and generates a
that sensor nodes wake up periodically to detect objects.                    PST whose height is limited by a specified parameter Lmax .
When a sensor node wakes up on its duty cycle and detects                    Each node of the tree represents a significant movement
an object of interest, it transmits the location data of the                 pattern s whose occurrence probability is above a specified
object to its CH. Here, the location data include a times                    minimal threshold Pmin . It also carries the conditional
tamp, the ID of an object, and its location. We also assume                  empirical probabilities P ðjsÞ for each  in Æ that we use
that the targeted applications are delay-tolerant. Thus,                     in location prediction. Fig. 2 shows an example of a location
instead of forwarding the data upward immediately, the



                                                                                        t.c om
                                                                             sequence and the corresponding PST. nodef is one of the



                                                                                           om
CH compresses the data accumulated for a batch period and

                                                                                      po t.c
                                                                             children of the root node and represents a significant
sends it to the CH of the upper layer. The process is repeated                      gs po
                                                                             movement pattern with }f} whose occurrence probability
                                                                                  lo s
until the sink receives the location data. Consequently, the
                                                                                .b og

                                                                             is above 0.01. The conditional empirical probabilities are
                                                                              ts .bl


trajectory of a moving object is thus modeled as a series of                 P ð0 b0 j}f}Þ ¼ 0:33, P ð0 e0 j}f}Þ ¼ 0:67, and P ðj}f}Þ ¼ 0 for the
                                                                            ec ts




observations and expressed by a location sequence, i.e., a
                                                                          oj c




                                                                             other  in Æ.
                                                                        pr oje




sequence of sensor IDs visited by the object. We denote a                        PST is frequently used in predicting the occurrence
                                                                      re r
                                                                    lo rep




location sequence by S ¼ s0 s1 . . . sLÀ1 , where each item si is            probability of a given sequence, which provides us
                                                                  xp lo




a symbol in Æ and L is the sequence length. An example of                    important information in similarity comparison. The occur-
                                                                ee xp




an object’s trajectory and the obtained location sequence is
                                                             .ie ee




                                                                             rence probability of a sequence s regarding to a PST T ,
shown in the left of Fig. 2. The tracking network tracks
                                                            w e




                                                                             denoted by P T ðsÞ, is the prediction of the occurrence
                                                           w .i
                                                          w w




moving objects for a period and generates a location
                                                       :// w




                                                                             probability of s based on T . For example, the occurrence
                                                    tp //w




sequence data set, based on which we discover the group                      probability P T ð}nokjfb}Þ is computed as follows:
                                                  ht ttp:




relationships of the moving objects.
                                                   h




                                                                             P T ð}nokjfb}Þ ¼ P T ð}n}Þ Â P T ð0 o0 j}n}Þ Â P T ð0 k0 j}no}Þ
3.2    Variable Length Markov Model (VMM) and
       Probabilistic Suffix Tree (PST)                                                           Â P T ð0 j0 j}nok}Þ Â P T ð0 f 0 j}nokj}Þ
If the movements of an object are regular, the object’s next                                     Â P T ð0 b0 j}nokjf}Þ
location can be predicted based on its preceding locations.
                                                                                              ¼ P T ð}n}Þ Â P T ð0 o0 j}n}Þ Â P T ð0 k0 j}o}Þ
We model the regularity by using the Variable Length
Markov Model (VMM). Under the VMM, an object’s move-                                             Â P T ð0 j0 j}k}Þ Â P T ð0f 0 j}j}Þ Â P T ð0 b0 j}okjf}Þ
ment is expressed by a conditional probability distribution                                   ¼ 0:05 Â 1 Â 1 Â 1 Â 1 Â 0:3 ¼ 0:0165:
over Æ. Let s denote a pattern which is a subsequence of a
location sequence S and  denote a symbol in Æ. The                             PST is also useful and efficient in predicting the next
conditional probability P ðjsÞ is the occurrence probability                item of a sequence. For a given sequence s and a PST T , our
that  will follow s in S. Since the length of s is floating, the            predict_next algorithm as shown in Fig. 3 outputs the most
VMM provides flexibility to adapt to the variable length of                  probable next item, denoted by predict nextðT ; sÞ. We
movement patterns. Note that when a pattern s occurs more                    demonstrate its efficiency by an example as follow: Given
frequently, it carries more information about the movements                  s ¼ }nokjf} and T shown in Fig. 2, the predict_next
of the object and is thus more desirable for the purpose of                  algorithm traverses the tree to the deepest node nodeokjf
prediction. To find the informative patterns, we first define a              along the path including noderoot , nodef , nodejf , nodekjf , and
pattern as a significant movement pattern if its occurrence                  nodeokjf . Finally, symbol 0 e0 , which has the highest condi-
probability is above a minimal threshold.                                    tional empirical probability in nodeokjf , is returned, i.e.,
    To learn the significant movement patterns, we adapt                     predict nextðT ; }nokjf}Þ ¼ 0 e0 . The algorithm’s computa-
Probabilistic Suffix Tree (PST) [33] for it has the lowest                   tional overhead is limited by the height of a PST so that it
storage requirement among many VMM implementations                           is suitable for sensor nodes.
[34]. PST’s low complexity, i.e., OðnÞ in both time and space
                                                                                1. The PST building algorithm is given in Appendix A, which can
[35], also makes it more attractive especially for streaming or              be found on the Computer Society Digital Library at http://doi.
resource-constrained environments [36]. The PST building                     ieeecomputersociety.org/10.1109/TKDE.2010.30.
TSAI ET AL.: EXPLORING APPLICATION-LEVEL SEMANTICS FOR DATA COMPRESSION                                                                                                                     99


3.3 Problem Description                                                           the similarity of two data streams only on the changed
We formulate the problem of this paper as exploring the                           mature nodes of emission trees [36], instead of all nodes.
group movement patterns to compress the location                                     To combine multiple local grouping results into a
sequences of a group of moving objects for transmission                           consensus, the CE algorithm utilizes the Jaccard similarity
efficiency. Consider a set of moving objects O ¼                                  coefficient to measure the similarity between a pair of
fo1 ; o2 ; . . . ; on g and their associated location sequence data               objects, and normalized mutual information (NMI) to
set S ¼ fS1 ; S2 ; . . . ; Sn g.                                                  derive the final ensembling result. It trades off the grouping
                                                                                  quality against the computation cost by adjusting a partition
Definition 1. Two objects are similar to each other if their
                                                                                  parameter. In contrast to approaches that perform cluster-
  movement patterns are similar. Given the similarity measure
                                                                                  ing among the entire trajectories, the distributed algorithm
  function simp 2 and a minimal threshold simmin , oi and oj are
                                                                                  discovers the group relationships in a distributed manner
  similar if their similarity score simp ðoi ; oj Þ is above simmin , i.e.,
                                                                                  on sensor nodes. As a result, we can discover group
  simp ðoi ; oj Þ ! simmin . The set of objects that are similar to oi is
                                                                                  movement patterns to compress the location data in the
  denoted by soðoi Þ ¼ foj j8oj 2 O; simp ðoi ; oj Þ ! simmin g.
                                                                                  areas where objects have explicit group relationships.
Definition 2. A set of objects is recognized as a group if they are               Besides, the distributed design provides flexibility to take
  highly similar to one another. Let g denote a set of objects. g is a            partial local grouping results into ensembling when the
  group if every object in g is similar to at least a threshold of                group relationships of moving objects in a specified
  objects in g, i.e., 8oi 2 g, jsoðogijÞgj ! 
, where 
 is with default
                                   j                                              subregion are interested. Also, it is especially suitable for
  value 1 .3
        2                                                                         heterogeneous tracking configurations, which helps reduce
                                                                                  the tracking cost, e.g., instead of waking up all sensors at
   We formally define the moving object clustering problem                        the same frequency, a fine-grained tracking interval is
as follows: Given a set of moving objects O together with                         specified for partial terrain in the migration season to
their associated location sequence data set S and a minimal                       reduce the energy consumption. Rather than deploying the




                                                                                            t.c om
similarity threshold simmin , the moving object clustering




                                                                                               om
                                                                                  sensors in the same density, they are only highly concen-

                                                                                          po t.c
problem is to partition O into nonoverlapped groups,                              trated in areas of interest to reduce deployment costs.
denoted by G ¼ fg1 ; g2 ; . . . ; gi g, such that the number of                         gs po
                                                                                      lo s
                                                                                    .b og

groups is minimized, i.e., jGj is minimal. Thereafter, with                       4.1   The Group Movement Pattern Mining
                                                                                  ts .bl


the discovered group information and the obtained group                                 (GMPMine) Algorithm
                                                                                ec ts
                                                                              oj c




movement patterns, the group data compression problem is                          To provide better discrimination accuracy, we propose a
                                                                            pr oje




to compress the location sequences of a group of moving                           new similarity measure simp to compare the similarity of
                                                                          re r
                                                                        lo rep




objects for transmission efficiency. Specifically, we formu-                      two objects. For each of their significant movement patterns,
                                                                      xp lo




late the group data compression problem as a merge                                the new similarity measure considers not merely two
                                                                    ee xp
                                                                 .ie ee




problem and an HIR problem. The merge problem is to                               probability distributions but also two weight factors, i.e.,
                                                                w e




combine multiple location sequences to reduce the overall                         the significance of the pattern regarding to each PST. The
                                                               w .i
                                                              w w




sequence length, while the HIR problem targets to minimize                        similarity score simp of oi and oj based on their respective
                                                           :// w
                                                        tp //w




the entropy of a sequence such that the amount of data is                         PSTs, Ti and Tj , is defined as follows:
                                                      ht ttp:




reduced with or without loss of information.
                                                                                                                   P qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
                                                                                                                                   P
                                                       h




                                                                                                                                                    Ti                    Tj            2
                                                                                                                        s2S e           2Æ ðP ðsÞ À P ðsÞÞ
                                                                                    simp ðoi ; oj Þ ¼ À log                                                pffiffiffi                           ; ð1Þ
4    MINING OF GROUP MOVEMENT PATTERNS                                                                                                     2Lmax þ 2
To tackle the moving object clustering problem, we propose                                 e
                                                                                  where S denotes the union of significant patterns (node
a distributed mining algorithm, which comprises the                               strings) on the two trees. The similarity score simp includes
GMPMine and CE algorithms. First, the GMPMine algo-                               the distance associated with a pattern s, defined as
rithm uses a PST to generate an object’s significant move-                                    qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
                                                                                                 X
ment patterns and computes the similarity of two objects by                          dðsÞ ¼                  ðP Ti ðsÞ À P Tj ðsÞÞ2
                                                                                                      2Æ
using simp to derive the local grouping results. The merits                                   qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
                                                                                                 X
of simp include its accuracy and efficiency: First, simp                                  ¼           2Æ
                                                                                                             ðP Ti ðsÞ Â P Ti ðjsÞ À P Tj ðsÞ Â P Tj ðjsÞÞ2 ;
considers the significances of each movement pattern
regarding to individual objects so that it achieves better                        where dðsÞ is the euclidean distance associated with a
accuracy in similarity comparison. For a PST can be used to                       pattern s over Ti and Tj .
predict a pattern’s occurrence probability, which is viewed                           For a pattern s 2 T , P T ðsÞ is a significant value because
as the significance of the pattern regarding the PST, simp                        the occurrence probability of s is higher than the minimal
thus includes movement patterns’ predicted occurrence                             support Pmin . If oi and oj share the pattern s, we have s 2 Ti
probabilities to provide fine-grained similarity comparison.                      and s 2 Tj , respectively, such that P Ti ðsÞ and P Tj ðsÞ are non-
Second, simp can offer seamless and efficient comparison                          negligible and meaningful in the similarity comparison.
for the applications with evolving and evolutionary                               Because the conditional empirical probabilities are also parts
similarity relationships. This is because simp can compare                        of a pattern, we consider the conditional empirical prob-
                                                                                  abilities P T ðjsÞ when calculating the distance between two
   2. simp is to be defined in Section 4.1.                                                                                       e
                                                                                  PSTs. Therefore, we sum dðsÞ for all s 2 S as the distance
   3. In this work, we set the threshold to 1 , as suggested in [37], and leave
                                            2
the similarity threshold as the major control parameter because training the      between two PSTs. Note that the distance betweenffiffiffitwo PSTs
                                                                                                                                        p
similarity threshold of two objects is easier than that of a group of objects.    is normalized by its maximal value, i.e., 2Lmax þ 2. We take
100                                         IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,          VOL. 23,   NO. 1,   JANUARY 2011


the negative log of the distance between two PSTs as the
similarity score such that a larger value of the similarity
score implies a stronger similar relationship, and vice versa.
With the definition of similarity score, two objects are similar
to each other if their score is above a specified similarity
threshold.
    The GMPMine algorithm includes four steps. First, we             Fig. 4. Design of the two-phase and 2D compression algorithm.
extract the movement patterns from the location sequences
by learning a PST for each object. Second, our algorithm             ½0; 1Š such that a finer-grained configuration of D achieves a
constructs an undirected, unweighted similarity graph                better grouping quality but in a higher computation cost.
where similar objects share an edge between each other.              Therefore, for a set of thresholds D, we rewrite our objective
                                                                                                     P
We model the density of group relationship by the                    function as G0 ¼ argmaxG;2D K NMIðGi ; G Þ.
                                                                                                       i¼1
connectivity of a subgraph, which is also defined as the                 The algorithm includes three steps. First, we utilize
minimal cut of the subgraph. When the ratio of the                   Jaccard Similarity Coefficient [40] as the measure of the
connectivity to the size of the subgraph is higher than a            similarity for each pair of objects. Second, for each  2 D,
threshold, the objects corresponding to the subgraph are             we construct a graph where two objects share an edge if
identified as a group. Since the optimization of the graph           their Jaccard Similarity Coefficient is above . Our algo-
partition problem is intractable in general, we bisect the           rithm partitions the objects to generate a partitioning result
similarity graph in the following way. We leverage the HCS           G . Third, we select the ensembling result G0 . Because of
cluster algorithm [37] to partition the graph and derive the         space limitations, we only demonstrate the CE mining
location group information. Finally, we select a group PST           algorithm with an example in Appendix C, which can be
Tg for each group in order to conserve the memory space by           found on the Computer Society Digital Library at http://
                                                   P                 doi.ieeecomputersociety.org/10.1109/TKDE.2010.30. In the
using the formula expressed as T g ¼ argmaxT 2T s2S P T ðsÞ,



                                                                                   t.c om
                                                                     next section, we propose our compression algorithm that




                                                                                      om
where S denotes sequences of a group of objects and T
                                                                     leverages the obtained group movement patterns.

                                                                                 po t.c
denotes their PSTs. Due to the page limit, we give an
illustrative example of the GMPMine algorithm in Appen-
                                                                               gs po
                                                                             lo s
                                                                           .b og

dix B, which can be found on the Computer Society Digital            5   DESIGN OF A COMPRESSION ALGORITHM WITH
                                                                         ts .bl



Library at http://doi.ieeecomputersociety.org/10.1109/                   GROUP MOVEMENT PATTERNS
                                                                       ec ts
                                                                     oj c
                                                                   pr oje




TKDE.2010.30.                                                        A WSN is composed of a large number of miniature sensor
                                                                 re r
                                                               lo rep




4.2 The Cluster Ensembling (CE) Algorithm                            nodes that are deployed in a remote area for various
                                                             xp lo




                                                                     applications, such as environmental monitoring or wildlife
In the previous section, each CH collects location data and
                                                           ee xp




                                                                     tracking. These sensor nodes are usually battery-powered
                                                        .ie ee




generates local group results with the proposed GMPMine
                                                                     and recharging a large number of them is difficult. There-
                                                       w e




algorithm. Since the objects may pass through only partial
                                                      w .i




                                                                     fore, energy conservation is paramount among all design
                                                     w w




sensor clusters and have different movement patterns in
                                                  :// w




                                                                     issues in WSNs [41], [42]. Because the target objects are
                                               tp //w




different clusters, the local grouping results may be
                                                                     moving, conserving energy in WSNs for tracking moving
                                             ht ttp:




inconsistent. For example, if objects in a sensor cluster walk
                                                                     objects is more difficult than in WSNs that monitor immobile
                                              h




close together across a canyon, it is reasonable to consider
                                                                     phenomena, such as humidity or vibrations. Hence, pre-
them a group. In contrast, objects scattered in grassland
                                                                     vious works, such as [43], [44], [45], [46], [47], [48], especially
may not be identified as a group. Furthermore, in the case
                                                                     consider movement characteristics of moving objects in their
where a group of objects moves across the margin of a                designs to track objects efficiently. On the other hand, since
sensor cluster, it is difficult to find their group relationships.   transmission of data is one of the most energy expensive
Therefore, we propose the CE algorithm to combine                    tasks in WSNs, data compression is utilized to reduce the
multiple local grouping results. The algorithm solves the            amount of delivered data [1], [2], [3], [4], [5], [6], [49], [50],
inconsistency problem and improves the grouping quality.             [51], [52]. Nevertheless, few of the above works have
    The ensembling problem involves finding the partition of         addressed the application-level semantics, i.e., the tempor-
all moving objects O that contains the most information about        al-and-spatial correlations of a group of moving objects.
the local grouping results. We utilize NMI [38], [39] to                Therefore, to reduce the amount of delivered data, we
evaluate the grouping quality. Let C denote the ensemble             propose the 2P2D algorithm which leverages the group
of the local grouping results, represented as C ¼ fG0 ;              movement patterns derived in Section 4 to compress the
                                                                     location sequences of moving objects elaborately. As shown
G1 ; . . . ; GK g, where K denotes the ensemble size. Our goal
                                                                     in Fig. 4, the algorithm includes the sequence merge phase
is to discover the ensembling result G0 that contains the most       and the entropy reduction phase to compress location
                                                P
information about C, i.e., G0 ¼ argmax e K NMIðGi ; GÞ,
                                                  i¼1                sequences vertically and horizontally. In the sequence
                                            G2G
            e
where G denotes all possible ensembling results.                     merge phase, we propose the Merge algorithm to compress
                                            e
    However, enumerating every G 2 G in order to find the            the location sequences of a group of moving objects. Since
                                 0
optimal ensembling result G is impractical, especially in the        objects with similar movement patterns are identified as a
resource-constrained environments. To overcome this diffi-           group, their location sequences are similar. The Merge
culty, the CE algorithm trades off the grouping quality              algorithm avoids redundant sending of their locations, and
against the computation cost by adjusting the partition              thus, reduces the overall sequence length. It combines the
parameter D, i.e., a set of thresholds with values in the range      sequences of a group of moving objects by 1) trimming
TSAI ET AL.: EXPLORING APPLICATION-LEVEL SEMANTICS FOR DATA COMPRESSION                                                          101


multiple identical symbols at the same time interval into a
single symbol or 2) choosing a qualified symbol to represent
them when a tolerance of loss of accuracy is specified by the
application. Therefore, the algorithm trims and prunes
more items when the group size is larger and the group
relationships are more distinct. Besides, in the case that only
the location center of a group of objects is of interest, our
approach can find the aggregated value in the phase,
instead of transmitting all location sequences back to the        Fig. 5. An example of the Merge algorithm. (a) Three sequences with
                                                                  high similarity. (b) The merged sequence S 00 .
sink for postprocessing.
   In the entropy reduction phase, we propose the Replace
                                                                  S-column into a single symbol. Specifically, given a group
algorithm that utilizes the group movement patterns as the
                                                                  of n sequences, the items of an S-column are replaced by a
prediction model to further compress the merged sequence.
                                                                  single symbol, whereas the items of a D-column are
The Replace algorithm guarantees the reduction of a
                                                                  wrapped up between two 0 =0 symbols. Finally, our algo-
sequence’s entropy, and consequently, improves compres-
                                                                  rithm generates a merged sequence containing the same
sibility without loss of information. Specifically, we define a
                                                                  information of the original sequences. In decompressing
new problem of minimizing the entropy of a sequence as
                                                                  from the merged sequence, while symbol 0 =0 is encountered,
the HIR problem. To reduce the entropy of a location
                                                                  the items after it are output until the next 0 =0 symbol.
sequence, we explore Shannon’s theorem [17] to derive
                                                                  Otherwise, for each item, we repeat it n times to generate
three replacement rules, based on which the Replace
                                                                  the original sequences. Fig. 5b shows the merged sequence
algorithm reduces the entropy efficiently. Also, we prove
                                                                  S 00 whose length is decreased from 60 items to 48 items such
that the Replace algorithm obtains the optimal solution of
                                                                  that 12 items are conserved. The example pointed out that
the HIR problem as Theorem 1. In addition, since the objects



                                                                                 t.c om
                                                                  our approach can reduce the amount of data without loss of




                                                                                    om
may enter and leave a sensor cluster multiple times during

                                                                               po t.c
a batch period and a group of objects may enter and leave a       information. Moreover, when there are more S-columns,
                                                                             gs po
                                                                  our approach can bring more benefit.
cluster at slightly different times, we discuss the segmenta-
                                                                           lo s
                                                                         .b og

                                                                       When a little loss in accuracy is tolerant, representing
tion and alignment problems in Section 5.3. Table 1
                                                                       ts .bl


                                                                  items in a D-column by an qualified symbol to generate
summaries the notations.
                                                                     ec ts




                                                                  more S-columns can improve the compressibility. We
                                                                   oj c
                                                                 pr oje




5.1 Sequence Merge Phase                                          regulate the accuracy by an error bound, defined as the
                                                               re r
                                                             lo rep




In the application of tracking wild animals, multiple             maximal hop count between the real and reported locations
                                                           xp lo




moving objects may have group relationships and share             of an object. For example, replacing items 2 and 6 of S2 by 0 g0
                                                         ee xp




                                                                  and 0 p0 , respectively, creates two more S-columns, and thus,
                                                      .ie ee




similar trajectories. In this case, transmitting their location
                                                     w e




data separately leads to redundancy. Therefore, in this           results in a shorter merged sequence with 42 items. To
                                                    w .i
                                                   w w




section, we concentrate on the problem of compressing             select a representative symbol for a D-column, we includes
                                                :// w
                                             tp //w




multiple similar sequences of a group of moving objects.          a selection criterion to minimize the average deviation
                                           ht ttp:




   Consider an illustrative example in Fig. 5a, where three       between the real locations and reported locations for a
                                            h




location sequences S0 , S1 , and S2 represent the trajectories    group of objects at each time interval as follows:
of a group of three moving objects. Items with the same
index belong to a column, and a column containing                 Selection criterion. The maximal distance between the reported
identical symbols is called an S-column; otherwise, the             location and the real location is below a specified error
bound
column is called a D-column. Since sending the items in an          eb, i.e., when the ith column is a D-column,
Sj ½iŠ À
eb
S-column individually causes redundancy, our basic idea             must hold for 0 j  n, where n is the number of sequences.
of compressing multiple sequences is to trim the items in an


                                                            TABLE 1
                                                  Description of the Notations
102                                          IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,             VOL. 23,    NO. 1,   JANUARY 2011




                                                                       Fig. 7. An example of the Merge algorithm: the merged sequence with
                                                                       eb ¼ 1.

                                                                                jfs js ¼ ;0 jLgj
                                                                       as pi ¼ j j L    i
                                                                                                 , and L ¼ jS j. According to Shannon’s
                                                                       source coding theorem, the optimal code length for symbol
                                                                       i is log2 pi , which is also called the information content of i .
                                                                       Information entropy (or Shannon’s entropy) is thereby
                                                                       defined as the overall sum of the information content over Æ,
                                                                                                                     X
                                                                               eðSÞ ¼ eðp0 ; p1 ; . . . ; pjÆjÀ1 Þ ¼   Àpi  log2 pi : ð2Þ
                                                                                                                0 ijÆj

                                                                       Shannon’s entropy represents the optimal average code
                                                                       length in data compression, where the length of a symbol’s
Fig. 6. The Merge algorithm.                                           codeword is proportional to its information content [17].
                                                                            A property of Shannon’s entropy is that the entropy is
   Therefore, to compress the location sequences for a                 the maximum, while all probabilities are of the same value.
group of moving objects, we propose the Merge algorithm                Thus, in the case without considering the regularity in the
shown in Fig. 6. The input of the algorithm contains a group           movements of objects, the occurrence probabilities of
of sequences fSi j0 i  ng and an error bound eb, while
                                                                       symbols are assumed to be uniform such that the entropy
the output is a merged sequence that represents the group



                                                                                    t.c om
                                                                       of a location sequence as well as the compression ratio is




                                                                                       om
of sequences. Specifically, the Merge algorithm processes

                                                                                  po t.c
the sequences in a columnwise way. For each column, the                fixed and dependent on the size of a sensor cluster, e.g., for
                                                                                gs po
                                                                       a location sequence with length D collected in a sensor
algorithm first checks whether it is an S-column. For an S-
                                                                              lo s
                                                                            .b og

                                                                                                                                     1
column, it retains the value of the items as Lines 4-5.                cluster of 16 nodes, the entropy of the sequence is eð16 ,
                                                                          ts .bl


                                                                        1           1
Otherwise, while an error bound eb  0 is specified, a                 16 ; . . . ; 16Þ ¼ 4. Consequently, 4D bits are needed to repre-
                                                                        ec ts
                                                                      oj c




representative symbol is selected according to the selection           sent the location sequence. Nevertheless, since the move-
                                                                    pr oje




criterion as Line 7. If a qualified symbol exists to represent         ments of a moving object are of some regularity, the
                                                                  re r
                                                                lo rep




the column, the algorithm outputs it as Lines 15-18.                   occurrence probabilities of symbols are probably skewed
                                                              xp lo




Otherwise, the items in the column are retained and                    and the entropy is lower. For example, if two probabilities
                                                            ee xp
                                                         .ie ee




                                                                                       1      3
wrapped by a pair of “/” as Lines 9-13. The process repeats            become 32 and 32 , the entropy is reduced to 3.97. Thus,
                                                        w e




until all columns are examined. Afterward, the merged                  instead of encoding the sequence using 4 bits per symbol,
                                                       w .i
                                                      w w




sequence S 00 is generated.
                                                   :// w




                                                                       only 3:97D bits are required for the same information. This
                                                tp //w




   Following the example shown in Fig. 5a, with a specified            example points out that when a moving object exhibits
                                              ht ttp:




error bound eb as 1, the Merge algorithm generates the                 some degree of regularity in their movements, the skewness
                                               h




solution, as shown in Fig. 7; the merged sequence contains             of these probabilities lowers the entropy. Seeing that data
only 20 items, i.e., 40 items are curtailed.                           with lower entropy require fewer bits to represent the same
5.2 Entropy Reduction Phase                                            information, reducing the entropy thereby benefits for data
                                                                       compression and, by extension, storage and transmission.
In the entropy reduction phase, we propose the Replace
                                                                            Motivated by the above observation, we design the
algorithm to minimize the entropy of the merged sequence
                                                                       Replace algorithm to reduce the entropy of a location
obtained in the sequence merge phase. Since data with
                                                                       sequence. Our algorithm imposes the hit symbols on the
lower entropy require fewer bits for storage and transmis-
                                                                       location sequence to increase the skewness. Specifically, the
sion [17], we replace some items to reduce the entropy
                                                                       algorithm uses the group movement patterns built in both
without loss of information. The object movement patterns
                                                                       the transmitter (CH) and the receiver (sink) as the
discovered by our distributed mining algorithm enable us
to find the replaceable items and facilitate the selection of          prediction model to decide whether an item of a sequence
items in our compression algorithm. In this section, we first          is predictable. A CH replaces the predictable items each
introduce and define the HIR problem, and then, explore                with a hit symbol to reduce the location sequence’s entropy
the properties of Shannon’s entropy to solve the HIR                   when compressing it. After receiving the compressed
problem. We extend the concentration property for entropy              sequence, the sink node decompresses it and substitutes
reduction and discuss the benefits of replacing multiple               every hit symbol with the original symbol by the identical
symbols simultaneously. We derive three replacement rules              prediction model, and no information loss occurs.
for the HIR problem and prove that the entropy of the                       Here, an item si of a sequence S is predictable item if
obtained solution is minimized.                                        predict nextðT ; S½0::i À 1ŠÞ is the same value as si . A symbol
                                                                       is a predictable symbol once an item of the symbol is
5.2.1 Three Derived Replacement Rules                                  predictable. For ease of explanation, we use a taglst4 to
Let P ¼ fp0 , p1 ; . . . ; pjÆjÀ1 g denote the probability distribu-
                                                                          4. A taglst associated with a sequence S is a sequence of 0s and 1s
tion of the alphabet Æ corresponding to a location sequence            obtained as follows: For those predictable items in S, the corresponding
S, where pi is the occurrence probability of i in Æ, defined          items in taglst are set as 1. Otherwise, their values in taglst are 0.
TSAI ET AL.: EXPLORING APPLICATION-LEVEL SEMANTICS FOR DATA COMPRESSION                                                                                     103




Fig. 8. An example of the simple method. The first row represents the index, and the second row is the location sequence. The third row is the taglst
which represents the prediction results. The last row is the result of the simple approach. Note that items 1, 2, 7, 9, 11, 12, 15, and 22 are predictable
items such that the set of predictable symbols is s ¼ fk, a, j, f, og. In the example, items 2, 9, and 10 are items of 0 o0 such that the number of items of
                                                   ^
symbol 0 o0 is nð0 o0 Þ ¼ 3, whereas items 2 and 9 are the predictable items of 0 o0 , and the number of predictable items of symbol 0 o0 is nhit ð0 o0 Þ ¼ 2.




Fig. 9. Problems of the simple approach.

express whether each item of S is predictable in the                            fifth, called the strong concentration property, is derived
following sections. Consider the illustrative example shown                     and proved in the paper.5
in Fig. 8, the probability distribution of Æ corresponding to



                                                                                          t.c om
                                                                                Property 1 (Expansibility). Adding a probability with a value




                                                                                             om
S is P ¼ f0:04; 0:16; 0:04; 0; 0; 0:08; 0:04; 0; 0:04; 0:12; 0:24; 0;

                                                                                        po t.c
                                                                                  of zero does not change the entropy, i.e., eðp0 , p1 ; . . . ; pjÆjÀ1 ,
0; 0:12; 0:12; 0g, and items 1, 2, 7, 9, 11, 12, 15, and 22 are                       gs po
                                                                                  pjÆj Þ is identical to eðp0 , p1 ; . . . ; pjÆjÀ1 Þ when pjÆj is zero.
                                                                                    lo s
predictable items. To reduce the entropy of the sequence, a
                                                                                  .b og
                                                                                ts .bl


simple approach is to replace each of the predictable items
                                                                                   According to Property 1, we add a new symbol 0 :0 to Æ
                                                                              ec ts




with the hit symbol and obtain an intermediate sequence
                                                                            oj c




                                                                                without affecting the entropy and denote its probability as
                                                                          pr oje




S 0 ¼00n::kfbb:n:o::ij:bbnkkk:gc00 . Compared with the original
                                                                                p16 such that P is fp0 , p1 ; . . . ; p15 , p16 ¼ 0g.
                                                                        re r




sequence S with entropy 3.053, the entropy of S 0 is reduced
                                                                      lo rep




to 2.854. Encoding S and S 0 by the Huffman coding                              Property 2 (Symmetry). Any permutation of the probability
                                                                    xp lo
                                                                  ee xp




technique, the lengths of the output bit streams are 77 and                       values does not change to the entropy. For example,
                                                               .ie ee




73 bits, respectively, i.e., 5 bits are conserved by the simple                   eð0:1; 0:4; 0:5Þ is identical to eð0:4; 0:5; 0:1Þ.
                                                              w e
                                                             w .i
                                                            w w




approach.                                                                       Property 3 (Accumulation). Moving all the value from one
                                                         :// w




    However, the above simple approach does not always
                                                      tp //w




                                                                                  probability to another such that the former can be thought
                                                    ht ttp:




minimize the entropy. Consider the example shown in Fig. 9a,                      of as being eliminated decreases the entropy, i.e.,
an intermediate sequence with items 1 and 19 unreplaced has
                                                     h




                                                                                  eðp0 ; p1 ; . . . ; 0; . . . ; pi þ pj ; . . . ; pjÆjÀ1 Þ is equal or less than
lower entropy than that generated by the simple approach.                         eðp0 ,p1 ; . . . ; pi ; . . . ; pj ; . . . ; pjÆjÀ1 Þ.
For the example shown in Fig. 9b, the simple approach even
increases the entropy from 2.883 to 2.963.                                          With Properties 2 and 3, if all the items of symbol  are
    We define the above problem as the HIR problem and                          predictable, i.e., nðÞ ¼ nhit ðÞ, replacing all the items of 
formulate it as follows:                                                        by 0 :0 will not affect the entropy. If there are multiple
Definition 3 (HIR problem). Given a sequence S ¼ fsi jsi 2                      symbols having nðÞ ¼ nhit ðÞ, replacing all the items of
  Æ; 0 i  Lg and a taglst, an intermediate sequence is a                       these symbols can reduce the entropy. Thus, we derive the
  generation of S, denoted by S 0 ¼ fs0i j0 i  Lg, where s0i is                first replacement rule—the accumulation rule: Replace all
  equal to si if taglst½iŠ ¼ 0. Otherwise, s0i is equal to si or 0 :0 .         items of symbol  in s, where nðÞ ¼ nhit ðÞ.
                                                                                                        ^
  The HIR problem is to find the intermediate sequence S 0 such
                                                                                Property 4 (Concentration). For two probabilities, moving a
  that the entropy of S 0 is minimal for all possible intermediate
                                                                                  value from the lower probability to the higher probability
  sequences.
                                                                                  decreases the entropy, i.e., if pi pj , for 0  Áp pi ,
                                                                                  eðp0 ,p1 ; . . . ; pi À Áp; . . . ; pj þ Áp; . . . ; pjÆjÀ1 Þ is less than
   A brute-force method to the HIR problem is to enumer-
ate all possible intermediate sequences to find the optimal                       eðp0 ,p1 ; . . . ; pi ; . . . ; pj ; . . . ; pjÆjÀ1 Þ.
solution. However, this brute-force approach is not scalable,                   Property 5 (Strong concentration). For two probabilities,
especially when the number of the predictable items is                            moving a value that is larger than the difference of the two
large. Therefore, to solve the HIR problem, we explore                            probabilities from the higher probability to the lower one
properties of Shannon’s entropy to derive three replace-                          decreases the entropy, i.e., if pi  pj , for pi À pj  Áp pi ,
ment rules that our Replace algorithm leverages to obtain
the optimal solution. Here, we list five most relevant                             5. Due to the page limit, the proofs of the strong concentration rule and
                                                                                Lemmas 1-4 are given in Appendices C and D, respectively, which can be
properties to explain the replacement rules. The first four                     found on the Computer Society Digital Library at http://doi.
properties can be obtained from [17], [53], [54], while the                     ieeecomputersociety.org/10.1109/TKDE.2010.30.
104                                                          IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,                VOL. 23,   NO. 1,   JANUARY 2011


      eðp0 ,p1 ; . . . ; pi À Áp; . . . ; pj þ Áp; . . . ; pjÆjÀ1 Þ is less than      as Lemma 4. We thereby derive the third replacement
      eðp0 ,p1 ; . . . ; pi ; . . . ; pj ; . . . ; pjÆjÀ1 Þ.                          rule—the multiple symbol rule: Replace all of the
                                                                                      predictable items of every symbol in s0 if gainð^0 Þ  0.
                                                                                                                           ^          s
   According to Properties 4 and 5, we conclude that if the                           Lemma 3. If exist 0 an pin , n ¼ 1; . . . ; k, such that
difference of two probabilities increases, the entropy                                                                                                 P
                                                                                        fi1 i2 ...ik j ða1 ; a2 ; . . . ; ak Þ is minimal, we have pj þ k ai !
                                                                                                                                                        i¼1
decreases. When the conditions conform to Properties 4
                                                                                        pin À an , n ¼ 1; . . . ; k.
and 5, we further prove that the entropy is monotonically
reduced as Áp increases such that replacing all predictable                           Lemma 4. If exist xmin1 , xmin2 ; . . . , and xmink such that pj þ
                                                                                        Pk
items of a qualified symbol reduces the entropy mostly as                                    i¼1 xmini  pin À xminn for n ¼ 1; 2; . . . ; k,fi1 i2 ...ik j ðx1 ; x2 ;
Lemmas 1 and 2.                                                                         . . . ; xk Þ is monotonically decreasing for xminn xn pin ,
Definition 4 (Concentration function). For a probability                                n ¼ 1; . . . ; k.
  distribution P ¼ fp0 , p1 ; . . . ; pjÆjÀ1 g, we define the concentra-
  tion function of k probabilities pi1 , pi2 ; . . . ; pik , and pj as                5.2.2 The Replace Algorithm
                                                                                     Based on the observations described in the previous section,
     fi1 i2 ...ik j ðx1 ; x2 ; . . . ; xk Þ ¼ e p0 ; . . . ; pi1 À x1 ; . . . ; pi2   we propose the Replace algorithm that leverages the three
                                                                                      replacement rules to obtain the optimal solution for the HIR
                                         À x2 ; . . . ; pik À xk ; . . . ; pj         problem. Our algorithm examines the predictable symbols
                                         Xk                                          on their statistics, which include the number of items and
                                       þ    xi ; . . . ; pj P jÀ1 :                   the number of predictable items of each predictable symbol.
                                          i¼1                                         The algorithm first replaces the qualified symbols according
                                                                                      to the accumulation rule. Afterward, since the concentration
                                                                                      rule and the multiple symbol rule are related to nð0 :0 Þ, which
Lemma 1. If pi pj , for any x such that 0 x pi , the                                  is increased after every replacement, the algorithm itera-




                                                                                                   t.c om
  concentration function of pi and pj is monotonic decreasing.                        tively replaces the qualified symbols according to the two



                                                                                                      om
                                                                                                 po t.c
Lemma 2. If pi  pj , for any x such that pi À pj x pi , the                          rules until all qualified symbols are replaced. The algorithm
  concentration function fij ðxÞ of pi and pj is monotonic
                                                                                               gs po
                                                                                      thereby replaces qualified symbols and reduces the entropy
                                                                                             lo s
                                                                                           .b og

  decreasing.                                                                         toward the optimum gradually. Compared with the brute-
                                                                                         ts .bl


                                                                                      force method that enumerates all possible intermediate
                                                                                       ec ts
                                                                                     oj c




   Accordingly, we derive the second replacement rule—the                             sequences for the optimum in exponential complexity, the
                                                                                   pr oje




                                                                                      Replace algorithm that leverages the derived rules to obtain
                                                                                 re r




concentration rule: Replace all predictable items of symbol
                                                                               lo rep




 in s, where nðÞ nð0 :0 Þ or nhit ðÞ  nðÞ À nð0 :0 Þ.
      ^                                                                               the optimal solution in OðLÞ time6 is more scalable and
                                                                             xp lo




                                                                                      efficient. We prove that the Replace algorithm guarantees to
                                                                           ee xp




   As an extension of the above properties, we also explore
                                                                        .ie ee




the entropy variation, while predictable items of multiple                            reduce the entropy monotonically and obtains the optimal
                                                                       w e




symbols are replaced simultaneously. Let s0 denote a subset                           solution of the HIR problem as Theorem 1. Next, we detail
                                                                      w .i




                                                    ^
                                                                     w w
                                                                  :// w




of s, where j^0 j ! 2. We prove that if replacing predictable
   ^            s                                                                     the replace algorithm and demonstrate the algorithm by an
                                                               tp //w




symbols in s0 can minimize the entropy corresponding to
               ^                                                                      illustrative example.
                                                             ht ttp:




symbols in s0 , the number of hit symbols after the
                 ^                                                                    Theorem 1.7 The Replace algorithm obtains the optimal solution
                                                              h




replacement must be larger than the number of the                                       of the HIR problem.
predictable symbols after the replacement as Lemma 3. To
investigate whether the converse statement exists, we                                    Fig. 10 shows the Replace algorithm. The input includes
conduct an experiment in a brute-force way. However, the                              a location sequence S and a predictor Tg , while the output,
experimental results show that even under the condition
               P                                                                      denoted by S 0 , is a sequence in which qualified items are
that nð0 :0 Þ þ 80 2^0 nhit ð0 Þ ! nðÞ À nhit ðÞ for every  in s0 ,
                     s                                              ^                 replaced by 0 :0 . Initially, Lines 3-9 of the algorithm find the
replacing predictable items of the symbols in s0 does not ^                           set of predictable symbols together their statistics. Then, it
guarantee the reduction of the entropy. Therefore, we                                 exams the statistics of the predictable symbols according to
compare the difference of the entropy before and after                                the three replacement rules as follows: First, according to
replacing symbols in s0 as ^                                                          the accumulation rule, it replaces qualified symbols in one
                         X                    p                                      scan of the predictable symbols as Lines 10-14. Next, the
               ^
        gainðS 0 Þ ¼ À           p  log2          À Áp                             algorithm iteratively exams for the concentration and the
                         82^s 0           p À Áp
                                                                                      multiple symbol rules by two loops. The first loop from
                         Â log2 ðp À Áp Þ                                           Line 16 to Line 22 is for the concentration, whereas the
                                                         
                                               p0 :0                                  second loop from Line 25 to Line 36 is for the multiple
                         À p0 :0 Â log2 0 0 P                                         symbol rule. In our design, since finding a combination of
                                         p : þ 82^0 ÁpÀ
                                                     s
                                                                          !           predictable symbols to make gainð^0 Þ  0 hold is more
                                                                                                                                 s
                             X                                X
                         þ           Áp  log2 p þ   0 :0            ÁpÀ :          costly, the algorithm is prone to replace symbols with the
                             82^0
                                s                             82^0
                                                                 s
                                                                                         6. The complexity analysis is given in Appendix E, which can be found
In addition, we also prove that once replacing partial                                on the Computer Society Digital Library at http://doi.ieeecomputersociety.
predictable items of symbols in s0 reduces entropy, repla-
                                ^                                                     org/10.1109/TKDE.2010.30.
                                                                                         7. The proof of Theorem 1 is given in Appendix F, which can be found on
cing all predictable items of these symbols reduces the                               the Computer Society Digital Library at http://doi.ieeecomputersociety.
entropy mostly since the entropy decreases monotonically                              org/10.1109/TKDE.2010.30.
Exploring application level semantics

Más contenido relacionado

Más de ingenioustech

Impact of le arrivals and departures on buffer
Impact of  le arrivals and departures on bufferImpact of  le arrivals and departures on buffer
Impact of le arrivals and departures on bufferingenioustech
 
Exploiting dynamic resource allocation for
Exploiting dynamic resource allocation forExploiting dynamic resource allocation for
Exploiting dynamic resource allocation foringenioustech
 
Efficient computation of range aggregates
Efficient computation of range aggregatesEfficient computation of range aggregates
Efficient computation of range aggregatesingenioustech
 
Dynamic measurement aware
Dynamic measurement awareDynamic measurement aware
Dynamic measurement awareingenioustech
 
Design and evaluation of a proxy cache for
Design and evaluation of a proxy cache forDesign and evaluation of a proxy cache for
Design and evaluation of a proxy cache foringenioustech
 
Throughput optimization in
Throughput optimization inThroughput optimization in
Throughput optimization iningenioustech
 
Phish market protocol
Phish market protocolPhish market protocol
Phish market protocolingenioustech
 
Peering equilibrium multi path routing
Peering equilibrium multi path routingPeering equilibrium multi path routing
Peering equilibrium multi path routingingenioustech
 
Online social network
Online social networkOnline social network
Online social networkingenioustech
 
On the quality of service of crash recovery
On the quality of service of crash recoveryOn the quality of service of crash recovery
On the quality of service of crash recoveryingenioustech
 
Bayesian classifiers programmed in sql
Bayesian classifiers programmed in sqlBayesian classifiers programmed in sql
Bayesian classifiers programmed in sqlingenioustech
 
Conditional%20 shortest%20path%20routing%20in%20delay%20tolerant%20networks[1]
Conditional%20 shortest%20path%20routing%20in%20delay%20tolerant%20networks[1]Conditional%20 shortest%20path%20routing%20in%20delay%20tolerant%20networks[1]
Conditional%20 shortest%20path%20routing%20in%20delay%20tolerant%20networks[1]ingenioustech
 
Applied research of e learning
Applied research of e learningApplied research of e learning
Applied research of e learningingenioustech
 
Active reranking for web image search
Active reranking for web image searchActive reranking for web image search
Active reranking for web image searchingenioustech
 
A dynamic performance-based_flow_control
A dynamic performance-based_flow_controlA dynamic performance-based_flow_control
A dynamic performance-based_flow_controlingenioustech
 

Más de ingenioustech (20)

Impact of le arrivals and departures on buffer
Impact of  le arrivals and departures on bufferImpact of  le arrivals and departures on buffer
Impact of le arrivals and departures on buffer
 
Exploiting dynamic resource allocation for
Exploiting dynamic resource allocation forExploiting dynamic resource allocation for
Exploiting dynamic resource allocation for
 
Efficient computation of range aggregates
Efficient computation of range aggregatesEfficient computation of range aggregates
Efficient computation of range aggregates
 
Dynamic measurement aware
Dynamic measurement awareDynamic measurement aware
Dynamic measurement aware
 
Design and evaluation of a proxy cache for
Design and evaluation of a proxy cache forDesign and evaluation of a proxy cache for
Design and evaluation of a proxy cache for
 
Throughput optimization in
Throughput optimization inThroughput optimization in
Throughput optimization in
 
Tcp
TcpTcp
Tcp
 
Privacy preserving
Privacy preservingPrivacy preserving
Privacy preserving
 
Phish market protocol
Phish market protocolPhish market protocol
Phish market protocol
 
Peering equilibrium multi path routing
Peering equilibrium multi path routingPeering equilibrium multi path routing
Peering equilibrium multi path routing
 
Peace
PeacePeace
Peace
 
Online social network
Online social networkOnline social network
Online social network
 
On the quality of service of crash recovery
On the quality of service of crash recoveryOn the quality of service of crash recovery
On the quality of service of crash recovery
 
Layered approach
Layered approachLayered approach
Layered approach
 
Intrution detection
Intrution detectionIntrution detection
Intrution detection
 
Bayesian classifiers programmed in sql
Bayesian classifiers programmed in sqlBayesian classifiers programmed in sql
Bayesian classifiers programmed in sql
 
Conditional%20 shortest%20path%20routing%20in%20delay%20tolerant%20networks[1]
Conditional%20 shortest%20path%20routing%20in%20delay%20tolerant%20networks[1]Conditional%20 shortest%20path%20routing%20in%20delay%20tolerant%20networks[1]
Conditional%20 shortest%20path%20routing%20in%20delay%20tolerant%20networks[1]
 
Applied research of e learning
Applied research of e learningApplied research of e learning
Applied research of e learning
 
Active reranking for web image search
Active reranking for web image searchActive reranking for web image search
Active reranking for web image search
 
A dynamic performance-based_flow_control
A dynamic performance-based_flow_controlA dynamic performance-based_flow_control
A dynamic performance-based_flow_control
 

Último

Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxShobhayan Kirtania
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 

Último (20)

Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 

Exploring application level semantics

  • 1. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 1, JANUARY 2011 95 Exploring Application-Level Semantics for Data Compression Hsiao-Ping Tsai, Member, IEEE, De-Nian Yang, and Ming-Syan Chen, Fellow, IEEE Abstract—Natural phenomena show that many creatures form large social groups and move in regular patterns. However, previous works focus on finding the movement patterns of each single object or all objects. In this paper, we first propose an efficient distributed mining algorithm to jointly identify a group of moving objects and discover their movement patterns in wireless sensor networks. Afterward, we propose a compression algorithm, called 2P2D, which exploits the obtained group movement patterns to reduce the amount of delivered data. The compression algorithm includes a sequence merge and an entropy reduction phases. In the sequence merge phase, we propose a Merge algorithm to merge and compress the location data of a group of moving objects. In the entropy reduction phase, we formulate a Hit Item Replacement (HIR) problem and propose a Replace algorithm that obtains the optimal solution. Moreover, we devise three replacement rules and derive the maximum compression ratio. The experimental results show that the proposed compression algorithm leverages the group movement patterns to reduce the amount of delivered data effectively and efficiently. Index Terms—Data compression, distributed clustering, object tracking. Ç t.c om 1 INTRODUCTION om R po t.c ECENT advances in location-acquisition technologies, zebra, whales, and birds, form large social groups when such as global positioning systems (GPSs) and wireless gs po migrating to find food, or for breeding or wintering. These lo s .b og sensor networks (WSNs), have fostered many novel characteristics indicate that the trajectory data of multiple ts .bl applications like object tracking, environmental monitoring, objects may be correlated for biological applications. More- ec ts oj c and location-dependent service. These applications gener- over, some research domains, such as the study of animals’ pr oje ate a large amount of location data, and thus, lead to social behavior and wildlife migration [7], [8], are more re r lo rep transmission and storage challenges, especially in resource- concerned with the movement patterns of groups of xp lo animals, not individuals; hence, tracking each object is ee xp constrained environments like WSNs. To reduce the data .ie ee volume, various algorithms have been proposed for data unnecessary in this case. This raises a new challenge of w e finding moving animals belonging to the same group and w .i compression and data aggregation [1], [2], [3], [4], [5], [6]. w w identifying their aggregated group movement patterns. :// w However, the above works do not address application-level tp //w semantics, such as the group relationships and movement Therefore, under the assumption that objects with similar ht ttp: patterns, in the location data. movement patterns are regarded as a group, we define the h In object tracking applications, many natural phenomena moving object clustering problem as given the movement show that objects often exhibit some degree of regularity in trajectories of objects, partitioning the objects into non- their movements. For example, the famous annual wild- overlapped groups such that the number of groups is ebeest migration demonstrates that the movements of minimized. Then, group movement pattern discovery is to find the most representative movement patterns regarding creatures are temporally and spatially correlated. Biologists each group of objects, which are further utilized to also have found that many creatures, such as elephants, compress location data. Discovering the group movement patterns is more . H.-P. Tsai is with the Department of Electrical Engineering (EE), National difficult than finding the patterns of a single object or all Chung Hsing University, No. 250, Kuo Kuang Road, Taichung 402, objects, because we need to jointly identify a group of objects Taiwan, ROC. E-mail: hptsai@nchu.edu.tw. . D.-N. Yang is with the Institute of Information Science (IIS) and the and discover their aggregated group movement patterns. Research Center of Information Technology Innovation (CITI), Academia The constrained resource of WSNs should also be consid- Sinica, No. 128, Academia Road, Sec. 2, Nankang, Taipei 11529, Taiwan, ered in approaching the moving object clustering problem. ROC. E-mail: dnyang@iis.sinica.edu.tw. However, few of existing approaches consider these issues . M.-S. Chen is with the Research Center of Information Technology Innovation (CITI), Academia Sinica, No. 128, Academia Road, Sec. 2, simultaneously. On the one hand, the temporal-and-spatial Nankang, Taipei 11529, Taiwan, and the Department of Electrical correlations in the movements of moving objects are Engineering (EE), Department of Computer Science and Information modeled as sequential patterns in data mining to discover Engineer (CSIE), and Graduate Institute of Communication Engineering (GICE), National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei the frequent movement patterns [9], [10], [11], [12]. How- 10617, Taiwan, ROC. E-mail: mschen@cc.ee.ntu.edu.tw. ever, sequential patterns 1) consider the characteristics of all Manuscript received 22 Sept. 2008; revised 27 Feb. 2009; accepted 29 July objects, 2) lack information about a frequent pattern’s 2009; published online 4 Feb. 2010. significance regarding individual trajectories, and 3) carry Recommended for acceptance by M. Ester. no time information between consecutive items, which make For information on obtaining reprints of this article, please send e-mail to: tkde@computer.org, and reference IEEECS Log Number TKDE-2008-09-0497. them unsuitable for location prediction and similarity Digital Object Identifier no. 10.1109/TKDE.2010.30. comparison. On the other hand, previous works, such as 1041-4347/11/$26.00 ß 2011 IEEE Published by the IEEE Computer Society
  • 2. 96 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 1, JANUARY 2011 [13], [14], [15], measure the similarity among these entire of the HIR problem based on Shannon’s theorem [17] and trajectory sequences to group moving objects. Since objects guarantees the reduction of entropy, which is convention- may be close together in some types of terrain, such as ally viewed as an optimization bound of compression gorges, and widely distributed in less rugged areas, their performance. As a result, our approach reduces the amount group relationships are distinct in some areas and vague in of delivered data and, by extension, the energy consumption others. Thus, approaches that perform clustering among in WSNs. entire trajectories may not be able to identify the local group Our contributions are threefold: relationships. In addition, most of the above works are . Different from previous works, we formulate a centralized algorithms [9], [10], [11], [12], [13], [14], [15], moving object clustering problem that jointly iden- which need to collect all data to a server before processing. tifies a group of objects and discovers their move- Thus, unnecessary and redundant data may be delivered, ment patterns. The application-level semantics are leading to much more power consumption because data useful for various applications, such as data storage transmission needs more power than data processing in and transmission, task scheduling, and network WSNs [5]. In [16], we have proposed a clustering algorithm construction. to find the group relationships for query and data aggrega- . To approach the moving object clustering problem, tion efficiency. The differences of [16] and this work are as we propose an efficient distributed mining algo- follows: First, since the clustering algorithm itself is a rithm to minimize the number of groups such that centralized algorithm, in this work, we further consider members in each of the discovered groups are highly systematically combining multiple local clustering results related by their movement patterns. into a consensus to improve the clustering quality and for . We propose a novel compression algorithm to use in the update-based tracking network. Second, when a compress the location data of a group of moving delay is tolerant in the tracking application, a new data objects with or without loss of information. We management approach is required to offer transmission formulate the HIR problem to minimize the entropy t.c om om efficiency, which also motivates this study. We thus define of location data and explore the Shannon’s theorem po t.c the problem of compressing the location data of a group of gs po to solve the HIR problem. We also prove that the lo s moving objects as the group data compression problem. proposed compression algorithm obtains the opti- .b og Therefore, in this paper, we first introduce our distrib- ts .bl mal solution of the HIR problem efficiently. ec ts uted mining algorithm to approach the moving object The remainder of the paper is organized as follows: In oj c pr oje clustering problem and discover group movement patterns. Section 2, we review related works. In Section 3, we provide re r Then, based on the discovered group movement patterns, lo rep an overview of the network, location, and movement we propose a novel compression algorithm to tackle the xp lo models and formulate our problem. In Section 4, we ee xp group data compression problem. Our distributed mining describe the distributed mining algorithm. In Section 5, .ie ee algorithm comprises a Group Movement Pattern Mining w e we formulate the compression problems and propose our w .i w w (GMPMine) and a Cluster Ensembling (CE) algorithms. It compression algorithm. Section 6 details our experimental :// w avoids transmitting unnecessary and redundant data by tp //w results. Finally, we summarize our conclusions in Section 7. ht ttp: transmitting only the local grouping results to a base station h (the sink), instead of all of the moving objects’ location data. Specifically, the GMPMine algorithm discovers the local 2 RELATED WORK group movement patterns by using a novel similarity 2.1 Movement Pattern Mining measure, while the CE algorithm combines the local Agrawal and Srikant [18] first defined the sequential grouping results to remove inconsistency and improve the pattern mining problem and proposed an Apriori-like grouping quality by using the information theory. algorithm to find the frequent sequential patterns. Han et Different from previous compression techniques that al. consider the pattern projection method in mining remove redundancy of data according to the regularity sequential patterns and proposed FreeSpan [19], which is within the data, we devise a novel two-phase and an FP-growth-based algorithm. Yang and Hu [9] developed 2D algorithm, called 2P2D, which utilizes the discovered a new match measure for imprecise trajectory data and group movement patterns shared by the transmitting node proposed TrajPattern to mine sequential patterns. Many and the receiving node to compress data. In addition to variations derived from sequential patterns are used in remove redundancy of data according to the correlations various applications, e.g., Chen et al. [20] discover path within the data of each single object, the 2P2D algorithm traversal patterns in a Web environment, while Peng and further leverages the correlations of multiple objects and Chen [21] mine user moving patterns incrementally in a their movement patterns to enhance the compressibility. mobile computing system. However, sequential patterns Specifically, the 2P2D algorithm comprises a sequence and its variations like [20], [21] do not provide sufficient merge and an entropy reduction phases. In the sequence information for location prediction or clustering. First, they merge phase, we propose a Merge algorithm to merge and carry no time information between consecutive items, so compress the location data of a group of objects. In the they cannot provide accurate information for location entropy reduction phase, we formulate a Hit Item Replace- prediction when time is concerned. Second, they consider ment (HIR) problem to minimize the entropy of the merged the characteristics of all objects, which make the meaningful data and propose a Replace algorithm to obtain the optimal movement characteristics of individual objects or a group of solution. The Replace algorithm finds the optimal solution moving objects inconspicuous and ignored. Third, because
  • 3. TSAI ET AL.: EXPLORING APPLICATION-LEVEL SEMANTICS FOR DATA COMPRESSION 97 a sequential pattern lacks information about its significance regarding to each individual trajectory, they are not fully representative to individual trajectories. To discover sig- nificant patterns for location prediction, Morzy mines frequent trajectories whose consecutive items are also adjacent in the original trajectory data [10], [22]. Meanwhile, Giannotti et al. [11] extract T-patterns from spatiotemporal data sets to provide concise descriptions of frequent movements, and Tseng and Lin [12] proposed the TMP- Mine algorithm for discovering the temporal movement patterns. However, the above Apriori-like or FP-growth- based algorithms still focus on discovering frequent Fig. 1. (a) The hierarchical- and cluster-based network structure and the patterns of all objects and may suffer from computing data flow of an update-based tracking network. (b) A flat view of a two- layer network structure with 16 clusters. efficiency or memory problems, which make them unsui- table for use in resource-constrained environments. memory constraint of a sensor device. Summarization of the 2.2 Clustering original data by regression or linear modeling has been Recently, clustering based on objects’ movement behavior proposed for trajectory data compression [3], [6]. However, has attracted more attention. Wang et al. [14] transform the the above works do not address application-level semantics location sequences into a transaction-like data on users and in data, such as the correlations of a group of moving based on which to obtain a valid group, but the proposed objects, which we exploit to enhance the compressibility. AGP and VG growth are still Apriori-like or FP-growth- based algorithms that suffer from high computing cost and 3 PRELIMINARIES t.c om memory demand. Nanni and Pedreschi [15] proposed a om 3.1 Network and Location Models po t.c density-based clustering algorithm, which makes use of an optimal time interval and the average euclidean distance gs po Many researchers believe that a hierarchical architecture lo s provides better coverage and scalability, and also extends .b og between each point of two trajectories, to approach the ts .bl trajectory clustering problem. However, the above works the network lifetime of WSNs [25], [26]. In a hierarchical ec ts WSN, such as that proposed in [27], the energy, computing, oj c discover global group relationships based on the proportion pr oje of the time a group of users stay close together to the whole and storage capacity of sensor nodes are heterogeneous. A re r lo rep time duration or the average euclidean distance of the entire high-end sophisticated sensor node, such as Intel Stargate xp lo trajectories. Thus, they may not be able to reveal the local [28], is assigned as a cluster head (CH) to perform high ee xp complexity tasks; while a resource-constrained sensor node, .ie ee group relationships, which are required for many applica- such as Mica2 mote [29], performs the sensing and low w e tions. In addition, though computing the average euclidean w .i w w distance of two geometric trajectories is simple and useful, complexity tasks. In this work, we adopt a hierarchical :// w tp //w the geometric coordinates are expensive and not always network structure with K layers, as shown in Fig. 1a, where ht ttp: available. Approaches, such as EDR, LCSS, and DTW, are sensor nodes are clustered in each level and collaboratively h widely used to compute the similarity of symbolic trajectory gather or relay remote information to a base station called a sequences [13], but the above dynamic programming sink. A sensor cluster is a mesh network of n  n sensor approaches suffer from scalability problem [23]. To provide nodes handled by a CH and communicate with each other scalability, approximation or summarization techniques are by using multihop routing [30]. We assume that each node used to represent original data. Guralnik and Karypis [23] in a sensor cluster has a locally unique ID and denote the project each sequence into a vector space of sequential sensor IDs by an alphabet Æ. Fig. 1b shows an example of a patterns and use a vector-based K-means algorithm to cluster two-layer tracking network, where each sensor cluster objects. However, the importance of a sequential pattern contains 16 nodes identified by Æ ¼ fa, b; . . . ; pg. regarding individual sequences can be very different, which In this work, an object is defined as a target, such as an is not considered in this work. To cluster sequences, Yang animal or a bird, that is recognizable and trackable by the and Wang proposed CLUSEQ [24], which iteratively identi- tracking network. To represent the location of an object, fies a sequence to a learned model, yet the generated clusters geometric models and symbolic models are widely used may overlap which differentiates their problem from ours. [31]. A geometric location denotes precise two-dimension or three-dimension coordinates; while a symbolic location 2.3 Data Compression represents an area, such as the sensing area of a sensor node Data compression can reduce the storage and energy or a cluster of sensor nodes, defined by the application. Since consumption for resource-constrained applications. In [1], the accurate geometric location is not easy to obtain and distributed source (Slepian-Wolf) coding uses joint entropy techniques like the Received Signal Strength (RSS) [32] can to encode two nodes’ data individually without sharing any simply estimate an object’s location based on the ID of the data between them; however, it requires prior knowledge of sensor node with the strongest signal, we employ a symbolic cross correlations of sources. Other works, such as [2], [4], model and describe the location of an object by using the ID combine data compression with routing by exploiting cross of a nearby sensor node. correlations between sensor nodes to reduce the data size. Object tracking is defined as a task of detecting a moving In [5], a tailed LZW has been proposed to address the object’s location and reporting the location data to the sink
  • 4. 98 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 1, JANUARY 2011 Fig. 2. An example of an object’s moving trajectory, the obtained location sequence, and the generated PST T . Fig. 3. The predict_next algorithm. periodically at a time interval. Hence, an observation on an object is defined by the obtained location data. We assume algorithm1 learns from a location sequence and generates a that sensor nodes wake up periodically to detect objects. PST whose height is limited by a specified parameter Lmax . When a sensor node wakes up on its duty cycle and detects Each node of the tree represents a significant movement an object of interest, it transmits the location data of the pattern s whose occurrence probability is above a specified object to its CH. Here, the location data include a times minimal threshold Pmin . It also carries the conditional tamp, the ID of an object, and its location. We also assume empirical probabilities P ðjsÞ for each in Æ that we use that the targeted applications are delay-tolerant. Thus, in location prediction. Fig. 2 shows an example of a location instead of forwarding the data upward immediately, the t.c om sequence and the corresponding PST. nodef is one of the om CH compresses the data accumulated for a batch period and po t.c children of the root node and represents a significant sends it to the CH of the upper layer. The process is repeated gs po movement pattern with }f} whose occurrence probability lo s until the sink receives the location data. Consequently, the .b og is above 0.01. The conditional empirical probabilities are ts .bl trajectory of a moving object is thus modeled as a series of P ð0 b0 j}f}Þ ¼ 0:33, P ð0 e0 j}f}Þ ¼ 0:67, and P ðj}f}Þ ¼ 0 for the ec ts observations and expressed by a location sequence, i.e., a oj c other in Æ. pr oje sequence of sensor IDs visited by the object. We denote a PST is frequently used in predicting the occurrence re r lo rep location sequence by S ¼ s0 s1 . . . sLÀ1 , where each item si is probability of a given sequence, which provides us xp lo a symbol in Æ and L is the sequence length. An example of important information in similarity comparison. The occur- ee xp an object’s trajectory and the obtained location sequence is .ie ee rence probability of a sequence s regarding to a PST T , shown in the left of Fig. 2. The tracking network tracks w e denoted by P T ðsÞ, is the prediction of the occurrence w .i w w moving objects for a period and generates a location :// w probability of s based on T . For example, the occurrence tp //w sequence data set, based on which we discover the group probability P T ð}nokjfb}Þ is computed as follows: ht ttp: relationships of the moving objects. h P T ð}nokjfb}Þ ¼ P T ð}n}Þ Â P T ð0 o0 j}n}Þ Â P T ð0 k0 j}no}Þ 3.2 Variable Length Markov Model (VMM) and Probabilistic Suffix Tree (PST)  P T ð0 j0 j}nok}Þ Â P T ð0 f 0 j}nokj}Þ If the movements of an object are regular, the object’s next  P T ð0 b0 j}nokjf}Þ location can be predicted based on its preceding locations. ¼ P T ð}n}Þ Â P T ð0 o0 j}n}Þ Â P T ð0 k0 j}o}Þ We model the regularity by using the Variable Length Markov Model (VMM). Under the VMM, an object’s move-  P T ð0 j0 j}k}Þ Â P T ð0f 0 j}j}Þ Â P T ð0 b0 j}okjf}Þ ment is expressed by a conditional probability distribution ¼ 0:05  1  1  1  1  0:3 ¼ 0:0165: over Æ. Let s denote a pattern which is a subsequence of a location sequence S and denote a symbol in Æ. The PST is also useful and efficient in predicting the next conditional probability P ðjsÞ is the occurrence probability item of a sequence. For a given sequence s and a PST T , our that will follow s in S. Since the length of s is floating, the predict_next algorithm as shown in Fig. 3 outputs the most VMM provides flexibility to adapt to the variable length of probable next item, denoted by predict nextðT ; sÞ. We movement patterns. Note that when a pattern s occurs more demonstrate its efficiency by an example as follow: Given frequently, it carries more information about the movements s ¼ }nokjf} and T shown in Fig. 2, the predict_next of the object and is thus more desirable for the purpose of algorithm traverses the tree to the deepest node nodeokjf prediction. To find the informative patterns, we first define a along the path including noderoot , nodef , nodejf , nodekjf , and pattern as a significant movement pattern if its occurrence nodeokjf . Finally, symbol 0 e0 , which has the highest condi- probability is above a minimal threshold. tional empirical probability in nodeokjf , is returned, i.e., To learn the significant movement patterns, we adapt predict nextðT ; }nokjf}Þ ¼ 0 e0 . The algorithm’s computa- Probabilistic Suffix Tree (PST) [33] for it has the lowest tional overhead is limited by the height of a PST so that it storage requirement among many VMM implementations is suitable for sensor nodes. [34]. PST’s low complexity, i.e., OðnÞ in both time and space 1. The PST building algorithm is given in Appendix A, which can [35], also makes it more attractive especially for streaming or be found on the Computer Society Digital Library at http://doi. resource-constrained environments [36]. The PST building ieeecomputersociety.org/10.1109/TKDE.2010.30.
  • 5. TSAI ET AL.: EXPLORING APPLICATION-LEVEL SEMANTICS FOR DATA COMPRESSION 99 3.3 Problem Description the similarity of two data streams only on the changed We formulate the problem of this paper as exploring the mature nodes of emission trees [36], instead of all nodes. group movement patterns to compress the location To combine multiple local grouping results into a sequences of a group of moving objects for transmission consensus, the CE algorithm utilizes the Jaccard similarity efficiency. Consider a set of moving objects O ¼ coefficient to measure the similarity between a pair of fo1 ; o2 ; . . . ; on g and their associated location sequence data objects, and normalized mutual information (NMI) to set S ¼ fS1 ; S2 ; . . . ; Sn g. derive the final ensembling result. It trades off the grouping quality against the computation cost by adjusting a partition Definition 1. Two objects are similar to each other if their parameter. In contrast to approaches that perform cluster- movement patterns are similar. Given the similarity measure ing among the entire trajectories, the distributed algorithm function simp 2 and a minimal threshold simmin , oi and oj are discovers the group relationships in a distributed manner similar if their similarity score simp ðoi ; oj Þ is above simmin , i.e., on sensor nodes. As a result, we can discover group simp ðoi ; oj Þ ! simmin . The set of objects that are similar to oi is movement patterns to compress the location data in the denoted by soðoi Þ ¼ foj j8oj 2 O; simp ðoi ; oj Þ ! simmin g. areas where objects have explicit group relationships. Definition 2. A set of objects is recognized as a group if they are Besides, the distributed design provides flexibility to take highly similar to one another. Let g denote a set of objects. g is a partial local grouping results into ensembling when the group if every object in g is similar to at least a threshold of group relationships of moving objects in a specified objects in g, i.e., 8oi 2 g, jsoðogijÞgj ! , where is with default j subregion are interested. Also, it is especially suitable for value 1 .3 2 heterogeneous tracking configurations, which helps reduce the tracking cost, e.g., instead of waking up all sensors at We formally define the moving object clustering problem the same frequency, a fine-grained tracking interval is as follows: Given a set of moving objects O together with specified for partial terrain in the migration season to their associated location sequence data set S and a minimal reduce the energy consumption. Rather than deploying the t.c om similarity threshold simmin , the moving object clustering om sensors in the same density, they are only highly concen- po t.c problem is to partition O into nonoverlapped groups, trated in areas of interest to reduce deployment costs. denoted by G ¼ fg1 ; g2 ; . . . ; gi g, such that the number of gs po lo s .b og groups is minimized, i.e., jGj is minimal. Thereafter, with 4.1 The Group Movement Pattern Mining ts .bl the discovered group information and the obtained group (GMPMine) Algorithm ec ts oj c movement patterns, the group data compression problem is To provide better discrimination accuracy, we propose a pr oje to compress the location sequences of a group of moving new similarity measure simp to compare the similarity of re r lo rep objects for transmission efficiency. Specifically, we formu- two objects. For each of their significant movement patterns, xp lo late the group data compression problem as a merge the new similarity measure considers not merely two ee xp .ie ee problem and an HIR problem. The merge problem is to probability distributions but also two weight factors, i.e., w e combine multiple location sequences to reduce the overall the significance of the pattern regarding to each PST. The w .i w w sequence length, while the HIR problem targets to minimize similarity score simp of oi and oj based on their respective :// w tp //w the entropy of a sequence such that the amount of data is PSTs, Ti and Tj , is defined as follows: ht ttp: reduced with or without loss of information. P qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P h Ti Tj 2 s2S e 2Æ ðP ðsÞ À P ðsÞÞ simp ðoi ; oj Þ ¼ À log pffiffiffi ; ð1Þ 4 MINING OF GROUP MOVEMENT PATTERNS 2Lmax þ 2 To tackle the moving object clustering problem, we propose e where S denotes the union of significant patterns (node a distributed mining algorithm, which comprises the strings) on the two trees. The similarity score simp includes GMPMine and CE algorithms. First, the GMPMine algo- the distance associated with a pattern s, defined as rithm uses a PST to generate an object’s significant move- qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi X ment patterns and computes the similarity of two objects by dðsÞ ¼ ðP Ti ðsÞ À P Tj ðsÞÞ2 2Æ using simp to derive the local grouping results. The merits qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi X of simp include its accuracy and efficiency: First, simp ¼ 2Æ ðP Ti ðsÞ Â P Ti ðjsÞ À P Tj ðsÞ Â P Tj ðjsÞÞ2 ; considers the significances of each movement pattern regarding to individual objects so that it achieves better where dðsÞ is the euclidean distance associated with a accuracy in similarity comparison. For a PST can be used to pattern s over Ti and Tj . predict a pattern’s occurrence probability, which is viewed For a pattern s 2 T , P T ðsÞ is a significant value because as the significance of the pattern regarding the PST, simp the occurrence probability of s is higher than the minimal thus includes movement patterns’ predicted occurrence support Pmin . If oi and oj share the pattern s, we have s 2 Ti probabilities to provide fine-grained similarity comparison. and s 2 Tj , respectively, such that P Ti ðsÞ and P Tj ðsÞ are non- Second, simp can offer seamless and efficient comparison negligible and meaningful in the similarity comparison. for the applications with evolving and evolutionary Because the conditional empirical probabilities are also parts similarity relationships. This is because simp can compare of a pattern, we consider the conditional empirical prob- abilities P T ðjsÞ when calculating the distance between two 2. simp is to be defined in Section 4.1. e PSTs. Therefore, we sum dðsÞ for all s 2 S as the distance 3. In this work, we set the threshold to 1 , as suggested in [37], and leave 2 the similarity threshold as the major control parameter because training the between two PSTs. Note that the distance betweenffiffiffitwo PSTs p similarity threshold of two objects is easier than that of a group of objects. is normalized by its maximal value, i.e., 2Lmax þ 2. We take
  • 6. 100 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 1, JANUARY 2011 the negative log of the distance between two PSTs as the similarity score such that a larger value of the similarity score implies a stronger similar relationship, and vice versa. With the definition of similarity score, two objects are similar to each other if their score is above a specified similarity threshold. The GMPMine algorithm includes four steps. First, we Fig. 4. Design of the two-phase and 2D compression algorithm. extract the movement patterns from the location sequences by learning a PST for each object. Second, our algorithm ½0; 1Š such that a finer-grained configuration of D achieves a constructs an undirected, unweighted similarity graph better grouping quality but in a higher computation cost. where similar objects share an edge between each other. Therefore, for a set of thresholds D, we rewrite our objective P We model the density of group relationship by the function as G0 ¼ argmaxG;2D K NMIðGi ; G Þ. i¼1 connectivity of a subgraph, which is also defined as the The algorithm includes three steps. First, we utilize minimal cut of the subgraph. When the ratio of the Jaccard Similarity Coefficient [40] as the measure of the connectivity to the size of the subgraph is higher than a similarity for each pair of objects. Second, for each 2 D, threshold, the objects corresponding to the subgraph are we construct a graph where two objects share an edge if identified as a group. Since the optimization of the graph their Jaccard Similarity Coefficient is above . Our algo- partition problem is intractable in general, we bisect the rithm partitions the objects to generate a partitioning result similarity graph in the following way. We leverage the HCS G . Third, we select the ensembling result G0 . Because of cluster algorithm [37] to partition the graph and derive the space limitations, we only demonstrate the CE mining location group information. Finally, we select a group PST algorithm with an example in Appendix C, which can be Tg for each group in order to conserve the memory space by found on the Computer Society Digital Library at http:// P doi.ieeecomputersociety.org/10.1109/TKDE.2010.30. In the using the formula expressed as T g ¼ argmaxT 2T s2S P T ðsÞ, t.c om next section, we propose our compression algorithm that om where S denotes sequences of a group of objects and T leverages the obtained group movement patterns. po t.c denotes their PSTs. Due to the page limit, we give an illustrative example of the GMPMine algorithm in Appen- gs po lo s .b og dix B, which can be found on the Computer Society Digital 5 DESIGN OF A COMPRESSION ALGORITHM WITH ts .bl Library at http://doi.ieeecomputersociety.org/10.1109/ GROUP MOVEMENT PATTERNS ec ts oj c pr oje TKDE.2010.30. A WSN is composed of a large number of miniature sensor re r lo rep 4.2 The Cluster Ensembling (CE) Algorithm nodes that are deployed in a remote area for various xp lo applications, such as environmental monitoring or wildlife In the previous section, each CH collects location data and ee xp tracking. These sensor nodes are usually battery-powered .ie ee generates local group results with the proposed GMPMine and recharging a large number of them is difficult. There- w e algorithm. Since the objects may pass through only partial w .i fore, energy conservation is paramount among all design w w sensor clusters and have different movement patterns in :// w issues in WSNs [41], [42]. Because the target objects are tp //w different clusters, the local grouping results may be moving, conserving energy in WSNs for tracking moving ht ttp: inconsistent. For example, if objects in a sensor cluster walk objects is more difficult than in WSNs that monitor immobile h close together across a canyon, it is reasonable to consider phenomena, such as humidity or vibrations. Hence, pre- them a group. In contrast, objects scattered in grassland vious works, such as [43], [44], [45], [46], [47], [48], especially may not be identified as a group. Furthermore, in the case consider movement characteristics of moving objects in their where a group of objects moves across the margin of a designs to track objects efficiently. On the other hand, since sensor cluster, it is difficult to find their group relationships. transmission of data is one of the most energy expensive Therefore, we propose the CE algorithm to combine tasks in WSNs, data compression is utilized to reduce the multiple local grouping results. The algorithm solves the amount of delivered data [1], [2], [3], [4], [5], [6], [49], [50], inconsistency problem and improves the grouping quality. [51], [52]. Nevertheless, few of the above works have The ensembling problem involves finding the partition of addressed the application-level semantics, i.e., the tempor- all moving objects O that contains the most information about al-and-spatial correlations of a group of moving objects. the local grouping results. We utilize NMI [38], [39] to Therefore, to reduce the amount of delivered data, we evaluate the grouping quality. Let C denote the ensemble propose the 2P2D algorithm which leverages the group of the local grouping results, represented as C ¼ fG0 ; movement patterns derived in Section 4 to compress the location sequences of moving objects elaborately. As shown G1 ; . . . ; GK g, where K denotes the ensemble size. Our goal in Fig. 4, the algorithm includes the sequence merge phase is to discover the ensembling result G0 that contains the most and the entropy reduction phase to compress location P information about C, i.e., G0 ¼ argmax e K NMIðGi ; GÞ, i¼1 sequences vertically and horizontally. In the sequence G2G e where G denotes all possible ensembling results. merge phase, we propose the Merge algorithm to compress e However, enumerating every G 2 G in order to find the the location sequences of a group of moving objects. Since 0 optimal ensembling result G is impractical, especially in the objects with similar movement patterns are identified as a resource-constrained environments. To overcome this diffi- group, their location sequences are similar. The Merge culty, the CE algorithm trades off the grouping quality algorithm avoids redundant sending of their locations, and against the computation cost by adjusting the partition thus, reduces the overall sequence length. It combines the parameter D, i.e., a set of thresholds with values in the range sequences of a group of moving objects by 1) trimming
  • 7. TSAI ET AL.: EXPLORING APPLICATION-LEVEL SEMANTICS FOR DATA COMPRESSION 101 multiple identical symbols at the same time interval into a single symbol or 2) choosing a qualified symbol to represent them when a tolerance of loss of accuracy is specified by the application. Therefore, the algorithm trims and prunes more items when the group size is larger and the group relationships are more distinct. Besides, in the case that only the location center of a group of objects is of interest, our approach can find the aggregated value in the phase, instead of transmitting all location sequences back to the Fig. 5. An example of the Merge algorithm. (a) Three sequences with high similarity. (b) The merged sequence S 00 . sink for postprocessing. In the entropy reduction phase, we propose the Replace S-column into a single symbol. Specifically, given a group algorithm that utilizes the group movement patterns as the of n sequences, the items of an S-column are replaced by a prediction model to further compress the merged sequence. single symbol, whereas the items of a D-column are The Replace algorithm guarantees the reduction of a wrapped up between two 0 =0 symbols. Finally, our algo- sequence’s entropy, and consequently, improves compres- rithm generates a merged sequence containing the same sibility without loss of information. Specifically, we define a information of the original sequences. In decompressing new problem of minimizing the entropy of a sequence as from the merged sequence, while symbol 0 =0 is encountered, the HIR problem. To reduce the entropy of a location the items after it are output until the next 0 =0 symbol. sequence, we explore Shannon’s theorem [17] to derive Otherwise, for each item, we repeat it n times to generate three replacement rules, based on which the Replace the original sequences. Fig. 5b shows the merged sequence algorithm reduces the entropy efficiently. Also, we prove S 00 whose length is decreased from 60 items to 48 items such that the Replace algorithm obtains the optimal solution of that 12 items are conserved. The example pointed out that the HIR problem as Theorem 1. In addition, since the objects t.c om our approach can reduce the amount of data without loss of om may enter and leave a sensor cluster multiple times during po t.c a batch period and a group of objects may enter and leave a information. Moreover, when there are more S-columns, gs po our approach can bring more benefit. cluster at slightly different times, we discuss the segmenta- lo s .b og When a little loss in accuracy is tolerant, representing tion and alignment problems in Section 5.3. Table 1 ts .bl items in a D-column by an qualified symbol to generate summaries the notations. ec ts more S-columns can improve the compressibility. We oj c pr oje 5.1 Sequence Merge Phase regulate the accuracy by an error bound, defined as the re r lo rep In the application of tracking wild animals, multiple maximal hop count between the real and reported locations xp lo moving objects may have group relationships and share of an object. For example, replacing items 2 and 6 of S2 by 0 g0 ee xp and 0 p0 , respectively, creates two more S-columns, and thus, .ie ee similar trajectories. In this case, transmitting their location w e data separately leads to redundancy. Therefore, in this results in a shorter merged sequence with 42 items. To w .i w w section, we concentrate on the problem of compressing select a representative symbol for a D-column, we includes :// w tp //w multiple similar sequences of a group of moving objects. a selection criterion to minimize the average deviation ht ttp: Consider an illustrative example in Fig. 5a, where three between the real locations and reported locations for a h location sequences S0 , S1 , and S2 represent the trajectories group of objects at each time interval as follows: of a group of three moving objects. Items with the same index belong to a column, and a column containing Selection criterion. The maximal distance between the reported identical symbols is called an S-column; otherwise, the location and the real location is below a specified error
  • 8.
  • 9. bound column is called a D-column. Since sending the items in an eb, i.e., when the ith column is a D-column,
  • 11. eb S-column individually causes redundancy, our basic idea must hold for 0 j n, where n is the number of sequences. of compressing multiple sequences is to trim the items in an TABLE 1 Description of the Notations
  • 12. 102 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 1, JANUARY 2011 Fig. 7. An example of the Merge algorithm: the merged sequence with eb ¼ 1. jfs js ¼ ;0 jLgj as pi ¼ j j L i , and L ¼ jS j. According to Shannon’s source coding theorem, the optimal code length for symbol i is log2 pi , which is also called the information content of i . Information entropy (or Shannon’s entropy) is thereby defined as the overall sum of the information content over Æ, X eðSÞ ¼ eðp0 ; p1 ; . . . ; pjÆjÀ1 Þ ¼ Àpi  log2 pi : ð2Þ 0 ijÆj Shannon’s entropy represents the optimal average code length in data compression, where the length of a symbol’s Fig. 6. The Merge algorithm. codeword is proportional to its information content [17]. A property of Shannon’s entropy is that the entropy is Therefore, to compress the location sequences for a the maximum, while all probabilities are of the same value. group of moving objects, we propose the Merge algorithm Thus, in the case without considering the regularity in the shown in Fig. 6. The input of the algorithm contains a group movements of objects, the occurrence probabilities of of sequences fSi j0 i ng and an error bound eb, while symbols are assumed to be uniform such that the entropy the output is a merged sequence that represents the group t.c om of a location sequence as well as the compression ratio is om of sequences. Specifically, the Merge algorithm processes po t.c the sequences in a columnwise way. For each column, the fixed and dependent on the size of a sensor cluster, e.g., for gs po a location sequence with length D collected in a sensor algorithm first checks whether it is an S-column. For an S- lo s .b og 1 column, it retains the value of the items as Lines 4-5. cluster of 16 nodes, the entropy of the sequence is eð16 , ts .bl 1 1 Otherwise, while an error bound eb 0 is specified, a 16 ; . . . ; 16Þ ¼ 4. Consequently, 4D bits are needed to repre- ec ts oj c representative symbol is selected according to the selection sent the location sequence. Nevertheless, since the move- pr oje criterion as Line 7. If a qualified symbol exists to represent ments of a moving object are of some regularity, the re r lo rep the column, the algorithm outputs it as Lines 15-18. occurrence probabilities of symbols are probably skewed xp lo Otherwise, the items in the column are retained and and the entropy is lower. For example, if two probabilities ee xp .ie ee 1 3 wrapped by a pair of “/” as Lines 9-13. The process repeats become 32 and 32 , the entropy is reduced to 3.97. Thus, w e until all columns are examined. Afterward, the merged instead of encoding the sequence using 4 bits per symbol, w .i w w sequence S 00 is generated. :// w only 3:97D bits are required for the same information. This tp //w Following the example shown in Fig. 5a, with a specified example points out that when a moving object exhibits ht ttp: error bound eb as 1, the Merge algorithm generates the some degree of regularity in their movements, the skewness h solution, as shown in Fig. 7; the merged sequence contains of these probabilities lowers the entropy. Seeing that data only 20 items, i.e., 40 items are curtailed. with lower entropy require fewer bits to represent the same 5.2 Entropy Reduction Phase information, reducing the entropy thereby benefits for data compression and, by extension, storage and transmission. In the entropy reduction phase, we propose the Replace Motivated by the above observation, we design the algorithm to minimize the entropy of the merged sequence Replace algorithm to reduce the entropy of a location obtained in the sequence merge phase. Since data with sequence. Our algorithm imposes the hit symbols on the lower entropy require fewer bits for storage and transmis- location sequence to increase the skewness. Specifically, the sion [17], we replace some items to reduce the entropy algorithm uses the group movement patterns built in both without loss of information. The object movement patterns the transmitter (CH) and the receiver (sink) as the discovered by our distributed mining algorithm enable us to find the replaceable items and facilitate the selection of prediction model to decide whether an item of a sequence items in our compression algorithm. In this section, we first is predictable. A CH replaces the predictable items each introduce and define the HIR problem, and then, explore with a hit symbol to reduce the location sequence’s entropy the properties of Shannon’s entropy to solve the HIR when compressing it. After receiving the compressed problem. We extend the concentration property for entropy sequence, the sink node decompresses it and substitutes reduction and discuss the benefits of replacing multiple every hit symbol with the original symbol by the identical symbols simultaneously. We derive three replacement rules prediction model, and no information loss occurs. for the HIR problem and prove that the entropy of the Here, an item si of a sequence S is predictable item if obtained solution is minimized. predict nextðT ; S½0::i À 1ŠÞ is the same value as si . A symbol is a predictable symbol once an item of the symbol is 5.2.1 Three Derived Replacement Rules predictable. For ease of explanation, we use a taglst4 to Let P ¼ fp0 , p1 ; . . . ; pjÆjÀ1 g denote the probability distribu- 4. A taglst associated with a sequence S is a sequence of 0s and 1s tion of the alphabet Æ corresponding to a location sequence obtained as follows: For those predictable items in S, the corresponding S, where pi is the occurrence probability of i in Æ, defined items in taglst are set as 1. Otherwise, their values in taglst are 0.
  • 13. TSAI ET AL.: EXPLORING APPLICATION-LEVEL SEMANTICS FOR DATA COMPRESSION 103 Fig. 8. An example of the simple method. The first row represents the index, and the second row is the location sequence. The third row is the taglst which represents the prediction results. The last row is the result of the simple approach. Note that items 1, 2, 7, 9, 11, 12, 15, and 22 are predictable items such that the set of predictable symbols is s ¼ fk, a, j, f, og. In the example, items 2, 9, and 10 are items of 0 o0 such that the number of items of ^ symbol 0 o0 is nð0 o0 Þ ¼ 3, whereas items 2 and 9 are the predictable items of 0 o0 , and the number of predictable items of symbol 0 o0 is nhit ð0 o0 Þ ¼ 2. Fig. 9. Problems of the simple approach. express whether each item of S is predictable in the fifth, called the strong concentration property, is derived following sections. Consider the illustrative example shown and proved in the paper.5 in Fig. 8, the probability distribution of Æ corresponding to t.c om Property 1 (Expansibility). Adding a probability with a value om S is P ¼ f0:04; 0:16; 0:04; 0; 0; 0:08; 0:04; 0; 0:04; 0:12; 0:24; 0; po t.c of zero does not change the entropy, i.e., eðp0 , p1 ; . . . ; pjÆjÀ1 , 0; 0:12; 0:12; 0g, and items 1, 2, 7, 9, 11, 12, 15, and 22 are gs po pjÆj Þ is identical to eðp0 , p1 ; . . . ; pjÆjÀ1 Þ when pjÆj is zero. lo s predictable items. To reduce the entropy of the sequence, a .b og ts .bl simple approach is to replace each of the predictable items According to Property 1, we add a new symbol 0 :0 to Æ ec ts with the hit symbol and obtain an intermediate sequence oj c without affecting the entropy and denote its probability as pr oje S 0 ¼00n::kfbb:n:o::ij:bbnkkk:gc00 . Compared with the original p16 such that P is fp0 , p1 ; . . . ; p15 , p16 ¼ 0g. re r sequence S with entropy 3.053, the entropy of S 0 is reduced lo rep to 2.854. Encoding S and S 0 by the Huffman coding Property 2 (Symmetry). Any permutation of the probability xp lo ee xp technique, the lengths of the output bit streams are 77 and values does not change to the entropy. For example, .ie ee 73 bits, respectively, i.e., 5 bits are conserved by the simple eð0:1; 0:4; 0:5Þ is identical to eð0:4; 0:5; 0:1Þ. w e w .i w w approach. Property 3 (Accumulation). Moving all the value from one :// w However, the above simple approach does not always tp //w probability to another such that the former can be thought ht ttp: minimize the entropy. Consider the example shown in Fig. 9a, of as being eliminated decreases the entropy, i.e., an intermediate sequence with items 1 and 19 unreplaced has h eðp0 ; p1 ; . . . ; 0; . . . ; pi þ pj ; . . . ; pjÆjÀ1 Þ is equal or less than lower entropy than that generated by the simple approach. eðp0 ,p1 ; . . . ; pi ; . . . ; pj ; . . . ; pjÆjÀ1 Þ. For the example shown in Fig. 9b, the simple approach even increases the entropy from 2.883 to 2.963. With Properties 2 and 3, if all the items of symbol are We define the above problem as the HIR problem and predictable, i.e., nðÞ ¼ nhit ðÞ, replacing all the items of formulate it as follows: by 0 :0 will not affect the entropy. If there are multiple Definition 3 (HIR problem). Given a sequence S ¼ fsi jsi 2 symbols having nðÞ ¼ nhit ðÞ, replacing all the items of Æ; 0 i Lg and a taglst, an intermediate sequence is a these symbols can reduce the entropy. Thus, we derive the generation of S, denoted by S 0 ¼ fs0i j0 i Lg, where s0i is first replacement rule—the accumulation rule: Replace all equal to si if taglst½iŠ ¼ 0. Otherwise, s0i is equal to si or 0 :0 . items of symbol in s, where nðÞ ¼ nhit ðÞ. ^ The HIR problem is to find the intermediate sequence S 0 such Property 4 (Concentration). For two probabilities, moving a that the entropy of S 0 is minimal for all possible intermediate value from the lower probability to the higher probability sequences. decreases the entropy, i.e., if pi pj , for 0 Áp pi , eðp0 ,p1 ; . . . ; pi À Áp; . . . ; pj þ Áp; . . . ; pjÆjÀ1 Þ is less than A brute-force method to the HIR problem is to enumer- ate all possible intermediate sequences to find the optimal eðp0 ,p1 ; . . . ; pi ; . . . ; pj ; . . . ; pjÆjÀ1 Þ. solution. However, this brute-force approach is not scalable, Property 5 (Strong concentration). For two probabilities, especially when the number of the predictable items is moving a value that is larger than the difference of the two large. Therefore, to solve the HIR problem, we explore probabilities from the higher probability to the lower one properties of Shannon’s entropy to derive three replace- decreases the entropy, i.e., if pi pj , for pi À pj Áp pi , ment rules that our Replace algorithm leverages to obtain the optimal solution. Here, we list five most relevant 5. Due to the page limit, the proofs of the strong concentration rule and Lemmas 1-4 are given in Appendices C and D, respectively, which can be properties to explain the replacement rules. The first four found on the Computer Society Digital Library at http://doi. properties can be obtained from [17], [53], [54], while the ieeecomputersociety.org/10.1109/TKDE.2010.30.
  • 14. 104 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 1, JANUARY 2011 eðp0 ,p1 ; . . . ; pi À Áp; . . . ; pj þ Áp; . . . ; pjÆjÀ1 Þ is less than as Lemma 4. We thereby derive the third replacement eðp0 ,p1 ; . . . ; pi ; . . . ; pj ; . . . ; pjÆjÀ1 Þ. rule—the multiple symbol rule: Replace all of the predictable items of every symbol in s0 if gainð^0 Þ 0. ^ s According to Properties 4 and 5, we conclude that if the Lemma 3. If exist 0 an pin , n ¼ 1; . . . ; k, such that difference of two probabilities increases, the entropy P fi1 i2 ...ik j ða1 ; a2 ; . . . ; ak Þ is minimal, we have pj þ k ai ! i¼1 decreases. When the conditions conform to Properties 4 pin À an , n ¼ 1; . . . ; k. and 5, we further prove that the entropy is monotonically reduced as Áp increases such that replacing all predictable Lemma 4. If exist xmin1 , xmin2 ; . . . , and xmink such that pj þ Pk items of a qualified symbol reduces the entropy mostly as i¼1 xmini pin À xminn for n ¼ 1; 2; . . . ; k,fi1 i2 ...ik j ðx1 ; x2 ; Lemmas 1 and 2. . . . ; xk Þ is monotonically decreasing for xminn xn pin , Definition 4 (Concentration function). For a probability n ¼ 1; . . . ; k. distribution P ¼ fp0 , p1 ; . . . ; pjÆjÀ1 g, we define the concentra- tion function of k probabilities pi1 , pi2 ; . . . ; pik , and pj as 5.2.2 The Replace Algorithm Based on the observations described in the previous section, fi1 i2 ...ik j ðx1 ; x2 ; . . . ; xk Þ ¼ e p0 ; . . . ; pi1 À x1 ; . . . ; pi2 we propose the Replace algorithm that leverages the three replacement rules to obtain the optimal solution for the HIR À x2 ; . . . ; pik À xk ; . . . ; pj problem. Our algorithm examines the predictable symbols Xk on their statistics, which include the number of items and þ xi ; . . . ; pj P jÀ1 : the number of predictable items of each predictable symbol. i¼1 The algorithm first replaces the qualified symbols according to the accumulation rule. Afterward, since the concentration rule and the multiple symbol rule are related to nð0 :0 Þ, which Lemma 1. If pi pj , for any x such that 0 x pi , the is increased after every replacement, the algorithm itera- t.c om concentration function of pi and pj is monotonic decreasing. tively replaces the qualified symbols according to the two om po t.c Lemma 2. If pi pj , for any x such that pi À pj x pi , the rules until all qualified symbols are replaced. The algorithm concentration function fij ðxÞ of pi and pj is monotonic gs po thereby replaces qualified symbols and reduces the entropy lo s .b og decreasing. toward the optimum gradually. Compared with the brute- ts .bl force method that enumerates all possible intermediate ec ts oj c Accordingly, we derive the second replacement rule—the sequences for the optimum in exponential complexity, the pr oje Replace algorithm that leverages the derived rules to obtain re r concentration rule: Replace all predictable items of symbol lo rep in s, where nðÞ nð0 :0 Þ or nhit ðÞ nðÞ À nð0 :0 Þ. ^ the optimal solution in OðLÞ time6 is more scalable and xp lo efficient. We prove that the Replace algorithm guarantees to ee xp As an extension of the above properties, we also explore .ie ee the entropy variation, while predictable items of multiple reduce the entropy monotonically and obtains the optimal w e symbols are replaced simultaneously. Let s0 denote a subset solution of the HIR problem as Theorem 1. Next, we detail w .i ^ w w :// w of s, where j^0 j ! 2. We prove that if replacing predictable ^ s the replace algorithm and demonstrate the algorithm by an tp //w symbols in s0 can minimize the entropy corresponding to ^ illustrative example. ht ttp: symbols in s0 , the number of hit symbols after the ^ Theorem 1.7 The Replace algorithm obtains the optimal solution h replacement must be larger than the number of the of the HIR problem. predictable symbols after the replacement as Lemma 3. To investigate whether the converse statement exists, we Fig. 10 shows the Replace algorithm. The input includes conduct an experiment in a brute-force way. However, the a location sequence S and a predictor Tg , while the output, experimental results show that even under the condition P denoted by S 0 , is a sequence in which qualified items are that nð0 :0 Þ þ 80 2^0 nhit ð0 Þ ! nðÞ À nhit ðÞ for every in s0 , s ^ replaced by 0 :0 . Initially, Lines 3-9 of the algorithm find the replacing predictable items of the symbols in s0 does not ^ set of predictable symbols together their statistics. Then, it guarantee the reduction of the entropy. Therefore, we exams the statistics of the predictable symbols according to compare the difference of the entropy before and after the three replacement rules as follows: First, according to replacing symbols in s0 as ^ the accumulation rule, it replaces qualified symbols in one X p scan of the predictable symbols as Lines 10-14. Next, the ^ gainðS 0 Þ ¼ À p  log2 À Áp algorithm iteratively exams for the concentration and the 82^s 0 p À Áp multiple symbol rules by two loops. The first loop from  log2 ðp À Áp Þ Line 16 to Line 22 is for the concentration, whereas the p0 :0 second loop from Line 25 to Line 36 is for the multiple À p0 :0  log2 0 0 P symbol rule. In our design, since finding a combination of p : þ 82^0 ÁpÀ s ! predictable symbols to make gainð^0 Þ 0 hold is more s X X þ Áp  log2 p þ 0 :0 ÁpÀ : costly, the algorithm is prone to replace symbols with the 82^0 s 82^0 s 6. The complexity analysis is given in Appendix E, which can be found In addition, we also prove that once replacing partial on the Computer Society Digital Library at http://doi.ieeecomputersociety. predictable items of symbols in s0 reduces entropy, repla- ^ org/10.1109/TKDE.2010.30. 7. The proof of Theorem 1 is given in Appendix F, which can be found on cing all predictable items of these symbols reduces the the Computer Society Digital Library at http://doi.ieeecomputersociety. entropy mostly since the entropy decreases monotonically org/10.1109/TKDE.2010.30.