Dear Students
Ingenious techno Solution offers an expertise guidance on you Final Year IEEE & Non- IEEE Projects on the following domain
JAVA
.NET
EMBEDDED SYSTEMS
ROBOTICS
MECHANICAL
MATLAB etc
For further details contact us:
enquiry@ingenioustech.in
044-42046028 or 8428302179.
Ingenious Techno Solution
#241/85, 4th floor
Rangarajapuram main road,
Kodambakkam (Power House)
http://www.ingenioustech.in/
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Exploring application level semantics
1. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 1, JANUARY 2011 95
Exploring Application-Level Semantics
for Data Compression
Hsiao-Ping Tsai, Member, IEEE, De-Nian Yang, and Ming-Syan Chen, Fellow, IEEE
Abstract—Natural phenomena show that many creatures form large social groups and move in regular patterns. However, previous
works focus on finding the movement patterns of each single object or all objects. In this paper, we first propose an efficient distributed
mining algorithm to jointly identify a group of moving objects and discover their movement patterns in wireless sensor networks.
Afterward, we propose a compression algorithm, called 2P2D, which exploits the obtained group movement patterns to reduce the
amount of delivered data. The compression algorithm includes a sequence merge and an entropy reduction phases. In the sequence
merge phase, we propose a Merge algorithm to merge and compress the location data of a group of moving objects. In the entropy
reduction phase, we formulate a Hit Item Replacement (HIR) problem and propose a Replace algorithm that obtains the optimal
solution. Moreover, we devise three replacement rules and derive the maximum compression ratio. The experimental results show that
the proposed compression algorithm leverages the group movement patterns to reduce the amount of delivered data effectively and
efficiently.
Index Terms—Data compression, distributed clustering, object tracking.
Ç
t.c om
1 INTRODUCTION
om
R po t.c
ECENT advances in location-acquisition technologies, zebra, whales, and birds, form large social groups when
such as global positioning systems (GPSs) and wireless
gs po
migrating to find food, or for breeding or wintering. These
lo s
.b og
sensor networks (WSNs), have fostered many novel characteristics indicate that the trajectory data of multiple
ts .bl
applications like object tracking, environmental monitoring, objects may be correlated for biological applications. More-
ec ts
oj c
and location-dependent service. These applications gener- over, some research domains, such as the study of animals’
pr oje
ate a large amount of location data, and thus, lead to social behavior and wildlife migration [7], [8], are more
re r
lo rep
transmission and storage challenges, especially in resource- concerned with the movement patterns of groups of
xp lo
animals, not individuals; hence, tracking each object is
ee xp
constrained environments like WSNs. To reduce the data
.ie ee
volume, various algorithms have been proposed for data unnecessary in this case. This raises a new challenge of
w e
finding moving animals belonging to the same group and
w .i
compression and data aggregation [1], [2], [3], [4], [5], [6].
w w
identifying their aggregated group movement patterns.
:// w
However, the above works do not address application-level
tp //w
semantics, such as the group relationships and movement Therefore, under the assumption that objects with similar
ht ttp:
patterns, in the location data. movement patterns are regarded as a group, we define the
h
In object tracking applications, many natural phenomena moving object clustering problem as given the movement
show that objects often exhibit some degree of regularity in trajectories of objects, partitioning the objects into non-
their movements. For example, the famous annual wild- overlapped groups such that the number of groups is
ebeest migration demonstrates that the movements of minimized. Then, group movement pattern discovery is to
find the most representative movement patterns regarding
creatures are temporally and spatially correlated. Biologists
each group of objects, which are further utilized to
also have found that many creatures, such as elephants,
compress location data.
Discovering the group movement patterns is more
. H.-P. Tsai is with the Department of Electrical Engineering (EE), National difficult than finding the patterns of a single object or all
Chung Hsing University, No. 250, Kuo Kuang Road, Taichung 402, objects, because we need to jointly identify a group of objects
Taiwan, ROC. E-mail: hptsai@nchu.edu.tw.
. D.-N. Yang is with the Institute of Information Science (IIS) and the
and discover their aggregated group movement patterns.
Research Center of Information Technology Innovation (CITI), Academia The constrained resource of WSNs should also be consid-
Sinica, No. 128, Academia Road, Sec. 2, Nankang, Taipei 11529, Taiwan, ered in approaching the moving object clustering problem.
ROC. E-mail: dnyang@iis.sinica.edu.tw. However, few of existing approaches consider these issues
. M.-S. Chen is with the Research Center of Information Technology
Innovation (CITI), Academia Sinica, No. 128, Academia Road, Sec. 2, simultaneously. On the one hand, the temporal-and-spatial
Nankang, Taipei 11529, Taiwan, and the Department of Electrical correlations in the movements of moving objects are
Engineering (EE), Department of Computer Science and Information modeled as sequential patterns in data mining to discover
Engineer (CSIE), and Graduate Institute of Communication Engineering
(GICE), National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei the frequent movement patterns [9], [10], [11], [12]. How-
10617, Taiwan, ROC. E-mail: mschen@cc.ee.ntu.edu.tw. ever, sequential patterns 1) consider the characteristics of all
Manuscript received 22 Sept. 2008; revised 27 Feb. 2009; accepted 29 July objects, 2) lack information about a frequent pattern’s
2009; published online 4 Feb. 2010. significance regarding individual trajectories, and 3) carry
Recommended for acceptance by M. Ester. no time information between consecutive items, which make
For information on obtaining reprints of this article, please send e-mail to:
tkde@computer.org, and reference IEEECS Log Number TKDE-2008-09-0497. them unsuitable for location prediction and similarity
Digital Object Identifier no. 10.1109/TKDE.2010.30. comparison. On the other hand, previous works, such as
1041-4347/11/$26.00 ß 2011 IEEE Published by the IEEE Computer Society
2. 96 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 1, JANUARY 2011
[13], [14], [15], measure the similarity among these entire of the HIR problem based on Shannon’s theorem [17] and
trajectory sequences to group moving objects. Since objects guarantees the reduction of entropy, which is convention-
may be close together in some types of terrain, such as ally viewed as an optimization bound of compression
gorges, and widely distributed in less rugged areas, their performance. As a result, our approach reduces the amount
group relationships are distinct in some areas and vague in of delivered data and, by extension, the energy consumption
others. Thus, approaches that perform clustering among in WSNs.
entire trajectories may not be able to identify the local group Our contributions are threefold:
relationships. In addition, most of the above works are
. Different from previous works, we formulate a
centralized algorithms [9], [10], [11], [12], [13], [14], [15],
moving object clustering problem that jointly iden-
which need to collect all data to a server before processing.
tifies a group of objects and discovers their move-
Thus, unnecessary and redundant data may be delivered,
ment patterns. The application-level semantics are
leading to much more power consumption because data
useful for various applications, such as data storage
transmission needs more power than data processing in
and transmission, task scheduling, and network
WSNs [5]. In [16], we have proposed a clustering algorithm
construction.
to find the group relationships for query and data aggrega-
. To approach the moving object clustering problem,
tion efficiency. The differences of [16] and this work are as
we propose an efficient distributed mining algo-
follows: First, since the clustering algorithm itself is a
rithm to minimize the number of groups such that
centralized algorithm, in this work, we further consider
members in each of the discovered groups are highly
systematically combining multiple local clustering results related by their movement patterns.
into a consensus to improve the clustering quality and for . We propose a novel compression algorithm to
use in the update-based tracking network. Second, when a compress the location data of a group of moving
delay is tolerant in the tracking application, a new data objects with or without loss of information. We
management approach is required to offer transmission formulate the HIR problem to minimize the entropy
t.c om
om
efficiency, which also motivates this study. We thus define of location data and explore the Shannon’s theorem
po t.c
the problem of compressing the location data of a group of gs po
to solve the HIR problem. We also prove that the
lo s
moving objects as the group data compression problem. proposed compression algorithm obtains the opti-
.b og
Therefore, in this paper, we first introduce our distrib-
ts .bl
mal solution of the HIR problem efficiently.
ec ts
uted mining algorithm to approach the moving object
The remainder of the paper is organized as follows: In
oj c
pr oje
clustering problem and discover group movement patterns.
Section 2, we review related works. In Section 3, we provide
re r
Then, based on the discovered group movement patterns,
lo rep
an overview of the network, location, and movement
we propose a novel compression algorithm to tackle the
xp lo
models and formulate our problem. In Section 4, we
ee xp
group data compression problem. Our distributed mining describe the distributed mining algorithm. In Section 5,
.ie ee
algorithm comprises a Group Movement Pattern Mining
w e
we formulate the compression problems and propose our
w .i
w w
(GMPMine) and a Cluster Ensembling (CE) algorithms. It compression algorithm. Section 6 details our experimental
:// w
avoids transmitting unnecessary and redundant data by
tp //w
results. Finally, we summarize our conclusions in Section 7.
ht ttp:
transmitting only the local grouping results to a base station
h
(the sink), instead of all of the moving objects’ location data.
Specifically, the GMPMine algorithm discovers the local 2 RELATED WORK
group movement patterns by using a novel similarity 2.1 Movement Pattern Mining
measure, while the CE algorithm combines the local Agrawal and Srikant [18] first defined the sequential
grouping results to remove inconsistency and improve the pattern mining problem and proposed an Apriori-like
grouping quality by using the information theory. algorithm to find the frequent sequential patterns. Han et
Different from previous compression techniques that al. consider the pattern projection method in mining
remove redundancy of data according to the regularity sequential patterns and proposed FreeSpan [19], which is
within the data, we devise a novel two-phase and an FP-growth-based algorithm. Yang and Hu [9] developed
2D algorithm, called 2P2D, which utilizes the discovered a new match measure for imprecise trajectory data and
group movement patterns shared by the transmitting node proposed TrajPattern to mine sequential patterns. Many
and the receiving node to compress data. In addition to variations derived from sequential patterns are used in
remove redundancy of data according to the correlations various applications, e.g., Chen et al. [20] discover path
within the data of each single object, the 2P2D algorithm traversal patterns in a Web environment, while Peng and
further leverages the correlations of multiple objects and Chen [21] mine user moving patterns incrementally in a
their movement patterns to enhance the compressibility. mobile computing system. However, sequential patterns
Specifically, the 2P2D algorithm comprises a sequence and its variations like [20], [21] do not provide sufficient
merge and an entropy reduction phases. In the sequence information for location prediction or clustering. First, they
merge phase, we propose a Merge algorithm to merge and carry no time information between consecutive items, so
compress the location data of a group of objects. In the they cannot provide accurate information for location
entropy reduction phase, we formulate a Hit Item Replace- prediction when time is concerned. Second, they consider
ment (HIR) problem to minimize the entropy of the merged the characteristics of all objects, which make the meaningful
data and propose a Replace algorithm to obtain the optimal movement characteristics of individual objects or a group of
solution. The Replace algorithm finds the optimal solution moving objects inconspicuous and ignored. Third, because
3. TSAI ET AL.: EXPLORING APPLICATION-LEVEL SEMANTICS FOR DATA COMPRESSION 97
a sequential pattern lacks information about its significance
regarding to each individual trajectory, they are not fully
representative to individual trajectories. To discover sig-
nificant patterns for location prediction, Morzy mines
frequent trajectories whose consecutive items are also
adjacent in the original trajectory data [10], [22]. Meanwhile,
Giannotti et al. [11] extract T-patterns from spatiotemporal
data sets to provide concise descriptions of frequent
movements, and Tseng and Lin [12] proposed the TMP-
Mine algorithm for discovering the temporal movement
patterns. However, the above Apriori-like or FP-growth-
based algorithms still focus on discovering frequent Fig. 1. (a) The hierarchical- and cluster-based network structure and the
patterns of all objects and may suffer from computing data flow of an update-based tracking network. (b) A flat view of a two-
layer network structure with 16 clusters.
efficiency or memory problems, which make them unsui-
table for use in resource-constrained environments.
memory constraint of a sensor device. Summarization of the
2.2 Clustering original data by regression or linear modeling has been
Recently, clustering based on objects’ movement behavior proposed for trajectory data compression [3], [6]. However,
has attracted more attention. Wang et al. [14] transform the the above works do not address application-level semantics
location sequences into a transaction-like data on users and in data, such as the correlations of a group of moving
based on which to obtain a valid group, but the proposed objects, which we exploit to enhance the compressibility.
AGP and VG growth are still Apriori-like or FP-growth-
based algorithms that suffer from high computing cost and 3 PRELIMINARIES
t.c om
memory demand. Nanni and Pedreschi [15] proposed a
om
3.1 Network and Location Models
po t.c
density-based clustering algorithm, which makes use of an
optimal time interval and the average euclidean distance gs po
Many researchers believe that a hierarchical architecture
lo s
provides better coverage and scalability, and also extends
.b og
between each point of two trajectories, to approach the
ts .bl
trajectory clustering problem. However, the above works the network lifetime of WSNs [25], [26]. In a hierarchical
ec ts
WSN, such as that proposed in [27], the energy, computing,
oj c
discover global group relationships based on the proportion
pr oje
of the time a group of users stay close together to the whole and storage capacity of sensor nodes are heterogeneous. A
re r
lo rep
time duration or the average euclidean distance of the entire high-end sophisticated sensor node, such as Intel Stargate
xp lo
trajectories. Thus, they may not be able to reveal the local [28], is assigned as a cluster head (CH) to perform high
ee xp
complexity tasks; while a resource-constrained sensor node,
.ie ee
group relationships, which are required for many applica-
such as Mica2 mote [29], performs the sensing and low
w e
tions. In addition, though computing the average euclidean
w .i
w w
distance of two geometric trajectories is simple and useful, complexity tasks. In this work, we adopt a hierarchical
:// w
tp //w
the geometric coordinates are expensive and not always network structure with K layers, as shown in Fig. 1a, where
ht ttp:
available. Approaches, such as EDR, LCSS, and DTW, are sensor nodes are clustered in each level and collaboratively
h
widely used to compute the similarity of symbolic trajectory gather or relay remote information to a base station called a
sequences [13], but the above dynamic programming sink. A sensor cluster is a mesh network of n  n sensor
approaches suffer from scalability problem [23]. To provide nodes handled by a CH and communicate with each other
scalability, approximation or summarization techniques are by using multihop routing [30]. We assume that each node
used to represent original data. Guralnik and Karypis [23] in a sensor cluster has a locally unique ID and denote the
project each sequence into a vector space of sequential sensor IDs by an alphabet Æ. Fig. 1b shows an example of a
patterns and use a vector-based K-means algorithm to cluster two-layer tracking network, where each sensor cluster
objects. However, the importance of a sequential pattern contains 16 nodes identified by Æ ¼ fa, b; . . . ; pg.
regarding individual sequences can be very different, which In this work, an object is defined as a target, such as an
is not considered in this work. To cluster sequences, Yang animal or a bird, that is recognizable and trackable by the
and Wang proposed CLUSEQ [24], which iteratively identi- tracking network. To represent the location of an object,
fies a sequence to a learned model, yet the generated clusters geometric models and symbolic models are widely used
may overlap which differentiates their problem from ours. [31]. A geometric location denotes precise two-dimension or
three-dimension coordinates; while a symbolic location
2.3 Data Compression represents an area, such as the sensing area of a sensor node
Data compression can reduce the storage and energy or a cluster of sensor nodes, defined by the application. Since
consumption for resource-constrained applications. In [1], the accurate geometric location is not easy to obtain and
distributed source (Slepian-Wolf) coding uses joint entropy techniques like the Received Signal Strength (RSS) [32] can
to encode two nodes’ data individually without sharing any simply estimate an object’s location based on the ID of the
data between them; however, it requires prior knowledge of sensor node with the strongest signal, we employ a symbolic
cross correlations of sources. Other works, such as [2], [4], model and describe the location of an object by using the ID
combine data compression with routing by exploiting cross of a nearby sensor node.
correlations between sensor nodes to reduce the data size. Object tracking is defined as a task of detecting a moving
In [5], a tailed LZW has been proposed to address the object’s location and reporting the location data to the sink
4. 98 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 1, JANUARY 2011
Fig. 2. An example of an object’s moving trajectory, the obtained location
sequence, and the generated PST T .
Fig. 3. The predict_next algorithm.
periodically at a time interval. Hence, an observation on an
object is defined by the obtained location data. We assume algorithm1 learns from a location sequence and generates a
that sensor nodes wake up periodically to detect objects. PST whose height is limited by a specified parameter Lmax .
When a sensor node wakes up on its duty cycle and detects Each node of the tree represents a significant movement
an object of interest, it transmits the location data of the pattern s whose occurrence probability is above a specified
object to its CH. Here, the location data include a times minimal threshold Pmin . It also carries the conditional
tamp, the ID of an object, and its location. We also assume empirical probabilities P ðjsÞ for each in Æ that we use
that the targeted applications are delay-tolerant. Thus, in location prediction. Fig. 2 shows an example of a location
instead of forwarding the data upward immediately, the
t.c om
sequence and the corresponding PST. nodef is one of the
om
CH compresses the data accumulated for a batch period and
po t.c
children of the root node and represents a significant
sends it to the CH of the upper layer. The process is repeated gs po
movement pattern with }f} whose occurrence probability
lo s
until the sink receives the location data. Consequently, the
.b og
is above 0.01. The conditional empirical probabilities are
ts .bl
trajectory of a moving object is thus modeled as a series of P ð0 b0 j}f}Þ ¼ 0:33, P ð0 e0 j}f}Þ ¼ 0:67, and P ðj}f}Þ ¼ 0 for the
ec ts
observations and expressed by a location sequence, i.e., a
oj c
other in Æ.
pr oje
sequence of sensor IDs visited by the object. We denote a PST is frequently used in predicting the occurrence
re r
lo rep
location sequence by S ¼ s0 s1 . . . sLÀ1 , where each item si is probability of a given sequence, which provides us
xp lo
a symbol in Æ and L is the sequence length. An example of important information in similarity comparison. The occur-
ee xp
an object’s trajectory and the obtained location sequence is
.ie ee
rence probability of a sequence s regarding to a PST T ,
shown in the left of Fig. 2. The tracking network tracks
w e
denoted by P T ðsÞ, is the prediction of the occurrence
w .i
w w
moving objects for a period and generates a location
:// w
probability of s based on T . For example, the occurrence
tp //w
sequence data set, based on which we discover the group probability P T ð}nokjfb}Þ is computed as follows:
ht ttp:
relationships of the moving objects.
h
P T ð}nokjfb}Þ ¼ P T ð}n}Þ Â P T ð0 o0 j}n}Þ Â P T ð0 k0 j}no}Þ
3.2 Variable Length Markov Model (VMM) and
Probabilistic Suffix Tree (PST) Â P T ð0 j0 j}nok}Þ Â P T ð0 f 0 j}nokj}Þ
If the movements of an object are regular, the object’s next  P T ð0 b0 j}nokjf}Þ
location can be predicted based on its preceding locations.
¼ P T ð}n}Þ Â P T ð0 o0 j}n}Þ Â P T ð0 k0 j}o}Þ
We model the regularity by using the Variable Length
Markov Model (VMM). Under the VMM, an object’s move- Â P T ð0 j0 j}k}Þ Â P T ð0f 0 j}j}Þ Â P T ð0 b0 j}okjf}Þ
ment is expressed by a conditional probability distribution ¼ 0:05 Â 1 Â 1 Â 1 Â 1 Â 0:3 ¼ 0:0165:
over Æ. Let s denote a pattern which is a subsequence of a
location sequence S and denote a symbol in Æ. The PST is also useful and efficient in predicting the next
conditional probability P ðjsÞ is the occurrence probability item of a sequence. For a given sequence s and a PST T , our
that will follow s in S. Since the length of s is floating, the predict_next algorithm as shown in Fig. 3 outputs the most
VMM provides flexibility to adapt to the variable length of probable next item, denoted by predict nextðT ; sÞ. We
movement patterns. Note that when a pattern s occurs more demonstrate its efficiency by an example as follow: Given
frequently, it carries more information about the movements s ¼ }nokjf} and T shown in Fig. 2, the predict_next
of the object and is thus more desirable for the purpose of algorithm traverses the tree to the deepest node nodeokjf
prediction. To find the informative patterns, we first define a along the path including noderoot , nodef , nodejf , nodekjf , and
pattern as a significant movement pattern if its occurrence nodeokjf . Finally, symbol 0 e0 , which has the highest condi-
probability is above a minimal threshold. tional empirical probability in nodeokjf , is returned, i.e.,
To learn the significant movement patterns, we adapt predict nextðT ; }nokjf}Þ ¼ 0 e0 . The algorithm’s computa-
Probabilistic Suffix Tree (PST) [33] for it has the lowest tional overhead is limited by the height of a PST so that it
storage requirement among many VMM implementations is suitable for sensor nodes.
[34]. PST’s low complexity, i.e., OðnÞ in both time and space
1. The PST building algorithm is given in Appendix A, which can
[35], also makes it more attractive especially for streaming or be found on the Computer Society Digital Library at http://doi.
resource-constrained environments [36]. The PST building ieeecomputersociety.org/10.1109/TKDE.2010.30.
5. TSAI ET AL.: EXPLORING APPLICATION-LEVEL SEMANTICS FOR DATA COMPRESSION 99
3.3 Problem Description the similarity of two data streams only on the changed
We formulate the problem of this paper as exploring the mature nodes of emission trees [36], instead of all nodes.
group movement patterns to compress the location To combine multiple local grouping results into a
sequences of a group of moving objects for transmission consensus, the CE algorithm utilizes the Jaccard similarity
efficiency. Consider a set of moving objects O ¼ coefficient to measure the similarity between a pair of
fo1 ; o2 ; . . . ; on g and their associated location sequence data objects, and normalized mutual information (NMI) to
set S ¼ fS1 ; S2 ; . . . ; Sn g. derive the final ensembling result. It trades off the grouping
quality against the computation cost by adjusting a partition
Definition 1. Two objects are similar to each other if their
parameter. In contrast to approaches that perform cluster-
movement patterns are similar. Given the similarity measure
ing among the entire trajectories, the distributed algorithm
function simp 2 and a minimal threshold simmin , oi and oj are
discovers the group relationships in a distributed manner
similar if their similarity score simp ðoi ; oj Þ is above simmin , i.e.,
on sensor nodes. As a result, we can discover group
simp ðoi ; oj Þ ! simmin . The set of objects that are similar to oi is
movement patterns to compress the location data in the
denoted by soðoi Þ ¼ foj j8oj 2 O; simp ðoi ; oj Þ ! simmin g.
areas where objects have explicit group relationships.
Definition 2. A set of objects is recognized as a group if they are Besides, the distributed design provides flexibility to take
highly similar to one another. Let g denote a set of objects. g is a partial local grouping results into ensembling when the
group if every object in g is similar to at least a threshold of group relationships of moving objects in a specified
objects in g, i.e., 8oi 2 g, jsoðogijÞgj !
, where
is with default
j subregion are interested. Also, it is especially suitable for
value 1 .3
2 heterogeneous tracking configurations, which helps reduce
the tracking cost, e.g., instead of waking up all sensors at
We formally define the moving object clustering problem the same frequency, a fine-grained tracking interval is
as follows: Given a set of moving objects O together with specified for partial terrain in the migration season to
their associated location sequence data set S and a minimal reduce the energy consumption. Rather than deploying the
t.c om
similarity threshold simmin , the moving object clustering
om
sensors in the same density, they are only highly concen-
po t.c
problem is to partition O into nonoverlapped groups, trated in areas of interest to reduce deployment costs.
denoted by G ¼ fg1 ; g2 ; . . . ; gi g, such that the number of gs po
lo s
.b og
groups is minimized, i.e., jGj is minimal. Thereafter, with 4.1 The Group Movement Pattern Mining
ts .bl
the discovered group information and the obtained group (GMPMine) Algorithm
ec ts
oj c
movement patterns, the group data compression problem is To provide better discrimination accuracy, we propose a
pr oje
to compress the location sequences of a group of moving new similarity measure simp to compare the similarity of
re r
lo rep
objects for transmission efficiency. Specifically, we formu- two objects. For each of their significant movement patterns,
xp lo
late the group data compression problem as a merge the new similarity measure considers not merely two
ee xp
.ie ee
problem and an HIR problem. The merge problem is to probability distributions but also two weight factors, i.e.,
w e
combine multiple location sequences to reduce the overall the significance of the pattern regarding to each PST. The
w .i
w w
sequence length, while the HIR problem targets to minimize similarity score simp of oi and oj based on their respective
:// w
tp //w
the entropy of a sequence such that the amount of data is PSTs, Ti and Tj , is defined as follows:
ht ttp:
reduced with or without loss of information.
P qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P
h
Ti Tj 2
s2S e 2Æ ðP ðsÞ À P ðsÞÞ
simp ðoi ; oj Þ ¼ À log pffiffiffi ; ð1Þ
4 MINING OF GROUP MOVEMENT PATTERNS 2Lmax þ 2
To tackle the moving object clustering problem, we propose e
where S denotes the union of significant patterns (node
a distributed mining algorithm, which comprises the strings) on the two trees. The similarity score simp includes
GMPMine and CE algorithms. First, the GMPMine algo- the distance associated with a pattern s, defined as
rithm uses a PST to generate an object’s significant move- qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
X
ment patterns and computes the similarity of two objects by dðsÞ ¼ ðP Ti ðsÞ À P Tj ðsÞÞ2
2Æ
using simp to derive the local grouping results. The merits qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
X
of simp include its accuracy and efficiency: First, simp ¼ 2Æ
ðP Ti ðsÞ Â P Ti ðjsÞ À P Tj ðsÞ Â P Tj ðjsÞÞ2 ;
considers the significances of each movement pattern
regarding to individual objects so that it achieves better where dðsÞ is the euclidean distance associated with a
accuracy in similarity comparison. For a PST can be used to pattern s over Ti and Tj .
predict a pattern’s occurrence probability, which is viewed For a pattern s 2 T , P T ðsÞ is a significant value because
as the significance of the pattern regarding the PST, simp the occurrence probability of s is higher than the minimal
thus includes movement patterns’ predicted occurrence support Pmin . If oi and oj share the pattern s, we have s 2 Ti
probabilities to provide fine-grained similarity comparison. and s 2 Tj , respectively, such that P Ti ðsÞ and P Tj ðsÞ are non-
Second, simp can offer seamless and efficient comparison negligible and meaningful in the similarity comparison.
for the applications with evolving and evolutionary Because the conditional empirical probabilities are also parts
similarity relationships. This is because simp can compare of a pattern, we consider the conditional empirical prob-
abilities P T ðjsÞ when calculating the distance between two
2. simp is to be defined in Section 4.1. e
PSTs. Therefore, we sum dðsÞ for all s 2 S as the distance
3. In this work, we set the threshold to 1 , as suggested in [37], and leave
2
the similarity threshold as the major control parameter because training the between two PSTs. Note that the distance betweenffiffiffitwo PSTs
p
similarity threshold of two objects is easier than that of a group of objects. is normalized by its maximal value, i.e., 2Lmax þ 2. We take
6. 100 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 1, JANUARY 2011
the negative log of the distance between two PSTs as the
similarity score such that a larger value of the similarity
score implies a stronger similar relationship, and vice versa.
With the definition of similarity score, two objects are similar
to each other if their score is above a specified similarity
threshold.
The GMPMine algorithm includes four steps. First, we Fig. 4. Design of the two-phase and 2D compression algorithm.
extract the movement patterns from the location sequences
by learning a PST for each object. Second, our algorithm ½0; 1Š such that a finer-grained configuration of D achieves a
constructs an undirected, unweighted similarity graph better grouping quality but in a higher computation cost.
where similar objects share an edge between each other. Therefore, for a set of thresholds D, we rewrite our objective
P
We model the density of group relationship by the function as G0 ¼ argmaxG;2D K NMIðGi ; G Þ.
i¼1
connectivity of a subgraph, which is also defined as the The algorithm includes three steps. First, we utilize
minimal cut of the subgraph. When the ratio of the Jaccard Similarity Coefficient [40] as the measure of the
connectivity to the size of the subgraph is higher than a similarity for each pair of objects. Second, for each 2 D,
threshold, the objects corresponding to the subgraph are we construct a graph where two objects share an edge if
identified as a group. Since the optimization of the graph their Jaccard Similarity Coefficient is above . Our algo-
partition problem is intractable in general, we bisect the rithm partitions the objects to generate a partitioning result
similarity graph in the following way. We leverage the HCS G . Third, we select the ensembling result G0 . Because of
cluster algorithm [37] to partition the graph and derive the space limitations, we only demonstrate the CE mining
location group information. Finally, we select a group PST algorithm with an example in Appendix C, which can be
Tg for each group in order to conserve the memory space by found on the Computer Society Digital Library at http://
P doi.ieeecomputersociety.org/10.1109/TKDE.2010.30. In the
using the formula expressed as T g ¼ argmaxT 2T s2S P T ðsÞ,
t.c om
next section, we propose our compression algorithm that
om
where S denotes sequences of a group of objects and T
leverages the obtained group movement patterns.
po t.c
denotes their PSTs. Due to the page limit, we give an
illustrative example of the GMPMine algorithm in Appen-
gs po
lo s
.b og
dix B, which can be found on the Computer Society Digital 5 DESIGN OF A COMPRESSION ALGORITHM WITH
ts .bl
Library at http://doi.ieeecomputersociety.org/10.1109/ GROUP MOVEMENT PATTERNS
ec ts
oj c
pr oje
TKDE.2010.30. A WSN is composed of a large number of miniature sensor
re r
lo rep
4.2 The Cluster Ensembling (CE) Algorithm nodes that are deployed in a remote area for various
xp lo
applications, such as environmental monitoring or wildlife
In the previous section, each CH collects location data and
ee xp
tracking. These sensor nodes are usually battery-powered
.ie ee
generates local group results with the proposed GMPMine
and recharging a large number of them is difficult. There-
w e
algorithm. Since the objects may pass through only partial
w .i
fore, energy conservation is paramount among all design
w w
sensor clusters and have different movement patterns in
:// w
issues in WSNs [41], [42]. Because the target objects are
tp //w
different clusters, the local grouping results may be
moving, conserving energy in WSNs for tracking moving
ht ttp:
inconsistent. For example, if objects in a sensor cluster walk
objects is more difficult than in WSNs that monitor immobile
h
close together across a canyon, it is reasonable to consider
phenomena, such as humidity or vibrations. Hence, pre-
them a group. In contrast, objects scattered in grassland
vious works, such as [43], [44], [45], [46], [47], [48], especially
may not be identified as a group. Furthermore, in the case
consider movement characteristics of moving objects in their
where a group of objects moves across the margin of a designs to track objects efficiently. On the other hand, since
sensor cluster, it is difficult to find their group relationships. transmission of data is one of the most energy expensive
Therefore, we propose the CE algorithm to combine tasks in WSNs, data compression is utilized to reduce the
multiple local grouping results. The algorithm solves the amount of delivered data [1], [2], [3], [4], [5], [6], [49], [50],
inconsistency problem and improves the grouping quality. [51], [52]. Nevertheless, few of the above works have
The ensembling problem involves finding the partition of addressed the application-level semantics, i.e., the tempor-
all moving objects O that contains the most information about al-and-spatial correlations of a group of moving objects.
the local grouping results. We utilize NMI [38], [39] to Therefore, to reduce the amount of delivered data, we
evaluate the grouping quality. Let C denote the ensemble propose the 2P2D algorithm which leverages the group
of the local grouping results, represented as C ¼ fG0 ; movement patterns derived in Section 4 to compress the
location sequences of moving objects elaborately. As shown
G1 ; . . . ; GK g, where K denotes the ensemble size. Our goal
in Fig. 4, the algorithm includes the sequence merge phase
is to discover the ensembling result G0 that contains the most and the entropy reduction phase to compress location
P
information about C, i.e., G0 ¼ argmax e K NMIðGi ; GÞ,
i¼1 sequences vertically and horizontally. In the sequence
G2G
e
where G denotes all possible ensembling results. merge phase, we propose the Merge algorithm to compress
e
However, enumerating every G 2 G in order to find the the location sequences of a group of moving objects. Since
0
optimal ensembling result G is impractical, especially in the objects with similar movement patterns are identified as a
resource-constrained environments. To overcome this diffi- group, their location sequences are similar. The Merge
culty, the CE algorithm trades off the grouping quality algorithm avoids redundant sending of their locations, and
against the computation cost by adjusting the partition thus, reduces the overall sequence length. It combines the
parameter D, i.e., a set of thresholds with values in the range sequences of a group of moving objects by 1) trimming
7. TSAI ET AL.: EXPLORING APPLICATION-LEVEL SEMANTICS FOR DATA COMPRESSION 101
multiple identical symbols at the same time interval into a
single symbol or 2) choosing a qualified symbol to represent
them when a tolerance of loss of accuracy is specified by the
application. Therefore, the algorithm trims and prunes
more items when the group size is larger and the group
relationships are more distinct. Besides, in the case that only
the location center of a group of objects is of interest, our
approach can find the aggregated value in the phase,
instead of transmitting all location sequences back to the Fig. 5. An example of the Merge algorithm. (a) Three sequences with
high similarity. (b) The merged sequence S 00 .
sink for postprocessing.
In the entropy reduction phase, we propose the Replace
S-column into a single symbol. Specifically, given a group
algorithm that utilizes the group movement patterns as the
of n sequences, the items of an S-column are replaced by a
prediction model to further compress the merged sequence.
single symbol, whereas the items of a D-column are
The Replace algorithm guarantees the reduction of a
wrapped up between two 0 =0 symbols. Finally, our algo-
sequence’s entropy, and consequently, improves compres-
rithm generates a merged sequence containing the same
sibility without loss of information. Specifically, we define a
information of the original sequences. In decompressing
new problem of minimizing the entropy of a sequence as
from the merged sequence, while symbol 0 =0 is encountered,
the HIR problem. To reduce the entropy of a location
the items after it are output until the next 0 =0 symbol.
sequence, we explore Shannon’s theorem [17] to derive
Otherwise, for each item, we repeat it n times to generate
three replacement rules, based on which the Replace
the original sequences. Fig. 5b shows the merged sequence
algorithm reduces the entropy efficiently. Also, we prove
S 00 whose length is decreased from 60 items to 48 items such
that the Replace algorithm obtains the optimal solution of
that 12 items are conserved. The example pointed out that
the HIR problem as Theorem 1. In addition, since the objects
t.c om
our approach can reduce the amount of data without loss of
om
may enter and leave a sensor cluster multiple times during
po t.c
a batch period and a group of objects may enter and leave a information. Moreover, when there are more S-columns,
gs po
our approach can bring more benefit.
cluster at slightly different times, we discuss the segmenta-
lo s
.b og
When a little loss in accuracy is tolerant, representing
tion and alignment problems in Section 5.3. Table 1
ts .bl
items in a D-column by an qualified symbol to generate
summaries the notations.
ec ts
more S-columns can improve the compressibility. We
oj c
pr oje
5.1 Sequence Merge Phase regulate the accuracy by an error bound, defined as the
re r
lo rep
In the application of tracking wild animals, multiple maximal hop count between the real and reported locations
xp lo
moving objects may have group relationships and share of an object. For example, replacing items 2 and 6 of S2 by 0 g0
ee xp
and 0 p0 , respectively, creates two more S-columns, and thus,
.ie ee
similar trajectories. In this case, transmitting their location
w e
data separately leads to redundancy. Therefore, in this results in a shorter merged sequence with 42 items. To
w .i
w w
section, we concentrate on the problem of compressing select a representative symbol for a D-column, we includes
:// w
tp //w
multiple similar sequences of a group of moving objects. a selection criterion to minimize the average deviation
ht ttp:
Consider an illustrative example in Fig. 5a, where three between the real locations and reported locations for a
h
location sequences S0 , S1 , and S2 represent the trajectories group of objects at each time interval as follows:
of a group of three moving objects. Items with the same
index belong to a column, and a column containing Selection criterion. The maximal distance between the reported
identical symbols is called an S-column; otherwise, the location and the real location is below a specified error
8.
9. bound
column is called a D-column. Since sending the items in an eb, i.e., when the ith column is a D-column,
11. eb
S-column individually causes redundancy, our basic idea must hold for 0 j n, where n is the number of sequences.
of compressing multiple sequences is to trim the items in an
TABLE 1
Description of the Notations
12. 102 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 1, JANUARY 2011
Fig. 7. An example of the Merge algorithm: the merged sequence with
eb ¼ 1.
jfs js ¼ ;0 jLgj
as pi ¼ j j L i
, and L ¼ jS j. According to Shannon’s
source coding theorem, the optimal code length for symbol
i is log2 pi , which is also called the information content of i .
Information entropy (or Shannon’s entropy) is thereby
defined as the overall sum of the information content over Æ,
X
eðSÞ ¼ eðp0 ; p1 ; . . . ; pjÆjÀ1 Þ ¼ Àpi  log2 pi : ð2Þ
0 ijÆj
Shannon’s entropy represents the optimal average code
length in data compression, where the length of a symbol’s
Fig. 6. The Merge algorithm. codeword is proportional to its information content [17].
A property of Shannon’s entropy is that the entropy is
Therefore, to compress the location sequences for a the maximum, while all probabilities are of the same value.
group of moving objects, we propose the Merge algorithm Thus, in the case without considering the regularity in the
shown in Fig. 6. The input of the algorithm contains a group movements of objects, the occurrence probabilities of
of sequences fSi j0 i ng and an error bound eb, while
symbols are assumed to be uniform such that the entropy
the output is a merged sequence that represents the group
t.c om
of a location sequence as well as the compression ratio is
om
of sequences. Specifically, the Merge algorithm processes
po t.c
the sequences in a columnwise way. For each column, the fixed and dependent on the size of a sensor cluster, e.g., for
gs po
a location sequence with length D collected in a sensor
algorithm first checks whether it is an S-column. For an S-
lo s
.b og
1
column, it retains the value of the items as Lines 4-5. cluster of 16 nodes, the entropy of the sequence is eð16 ,
ts .bl
1 1
Otherwise, while an error bound eb 0 is specified, a 16 ; . . . ; 16Þ ¼ 4. Consequently, 4D bits are needed to repre-
ec ts
oj c
representative symbol is selected according to the selection sent the location sequence. Nevertheless, since the move-
pr oje
criterion as Line 7. If a qualified symbol exists to represent ments of a moving object are of some regularity, the
re r
lo rep
the column, the algorithm outputs it as Lines 15-18. occurrence probabilities of symbols are probably skewed
xp lo
Otherwise, the items in the column are retained and and the entropy is lower. For example, if two probabilities
ee xp
.ie ee
1 3
wrapped by a pair of “/” as Lines 9-13. The process repeats become 32 and 32 , the entropy is reduced to 3.97. Thus,
w e
until all columns are examined. Afterward, the merged instead of encoding the sequence using 4 bits per symbol,
w .i
w w
sequence S 00 is generated.
:// w
only 3:97D bits are required for the same information. This
tp //w
Following the example shown in Fig. 5a, with a specified example points out that when a moving object exhibits
ht ttp:
error bound eb as 1, the Merge algorithm generates the some degree of regularity in their movements, the skewness
h
solution, as shown in Fig. 7; the merged sequence contains of these probabilities lowers the entropy. Seeing that data
only 20 items, i.e., 40 items are curtailed. with lower entropy require fewer bits to represent the same
5.2 Entropy Reduction Phase information, reducing the entropy thereby benefits for data
compression and, by extension, storage and transmission.
In the entropy reduction phase, we propose the Replace
Motivated by the above observation, we design the
algorithm to minimize the entropy of the merged sequence
Replace algorithm to reduce the entropy of a location
obtained in the sequence merge phase. Since data with
sequence. Our algorithm imposes the hit symbols on the
lower entropy require fewer bits for storage and transmis-
location sequence to increase the skewness. Specifically, the
sion [17], we replace some items to reduce the entropy
algorithm uses the group movement patterns built in both
without loss of information. The object movement patterns
the transmitter (CH) and the receiver (sink) as the
discovered by our distributed mining algorithm enable us
to find the replaceable items and facilitate the selection of prediction model to decide whether an item of a sequence
items in our compression algorithm. In this section, we first is predictable. A CH replaces the predictable items each
introduce and define the HIR problem, and then, explore with a hit symbol to reduce the location sequence’s entropy
the properties of Shannon’s entropy to solve the HIR when compressing it. After receiving the compressed
problem. We extend the concentration property for entropy sequence, the sink node decompresses it and substitutes
reduction and discuss the benefits of replacing multiple every hit symbol with the original symbol by the identical
symbols simultaneously. We derive three replacement rules prediction model, and no information loss occurs.
for the HIR problem and prove that the entropy of the Here, an item si of a sequence S is predictable item if
obtained solution is minimized. predict nextðT ; S½0::i À 1ŠÞ is the same value as si . A symbol
is a predictable symbol once an item of the symbol is
5.2.1 Three Derived Replacement Rules predictable. For ease of explanation, we use a taglst4 to
Let P ¼ fp0 , p1 ; . . . ; pjÆjÀ1 g denote the probability distribu-
4. A taglst associated with a sequence S is a sequence of 0s and 1s
tion of the alphabet Æ corresponding to a location sequence obtained as follows: For those predictable items in S, the corresponding
S, where pi is the occurrence probability of i in Æ, defined items in taglst are set as 1. Otherwise, their values in taglst are 0.
13. TSAI ET AL.: EXPLORING APPLICATION-LEVEL SEMANTICS FOR DATA COMPRESSION 103
Fig. 8. An example of the simple method. The first row represents the index, and the second row is the location sequence. The third row is the taglst
which represents the prediction results. The last row is the result of the simple approach. Note that items 1, 2, 7, 9, 11, 12, 15, and 22 are predictable
items such that the set of predictable symbols is s ¼ fk, a, j, f, og. In the example, items 2, 9, and 10 are items of 0 o0 such that the number of items of
^
symbol 0 o0 is nð0 o0 Þ ¼ 3, whereas items 2 and 9 are the predictable items of 0 o0 , and the number of predictable items of symbol 0 o0 is nhit ð0 o0 Þ ¼ 2.
Fig. 9. Problems of the simple approach.
express whether each item of S is predictable in the fifth, called the strong concentration property, is derived
following sections. Consider the illustrative example shown and proved in the paper.5
in Fig. 8, the probability distribution of Æ corresponding to
t.c om
Property 1 (Expansibility). Adding a probability with a value
om
S is P ¼ f0:04; 0:16; 0:04; 0; 0; 0:08; 0:04; 0; 0:04; 0:12; 0:24; 0;
po t.c
of zero does not change the entropy, i.e., eðp0 , p1 ; . . . ; pjÆjÀ1 ,
0; 0:12; 0:12; 0g, and items 1, 2, 7, 9, 11, 12, 15, and 22 are gs po
pjÆj Þ is identical to eðp0 , p1 ; . . . ; pjÆjÀ1 Þ when pjÆj is zero.
lo s
predictable items. To reduce the entropy of the sequence, a
.b og
ts .bl
simple approach is to replace each of the predictable items
According to Property 1, we add a new symbol 0 :0 to Æ
ec ts
with the hit symbol and obtain an intermediate sequence
oj c
without affecting the entropy and denote its probability as
pr oje
S 0 ¼00n::kfbb:n:o::ij:bbnkkk:gc00 . Compared with the original
p16 such that P is fp0 , p1 ; . . . ; p15 , p16 ¼ 0g.
re r
sequence S with entropy 3.053, the entropy of S 0 is reduced
lo rep
to 2.854. Encoding S and S 0 by the Huffman coding Property 2 (Symmetry). Any permutation of the probability
xp lo
ee xp
technique, the lengths of the output bit streams are 77 and values does not change to the entropy. For example,
.ie ee
73 bits, respectively, i.e., 5 bits are conserved by the simple eð0:1; 0:4; 0:5Þ is identical to eð0:4; 0:5; 0:1Þ.
w e
w .i
w w
approach. Property 3 (Accumulation). Moving all the value from one
:// w
However, the above simple approach does not always
tp //w
probability to another such that the former can be thought
ht ttp:
minimize the entropy. Consider the example shown in Fig. 9a, of as being eliminated decreases the entropy, i.e.,
an intermediate sequence with items 1 and 19 unreplaced has
h
eðp0 ; p1 ; . . . ; 0; . . . ; pi þ pj ; . . . ; pjÆjÀ1 Þ is equal or less than
lower entropy than that generated by the simple approach. eðp0 ,p1 ; . . . ; pi ; . . . ; pj ; . . . ; pjÆjÀ1 Þ.
For the example shown in Fig. 9b, the simple approach even
increases the entropy from 2.883 to 2.963. With Properties 2 and 3, if all the items of symbol are
We define the above problem as the HIR problem and predictable, i.e., nðÞ ¼ nhit ðÞ, replacing all the items of
formulate it as follows: by 0 :0 will not affect the entropy. If there are multiple
Definition 3 (HIR problem). Given a sequence S ¼ fsi jsi 2 symbols having nðÞ ¼ nhit ðÞ, replacing all the items of
Æ; 0 i Lg and a taglst, an intermediate sequence is a these symbols can reduce the entropy. Thus, we derive the
generation of S, denoted by S 0 ¼ fs0i j0 i Lg, where s0i is first replacement rule—the accumulation rule: Replace all
equal to si if taglst½iŠ ¼ 0. Otherwise, s0i is equal to si or 0 :0 . items of symbol in s, where nðÞ ¼ nhit ðÞ.
^
The HIR problem is to find the intermediate sequence S 0 such
Property 4 (Concentration). For two probabilities, moving a
that the entropy of S 0 is minimal for all possible intermediate
value from the lower probability to the higher probability
sequences.
decreases the entropy, i.e., if pi pj , for 0 Áp pi ,
eðp0 ,p1 ; . . . ; pi À Áp; . . . ; pj þ Áp; . . . ; pjÆjÀ1 Þ is less than
A brute-force method to the HIR problem is to enumer-
ate all possible intermediate sequences to find the optimal eðp0 ,p1 ; . . . ; pi ; . . . ; pj ; . . . ; pjÆjÀ1 Þ.
solution. However, this brute-force approach is not scalable, Property 5 (Strong concentration). For two probabilities,
especially when the number of the predictable items is moving a value that is larger than the difference of the two
large. Therefore, to solve the HIR problem, we explore probabilities from the higher probability to the lower one
properties of Shannon’s entropy to derive three replace- decreases the entropy, i.e., if pi pj , for pi À pj Áp pi ,
ment rules that our Replace algorithm leverages to obtain
the optimal solution. Here, we list five most relevant 5. Due to the page limit, the proofs of the strong concentration rule and
Lemmas 1-4 are given in Appendices C and D, respectively, which can be
properties to explain the replacement rules. The first four found on the Computer Society Digital Library at http://doi.
properties can be obtained from [17], [53], [54], while the ieeecomputersociety.org/10.1109/TKDE.2010.30.
14. 104 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 1, JANUARY 2011
eðp0 ,p1 ; . . . ; pi À Áp; . . . ; pj þ Áp; . . . ; pjÆjÀ1 Þ is less than as Lemma 4. We thereby derive the third replacement
eðp0 ,p1 ; . . . ; pi ; . . . ; pj ; . . . ; pjÆjÀ1 Þ. rule—the multiple symbol rule: Replace all of the
predictable items of every symbol in s0 if gainð^0 Þ 0.
^ s
According to Properties 4 and 5, we conclude that if the Lemma 3. If exist 0 an pin , n ¼ 1; . . . ; k, such that
difference of two probabilities increases, the entropy P
fi1 i2 ...ik j ða1 ; a2 ; . . . ; ak Þ is minimal, we have pj þ k ai !
i¼1
decreases. When the conditions conform to Properties 4
pin À an , n ¼ 1; . . . ; k.
and 5, we further prove that the entropy is monotonically
reduced as Áp increases such that replacing all predictable Lemma 4. If exist xmin1 , xmin2 ; . . . , and xmink such that pj þ
Pk
items of a qualified symbol reduces the entropy mostly as i¼1 xmini pin À xminn for n ¼ 1; 2; . . . ; k,fi1 i2 ...ik j ðx1 ; x2 ;
Lemmas 1 and 2. . . . ; xk Þ is monotonically decreasing for xminn xn pin ,
Definition 4 (Concentration function). For a probability n ¼ 1; . . . ; k.
distribution P ¼ fp0 , p1 ; . . . ; pjÆjÀ1 g, we define the concentra-
tion function of k probabilities pi1 , pi2 ; . . . ; pik , and pj as 5.2.2 The Replace Algorithm
Based on the observations described in the previous section,
fi1 i2 ...ik j ðx1 ; x2 ; . . . ; xk Þ ¼ e p0 ; . . . ; pi1 À x1 ; . . . ; pi2 we propose the Replace algorithm that leverages the three
replacement rules to obtain the optimal solution for the HIR
À x2 ; . . . ; pik À xk ; . . . ; pj problem. Our algorithm examines the predictable symbols
Xk on their statistics, which include the number of items and
þ xi ; . . . ; pj P jÀ1 : the number of predictable items of each predictable symbol.
i¼1 The algorithm first replaces the qualified symbols according
to the accumulation rule. Afterward, since the concentration
rule and the multiple symbol rule are related to nð0 :0 Þ, which
Lemma 1. If pi pj , for any x such that 0 x pi , the is increased after every replacement, the algorithm itera-
t.c om
concentration function of pi and pj is monotonic decreasing. tively replaces the qualified symbols according to the two
om
po t.c
Lemma 2. If pi pj , for any x such that pi À pj x pi , the rules until all qualified symbols are replaced. The algorithm
concentration function fij ðxÞ of pi and pj is monotonic
gs po
thereby replaces qualified symbols and reduces the entropy
lo s
.b og
decreasing. toward the optimum gradually. Compared with the brute-
ts .bl
force method that enumerates all possible intermediate
ec ts
oj c
Accordingly, we derive the second replacement rule—the sequences for the optimum in exponential complexity, the
pr oje
Replace algorithm that leverages the derived rules to obtain
re r
concentration rule: Replace all predictable items of symbol
lo rep
in s, where nðÞ nð0 :0 Þ or nhit ðÞ nðÞ À nð0 :0 Þ.
^ the optimal solution in OðLÞ time6 is more scalable and
xp lo
efficient. We prove that the Replace algorithm guarantees to
ee xp
As an extension of the above properties, we also explore
.ie ee
the entropy variation, while predictable items of multiple reduce the entropy monotonically and obtains the optimal
w e
symbols are replaced simultaneously. Let s0 denote a subset solution of the HIR problem as Theorem 1. Next, we detail
w .i
^
w w
:// w
of s, where j^0 j ! 2. We prove that if replacing predictable
^ s the replace algorithm and demonstrate the algorithm by an
tp //w
symbols in s0 can minimize the entropy corresponding to
^ illustrative example.
ht ttp:
symbols in s0 , the number of hit symbols after the
^ Theorem 1.7 The Replace algorithm obtains the optimal solution
h
replacement must be larger than the number of the of the HIR problem.
predictable symbols after the replacement as Lemma 3. To
investigate whether the converse statement exists, we Fig. 10 shows the Replace algorithm. The input includes
conduct an experiment in a brute-force way. However, the a location sequence S and a predictor Tg , while the output,
experimental results show that even under the condition
P denoted by S 0 , is a sequence in which qualified items are
that nð0 :0 Þ þ 80 2^0 nhit ð0 Þ ! nðÞ À nhit ðÞ for every in s0 ,
s ^ replaced by 0 :0 . Initially, Lines 3-9 of the algorithm find the
replacing predictable items of the symbols in s0 does not ^ set of predictable symbols together their statistics. Then, it
guarantee the reduction of the entropy. Therefore, we exams the statistics of the predictable symbols according to
compare the difference of the entropy before and after the three replacement rules as follows: First, according to
replacing symbols in s0 as ^ the accumulation rule, it replaces qualified symbols in one
X p scan of the predictable symbols as Lines 10-14. Next, the
^
gainðS 0 Þ ¼ À p  log2 À Áp algorithm iteratively exams for the concentration and the
82^s 0 p À Áp
multiple symbol rules by two loops. The first loop from
 log2 ðp À Áp Þ Line 16 to Line 22 is for the concentration, whereas the
p0 :0 second loop from Line 25 to Line 36 is for the multiple
À p0 :0 Â log2 0 0 P symbol rule. In our design, since finding a combination of
p : þ 82^0 ÁpÀ
s
! predictable symbols to make gainð^0 Þ 0 hold is more
s
X X
þ Áp  log2 p þ 0 :0 ÁpÀ : costly, the algorithm is prone to replace symbols with the
82^0
s 82^0
s
6. The complexity analysis is given in Appendix E, which can be found
In addition, we also prove that once replacing partial on the Computer Society Digital Library at http://doi.ieeecomputersociety.
predictable items of symbols in s0 reduces entropy, repla-
^ org/10.1109/TKDE.2010.30.
7. The proof of Theorem 1 is given in Appendix F, which can be found on
cing all predictable items of these symbols reduces the the Computer Society Digital Library at http://doi.ieeecomputersociety.
entropy mostly since the entropy decreases monotonically org/10.1109/TKDE.2010.30.