2. 2 Data Compression Lec (3)
1-Run-Length Encoding
The idea behind this approach to data compression is this: If a data item
d occurs nconsecutive times in the input stream, replace the n
occurrences with the single pairnd. The n consecutive occurrences of a
data item are called a run length of n, and thisapproach to data
compression is called run-length encoding or RLE. We apply this ideafirst
to text compression and then to image compression.
RLE Text Compression
Just replacing 2._all_is_too_well with 2._a2_is_t2_we2 will not
work.Even the string 2._a2l_is_t2o_we2l does not solve this problem.
One way to solve this problem is to precede each repetition with a
special escape character. If we use the character @ as the escape
character, then the string 2._a@2l_is_t@2o_we@2l can be
decompressed unambiguously. However, this string is longer than the
original string, because it replaces two consecutive letters with three
characters. We have to adopt the convention that only three or more
repetitions of the same character will be replaced with a repetition
factor. The main problems with this method are the following:
1. In English text there are not many repetitions. There are many
“doubles” but a “triple” is rare.
2. The character “@” may be part of the text in the input stream,
in which case a different escape character must be chosen.
Sometimes the input stream may contain every possible
character in the alphabet.
RLE Image Compression
RLE can be used to compress grayscale images. Each run of pixels of
the same intensity (gray level) is encoded as a pair (run length, pixel
value). The run length usually occupies one byte, allowing for runs of
up to 255 pixels. The pixel value occupies several bits, depending on
the number of gray levels (typically between 4 and 8 bits).
3. 3 Data Compression Lec (3)
Example 3.1An 8-bit deep grayscale bitmap that starts with
12, 12, 12, 12, 12, 12, 12, 12, 12, 35, 76, 112, 67, 87, 87, 87,
5, 5, 5, 5, 5, 5, 1, . . .
is compressed into 9 ,12,35,76,112,67, 3 ,87, 6 ,5,1,. . . , where
the bold numbers indicate counts. The problem is to distinguish
between a byte containing a grayscale value (such as 12) and one
containing a count (such as 9 ). Here are some solutions
1. If the image is limited to just 128 grayscales, we can devote
one bit in each byte to indicate whether the byte contains a
grayscale value or a count.
2. If the number of grayscales is 256, it can be reduced to 255
with one value reserved as a flag to precede every byte with a
count. If the flag is, say, 255, then the sequence above be
comes
255, 9, 12, 35, 76, 112, 67, 255, 3, 87, 255, 6, 5, 1, . . . .
3. Again, one bit is devoted to each byte to indicate whether the byte
contains a grayscale value or a count. This time, however, these extra
bits are accumulated in groups of 8,and each group is written on the
output stream preceding (or following) the 8 bytes it “corresponds to.”
Example: the sequence 9 ,12,35,76,112,67, 3 ,87, 6 ,5,1,. ...
becomes
10000010 ,9,12,35,76,112,67,3,87, 100..... ,6,5,1,. .
4. 4 Data Compression Lec (3)
2-Move-to-Front Coding
The basic idea of this method is to maintain the alphabet A of
symbols as a list where frequently occurring symbols are located near
the front. A symbol s isencoded as the number of symbols that
precede it in this list
Example 3.2
Here are example that illustrate the move-to-front idea. The
alphabet A=(a, b, c, d, m, n, o, p)
The input stream abcddcbamnopponm is encoded as
C = (0, 1, 2, 3, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3)
5. 5 Data Compression Lec (3)
3-Huffman coding
Huffman encoding is a way to assign binary codes to symbols that
reduces the overall number of bitsused to encode a typical string of
those symbols.
For example, if you use letters as symbols and have details of the
frequency of occurence of those letters in typical strings, then you
could just encode each letter with a fixed number of bits, such as in
ASCII codes. You can do better than this by encoding more
frequently occurring letters such as e and a, with smaller bit strings;
and less frequently occurring letters such as q and x with longer bit
strings.
Any string of letters will be encoded as a string of bits that are no-
longer of the same length per letter. To successfully decode such as
string, the smaller codes assigned to letters such as 'e' cannot occur
as a prefix in the larger codes such as that for 'x'.
If you were to assign a code 01 for 'e' and code 011 for 'x', then if
the bits to decode started as 011... then you would not know if you
should decode an 'e' or an 'x'.
The Huffman coding scheme takes each symbol and its weight (or
frequency of occurrence), and generates proper encodings for each
symbol taking account of the weights of each symbol, so that higher
weighted symbols have less bits in their encoding. (See the WP article
for more information).
A Huffman encoding can be computed by first creating a tree of
nodes:
6. 6 Data Compression Lec (3)
Algorithm Huffman coding
1- Create a leaf node for each symbol and add it to the
priority queue.
2- While there is more than one node in the queue:
a. Remove the node of highest priority (lowest
probability) twice to get two nodes.
b. Create a new internal node with these two nodes as
children and with probability equal to the sum of the
two nodes' probabilities.
c. Add the new node to the queue.
3- The remaining node is the root node and the tree is
complete.
Traverse the constructed binary tree from root to leaves
assigning and accumulating a '0' for one branch and a '1' for
the other at each node. The accumulated zeroes and ones at
each leaf constitute a Huffman encoding for those symbols and
weights:
7. 7 Data Compression Lec (3)
Example : build codebook for the following symbols
symbols A B C D
probability o.2 0.3 0.1 0.4
--
D 0.4
B 0.3
A 0.2
C 0.1
-1.00.20.3
-
0.60
0.41
D 0.4 0.4 0.4 0.4 0.4 0.4 0.6
B0.3 0.3 0.3 0.3 0.6 0.6 0.4
A0.2 0.2 0.3 0.3
C0.1 0.1
D 0.4 0.4 0.4 0.6
B 0.3 0.3 0.6 0.4
A 0.2 0.3
C 0.1
0
1
01