SlideShare una empresa de Scribd logo
1 de 20
Descargar para leer sin conexión
Data Compression
FileCompression
Huffman Tries
                       0          1
            0      1              0   1
            C                     D   B
                   0       1
                  A         R

                 ABRACADABRA

  01011011010000101001011011010
                                          1
File Compression
   Text files are usually stored by representing each
    character with an 8-bit ASCII code (type man ascii in a
    Unix shell to see the ASCII encoding)
   The ASCII encoding is an example of fixed-length
    encoding, where each character is represented with
    the same number of bits
   In order to reduce the space required to store a text
    file, we can exploit the fact that some characters are
    more likely to occur than others
   variable-length encoding uses binary codes of
    different lengths for different characters; thus, we can
    assign fewer bits to frequently used characters, and
    more bits to rarely used characters.

                                                               2
File Compression: Example

 An Encoding Example
  text: java
  encoding: a = “0”, j = “11”, v = “10”
  encoded text: 110100 (6 bits)
   How to decode (problems in ambiguity)?
    encoding: a = “0”, j = “01”, v = “00”
    encoded text: 010000 (6 bits)
    could be "java", or "jvv", or "jaaaa"



                                             3
Encoding Trie
   To prevent ambiguities in decoding, we require
    that the encoding satisfies the prefix rule: no
    code is a prefix of another.
      a = “0”, j = “11”, v = “10” satisfies the prefix rule
      a = “0”, j = “01”, v= “00” does not satisfy the
       prefix rule (the code of 'a' is a prefix of the
       codes of 'j' and 'v')




                                                           4
Encoding Trie(2)
We use an encoding trie to satisfy this prefix rule
   the characters are stored at the external
     nodes.
   a left child (edge) means 0
   a right child (edge) means 1

                                      A   010
                  0         1
                                      B   11
        0         1          0    1
                                    C 00
                             D    B
       C                            D 10
              0       1
              A       R               R   011         5
Example of Decoding
                                          A   010
                      0       1
                                          B   11

   trie     0        1
                              D
                                  0   1
                                      B
                                          C   00
             C                            D   10
                  0       1
                  A       R               R   011



• encoded text:   01011011010000101001011011010
• text:


           ABRACADABRA

                                                    6
Trie this!
100001111100100110001110111100010101001101
  00
                          0           1

                 0        1               0       1
             O                        R
                      0       1               0       1


                  0   1       0   1       0   1       0   1
                 S    W       T   B       E   C       K   N
                                                              7
Optimal Compression
   An issue with encoding tries is to ensure that the encoded
    text is as short as possible:

ABRACADABRA                                  0        1
01011011010000101001011010 0             1                0   1
10                         C                           D      B
29 bits                                  0       1
                                        A        R

                                             0        1
ABRACADABRA                       0      1                0   1
00101100010000110010110           A                       B   R
0                                        0       1
24 bits                                 C         D
                                                              8
Construction algorithm
 Given  frequencies of characters we wish
  to compute a trie so that the length of the
  encoding is minimum possible.
 Each character is a leaf of the trie
 The number of bits used to encode a
  character is its level number (root is 0).
 Thus if fi is the frequency of the ith
  character and li is the level of the leaf
  corresponding to it then we want to find a
  tree which minimises i fili
                                                9
Total weighted external path length
 The quantity i fili is called the total external
  weighted path length of a tree.
 We view each leaf as having a weight equal to
  the frequency of the corresponding character.
 Given weights f1,f2,…,fn we wish to find a tree
  whose weighted external path length is
  minimum.
 We denote this by WEPL(f1,f2,…, fn)




                                                     10
Huffman Encoding Trie
       ABRA CAD ABRA
              character      A   B       R   C    D
              frequency      5   2       2   1    1

       5         2           2               2
       A         B           R
                                     C   1       D 1

       5             4                       2
       A
             B   2       R   2       C   1       D 1

       5                         6
       A
                     4                       2

                 2           2           1         1
                                                       11
                 B           R           C         D
Huffman Encoding Trie (contd.)
       5                            6
       A
                        4                       2

               2                2       1               1
               B                R       C               D

                   11
           0                1
       5                            6
       A                    0               1
                        4                       2
                   0        1               0       1
               2                2       1               1
               B                R       C               D
                                                            12
Final Huffman Encoding Trie
                 11
         0            1
    5                     6
    A                 0         1
                  4                    2
              0       1            0       1
             2        2        1           1
             B        R        C           D


  AB RACADAB RA
  0 100 101 0 110 0 111 0 100 1010
  23 bits
                                               13
Another Huffman Encoding Trie
     ABRA CAD ABRA
            character   A   B       R   C     D
            frequency   5   2       2   1     1


     5        2         2               2
     A        B         R
                                C 1         D 1


     5       2
                                4
     A       B

                        2               2
                        R
                                C 1         D 1
                                                  14
Another Huffman Encoding Trie
     5    2
                   4
     A    B

               2         2
               R
                   C 1       D 1



     5        6
     A
          2        4
          B
               2         2
               R
                   C 1       D 1
                                   15
Another Huffman Encoding Trie
                  6
     5
     A
              2       4
              B
                  2         2
                  R
                      C 1       D 1

         11


     5            6
     A
              2       4
              B
                  2         2
                  R
                      C 1       D 1
                                      16
Another Huffman Encoding Trie
                 11
             0                1

         5                        6
                          0               1
          A
                      2                       4
                      B               0               1

                                  2                       2
                                                  0            1
                                  R
                                              C 1             D 1
 AB RACADAB RA
 0 10 110 0 1100 0 1111 0 10 110 0
 23 bits
                                                                    17
Correctness of the algorithm
 Why   does this algorithm compute a tree
  with the minimum weighted external path
  length?
 We prove this by induction on the number
  of leaves/characters.
 The claim is true when we have only two
  leaves.
 Suppose claim is true when we have n-1
  characters.
                                             18
Correctness of the algorithm(2)
 When we have n characters at the first step we
  replace the two characters with frequencies
  f1,f2 with one character of frequency f1+f2.
 Beyond this point the algorithm behaves as if it
  had only n-1 characters. Hence it computes a
  tree with min. total weighted external path
  length i.e. WEPL(f1+f2, f3,…fn)
 Hence the tree computed has weighted
  external path length f1+f2+ WEPL(f1+f2, f3,…fn)


                                                     19
Correctness of the algorithm(3)

We now argue that
WEPL(f1,f2, f3,…fn ) = f1+f2+ WEPL(f1+f2, f3,…fn)

   This follows from the fact that in the optimum
    tree the leaves with the two lowest weights are
    siblings




                                                      20

Más contenido relacionado

Destacado (19)

Lec3
Lec3Lec3
Lec3
 
National geog
National geogNational geog
National geog
 
May26 arxiu notebook
May26   arxiu notebookMay26   arxiu notebook
May26 arxiu notebook
 
Cigarette excise tax add data 04 11
Cigarette excise tax add data 04 11Cigarette excise tax add data 04 11
Cigarette excise tax add data 04 11
 
기업교육론 4장 학생발표자료
기업교육론 4장 학생발표자료기업교육론 4장 학생발표자료
기업교육론 4장 학생발표자료
 
Niver helen
Niver helenNiver helen
Niver helen
 
λομβαρδία
λομβαρδίαλομβαρδία
λομβαρδία
 
Wendy troupe terametric_twitter_roi_webinar_120910
Wendy troupe terametric_twitter_roi_webinar_120910Wendy troupe terametric_twitter_roi_webinar_120910
Wendy troupe terametric_twitter_roi_webinar_120910
 
Castles
CastlesCastles
Castles
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Counting
 
FHM Funding Impacts and Outcomes of Programs, July 2011
FHM Funding Impacts and Outcomes of Programs, July 2011FHM Funding Impacts and Outcomes of Programs, July 2011
FHM Funding Impacts and Outcomes of Programs, July 2011
 
5강 기업교육론 20110330(공유)
5강 기업교육론 20110330(공유)5강 기업교육론 20110330(공유)
5강 기업교육론 20110330(공유)
 
Plagesdumonde
PlagesdumondePlagesdumonde
Plagesdumonde
 
Akropolis
AkropolisAkropolis
Akropolis
 
People del mon
People del monPeople del mon
People del mon
 
如何提升产品体验与设计品质,兼谈价值体现与个人成长
如何提升产品体验与设计品质,兼谈价值体现与个人成长 如何提升产品体验与设计品质,兼谈价值体现与个人成长
如何提升产品体验与设计品质,兼谈价值体现与个人成长
 
Brite zeynep 2012
Brite zeynep 2012Brite zeynep 2012
Brite zeynep 2012
 
Praag bij nacht
Praag bij nachtPraag bij nacht
Praag bij nacht
 
Impresora en red
Impresora en red Impresora en red
Impresora en red
 

Similar a Lec19

Lect2 up340 (100501)
Lect2 up340 (100501)Lect2 up340 (100501)
Lect2 up340 (100501)aicdesign
 
Text compression in LZW and Flate
Text compression in LZW and FlateText compression in LZW and Flate
Text compression in LZW and FlateSubeer Rangra
 
86254 162058-ee2255-digital-logic-circuits
86254 162058-ee2255-digital-logic-circuits86254 162058-ee2255-digital-logic-circuits
86254 162058-ee2255-digital-logic-circuitsLekashri Subramanian
 
Moment Distribution Method
Moment Distribution MethodMoment Distribution Method
Moment Distribution MethodAnas Share
 
Fault tolerant and online testability
Fault tolerant and online testabilityFault tolerant and online testability
Fault tolerant and online testabilitySajib Mitra
 
Telecom SS7 basic
Telecom SS7 basicTelecom SS7 basic
Telecom SS7 basicKhem Raj
 
Lesson 8_et438b (2).ppsx
Lesson 8_et438b (2).ppsxLesson 8_et438b (2).ppsx
Lesson 8_et438b (2).ppsxBkannan2
 
Automatically Defined Functions for Learning Classifier Systems
Automatically Defined Functions for Learning Classifier SystemsAutomatically Defined Functions for Learning Classifier Systems
Automatically Defined Functions for Learning Classifier SystemsDaniele Loiacono
 
Lect2 up360 (100329)
Lect2 up360 (100329)Lect2 up360 (100329)
Lect2 up360 (100329)aicdesign
 
343logic-design-lab-manual-10 esl38-3rd-sem-2011
343logic-design-lab-manual-10 esl38-3rd-sem-2011343logic-design-lab-manual-10 esl38-3rd-sem-2011
343logic-design-lab-manual-10 esl38-3rd-sem-2011e11ie
 
Using Python3 to Build a Cloud Computing Service for my Superboard II
Using Python3 to Build a Cloud Computing Service for my Superboard IIUsing Python3 to Build a Cloud Computing Service for my Superboard II
Using Python3 to Build a Cloud Computing Service for my Superboard IIDavid Beazley (Dabeaz LLC)
 

Similar a Lec19 (20)

08 decoder
08 decoder08 decoder
08 decoder
 
06 Arithmetic 1
06 Arithmetic 106 Arithmetic 1
06 Arithmetic 1
 
Ec gate-2010
Ec gate-2010Ec gate-2010
Ec gate-2010
 
Lect2 up340 (100501)
Lect2 up340 (100501)Lect2 up340 (100501)
Lect2 up340 (100501)
 
Codes
CodesCodes
Codes
 
PAL
PALPAL
PAL
 
Text compression in LZW and Flate
Text compression in LZW and FlateText compression in LZW and Flate
Text compression in LZW and Flate
 
86254 162058-ee2255-digital-logic-circuits
86254 162058-ee2255-digital-logic-circuits86254 162058-ee2255-digital-logic-circuits
86254 162058-ee2255-digital-logic-circuits
 
Moment Distribution Method
Moment Distribution MethodMoment Distribution Method
Moment Distribution Method
 
Adc(pic)
Adc(pic)Adc(pic)
Adc(pic)
 
Fault tolerant and online testability
Fault tolerant and online testabilityFault tolerant and online testability
Fault tolerant and online testability
 
Chapter 03 cyclic codes
Chapter 03   cyclic codesChapter 03   cyclic codes
Chapter 03 cyclic codes
 
16%20 lecture
16%20 lecture16%20 lecture
16%20 lecture
 
Telecom SS7 basic
Telecom SS7 basicTelecom SS7 basic
Telecom SS7 basic
 
Lesson 8_et438b (2).ppsx
Lesson 8_et438b (2).ppsxLesson 8_et438b (2).ppsx
Lesson 8_et438b (2).ppsx
 
Bca i sem de lab
Bca i sem  de labBca i sem  de lab
Bca i sem de lab
 
Automatically Defined Functions for Learning Classifier Systems
Automatically Defined Functions for Learning Classifier SystemsAutomatically Defined Functions for Learning Classifier Systems
Automatically Defined Functions for Learning Classifier Systems
 
Lect2 up360 (100329)
Lect2 up360 (100329)Lect2 up360 (100329)
Lect2 up360 (100329)
 
343logic-design-lab-manual-10 esl38-3rd-sem-2011
343logic-design-lab-manual-10 esl38-3rd-sem-2011343logic-design-lab-manual-10 esl38-3rd-sem-2011
343logic-design-lab-manual-10 esl38-3rd-sem-2011
 
Using Python3 to Build a Cloud Computing Service for my Superboard II
Using Python3 to Build a Cloud Computing Service for my Superboard IIUsing Python3 to Build a Cloud Computing Service for my Superboard II
Using Python3 to Build a Cloud Computing Service for my Superboard II
 

Más de Anjneya Varshney (12)

Lec16
Lec16Lec16
Lec16
 
Lec12
Lec12Lec12
Lec12
 
Lec11
Lec11Lec11
Lec11
 
Lec10
Lec10Lec10
Lec10
 
Lec9
Lec9Lec9
Lec9
 
Lec8
Lec8Lec8
Lec8
 
Lec7
Lec7Lec7
Lec7
 
Lec6
Lec6Lec6
Lec6
 
Lec5
Lec5Lec5
Lec5
 
Lec4
Lec4Lec4
Lec4
 
Lec3
Lec3Lec3
Lec3
 
Lec13
Lec13Lec13
Lec13
 

Lec19

  • 1. Data Compression FileCompression Huffman Tries 0 1 0 1 0 1 C D B 0 1 A R ABRACADABRA 01011011010000101001011011010 1
  • 2. File Compression  Text files are usually stored by representing each character with an 8-bit ASCII code (type man ascii in a Unix shell to see the ASCII encoding)  The ASCII encoding is an example of fixed-length encoding, where each character is represented with the same number of bits  In order to reduce the space required to store a text file, we can exploit the fact that some characters are more likely to occur than others  variable-length encoding uses binary codes of different lengths for different characters; thus, we can assign fewer bits to frequently used characters, and more bits to rarely used characters. 2
  • 3. File Compression: Example  An Encoding Example text: java encoding: a = “0”, j = “11”, v = “10” encoded text: 110100 (6 bits)  How to decode (problems in ambiguity)? encoding: a = “0”, j = “01”, v = “00” encoded text: 010000 (6 bits) could be "java", or "jvv", or "jaaaa" 3
  • 4. Encoding Trie  To prevent ambiguities in decoding, we require that the encoding satisfies the prefix rule: no code is a prefix of another.  a = “0”, j = “11”, v = “10” satisfies the prefix rule  a = “0”, j = “01”, v= “00” does not satisfy the prefix rule (the code of 'a' is a prefix of the codes of 'j' and 'v') 4
  • 5. Encoding Trie(2) We use an encoding trie to satisfy this prefix rule  the characters are stored at the external nodes.  a left child (edge) means 0  a right child (edge) means 1 A 010 0 1 B 11 0 1 0 1 C 00 D B C D 10 0 1 A R R 011 5
  • 6. Example of Decoding A 010 0 1 B 11  trie 0 1 D 0 1 B C 00 C D 10 0 1 A R R 011 • encoded text: 01011011010000101001011011010 • text: ABRACADABRA 6
  • 7. Trie this! 100001111100100110001110111100010101001101 00 0 1 0 1 0 1 O R 0 1 0 1 0 1 0 1 0 1 0 1 S W T B E C K N 7
  • 8. Optimal Compression  An issue with encoding tries is to ensure that the encoded text is as short as possible: ABRACADABRA 0 1 01011011010000101001011010 0 1 0 1 10 C D B 29 bits 0 1 A R 0 1 ABRACADABRA 0 1 0 1 00101100010000110010110 A B R 0 0 1 24 bits C D 8
  • 9. Construction algorithm  Given frequencies of characters we wish to compute a trie so that the length of the encoding is minimum possible.  Each character is a leaf of the trie  The number of bits used to encode a character is its level number (root is 0).  Thus if fi is the frequency of the ith character and li is the level of the leaf corresponding to it then we want to find a tree which minimises i fili 9
  • 10. Total weighted external path length  The quantity i fili is called the total external weighted path length of a tree.  We view each leaf as having a weight equal to the frequency of the corresponding character.  Given weights f1,f2,…,fn we wish to find a tree whose weighted external path length is minimum.  We denote this by WEPL(f1,f2,…, fn) 10
  • 11. Huffman Encoding Trie ABRA CAD ABRA character A B R C D frequency 5 2 2 1 1 5 2 2 2 A B R C 1 D 1 5 4 2 A B 2 R 2 C 1 D 1 5 6 A 4 2 2 2 1 1 11 B R C D
  • 12. Huffman Encoding Trie (contd.) 5 6 A 4 2 2 2 1 1 B R C D 11 0 1 5 6 A 0 1 4 2 0 1 0 1 2 2 1 1 B R C D 12
  • 13. Final Huffman Encoding Trie 11 0 1 5 6 A 0 1 4 2 0 1 0 1 2 2 1 1 B R C D AB RACADAB RA 0 100 101 0 110 0 111 0 100 1010 23 bits 13
  • 14. Another Huffman Encoding Trie ABRA CAD ABRA character A B R C D frequency 5 2 2 1 1 5 2 2 2 A B R C 1 D 1 5 2 4 A B 2 2 R C 1 D 1 14
  • 15. Another Huffman Encoding Trie 5 2 4 A B 2 2 R C 1 D 1 5 6 A 2 4 B 2 2 R C 1 D 1 15
  • 16. Another Huffman Encoding Trie 6 5 A 2 4 B 2 2 R C 1 D 1 11 5 6 A 2 4 B 2 2 R C 1 D 1 16
  • 17. Another Huffman Encoding Trie 11 0 1 5 6 0 1 A 2 4 B 0 1 2 2 0 1 R C 1 D 1 AB RACADAB RA 0 10 110 0 1100 0 1111 0 10 110 0 23 bits 17
  • 18. Correctness of the algorithm  Why does this algorithm compute a tree with the minimum weighted external path length?  We prove this by induction on the number of leaves/characters.  The claim is true when we have only two leaves.  Suppose claim is true when we have n-1 characters. 18
  • 19. Correctness of the algorithm(2)  When we have n characters at the first step we replace the two characters with frequencies f1,f2 with one character of frequency f1+f2.  Beyond this point the algorithm behaves as if it had only n-1 characters. Hence it computes a tree with min. total weighted external path length i.e. WEPL(f1+f2, f3,…fn)  Hence the tree computed has weighted external path length f1+f2+ WEPL(f1+f2, f3,…fn) 19
  • 20. Correctness of the algorithm(3) We now argue that WEPL(f1,f2, f3,…fn ) = f1+f2+ WEPL(f1+f2, f3,…fn)  This follows from the fact that in the optimum tree the leaves with the two lowest weights are siblings 20