SlideShare a Scribd company logo
1 of 22
Download to read offline
Formats over Time
                           Exploring UK Web History




Andrew Jackson
UK Web Archive, The British Library



        iPres 2012   |   04-10-2012   |   Toronto
Formats over Time

DEBATING OBSOLESCENCE
Rothenberg & Rosenthal On Format Obsolescence



    Jeff Rothenberg:
       “Digital Information Lasts Forever –
        Or Five Years, Whichever Comes First.” (1997)
       “…still apt…” (2012)



    David Rosenthal:
       “when challenged, proponents of [format migration strategies]
        have failed to identify even one format in wide use when
        Rothenberg [made that assertion] that has gone obsolete in the
        intervening decade and a half.” (2010)
       That network effects inhibit obsolescence



    Where is the evidence?
Formats over Time

AN EXPERIMENT
UK Web Domain Dataset (1994-2010)




  UK Web Domain Dataset (1994-2010)
     From the Internet Archive
     Millions of websites
     > 2.5 billion resources
     > 400,000 ARC/WARC files
     > 35TB

  Execution at Scale
     Stored on HDFS
     Map-Reduce
Identification Tools




    DROID
       Well-known in digital preservation community

       Format version level identification

       Minor problem concerning file handles

       Only binary signature part (DROID-B) could be embedded

    Apache Tika
       Widely used identification and data extraction tool

       Identifies many formats at the MIME type level

       Easy to embed and extend

          Added ability to extract e.g. software identifiers

       Minor bug concerning identification buffer size
A Common Language For Format Identifiers




    Comparison and combination requires a common model
       Map PRONOM IDs to extended MIME Types

          fmt/18
           becomes
           application/pdf; version=1.4
       Allows easy comparison at sub-type level

       Can easily extend to cover other properties:

          text/plain; charset=UTF-8

          application/pdf;
           software=“Adobe Acrobat 6.0”
       Also extended Tika to output details from PDFs
Format Profile Dataset




    Server, Tika & DROID-B format profiles, over time:
     image/png image/png image/png; version=1.0 2004 102
                                                       !

                     application/pdf    !

        application/pdf; version=1.2; software="Acrobat
                  Distiller 4.0 for Windows"; 

                 source="Adobe PageMaker 6.0" !

             application/pdf; version=1.2 !2004 !1
    CC0 – free to download and reuse
       http://data.webarchive.org.uk/opendata/ukwa.ds.2/fmt/

       Please cite us and/or let us know if you use it

    Source code of all tools and modifications also available
       https://github.com/openplanets/nanite
Results

COMPARING TOOLS
Percentage)of)resources)
                                                                                   uniden0fied)




                                                                        0%#
                                                                                   1%#
                                                                                            10%#
                                                                19                                                        100%#
                                                                  96
                                                                19 #
                                                                  97
                                                                                                                                      Coverage & Depth




                                                                19 #
                                                                  98
                                                                19 #
                                                                  99
                                                                20 #
                                                                  00
                                                                20 #
                                                                  01
                                                                20 #
                                                                  02
                                                                20 #
                                                                  03
                                                        Year)   20 #
                                                                  04
                                                                20 #
                                                                  05
                                                                20 #
                                                                  06
                                                                20 #
                                                                  07
                                                                20 #
No format-version-level information from Apache Tika.



                                                                  08
                                                                20 #
                                                                  09
                                                                20 #
                                                                  10
                                                                                                                      DROID1B#v.59#




                                                                    #
                                                                                                   Apache#Tika#1.1#
Inconsistencies




    Gaps
       37 formats spotted by DROID-B but not Tika

          Notably includes earlier Office formats

       129 formats spotted by Tika but not DROID-B

          But at least 20 are due to not using the full DROID

    Conflicts
       Failed MIME type mapping, e.g. PDF 1.7 (since fixed)

       ‘Soft’ signatures – e.g. PICT matching 3M JPG (gone)

       DROID strictness – 9M GIF, 4M JPG, 1.3M PDF…

    Both tools bad at non-HTML/XML text formats
       CSS, scripting languages like JS, CSV, TSV, etc.
Results

FORMATS OVER TIME
Image Formats Over Time




                       100.00000%%

                        10.00000%%
Percentage)of)crawl)




                                               JPEG%
                         1.00000%%
                                     1996%
                                     1997%
                                     1998%
                                     1999%
                                     2000%
                                     2001%
                                     2002%
                                     2003%
                                     2004%
                                     2005%
                                     2006%
                                     2007%
                                     2008%
                                     2009%
                                     2010%
                                               GIF%
                         0.10000%%
                                               PNG%
                         0.01000%%             ICON%
                         0.00100%%             XBM%
                         0.00010%%             TIFF%

                         0.00001%%
                                       Year)
HTML Versions Over Time




                                100%%
Percentage)of)HTML)Resources)




                                 90%%
                                 80%%
                                                                                           XHTML%1.0%
                                 70%%
                                 60%%
                                                                              HTML%4.01%
                                 50%%                           HTML%4.0%
                                 40%%               HTML%3.2%
                                 30%%
                                 20%%
                                 10%%   HTML%2.0%
                                  0%%
                                      %

                                      %

                                      %

                                      %

                                      %

                                      %

                                      %

                                      %

                                      %

                                      %

                                      %

                                      %

                                      %

                                      %

                                      %
                                    96

                                    97

                                    98

                                    99

                                    00

                                    01

                                    02

                                    03

                                    04

                                    05

                                    06

                                    07

                                    08

                                    09

                                    10
                                  19

                                  19

                                  19

                                  19

                                  20

                                  20

                                  20

                                  20

                                  20

                                  20

                                  20

                                  20

                                  20

                                  20

                                  20
                                                                      Year)
Percentage)of)PDF)Resources)




                   0%$
                  10%$
                  20%$
                  30%$
                  40%$
                  50%$
                  60%$
                  70%$
                  80%$
                  90%$
                 100%$
        19
           96
             $
        19




                 1.0$
          97
             $
        19




                        1.1$
          98
             $
        19
           99
             $
        20
           00
                                                         PDF Versions Over Time




             $
        20
           01
                               1.2$


             $
        20
           02
             $
        20
           03
             $
Year)
        20
          04
             $
                                  1.3$




        20
           05
             $
        20
           06
             $
        20
          07
             $
                                         1.4$




        20
          08
             $
        20
                                                1.5$




           09
             $
        20
                                                  1.6$




           10
             $
Format Usage Versus Time




                                  10,000,000,000"
Number'of'Resources'in'Archive'




                                   1,000,000,000"
                                     100,000,000"
                                      10,000,000"
                                       1,000,000"
                                         100,000"
                                          10,000"
                                           1,000"
                                             100"
                                              10"
                                               1"
                                                    0"   2"   4"   6"       8"    10" 12"   14"   16"   18"
                                                                        Timespan'[Years]'
Results

IMPLEMENTATIONS
PDF Software Over Time




                               100%(
Percentage)of)PDF)Resources)




                                90%(
                                80%(
                                           Acrobat(         Acrobat(
                                70%(
                                          PDFWriter(
                                60%(
                                50%(
                                40%(
                                30%(
                                20%(
                                10%(              Acrobat(Dis,ller(
                                 0%(
                                      (




                                      (

                                      (

                                      (

                                      (

                                      (



                                      (

                                      (




                                      (

                                      (
                                      (

                                      (




                                      (




                                      (

                                      (
                                    96




                                    99

                                    00

                                    01

                                    02

                                    03



                                    05

                                    06




                                    09

                                    10
                                   97

                                   98




                                   04




                                   07

                                   08
                                 19




                                 19

                                 20

                                 20

                                 20

                                 20



                                 20

                                 20




                                 20

                                 20
                                 19

                                 19




                                 20




                                 20

                                 20
                                                            Year)


                                       Over 2100 Distinct PDF Software IDs
JPEG Hardware Over Time




                             100%$
Percentage)of)Harware)IDs)




                              90%$                                          NIKON$D40$
                              80%$
                              70%$
                              60%$               MX1700$
                              50%$
                              40%$
                              30%$
                              20%$                                              E990$
                              10%$        DS5$               CYBERSHOT$
                               0%$
                               19 $
                               19 $




                               20 $
                               20 $
                               20 $
                               20 $
                               20 $


                               20 $
                               20 $




                               20 $
                                    $
                               19 $




                               19 $
                               19 $




                               20 $




                               20 $
                               20 $
                                  95
                                  96




                                  99
                                  00
                                  01
                                  02
                                  03


                                  05
                                  06




                                  09
                                  10
                                 94




                                 97
                                 98




                                 04




                                 07
                                 08
                               19




                                                           Year)


                                     Over 2100 Distinct JPEG Hardware IDs
Formats over Time

CONCLUSIONS
Summary




    Format obsolescence is complex
       Network effects do appear to stabilize formats

       But once popular formats are fading nevertheless

       More sophisticated approach required

    Please re-use our data, or ask for more
    Firmer conclusions need:
          Richer, more detailed results

          From a wider range of corpora

    This approach only gives creator information
       A different approach will be needed to understand
        resource consumption (e.g. PPT 4, RealAudio 1)
Questions?




             webarchive.org.uk

More Related Content

More from Andy Jackson

The 'Digital Object Types' Issue
The 'Digital Object Types' IssueThe 'Digital Object Types' Issue
The 'Digital Object Types' IssueAndy Jackson
 
Ten years of the UK web archive: what have we saved?
Ten years of the UK web archive: what have we saved?Ten years of the UK web archive: what have we saved?
Ten years of the UK web archive: what have we saved?Andy Jackson
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrAndy Jackson
 
Seeing In The Dark: Discovery and data-mining of restricted web archives
Seeing In The Dark: Discovery and data-mining of restricted web archivesSeeing In The Dark: Discovery and data-mining of restricted web archives
Seeing In The Dark: Discovery and data-mining of restricted web archivesAndy Jackson
 
Digging into the Web Archive at the British Library 2014-11-27
Digging into the Web Archive at the British Library 2014-11-27Digging into the Web Archive at the British Library 2014-11-27
Digging into the Web Archive at the British Library 2014-11-27Andy Jackson
 
Unified characterisation, please
Unified characterisation, pleaseUnified characterisation, please
Unified characterisation, pleaseAndy Jackson
 

More from Andy Jackson (7)

The 'Digital Object Types' Issue
The 'Digital Object Types' IssueThe 'Digital Object Types' Issue
The 'Digital Object Types' Issue
 
Ten years of the UK web archive: what have we saved?
Ten years of the UK web archive: what have we saved?Ten years of the UK web archive: what have we saved?
Ten years of the UK web archive: what have we saved?
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Seeing In The Dark: Discovery and data-mining of restricted web archives
Seeing In The Dark: Discovery and data-mining of restricted web archivesSeeing In The Dark: Discovery and data-mining of restricted web archives
Seeing In The Dark: Discovery and data-mining of restricted web archives
 
Digging into the Web Archive at the British Library 2014-11-27
Digging into the Web Archive at the British Library 2014-11-27Digging into the Web Archive at the British Library 2014-11-27
Digging into the Web Archive at the British Library 2014-11-27
 
IIPC GA 2014 Solr
IIPC GA 2014 SolrIIPC GA 2014 Solr
IIPC GA 2014 Solr
 
Unified characterisation, please
Unified characterisation, pleaseUnified characterisation, please
Unified characterisation, please
 

Recently uploaded

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Formats Over Time: Exploring UK Web History

  • 1. Formats over Time Exploring UK Web History Andrew Jackson UK Web Archive, The British Library iPres 2012 | 04-10-2012 | Toronto
  • 3. Rothenberg & Rosenthal On Format Obsolescence   Jeff Rothenberg:   “Digital Information Lasts Forever – Or Five Years, Whichever Comes First.” (1997)   “…still apt…” (2012)   David Rosenthal:   “when challenged, proponents of [format migration strategies] have failed to identify even one format in wide use when Rothenberg [made that assertion] that has gone obsolete in the intervening decade and a half.” (2010)   That network effects inhibit obsolescence   Where is the evidence?
  • 4. Formats over Time AN EXPERIMENT
  • 5. UK Web Domain Dataset (1994-2010)   UK Web Domain Dataset (1994-2010)   From the Internet Archive   Millions of websites   > 2.5 billion resources   > 400,000 ARC/WARC files   > 35TB   Execution at Scale   Stored on HDFS   Map-Reduce
  • 6. Identification Tools   DROID   Well-known in digital preservation community   Format version level identification   Minor problem concerning file handles   Only binary signature part (DROID-B) could be embedded   Apache Tika   Widely used identification and data extraction tool   Identifies many formats at the MIME type level   Easy to embed and extend   Added ability to extract e.g. software identifiers   Minor bug concerning identification buffer size
  • 7. A Common Language For Format Identifiers   Comparison and combination requires a common model   Map PRONOM IDs to extended MIME Types   fmt/18 becomes application/pdf; version=1.4   Allows easy comparison at sub-type level   Can easily extend to cover other properties:   text/plain; charset=UTF-8   application/pdf; software=“Adobe Acrobat 6.0”   Also extended Tika to output details from PDFs
  • 8. Format Profile Dataset   Server, Tika & DROID-B format profiles, over time: image/png image/png image/png; version=1.0 2004 102 ! application/pdf !
 application/pdf; version=1.2; software="Acrobat Distiller 4.0 for Windows"; 
 source="Adobe PageMaker 6.0" !
 application/pdf; version=1.2 !2004 !1   CC0 – free to download and reuse   http://data.webarchive.org.uk/opendata/ukwa.ds.2/fmt/   Please cite us and/or let us know if you use it   Source code of all tools and modifications also available   https://github.com/openplanets/nanite
  • 10. Percentage)of)resources) uniden0fied) 0%# 1%# 10%# 19 100%# 96 19 # 97 Coverage & Depth 19 # 98 19 # 99 20 # 00 20 # 01 20 # 02 20 # 03 Year) 20 # 04 20 # 05 20 # 06 20 # 07 20 # No format-version-level information from Apache Tika. 08 20 # 09 20 # 10 DROID1B#v.59# # Apache#Tika#1.1#
  • 11. Inconsistencies   Gaps   37 formats spotted by DROID-B but not Tika   Notably includes earlier Office formats   129 formats spotted by Tika but not DROID-B   But at least 20 are due to not using the full DROID   Conflicts   Failed MIME type mapping, e.g. PDF 1.7 (since fixed)   ‘Soft’ signatures – e.g. PICT matching 3M JPG (gone)   DROID strictness – 9M GIF, 4M JPG, 1.3M PDF…   Both tools bad at non-HTML/XML text formats   CSS, scripting languages like JS, CSV, TSV, etc.
  • 13. Image Formats Over Time 100.00000%% 10.00000%% Percentage)of)crawl) JPEG% 1.00000%% 1996% 1997% 1998% 1999% 2000% 2001% 2002% 2003% 2004% 2005% 2006% 2007% 2008% 2009% 2010% GIF% 0.10000%% PNG% 0.01000%% ICON% 0.00100%% XBM% 0.00010%% TIFF% 0.00001%% Year)
  • 14. HTML Versions Over Time 100%% Percentage)of)HTML)Resources) 90%% 80%% XHTML%1.0% 70%% 60%% HTML%4.01% 50%% HTML%4.0% 40%% HTML%3.2% 30%% 20%% 10%% HTML%2.0% 0%% % % % % % % % % % % % % % % % 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 19 19 19 19 20 20 20 20 20 20 20 20 20 20 20 Year)
  • 15. Percentage)of)PDF)Resources) 0%$ 10%$ 20%$ 30%$ 40%$ 50%$ 60%$ 70%$ 80%$ 90%$ 100%$ 19 96 $ 19 1.0$ 97 $ 19 1.1$ 98 $ 19 99 $ 20 00 PDF Versions Over Time $ 20 01 1.2$ $ 20 02 $ 20 03 $ Year) 20 04 $ 1.3$ 20 05 $ 20 06 $ 20 07 $ 1.4$ 20 08 $ 20 1.5$ 09 $ 20 1.6$ 10 $
  • 16. Format Usage Versus Time 10,000,000,000" Number'of'Resources'in'Archive' 1,000,000,000" 100,000,000" 10,000,000" 1,000,000" 100,000" 10,000" 1,000" 100" 10" 1" 0" 2" 4" 6" 8" 10" 12" 14" 16" 18" Timespan'[Years]'
  • 18. PDF Software Over Time 100%( Percentage)of)PDF)Resources) 90%( 80%( Acrobat( Acrobat( 70%( PDFWriter( 60%( 50%( 40%( 30%( 20%( 10%( Acrobat(Dis,ller( 0%( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( 96 99 00 01 02 03 05 06 09 10 97 98 04 07 08 19 19 20 20 20 20 20 20 20 20 19 19 20 20 20 Year) Over 2100 Distinct PDF Software IDs
  • 19. JPEG Hardware Over Time 100%$ Percentage)of)Harware)IDs) 90%$ NIKON$D40$ 80%$ 70%$ 60%$ MX1700$ 50%$ 40%$ 30%$ 20%$ E990$ 10%$ DS5$ CYBERSHOT$ 0%$ 19 $ 19 $ 20 $ 20 $ 20 $ 20 $ 20 $ 20 $ 20 $ 20 $ $ 19 $ 19 $ 19 $ 20 $ 20 $ 20 $ 95 96 99 00 01 02 03 05 06 09 10 94 97 98 04 07 08 19 Year) Over 2100 Distinct JPEG Hardware IDs
  • 21. Summary   Format obsolescence is complex   Network effects do appear to stabilize formats   But once popular formats are fading nevertheless   More sophisticated approach required   Please re-use our data, or ask for more   Firmer conclusions need:   Richer, more detailed results   From a wider range of corpora   This approach only gives creator information   A different approach will be needed to understand resource consumption (e.g. PPT 4, RealAudio 1)
  • 22. Questions? webarchive.org.uk