SlideShare una empresa de Scribd logo
1 de 12
•Andrew Jackson
•Web Archiving Technical Lead
•British Library
Unified Characterisation, Please
The Practitioners' Have Spoken…
 Quality Assurance (of broken or potentially broken data):
 Quality assurance, Bit rot, and Integrity
 Appraisal and Assessment:
 Appraisal and assessment, Conformance, Unknown
characteristics, and Unknown file formats.
 Identify/Locate Preservation Worthy Data
 Identify Preservation Risks:
 Obsolescence, preservation risk and business constraint
 Long tail of many other issues:
 Contextual and Data capture issues through to Embedded
objects, and broader issues around Value and cost.
 Plus: Sustainable Tools
2
Appraisal and Assessment
Conformance, Unknown characteristics, and Unknown file
formats. Identify/Locate Preservation Worthy Data
 Identification
 Always used to „route‟ data to software that can understand it.
 Use minimum information to identify:
 e.g. header only if possible. “Truncated PDF”, not
“UNKNOWN”. GIS shapefiles: .shp, .shx, but with a missing
.dbf should be reported as such.
 Validation
 Two modes needed: “Fast fail”, “Log and continue” /Quirks
 Stop baseless distinction between “Well formed” and “Valid”
 Validation is irrelevant to digital preservation assessment:
 e.g. Effective “PDF/A”, without the 1.4 and XMP chunk.
 We‟re on the wrong side of Postel‟s Law.
 Unknown completeness and failure to future-proof:
 e.g. JHOVE tries to validate versions of PDF it cannot know.
 e.g. Tools sometimes interpret/migrate data opaquely. 3
• t
4
• this
5
Identify Preservation Risks
Obsolescence, preservation risk and business constraint
 Significant Properties are irrelevant here.
 It‟s not really about the content, but about the context.
 Dependency Analysis:
 What software does this need?
 Does this file use format features that are not well supported
across implementations?
 What other resources are transcluded?
 Fonts? c.f. OfficeDDT.
 Remote embeds?
 Embedded scripts that might mask dependencies?
 Do some operations require a password?
 e.g. JHOVE cannot spot „harmless‟ PDF encryption.
6
Sustainable Tools
Our Tools
 Pure-Java Characterisation:
 JHOVE („clean room‟ implementation)
 New Zealand Metadata Extractor (NZME)
 Apache Tika
 Java-based aggregation of various CLI tools:
 JHOVE2
 FITS
 Other Characterisation:
 XCL – C++/XML „clean room‟ extended with ImageMagick
 Many more, inc. forensics, BitCurator, OfficeDDT, jpylyzer...
 Identification:
 DROID, FIDO, Apache Tika, File
 Visualisation:
 C3PO, and many non-specialised tools.
7
Sustainable Tools
Up to date? Working together?
 Software Dependency Management:
 FITS/JHOVE2 embed old DROID versions, hard to upgrade.
 Dead dependencies: FITS and FFIdent, NZME and Jflac.
 Is FITS embedding JHOVE2, or is JHOVE2 embedding FITS?
 Embed shared modules instead?
 Software Project Management and Communication:
 JHOVE, JHOVE2? FITS?
 JHOVE2 only compiles on Sheila‟s branch?
 Roadmaps, issue management, testing, C.I., etc.
 Cross-project coordination and bug-fixing?
 Complexity: JHOVE2, XCL, extremely complex
 JHOVE2 Berkley DB causes checksum failures in tests
 Tika solves same problem using SAX 8
Sustainable Tools
Shared tests?
 Separate projects arise from separate workflows
 Start by understand commonality and find gaps?
 Share test cases and compare results?
 The OPF Format Corpus contains various valid and invalid files.
 Built by practitioners' to test real use cases.
 e.g. JP2 features, PDF Cabinet of Horrors.
 Do the tools give consistent and complementary results?
 Let‟s find out!
 c.f. Dave Tarrant‟s REF for Identification:
 http://data.openplanetsfoundation.org/ref/
 http://data.openplanetsfoundation.org/ref/pdf/pdf_1.7/
9
Bit-mashing as Tool QA
 Bitwise exploration of data sensitivity.
 One way to compare tools.
 Helps understand formats.
 c.f. Jay Gattuso‟s recent OPF blog.
10
Quality Assurance (of broken or potentially broken data)
Quality assurance, Bit rot, and Integrity
 JHOVE let failed TIFF-JP2 through…
 Jpylyzer does better.
 Both fall far short of actual rendering.
11
Where's the unification?
Where should we work together?
 Shared test corpora and test framework:
 Start with the OPF Format Corpus?
 Pull other corpora in by reference:
 http://www.pdfa.org/2011/08/isartor-test-suite/ for PDF/A
 Sustainable version of Dave Tarrant‟s REF?
 Extend with bit-mashing to compare tools?
 Aim to coordinate more:
 Make it clear where to go? (More about OfficeDDT).
 Consider merging projects?
 Consider sharing underlying libraries?
 Consider building Tika modules?
 Please consider Apache Preflight as base for PDF validation.
12

Más contenido relacionado

La actualidad más candente

Advanced .net api (ewout)
Advanced .net api (ewout)Advanced .net api (ewout)
Advanced .net api (ewout)DevDays
 
Furore devdays 2017- rdf2(solbrig)
Furore devdays 2017- rdf2(solbrig)Furore devdays 2017- rdf2(solbrig)
Furore devdays 2017- rdf2(solbrig)DevDays
 
Furore devdays 2017- profiling academy - profiling guidelines v1
Furore devdays 2017- profiling academy - profiling guidelines v1Furore devdays 2017- profiling academy - profiling guidelines v1
Furore devdays 2017- profiling academy - profiling guidelines v1DevDays
 
Devdays 2017 implementation guide authoring - ardon toonstra
Devdays 2017  implementation guide authoring - ardon toonstraDevdays 2017  implementation guide authoring - ardon toonstra
Devdays 2017 implementation guide authoring - ardon toonstraDevDays
 
Fhir dev days_basic_fhir_terminology_services
Fhir dev days_basic_fhir_terminology_servicesFhir dev days_basic_fhir_terminology_services
Fhir dev days_basic_fhir_terminology_servicesDevDays
 
Furore devdays2017 tdd-2-advanced
Furore devdays2017 tdd-2-advancedFurore devdays2017 tdd-2-advanced
Furore devdays2017 tdd-2-advancedDevDays
 
Furore devdays 2017- rdf1(solbrig)
Furore devdays 2017- rdf1(solbrig)Furore devdays 2017- rdf1(solbrig)
Furore devdays 2017- rdf1(solbrig)DevDays
 
Software Citation and Other Incentives at BD2K Software Discovery Workshop
Software Citation and Other Incentives at BD2K Software Discovery WorkshopSoftware Citation and Other Incentives at BD2K Software Discovery Workshop
Software Citation and Other Incentives at BD2K Software Discovery WorkshopMartin Hammitzsch
 
Fhir dev days 2017 fhir profiling - overview and introduction v07
Fhir dev days 2017   fhir profiling - overview and introduction v07Fhir dev days 2017   fhir profiling - overview and introduction v07
Fhir dev days 2017 fhir profiling - overview and introduction v07DevDays
 
Project Credit: Laure Haak - Contributor Role Pilot
Project Credit: Laure Haak - Contributor Role PilotProject Credit: Laure Haak - Contributor Role Pilot
Project Credit: Laure Haak - Contributor Role PilotCASRAI
 
SciForge Workshop@Potsdam Institute for Climate Impact Reserach; Nov 2014
SciForge Workshop@Potsdam Institute for Climate Impact Reserach; Nov 2014SciForge Workshop@Potsdam Institute for Climate Impact Reserach; Nov 2014
SciForge Workshop@Potsdam Institute for Climate Impact Reserach; Nov 2014dreusser
 
Security overview (grahame)
Security overview (grahame)Security overview (grahame)
Security overview (grahame)DevDays
 
The Future Publication of Software
The Future Publication of SoftwareThe Future Publication of Software
The Future Publication of SoftwareMartin Hammitzsch
 
Fhir foundation (grahame)
Fhir foundation (grahame)Fhir foundation (grahame)
Fhir foundation (grahame)DevDays
 
Hackdays and workshops 2019
Hackdays and workshops 2019Hackdays and workshops 2019
Hackdays and workshops 2019Jisc
 
Whats new (grahame)
Whats new (grahame)Whats new (grahame)
Whats new (grahame)DevDays
 
Furore devdays 2017- continua implementing fhir
Furore devdays 2017- continua implementing fhirFurore devdays 2017- continua implementing fhir
Furore devdays 2017- continua implementing fhirDevDays
 
fhir-documents
fhir-documentsfhir-documents
fhir-documentsDevDays
 
Furore devdays 2017- fhir and devices - cooper thc2
Furore devdays 2017- fhir and devices - cooper thc2Furore devdays 2017- fhir and devices - cooper thc2
Furore devdays 2017- fhir and devices - cooper thc2DevDays
 

La actualidad más candente (20)

Advanced .net api (ewout)
Advanced .net api (ewout)Advanced .net api (ewout)
Advanced .net api (ewout)
 
Furore devdays 2017- rdf2(solbrig)
Furore devdays 2017- rdf2(solbrig)Furore devdays 2017- rdf2(solbrig)
Furore devdays 2017- rdf2(solbrig)
 
Furore devdays 2017- profiling academy - profiling guidelines v1
Furore devdays 2017- profiling academy - profiling guidelines v1Furore devdays 2017- profiling academy - profiling guidelines v1
Furore devdays 2017- profiling academy - profiling guidelines v1
 
Devdays 2017 implementation guide authoring - ardon toonstra
Devdays 2017  implementation guide authoring - ardon toonstraDevdays 2017  implementation guide authoring - ardon toonstra
Devdays 2017 implementation guide authoring - ardon toonstra
 
Fhir dev days_basic_fhir_terminology_services
Fhir dev days_basic_fhir_terminology_servicesFhir dev days_basic_fhir_terminology_services
Fhir dev days_basic_fhir_terminology_services
 
Furore devdays2017 tdd-2-advanced
Furore devdays2017 tdd-2-advancedFurore devdays2017 tdd-2-advanced
Furore devdays2017 tdd-2-advanced
 
Furore devdays 2017- rdf1(solbrig)
Furore devdays 2017- rdf1(solbrig)Furore devdays 2017- rdf1(solbrig)
Furore devdays 2017- rdf1(solbrig)
 
Software Citation and Other Incentives at BD2K Software Discovery Workshop
Software Citation and Other Incentives at BD2K Software Discovery WorkshopSoftware Citation and Other Incentives at BD2K Software Discovery Workshop
Software Citation and Other Incentives at BD2K Software Discovery Workshop
 
Fhir dev days 2017 fhir profiling - overview and introduction v07
Fhir dev days 2017   fhir profiling - overview and introduction v07Fhir dev days 2017   fhir profiling - overview and introduction v07
Fhir dev days 2017 fhir profiling - overview and introduction v07
 
Project Credit: Laure Haak - Contributor Role Pilot
Project Credit: Laure Haak - Contributor Role PilotProject Credit: Laure Haak - Contributor Role Pilot
Project Credit: Laure Haak - Contributor Role Pilot
 
SciForge Workshop@Potsdam Institute for Climate Impact Reserach; Nov 2014
SciForge Workshop@Potsdam Institute for Climate Impact Reserach; Nov 2014SciForge Workshop@Potsdam Institute for Climate Impact Reserach; Nov 2014
SciForge Workshop@Potsdam Institute for Climate Impact Reserach; Nov 2014
 
Security overview (grahame)
Security overview (grahame)Security overview (grahame)
Security overview (grahame)
 
The Future Publication of Software
The Future Publication of SoftwareThe Future Publication of Software
The Future Publication of Software
 
Fhir foundation (grahame)
Fhir foundation (grahame)Fhir foundation (grahame)
Fhir foundation (grahame)
 
Hackdays and workshops 2019
Hackdays and workshops 2019Hackdays and workshops 2019
Hackdays and workshops 2019
 
Whats new (grahame)
Whats new (grahame)Whats new (grahame)
Whats new (grahame)
 
Use of ISOcat within CMDI
Use of ISOcat within CMDIUse of ISOcat within CMDI
Use of ISOcat within CMDI
 
Furore devdays 2017- continua implementing fhir
Furore devdays 2017- continua implementing fhirFurore devdays 2017- continua implementing fhir
Furore devdays 2017- continua implementing fhir
 
fhir-documents
fhir-documentsfhir-documents
fhir-documents
 
Furore devdays 2017- fhir and devices - cooper thc2
Furore devdays 2017- fhir and devices - cooper thc2Furore devdays 2017- fhir and devices - cooper thc2
Furore devdays 2017- fhir and devices - cooper thc2
 

Destacado

Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...
Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...
Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...JISC KeepIt project
 
Cochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and FormatsCochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and FormatsFuture Perfect 2012
 
Content profiling and C3PO
Content profiling and C3POContent profiling and C3PO
Content profiling and C3POSCAPE Project
 
Digging into File Formats: Poking around at data using file, DROID, JHOVE, an...
Digging into File Formats: Poking around at data using file, DROID, JHOVE, an...Digging into File Formats: Poking around at data using file, DROID, JHOVE, an...
Digging into File Formats: Poking around at data using file, DROID, JHOVE, an...stepheneisenhauer
 
Pain points for preservation services / workflows in repositories
Pain points for preservation services /  workflows in repositories Pain points for preservation services /  workflows in repositories
Pain points for preservation services / workflows in repositories prwheatley
 
Preservation content in_files
Preservation content in_filesPreservation content in_files
Preservation content in_filesRichard Wright
 

Destacado (7)

Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...
Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...
Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...
 
Cochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and FormatsCochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and Formats
 
Content profiling and C3PO
Content profiling and C3POContent profiling and C3PO
Content profiling and C3PO
 
Digging into File Formats: Poking around at data using file, DROID, JHOVE, an...
Digging into File Formats: Poking around at data using file, DROID, JHOVE, an...Digging into File Formats: Poking around at data using file, DROID, JHOVE, an...
Digging into File Formats: Poking around at data using file, DROID, JHOVE, an...
 
Pain points for preservation services / workflows in repositories
Pain points for preservation services /  workflows in repositories Pain points for preservation services /  workflows in repositories
Pain points for preservation services / workflows in repositories
 
[Dpf manager] berlin workshop
[Dpf manager] berlin workshop[Dpf manager] berlin workshop
[Dpf manager] berlin workshop
 
Preservation content in_files
Preservation content in_filesPreservation content in_files
Preservation content in_files
 

Similar a Unified characterisation, please

Showing How Security Has (And Hasn't) Improved, After Ten Years Of Trying
Showing How Security Has (And Hasn't) Improved, After Ten Years Of TryingShowing How Security Has (And Hasn't) Improved, After Ten Years Of Trying
Showing How Security Has (And Hasn't) Improved, After Ten Years Of TryingDan Kaminsky
 
Exercises portfolio-Digital Curation Tools (IS40620)
Exercises portfolio-Digital Curation Tools (IS40620)Exercises portfolio-Digital Curation Tools (IS40620)
Exercises portfolio-Digital Curation Tools (IS40620)softwaresatish
 
Meet a 100% R-based CRO. The summary of a 5-year journey
Meet a 100% R-based CRO. The summary of a 5-year journeyMeet a 100% R-based CRO. The summary of a 5-year journey
Meet a 100% R-based CRO. The summary of a 5-year journeyAdrian Olszewski
 
Meet a 100% R-based CRO - The summary of a 5-year journey
Meet a 100% R-based CRO - The summary of a 5-year journeyMeet a 100% R-based CRO - The summary of a 5-year journey
Meet a 100% R-based CRO - The summary of a 5-year journeyAdrian Olszewski
 
The  Selection Between An Open Source And Vended Software in Libraries: Oppor...
The  Selection Between An Open Source And Vended Software in Libraries:Oppor...The  Selection Between An Open Source And Vended Software in Libraries:Oppor...
The  Selection Between An Open Source And Vended Software in Libraries: Oppor...Hong (Jenny) Jing
 
Evaluation of format identification tools
Evaluation of format identification toolsEvaluation of format identification tools
Evaluation of format identification toolsSCAPE Project
 
Malicious pdf document detection based on feature extraction and entropy
Malicious pdf document detection based on feature extraction and entropyMalicious pdf document detection based on feature extraction and entropy
Malicious pdf document detection based on feature extraction and entropyijsptm
 
ESG - HDS HCP Anywhere Easy, Secure, On-Premises File Sharing
ESG - HDS HCP Anywhere Easy, Secure, On-Premises File SharingESG - HDS HCP Anywhere Easy, Secure, On-Premises File Sharing
ESG - HDS HCP Anywhere Easy, Secure, On-Premises File SharingHitachi Vantara
 
Information Management 2marks with answer
Information Management 2marks with answerInformation Management 2marks with answer
Information Management 2marks with answersuchi2480
 
Considerations for using personal information management (pim) software for d...
Considerations for using personal information management (pim) software for d...Considerations for using personal information management (pim) software for d...
Considerations for using personal information management (pim) software for d...Merlien Institute
 
scale_perf_best_practices
scale_perf_best_practicesscale_perf_best_practices
scale_perf_best_practiceswebuploader
 
Everyone's A Mechanic
Everyone's A MechanicEveryone's A Mechanic
Everyone's A MechanicBrad Houston
 
CucumberSeleniumWD
CucumberSeleniumWDCucumberSeleniumWD
CucumberSeleniumWDVikas Sarin
 
What Have We Lost - A look at some historical techniques
What Have We Lost - A look at some historical techniquesWhat Have We Lost - A look at some historical techniques
What Have We Lost - A look at some historical techniquesLloydMoore
 
Supporting PDF accessibility evaluation: Early results from the FixRep project
 Supporting PDF accessibility evaluation: Early results from the FixRep project Supporting PDF accessibility evaluation: Early results from the FixRep project
Supporting PDF accessibility evaluation: Early results from the FixRep projectUKOLN (dev), University of Bath
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo
 
Drupal 7 Feeds Intro Drupal Camp Indianapolis 2011
Drupal 7 Feeds Intro Drupal Camp Indianapolis 2011Drupal 7 Feeds Intro Drupal Camp Indianapolis 2011
Drupal 7 Feeds Intro Drupal Camp Indianapolis 2011jbarclay
 
Writting Better Software
Writting Better SoftwareWritting Better Software
Writting Better Softwaresvilen.ivanov
 
Tooling on distributed services
Tooling on distributed servicesTooling on distributed services
Tooling on distributed servicesHiraq Citra M
 

Similar a Unified characterisation, please (20)

Showing How Security Has (And Hasn't) Improved, After Ten Years Of Trying
Showing How Security Has (And Hasn't) Improved, After Ten Years Of TryingShowing How Security Has (And Hasn't) Improved, After Ten Years Of Trying
Showing How Security Has (And Hasn't) Improved, After Ten Years Of Trying
 
Exercises portfolio-Digital Curation Tools (IS40620)
Exercises portfolio-Digital Curation Tools (IS40620)Exercises portfolio-Digital Curation Tools (IS40620)
Exercises portfolio-Digital Curation Tools (IS40620)
 
Meet a 100% R-based CRO. The summary of a 5-year journey
Meet a 100% R-based CRO. The summary of a 5-year journeyMeet a 100% R-based CRO. The summary of a 5-year journey
Meet a 100% R-based CRO. The summary of a 5-year journey
 
Meet a 100% R-based CRO - The summary of a 5-year journey
Meet a 100% R-based CRO - The summary of a 5-year journeyMeet a 100% R-based CRO - The summary of a 5-year journey
Meet a 100% R-based CRO - The summary of a 5-year journey
 
The  Selection Between An Open Source And Vended Software in Libraries: Oppor...
The  Selection Between An Open Source And Vended Software in Libraries:Oppor...The  Selection Between An Open Source And Vended Software in Libraries:Oppor...
The  Selection Between An Open Source And Vended Software in Libraries: Oppor...
 
Evaluation of format identification tools
Evaluation of format identification toolsEvaluation of format identification tools
Evaluation of format identification tools
 
Malicious pdf document detection based on feature extraction and entropy
Malicious pdf document detection based on feature extraction and entropyMalicious pdf document detection based on feature extraction and entropy
Malicious pdf document detection based on feature extraction and entropy
 
ESG - HDS HCP Anywhere Easy, Secure, On-Premises File Sharing
ESG - HDS HCP Anywhere Easy, Secure, On-Premises File SharingESG - HDS HCP Anywhere Easy, Secure, On-Premises File Sharing
ESG - HDS HCP Anywhere Easy, Secure, On-Premises File Sharing
 
Information Management 2marks with answer
Information Management 2marks with answerInformation Management 2marks with answer
Information Management 2marks with answer
 
Considerations for using personal information management (pim) software for d...
Considerations for using personal information management (pim) software for d...Considerations for using personal information management (pim) software for d...
Considerations for using personal information management (pim) software for d...
 
scale_perf_best_practices
scale_perf_best_practicesscale_perf_best_practices
scale_perf_best_practices
 
Everyone's A Mechanic
Everyone's A MechanicEveryone's A Mechanic
Everyone's A Mechanic
 
QQML presentation
QQML presentationQQML presentation
QQML presentation
 
CucumberSeleniumWD
CucumberSeleniumWDCucumberSeleniumWD
CucumberSeleniumWD
 
What Have We Lost - A look at some historical techniques
What Have We Lost - A look at some historical techniquesWhat Have We Lost - A look at some historical techniques
What Have We Lost - A look at some historical techniques
 
Supporting PDF accessibility evaluation: Early results from the FixRep project
 Supporting PDF accessibility evaluation: Early results from the FixRep project Supporting PDF accessibility evaluation: Early results from the FixRep project
Supporting PDF accessibility evaluation: Early results from the FixRep project
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
 
Drupal 7 Feeds Intro Drupal Camp Indianapolis 2011
Drupal 7 Feeds Intro Drupal Camp Indianapolis 2011Drupal 7 Feeds Intro Drupal Camp Indianapolis 2011
Drupal 7 Feeds Intro Drupal Camp Indianapolis 2011
 
Writting Better Software
Writting Better SoftwareWritting Better Software
Writting Better Software
 
Tooling on distributed services
Tooling on distributed servicesTooling on distributed services
Tooling on distributed services
 

Más de Andy Jackson

The 'Digital Object Types' Issue
The 'Digital Object Types' IssueThe 'Digital Object Types' Issue
The 'Digital Object Types' IssueAndy Jackson
 
Ten years of the UK web archive: what have we saved?
Ten years of the UK web archive: what have we saved?Ten years of the UK web archive: what have we saved?
Ten years of the UK web archive: what have we saved?Andy Jackson
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrAndy Jackson
 
Seeing In The Dark: Discovery and data-mining of restricted web archives
Seeing In The Dark: Discovery and data-mining of restricted web archivesSeeing In The Dark: Discovery and data-mining of restricted web archives
Seeing In The Dark: Discovery and data-mining of restricted web archivesAndy Jackson
 
Digging into the Web Archive at the British Library 2014-11-27
Digging into the Web Archive at the British Library 2014-11-27Digging into the Web Archive at the British Library 2014-11-27
Digging into the Web Archive at the British Library 2014-11-27Andy Jackson
 
Formats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web HistoryFormats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web HistoryAndy Jackson
 

Más de Andy Jackson (7)

The 'Digital Object Types' Issue
The 'Digital Object Types' IssueThe 'Digital Object Types' Issue
The 'Digital Object Types' Issue
 
Ten years of the UK web archive: what have we saved?
Ten years of the UK web archive: what have we saved?Ten years of the UK web archive: what have we saved?
Ten years of the UK web archive: what have we saved?
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Seeing In The Dark: Discovery and data-mining of restricted web archives
Seeing In The Dark: Discovery and data-mining of restricted web archivesSeeing In The Dark: Discovery and data-mining of restricted web archives
Seeing In The Dark: Discovery and data-mining of restricted web archives
 
Digging into the Web Archive at the British Library 2014-11-27
Digging into the Web Archive at the British Library 2014-11-27Digging into the Web Archive at the British Library 2014-11-27
Digging into the Web Archive at the British Library 2014-11-27
 
IIPC GA 2014 Solr
IIPC GA 2014 SolrIIPC GA 2014 Solr
IIPC GA 2014 Solr
 
Formats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web HistoryFormats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web History
 

Último

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 

Último (20)

DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 

Unified characterisation, please

  • 1. •Andrew Jackson •Web Archiving Technical Lead •British Library Unified Characterisation, Please
  • 2. The Practitioners' Have Spoken…  Quality Assurance (of broken or potentially broken data):  Quality assurance, Bit rot, and Integrity  Appraisal and Assessment:  Appraisal and assessment, Conformance, Unknown characteristics, and Unknown file formats.  Identify/Locate Preservation Worthy Data  Identify Preservation Risks:  Obsolescence, preservation risk and business constraint  Long tail of many other issues:  Contextual and Data capture issues through to Embedded objects, and broader issues around Value and cost.  Plus: Sustainable Tools 2
  • 3. Appraisal and Assessment Conformance, Unknown characteristics, and Unknown file formats. Identify/Locate Preservation Worthy Data  Identification  Always used to „route‟ data to software that can understand it.  Use minimum information to identify:  e.g. header only if possible. “Truncated PDF”, not “UNKNOWN”. GIS shapefiles: .shp, .shx, but with a missing .dbf should be reported as such.  Validation  Two modes needed: “Fast fail”, “Log and continue” /Quirks  Stop baseless distinction between “Well formed” and “Valid”  Validation is irrelevant to digital preservation assessment:  e.g. Effective “PDF/A”, without the 1.4 and XMP chunk.  We‟re on the wrong side of Postel‟s Law.  Unknown completeness and failure to future-proof:  e.g. JHOVE tries to validate versions of PDF it cannot know.  e.g. Tools sometimes interpret/migrate data opaquely. 3
  • 6. Identify Preservation Risks Obsolescence, preservation risk and business constraint  Significant Properties are irrelevant here.  It‟s not really about the content, but about the context.  Dependency Analysis:  What software does this need?  Does this file use format features that are not well supported across implementations?  What other resources are transcluded?  Fonts? c.f. OfficeDDT.  Remote embeds?  Embedded scripts that might mask dependencies?  Do some operations require a password?  e.g. JHOVE cannot spot „harmless‟ PDF encryption. 6
  • 7. Sustainable Tools Our Tools  Pure-Java Characterisation:  JHOVE („clean room‟ implementation)  New Zealand Metadata Extractor (NZME)  Apache Tika  Java-based aggregation of various CLI tools:  JHOVE2  FITS  Other Characterisation:  XCL – C++/XML „clean room‟ extended with ImageMagick  Many more, inc. forensics, BitCurator, OfficeDDT, jpylyzer...  Identification:  DROID, FIDO, Apache Tika, File  Visualisation:  C3PO, and many non-specialised tools. 7
  • 8. Sustainable Tools Up to date? Working together?  Software Dependency Management:  FITS/JHOVE2 embed old DROID versions, hard to upgrade.  Dead dependencies: FITS and FFIdent, NZME and Jflac.  Is FITS embedding JHOVE2, or is JHOVE2 embedding FITS?  Embed shared modules instead?  Software Project Management and Communication:  JHOVE, JHOVE2? FITS?  JHOVE2 only compiles on Sheila‟s branch?  Roadmaps, issue management, testing, C.I., etc.  Cross-project coordination and bug-fixing?  Complexity: JHOVE2, XCL, extremely complex  JHOVE2 Berkley DB causes checksum failures in tests  Tika solves same problem using SAX 8
  • 9. Sustainable Tools Shared tests?  Separate projects arise from separate workflows  Start by understand commonality and find gaps?  Share test cases and compare results?  The OPF Format Corpus contains various valid and invalid files.  Built by practitioners' to test real use cases.  e.g. JP2 features, PDF Cabinet of Horrors.  Do the tools give consistent and complementary results?  Let‟s find out!  c.f. Dave Tarrant‟s REF for Identification:  http://data.openplanetsfoundation.org/ref/  http://data.openplanetsfoundation.org/ref/pdf/pdf_1.7/ 9
  • 10. Bit-mashing as Tool QA  Bitwise exploration of data sensitivity.  One way to compare tools.  Helps understand formats.  c.f. Jay Gattuso‟s recent OPF blog. 10
  • 11. Quality Assurance (of broken or potentially broken data) Quality assurance, Bit rot, and Integrity  JHOVE let failed TIFF-JP2 through…  Jpylyzer does better.  Both fall far short of actual rendering. 11
  • 12. Where's the unification? Where should we work together?  Shared test corpora and test framework:  Start with the OPF Format Corpus?  Pull other corpora in by reference:  http://www.pdfa.org/2011/08/isartor-test-suite/ for PDF/A  Sustainable version of Dave Tarrant‟s REF?  Extend with bit-mashing to compare tools?  Aim to coordinate more:  Make it clear where to go? (More about OfficeDDT).  Consider merging projects?  Consider sharing underlying libraries?  Consider building Tika modules?  Please consider Apache Preflight as base for PDF validation. 12