SlideShare una empresa de Scribd logo
1 de 17
ADOCO:
Facilitating Quality Control in
Mass Digitisation
Georg Petz
LIBRARIES IN THE DIGITAL AGE (LIDA) 2012
18 - 22 June 2012, Zadar, Croatia
georg.petz@onb.ac.at
Austrian Books Online




Austrian Books Online
(Public Private Partnership with Google)

www.onb.ac.at/ev/austrianbooksonline/


                                                                                  2/17
                                                                            Georg Petz
                           LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
Austrian Books Online




Key Data Austrian Books Online (ABO)

• Digitization ~ 600.000 Volumes / ca. 200 Mio. pages
• Only public domain material
• Project start
   – Planning and Preparation Phase: July – Dec 2010
   – Operational Project start (Manipulation): Dec 2011
   – Operational Project start (Digitization): March 2011
• ~70 project team members, 20+ in core team
• 7 work packages
• ~65K physical volumes scanned so far


                                                                                                3/17
                                                                                          Georg Petz
                                         LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
Austrian Books Online




Division of cost and work load
Google                  ONB

•   Transport          •   Provision of Metadata
•   Insurance          •   Selection
•   Scanning           •   Internal logistics
•   OCR                •   Conservational assessment
•   Image processing   •   Barcoding
•   Quality control    •   Metadata adjustments
•   Google Books       •   Data download and Quality
                           control
                       •   Data storage & digital
                           preservation
                       •   Digital Library

                                                                                        4/17
                                                                                  Georg Petz
                                 LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
Austrian Books Online




                                Digitisation


                      Data Download
ADOCO             Storage in Pair Tree
(Austrian Books   https://confluence.ucop.edu/display/Curation/PairTree
     Online
   Download
   & Control)           Quality Control

                                    Access
                                                                                                  5/17
                                                                                            Georg Petz
                                           LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
Austrian Books Online




Symlink Tree




                                                                      6/17
                                                                Georg Petz
               LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
Austrian Books Online



Download and Quality Assurance – ADOCO

• Method
   –   QA started July 2011
   –   Searching for systematic, not individual errors
   –   Mix of automatic and manual methods
   –   Manually impossible to check amount of pages

• Tool: ADOCO
   – Downloading volumes
   – Internal viewer with possibility for error annotations
   – Clustering of errors and suggestions of suspicious files for
     manual audit
   – Reporting module and statistics (currently in MySQL)
   – SCAPE collaboration

                                                                                          7/17
                                                                                    Georg Petz
                                   LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
Austrian Books Online




QC in typical inhouse project vs. ABO

• Inhouse
   – manual quality control
   – rescan


• ABO
   – automatic and manual quality control
   – no rescan but reprocessing




                                                                                              8/17
                                                                                        Georg Petz
                                       LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
Austrian Books Online




ADOCO Technology Stack



                           Jersey RESTful
       JSF (Primefaces)
                            WebService
                Spring Framework
                              Wrapped
          Hibernate
                             CLI-TOOLS
                 Apache Tomcat

           MySQL             NetApp Filer

                   Redhat Linux

                                                                                         9/17
                                                                                   Georg Petz
                                  LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
Austrian Books Online




      Book Viewer




                                       Book Viewer
Catalogue /
“Quick Search”                        [Mobile Apps]

   Full text Search

                                                                            10/17
                                                                       Georg Petz
                      LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
Austrian Books Online



Data Access
• JPEG-2000 Master-Files stored redundantly

• Access-Copies generated on the fly

• Digitised Books linked with online catalogue

• URN-Resolver for permanent identification underway
 (OBVSG - Austrian Library Network)

• Searchable and accessible via
    • TEL http://search.theeuropeanlibrary.org/portal/en/index.html
    • Europeana http://www.europeana.eu/portal/



                                                                                                 11/17
                                                                                            Georg Petz
                                           LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
Austrian Books Online




                           s t
            c a
         e n
    c r e
S                                                                    12/17
                                                                Georg Petz
               LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
Austrian Books Online




                                                      13/17
                                                 Georg Petz
LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
Austrian Books Online




• co-funded by the European Union under FP7

• develop scalable services for planning and execution of
  institutional preservation strategies

• SCAPE Preservation Platform makes use of Hadoop




                                                                                        14/17
                                                                                   Georg Petz
                                  LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
Austrian Books Online




•   framework for the distributed processing of large data sets across clusters
    of computers

•   overcome limitations SQL oriented databases

•   MapReduce paradigm

•   Sequence files:
    possibly compressed,
    containing pairs of writable
    key/values
                                                                                                 15/17
                                                                                            Georg Petz
                                           LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
Austrian Books Online




Screencast: Loading Books from PairTree into HDFS

•   fs
    The FileSystem (FS) shell is invoked by bin/hadoop fs <args>.
•   jar
    Runs a jar file. Users can bundle their Map Reduce code in a jar file and
    execute it using this command.

•   load hocr files into SequenceFile in HDFS:
    hadoop jar seqfileutility.jar -m -d
    /home/onbscs/testdata/abo/samples/small -e
    html -c NONE

•   source code:
    https://github.com/openplanets/scape/tree/master/tb-lsdr-seqfilecreator

                                                                                                 16/17
                                                                                            Georg Petz
                                           LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
Austrian Books Online




        Thank You!




georg.petz@onb.ac.at
www.onb.ac.at/austrianbooksonline
twitter.com/abooksonline

Photographs: Ingrid Oentrich

                                                                                          17/17
                                                                                     Georg Petz
                                    LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012

Más contenido relacionado

Último

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 

Último (20)

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 

Destacado

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 

Destacado (20)

Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 

LIDA 2012: ADOCO

  • 1. ADOCO: Facilitating Quality Control in Mass Digitisation Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012 18 - 22 June 2012, Zadar, Croatia georg.petz@onb.ac.at
  • 2. Austrian Books Online Austrian Books Online (Public Private Partnership with Google) www.onb.ac.at/ev/austrianbooksonline/ 2/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
  • 3. Austrian Books Online Key Data Austrian Books Online (ABO) • Digitization ~ 600.000 Volumes / ca. 200 Mio. pages • Only public domain material • Project start – Planning and Preparation Phase: July – Dec 2010 – Operational Project start (Manipulation): Dec 2011 – Operational Project start (Digitization): March 2011 • ~70 project team members, 20+ in core team • 7 work packages • ~65K physical volumes scanned so far 3/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
  • 4. Austrian Books Online Division of cost and work load Google ONB • Transport • Provision of Metadata • Insurance • Selection • Scanning • Internal logistics • OCR • Conservational assessment • Image processing • Barcoding • Quality control • Metadata adjustments • Google Books • Data download and Quality control • Data storage & digital preservation • Digital Library 4/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
  • 5. Austrian Books Online Digitisation Data Download ADOCO Storage in Pair Tree (Austrian Books https://confluence.ucop.edu/display/Curation/PairTree Online Download & Control) Quality Control Access 5/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
  • 6. Austrian Books Online Symlink Tree 6/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
  • 7. Austrian Books Online Download and Quality Assurance – ADOCO • Method – QA started July 2011 – Searching for systematic, not individual errors – Mix of automatic and manual methods – Manually impossible to check amount of pages • Tool: ADOCO – Downloading volumes – Internal viewer with possibility for error annotations – Clustering of errors and suggestions of suspicious files for manual audit – Reporting module and statistics (currently in MySQL) – SCAPE collaboration 7/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
  • 8. Austrian Books Online QC in typical inhouse project vs. ABO • Inhouse – manual quality control – rescan • ABO – automatic and manual quality control – no rescan but reprocessing 8/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
  • 9. Austrian Books Online ADOCO Technology Stack Jersey RESTful JSF (Primefaces) WebService Spring Framework Wrapped Hibernate CLI-TOOLS Apache Tomcat MySQL NetApp Filer Redhat Linux 9/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
  • 10. Austrian Books Online Book Viewer Book Viewer Catalogue / “Quick Search” [Mobile Apps] Full text Search 10/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
  • 11. Austrian Books Online Data Access • JPEG-2000 Master-Files stored redundantly • Access-Copies generated on the fly • Digitised Books linked with online catalogue • URN-Resolver for permanent identification underway (OBVSG - Austrian Library Network) • Searchable and accessible via • TEL http://search.theeuropeanlibrary.org/portal/en/index.html • Europeana http://www.europeana.eu/portal/ 11/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
  • 12. Austrian Books Online s t c a e n c r e S 12/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
  • 13. Austrian Books Online 13/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
  • 14. Austrian Books Online • co-funded by the European Union under FP7 • develop scalable services for planning and execution of institutional preservation strategies • SCAPE Preservation Platform makes use of Hadoop 14/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
  • 15. Austrian Books Online • framework for the distributed processing of large data sets across clusters of computers • overcome limitations SQL oriented databases • MapReduce paradigm • Sequence files: possibly compressed, containing pairs of writable key/values 15/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
  • 16. Austrian Books Online Screencast: Loading Books from PairTree into HDFS • fs The FileSystem (FS) shell is invoked by bin/hadoop fs <args>. • jar Runs a jar file. Users can bundle their Map Reduce code in a jar file and execute it using this command. • load hocr files into SequenceFile in HDFS: hadoop jar seqfileutility.jar -m -d /home/onbscs/testdata/abo/samples/small -e html -c NONE • source code: https://github.com/openplanets/scape/tree/master/tb-lsdr-seqfilecreator 16/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
  • 17. Austrian Books Online Thank You! georg.petz@onb.ac.at www.onb.ac.at/austrianbooksonline twitter.com/abooksonline Photographs: Ingrid Oentrich 17/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012

Notas del editor

  1. Processing consists of cleaning, cropping, and digitally &amp;quot;flattening&amp;quot; pages, in additon to optical character recognition. Volumes are processed shortly after they are scanned, and reprocessed infrequently after that. Analysis consists of organizing processed pages into a complete volume, including the selection of higher-quality pages in cases where there are more than one candidate page, and putting the pages of a volume in the correct order.
  2. Special software (ADOCO – ABO Download and Control) was implemented and is continuously developed to meet the needs of the quality auditing process. ADOCO enables simultaneous, multithreaded downloads. It is based on Primefaces and Spring Webflow, using Linux command line tools wrapped in JAVA (wget, tar, gpg, exiftool for image metadata, md5sum,..) and uses a MySQL-Database for technical and bibliographic metadata. It allows for various searches and views on the relevant volumes. Primefaces: Java-based Ajax framework with JSF components ( http://primefaces.org/ ) used for the implementation of the GUI Jersey RESTful WebService: JAX-RS (JSR 311) Reference Implementation for building RESTful Web services used to communicate with other ONB internal systems (e.g. fulltextsearch) ( http://jersey.java.net/ ) Spring: application development framework for enterprise Java™ Hibernate: Java persistence framework to perform object relational mapping and query databases using HQL and SQL. ADOCO uses SQL instead of HQL when performance is an issue ( http://www.hibernate.org/ ) Wrapped CLI-Tools: Linux command line tools wrapped in JAVA (wget, tar, gpg, exiftool for image metadata, md5sum,..) MySQL: Database for technical and bibliographic metadata ( http://www.mysql.com/ ) NetApp Filer: stores jp2, hocr, mets and txt files in PairTree Redhat Linux: Linux distribution
  3. SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats.