SlideShare una empresa de Scribd logo
1 de 33
Descargar para leer sin conexión
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




OCR in libraries – some practical remarks

                          Günter Mühlberger
                          Department for Digitisation and Digital Preservation
                          University Innsbruck Library
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




OCR in Libraries
       Not an easy chapter...
       Is the glass half empty or half full?
       Historical fonts: Black letter, gothic, Old Cyrillic, ...
       Great attempts for full-text
          – JSTOR (1994)
          – Google (2004)
 But: Still many digital libraries without integrated full-text
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




OCR and Digitization
 OCR changes everything!
 Workflow has to be adopted at all steps
          –      Preparation and selection of material
          –      Image processing & scanning
          –      Quality control
          –      Storage and preservation
          –      Correction and user involvement
          –      Full-text search
          –      Web interfaces for digital libraries
 Significant increase in complexity
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         4
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         5
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         6
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         7
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         8
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         9
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         10
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         11
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         12
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         13
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Preparation
 Which material will be taken for scanning? Options:
          – Bound volumes?
          – Microfilm?
          – Loose folios?
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Option: Bound volumes
 Bound volumes
          – Pros:
                      That’s the way books/journals/newspapers are in the library
          – Cons:
                      Often narrow binding, especially with newspapers
                      Often warping due to humidity
          – Remark
                      Technical solution: ScanRobots make life easier and double the speed
                       compared to manual interaction, e.g. 700 – 1000 pages per hour
                      Investment for ScanRobots must not be underestimated




                                                                                                                                                         15
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Option: Microfilm
 Microfilm
          – Pros:
                      If a microfilm is available it is a cheap alternative
                      Easy option (no handling of volumes)
          – Cons:
                      Microfilms have the same problems as bound volumes
                      Microfilms were often produced with minimum quality control
                      Microfilms before 1990 are often not in a good condition
 Remark
          – If microfilm was produced with good quality than there is no significant
            difference in the OCR quality
                      Case study with BL material will be published on IMPACT site



                                                                                                                                                         16
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Option: Loose folios
 Pros
          – No narrow binding, less warping
          – Extremely fast performance with industry scanners – low price
          – Duplicates can be sent to off-shore providers in huge packages
 Cons
          – Not feasible for material before 1850 – libraries would run into justification problems
          – Organisational effort to organise duplicates (but completeness has to be evaluated
            anyway)
 Remark
          – By far the best option to produce high quality with the lowest resources
          – Especially interesting for newspapers, 20th century material and grey literature
          – Used e.g. by MOA, JSTOR

                                                                                                                                                         17
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Good, bad and ugly images
 Careful scanning is A and O
          – Scanrobots and document scanners lower the requirements for a good
            operator, but still individual capability is decisive
 Criteria for a good page image are simple:
          –      sharp
          –      significant fonts with clear curves
          –      clear background, no shining through from the backside
          –      no warping of the page and no geometrical distortions
          –      complete shot with some white frame around the text borders
          –      lines to be parallel resp. rectangle to borders
          –      no noise of users
 If you have perfect images you can wait until OCR technology
  improves, with bad images you never get good results
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         19
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         20
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         21
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         22
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Bad print – broken characters
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




   und                                                                              wenn

                                                                                                                                                         24
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         25
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Bitonal or 8/24 Bit – 300 or 400 ppi – JPEG or TIFF?
 Bitonal vs. 8/24 bit
          – Rose Holley: Dlib Paper 2009: Grey scanning does not lead to better results
          – Experiment: Microfilm scanned bitonal or greyscale – no difference
 Simple experiments show the opposite
          – Innsbrucker Zeitungsarchiv: bitonal and 24 bit
          – Results are clearly better with colour
 300 or 400 Resolution
          – Very small font: Word text: 4 point font
 JPEG vs. TIFF RGB
          – Tests with the Treventus ScanRobot but also with other material show that
            there is no advantage of TIFF RGB images compared to compressed
            JPEGs
 Modern documents with medium sized fonts can be scanned with 300
  ppi and bitonal, but documents with small fonts and challenging paper
  quality etc. should be scanned with 400 ppi and 8 or 24 bit and can be
  stored as JPEGs with e.g. 90% compression rate
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Accuracy
 Is the glas half full or half empty?
          – Rose Holley <90% word recognition: Poor result
          – Google: OCR every image, so every correctly recognized word is better
            than nothing
          – Painful errors?
          – Mature users?


 Character vs. word accuracy
          – Word accuracy says much more, and is much easier to gain: Each word
            which would be correctly found in a full-text search, can be counted as
            correct.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Examples from real world projects
 Based on: ABBYY Recognition Server 2
          –      Reichstagsprotokolle, 1925
          –      Zedler, 1744
          –      Coburger Zeitung, 1808
          –      Judentum, 1803
          –      Eckartshausen, 1792
          –      Landesbauernkammer, 1921
          –      Galvani, 1793
          –      Hieber, 1722
          –      Hofmann, 1875
          –      Buschendorf, 1805
          –      Schreiben, 1689
          –      Lateinische Texte
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Correction of OCR text
 Until recently regarded as „absurd“
 But:
          – Crowd sourcing
          – New technologies
 Crowd sourcing
          –      Figures from Austrialian Newspaper Project:
          –      Correction via a simple editor: line by line correctioin
          –      Since August 2008 6000 users contributed
          –      7 Mill. lines in 318.000 articles were corrected
          –      If you count 50 characters per line it is worth about 200.000 EUR (=
                 compared to the prices of service providers)
 New technologies
          – IBM: CONCERT Tool, LMU: PostCorrection Tool
          – Productivity compared to simple rekeying will be enhanced by several
            factors (at least 1:5)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




What to do with OCR results?
 Structural enhancement
          – INEX: competition based on OCR files
          – Functional Extension Parser
 Preservation
          –      Complexity is significantly increased
          –      Output: TXT, PDF, ABBYY XML
          –      ALTO Format
          –      How to integrated corrective actions of users?
          –      Proposition for enhancing ALTO format
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Digital library applications
 Fulltext search
          – JSTOR, Google, publishers
          – Facetted Search (SOLR)
 Indexing through search engines
          – Site XML
 Visibility of the OCR text
          – User training (by doing)
          – Necessary if correction shall be included
 New research fields
          – Text mining
          – Linking of texts
          – Near duplicates, similiarity and new identifiers
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Summary
 OCR is a „must“
          – For documents of the 19. and 20th century OCR provides in general
            useful or even very good results
          – Bevore 1800: Improvements can be expected by IMPACT
          – Careful and exact scanning is always the main prerequisite, preferable
            in 400 ppi and 8 or 24 bit
          – Test runs with random sets
 Modern applications
          –      Fulltext search
          –      Visibility of the erroneous text
          –      Options for correcting the text by users
          –      Several export formats (also for end-users)
          –      Site XML for search engines
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                Thank you for your attention!

Más contenido relacionado

Más de IMPACT Centre of Competence

Advanced Imaging Services at KU Leuven Libraries Webinar slides
Advanced Imaging Services at KU Leuven Libraries Webinar slidesAdvanced Imaging Services at KU Leuven Libraries Webinar slides
Advanced Imaging Services at KU Leuven Libraries Webinar slidesIMPACT Centre of Competence
 

Más de IMPACT Centre of Competence (20)

Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
 
Session1 02.anna-maria sichani
Session1 02.anna-maria sichaniSession1 02.anna-maria sichani
Session1 02.anna-maria sichani
 
Session1 01.konstantin baierer
Session1 01.konstantin baiererSession1 01.konstantin baierer
Session1 01.konstantin baierer
 
Advanced Imaging Services at KU Leuven Libraries Webinar slides
Advanced Imaging Services at KU Leuven Libraries Webinar slidesAdvanced Imaging Services at KU Leuven Libraries Webinar slides
Advanced Imaging Services at KU Leuven Libraries Webinar slides
 
Xii simposi internacional noves tendencies
Xii simposi internacional noves tendenciesXii simposi internacional noves tendencies
Xii simposi internacional noves tendencies
 

Último

Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 

Último (20)

Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 

Bratislava WS - Mühlberger - OCR in libraries_pdf

  • 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR in libraries – some practical remarks Günter Mühlberger Department for Digitisation and Digital Preservation University Innsbruck Library
  • 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR in Libraries  Not an easy chapter...  Is the glass half empty or half full?  Historical fonts: Black letter, gothic, Old Cyrillic, ...  Great attempts for full-text – JSTOR (1994) – Google (2004)  But: Still many digital libraries without integrated full-text
  • 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR and Digitization  OCR changes everything!  Workflow has to be adopted at all steps – Preparation and selection of material – Image processing & scanning – Quality control – Storage and preservation – Correction and user involvement – Full-text search – Web interfaces for digital libraries  Significant increase in complexity
  • 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 4
  • 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 5
  • 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 6
  • 7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 7
  • 8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 8
  • 9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 9
  • 10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 10
  • 11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 11
  • 12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 12
  • 13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 13
  • 14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Preparation  Which material will be taken for scanning? Options: – Bound volumes? – Microfilm? – Loose folios?
  • 15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Option: Bound volumes  Bound volumes – Pros:  That’s the way books/journals/newspapers are in the library – Cons:  Often narrow binding, especially with newspapers  Often warping due to humidity – Remark  Technical solution: ScanRobots make life easier and double the speed compared to manual interaction, e.g. 700 – 1000 pages per hour  Investment for ScanRobots must not be underestimated 15
  • 16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Option: Microfilm  Microfilm – Pros:  If a microfilm is available it is a cheap alternative  Easy option (no handling of volumes) – Cons:  Microfilms have the same problems as bound volumes  Microfilms were often produced with minimum quality control  Microfilms before 1990 are often not in a good condition  Remark – If microfilm was produced with good quality than there is no significant difference in the OCR quality  Case study with BL material will be published on IMPACT site 16
  • 17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Option: Loose folios  Pros – No narrow binding, less warping – Extremely fast performance with industry scanners – low price – Duplicates can be sent to off-shore providers in huge packages  Cons – Not feasible for material before 1850 – libraries would run into justification problems – Organisational effort to organise duplicates (but completeness has to be evaluated anyway)  Remark – By far the best option to produce high quality with the lowest resources – Especially interesting for newspapers, 20th century material and grey literature – Used e.g. by MOA, JSTOR 17
  • 18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Good, bad and ugly images  Careful scanning is A and O – Scanrobots and document scanners lower the requirements for a good operator, but still individual capability is decisive  Criteria for a good page image are simple: – sharp – significant fonts with clear curves – clear background, no shining through from the backside – no warping of the page and no geometrical distortions – complete shot with some white frame around the text borders – lines to be parallel resp. rectangle to borders – no noise of users  If you have perfect images you can wait until OCR technology improves, with bad images you never get good results
  • 19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 19
  • 20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 20
  • 21. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 21
  • 22. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 22
  • 23. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Bad print – broken characters
  • 24. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. und wenn 24
  • 25. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 25
  • 26. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Bitonal or 8/24 Bit – 300 or 400 ppi – JPEG or TIFF?  Bitonal vs. 8/24 bit – Rose Holley: Dlib Paper 2009: Grey scanning does not lead to better results – Experiment: Microfilm scanned bitonal or greyscale – no difference  Simple experiments show the opposite – Innsbrucker Zeitungsarchiv: bitonal and 24 bit – Results are clearly better with colour  300 or 400 Resolution – Very small font: Word text: 4 point font  JPEG vs. TIFF RGB – Tests with the Treventus ScanRobot but also with other material show that there is no advantage of TIFF RGB images compared to compressed JPEGs  Modern documents with medium sized fonts can be scanned with 300 ppi and bitonal, but documents with small fonts and challenging paper quality etc. should be scanned with 400 ppi and 8 or 24 bit and can be stored as JPEGs with e.g. 90% compression rate
  • 27. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Accuracy  Is the glas half full or half empty? – Rose Holley <90% word recognition: Poor result – Google: OCR every image, so every correctly recognized word is better than nothing – Painful errors? – Mature users?  Character vs. word accuracy – Word accuracy says much more, and is much easier to gain: Each word which would be correctly found in a full-text search, can be counted as correct.
  • 28. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Examples from real world projects  Based on: ABBYY Recognition Server 2 – Reichstagsprotokolle, 1925 – Zedler, 1744 – Coburger Zeitung, 1808 – Judentum, 1803 – Eckartshausen, 1792 – Landesbauernkammer, 1921 – Galvani, 1793 – Hieber, 1722 – Hofmann, 1875 – Buschendorf, 1805 – Schreiben, 1689 – Lateinische Texte
  • 29. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Correction of OCR text  Until recently regarded as „absurd“  But: – Crowd sourcing – New technologies  Crowd sourcing – Figures from Austrialian Newspaper Project: – Correction via a simple editor: line by line correctioin – Since August 2008 6000 users contributed – 7 Mill. lines in 318.000 articles were corrected – If you count 50 characters per line it is worth about 200.000 EUR (= compared to the prices of service providers)  New technologies – IBM: CONCERT Tool, LMU: PostCorrection Tool – Productivity compared to simple rekeying will be enhanced by several factors (at least 1:5)
  • 30. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. What to do with OCR results?  Structural enhancement – INEX: competition based on OCR files – Functional Extension Parser  Preservation – Complexity is significantly increased – Output: TXT, PDF, ABBYY XML – ALTO Format – How to integrated corrective actions of users? – Proposition for enhancing ALTO format
  • 31. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Digital library applications  Fulltext search – JSTOR, Google, publishers – Facetted Search (SOLR)  Indexing through search engines – Site XML  Visibility of the OCR text – User training (by doing) – Necessary if correction shall be included  New research fields – Text mining – Linking of texts – Near duplicates, similiarity and new identifiers
  • 32. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Summary  OCR is a „must“ – For documents of the 19. and 20th century OCR provides in general useful or even very good results – Bevore 1800: Improvements can be expected by IMPACT – Careful and exact scanning is always the main prerequisite, preferable in 400 ppi and 8 or 24 bit – Test runs with random sets  Modern applications – Fulltext search – Visibility of the erroneous text – Options for correcting the text by users – Several export formats (also for end-users) – Site XML for search engines
  • 33. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Thank you for your attention!