SlideShare una empresa de Scribd logo
1 de 62
Click to edit document name




Overview of
IMPACT tools
Overview of IMPACT tools
Baseline: FineReader XIX




     First Omnifont OCR for Fraktur and Old European Scripts

     Special dictionaries for Old European languages

     Available as a result of METAe project

     Baseline for IMPACT project




                                                                1
IMPACT project: Improvements at every step
       Image pre-processing
         Better binarisation


       Layout analysis
         General quality improvements
         Better historical newspapers segmentation


       Recognition
         New classifiers
         Better gothic recognition, including new graphemes
         Old Slavonic script supported

       Export
         ALTO XML
         PDF MRC: smaller file size with the same image quality


       Extensibility
         Extended language support
         Training API


       Licensing and pricing
         Special XIX pricing for Impact project




                                                                   2
Much better quality




                      3
Technology availability


     FineReader Engine 9 – 10 - 11

     Recognition Server 3.0




     FineReader Online new

     Cloud OCR SDK new




                                      4
Thank you!




             5
Text Recognition
                        Improved recognition of pre-trained gothic fonts.
                         New graphemes.
DoW:
                            Test results: 20% ~ 30% recognition quality improvement

“Significantly
improved
performance and
substitution rate
especially for
historical texts.”
                        New language: Old Church Slavonic
                            Test results: ~5% recognition errors




                                                                                       6
More: Native ALTO Support




                  Available in

              FineReader Engine > 10 R2
              Cloud OCR SDK


                                           7
More: PDF MRC

The same image quality with much lower file size


     Original: jp2, 251 kb                         PDF MRC, 53 Kb




                                                                    8
Overview of IMPACT tools
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


Tools developed by the
National Center for Scientific Research (NCSR) "Demokritos"

                                                                                                  Border Removal Tool
 Image enhancement tools
 Download demo:                                                                                   It detects  and  removes  noisy 
 http://www.iit.demokritos.gr/~bgat/H-DocPro/                                                    black borders as well as noisy text 
                                                                                                 regions. 
                                                                                                 Moreover,  it  detects  the  optimal 
                                                                                                 page  frames  of  double  page 
                                                                                                 document images.




   IMPACT event: Project outcomes | 26 June 2012: NCSR “Demokritos” Tools, B.Gatos                                                                        1
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


Tools developed by the
National Center for Scientific Research (NCSR) "Demokritos"

                                                                                                  Border Removal Tool
 Image enhancement tools
 Download demo:
 http://www.iit.demokritos.gr/~bgat/H-DocPro/




                                                                                                (SET‐A) 38718  randomly  selected  historical 
                                                                                                images  with  and  without  noisy  black 
                                                                                                borders. 
                                                                                                (SET‐B) 22383  images  with  noisy  black 
                                                                                                borders, subset of SET‐A . 
                                                                                                (SET‐C) 3467 double page historical images

   IMPACT event: Project outcomes | 26 June 2012: NCSR “Demokritos” Tools, B.Gatos                                                                        2
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


Tools developed by the
National Center for Scientific Research (NCSR) "Demokritos"

                                                                                                  Page Curl Correction Tool
 Image enhancement tools
 Download demo:                                                                                   It  rectifies  document  images 
 http://www.iit.demokritos.gr/~bgat/H-DocPro/                                                    which  suffer  from  warping  and 
                                                                                                 perspective  distortions  that 
                                                                                                 deteriorate  the  performance  of 
                                                                                                 OCR.




   IMPACT event: Project outcomes | 26 June 2012: NCSR “Demokritos” Tools, B.Gatos                                                                        3
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


Tools developed by the
National Center for Scientific Research (NCSR) "Demokritos"

                                                                                                  Page Curl Correction Tool
 Image enhancement tools
 Download demo:                                                                                          420 randomly 
                                                                                                         selected historical 
 http://www.iit.demokritos.gr/~bgat/H-DocPro/                                                            document images




   IMPACT event: Project outcomes | 26 June 2012: NCSR “Demokritos” Tools, B.Gatos                                                                        4
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


Tools developed by the
National Center for Scientific Research (NCSR) "Demokritos"

 OCR evaluation toolkit
                                                                                                  Character Accuracy
 Download:
 http://www.iit.demokritos.gr/~bgat/OCREval/                                                      Word Accuracy
                                                                                                  Rejection Rate
                                 Based on the isri‐ocr‐evaluation‐tools  
                                  (Information     Science     Research   Characters Marked
                                  Institute,  University  of  Nevada  ‐
                                  Source code from
                                                                            Accuracy after Correction 
                                  http://code.google.com/p/isri‐ocr‐        Accuracy by Character Class
                                  evaluation‐tools/
                                  Code license: Apache License 2.0)         Figure of Merit

                                OCREval.exe                   [a,u8,u16] stop_words.txt (or “-”) gt.txt (or batchgt.txt)
                                                              ocr.txt (or batch.txt) character_report.txt
                                word_report.txt                           overall_report.xml

   IMPACT event: Project outcomes | 26 June 2012: NCSR “Demokritos” Tools, B.Gatos                                                                        5
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


Tools developed by the
National Center for Scientific Research (NCSR) "Demokritos"
                                                                                              It provides an integrated GUI for 
 Word Spotting Application                                                                  indexing historical documents 
 Download demo:                                                                              without an OCR engine. It allows 
 http://www.iit.demokritos.gr/~bgat/WordSpot/                                                searching the database for 
                                                                                             instances of a query keyword using 
                                                                                             several different methods (list, 
                                                                                             query by example, free text)




   IMPACT event: Project outcomes | 26 June 2012: NCSR “Demokritos” Tools, B.Gatos                                                                        6
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


Tools developed by the
National Center for Scientific Research (NCSR) "Demokritos"

 Word Spotting Application
 Download demo:
 http://www.iit.demokritos.gr/~bgat/WordSpot/




                                                                                     Benchmarking set: French book, BnF – year of 
                                                                                     publication: 1838 (153 pages), German book, 
                                                                                     BSB – year of publication: 1788  (126 pages)
   IMPACT event: Project outcomes | 26 June 2012: NCSR “Demokritos” Tools, B.Gatos                                                                        7
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


Tools developed by the
National Center for Scientific Research (NCSR) "Demokritos"
      Scientific impact
      Part  of  the  PhD  thesis  of  N. 
     Stamatopolous who was awarded 
     the  "Best  PhD  Thesis  of  2011                                  Impact Factor: 2.918                         Impact Factor:
                                                                                                                     1.091
     Award  in  the  area  of  Informatics 
     and Telematics“ in Greece.
      Publications on a great number 
     of widely recognized journals and 
     international conferences.
      IMPACT  related  publications 
     have  127  citations  from  non‐                                       Impact Factor: 1.474
     IMPACT  researchers  (information 
     calculated from Scopus)

   IMPACT event: Project outcomes | 26 June 2012: NCSR “Demokritos” Tools, B.Gatos                                                                        8
Overview of IMPACT tools
IMPACT Project Outcomes, 26 June 2012, The Hague




IMPACT Image and Ground Truth Repository

• Content Management System
• Datasets
   – 667,437 images
   – 52,880 PAGE ground truth files
• Invaluable resource for planning,
  experimentation and evaluation
  relating to digitisation projects

• Access to existing material via CoC
• Further development and usage in
  other projects and domains
  (Europeana Newspapers)                                                       1
IMPACT Project Outcomes, 26 June 2012, The Hague




    Aletheia
• Semi-automated tool for ground truth and showcase
    production

• Full PAGE support
• Page border, print
    space
•   Layout regions
    (incl. metadata)
•   Text lines, words
    and glyphs
•   Unicode text at all
    levels
•   Reading order,
    layers etc.
•   Quality control                                                          2
IMPACT Project Outcomes, 26 June 2012, The Hague




  Layout Evaluation Tool
• Performance evaluation of page segmentation and region
   classification
• Configuration of scenarios
   based on different types of
   errors

                 Miss /
                 Part. Miss

                 Split


                 Misclass.


                 Merge

                 False
                 Detection                                                     3
IMPACT Project Outcomes, 26 June 2012, The Hague




 Text Line and Word Segmentation
• Modules for integration in workflows and/or other tools
• Independent of OCR

• Used in
   •   Aletheia
   •   Dewarping
   •   T-OCR
   •   Word spotting




                                                                                 4
IMPACT Project Outcomes, 26 June 2012, The Hague




 Dewarping Tool
• Correction of arbitrary warping artefacts

• Two stage process:
   • Model detection
   • Image
      transformation

• Improved legibility
• Increased OCR
  accuracy

                                                                           5
IMPACT Project Outcomes, 26 June 2012, The Hague




More Information


 PRImA
   http://www.primaresearch.org

 IMPACT CoC
   http://www.digitisation.eu




                                                                          6
Overview of IMPACT tools
IBM Labs in Haifa




IBM`s Highlights
 CONCERT (COoperative eNgine for Correction of ExtRacted Text)
    IBM`s collaborative correction platform announced in August 2010
         http://www-03.ibm.com/press/us/en/pressrelease/32380.wss
     IBM`s Adaptive OCR integrated
     EE2/EE3 Dictionaries integrated
         Dutch, German, Spanish
 Sophisticated monitoring mechanism integrated 2011
    Designed for both malicious users and game users
 CONCERT Games introduction
    IMPACT Conference 2011
    Beyond the productivity tools
 Adaptive OCR System
    Font and language independent

1
IBM Labs in Haifa




CONCERT
 Adaptive collaborative correction platform
    Uses the feedback from the users to improve productivity
    Fully connected to the Adaptive OCR Engine
 Strong emphasis on productivity tools
    Reduce the time for verification/correction
         Patented smart-key approach
    Motivate volunteers
 Separating data entry process into several complementary tasks
    Optimized application dedicated to each task
    Break down the tasks into subtask
    Make it suitable for parallel processing
    Online compilation
 User Monitoring
     Robust against malicious users and designed for gamification
2
IBM Labs in Haifa




Adaptive OCR Engine
 Consistent and reliable confidence level
    Important for quality assurance
 No use of prior knowledge on the font
    Crazy font can be handled
 Good use of the feedback from the users
    Character and Word level
 Robust to distortion
    Page level distortion and printing variations
 Easy to migrate between books from the same publisher
    Continues update
 Not too slow
    Around 2-3 times slower than OMNI Engines

3
IBM Labs in Haifa




What is next ?
 Search Over OCR
    Beyond transcription
 Improve User Feedback
    Online advisor
    Best performers list
 Community building around content
    Integrate community tools within the platform
 Complete CONCERT Games
    iPhone/iPad/Android/Desktop
 E-Book creation
    Fully digital transcription
    Using original font as option
 Page distortion correction
    Fully integrate the word-based page distortion correction


4
Overview of IMPACT tools
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




The Functional Extension Parser (FEP)
A Document Understanding Platform
Günter Mühlberger
University of Innsbruck
Department for German Language and Literature Studies
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Functional Extension Parser
     A book is more than just pure text – it contains a lot of structural metadata
        – These metadata are (often) encoded in the layout of a document
        – Size of characters, position on page, distance to other lines, etc. is used to express
          structural meaning such as headlines, footnotes, caption lines, etc.
     FEP is a platform to process digitised or born digital documents and to
      “understand” the meaning of the layout by using a rules engine
        –     Modular approach
        –     Images and OCR as input
        –     Web-GUI for visualisation, editing and correction
        –     Evaluation module
        –     Export module (METS/ALTO, PDF, ePUB,...)
     Several pilots
        – EOD Network: Enhancement of PDF eBook of historical books
        – DNB: Extraction of metadata from title pages
        – Internal project: Card index
     FEP is available for free for research applications
        – Commercial deployment is foreseen via technology transfer platform of the University
          Innsbruck

                                                                                                                                                         2
Overview of IMPACT tools
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Postcorrection Tool
Ludwig-Maximilians-Universität München
Centrum für Informations- und Sprachverarbeitung




The Hague, June 26th 2012
Annette Gotscharek, Ulrich Reffle, Christoph Ringlstetter, Klaus U. Schulz,
Thorsten Vobl
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




    Postcorrection Tool as Carrier of IMPACT Technology

 Advanced language technology used for improvement of interactive postcorrection
    Advanced language technology used for improvement of interactive postcorrection
     Lexica, matching tool, profiler integrated as background technology
        Lexica, matching tool, profiler integrated as background technology
     Document centric knowledge from unsupervised analysis of OCRed
        Document centric knowledge from unsupervised analysis of OCRed
       document used for detection of error classes and suggested corrections
        document used for detection of error classes and suggested corrections
     Batchmode for corrections of many errors in „one shot“
        Batchmode for corrections of many errors in „one shot“
 Rich graphical user interface to let users fully benefit from „knowledge“ on
    Rich graphical user interface to let users fully benefit from „knowledge“ on
   document derived error classes
    document derived error classes
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




       Improved Lexicon Technology

 Valid historical words not marked as
    Valid historical words not marked as
   errors
    errors
 Historical variants proposed as correction
    Historical variants proposed as correction
   candidates
    candidates
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




       Improved Workflow - Batch Processing


 Strings with identical error patterns
    Strings with identical error patterns
   corrected as batch
    corrected as batch
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




        Improved Selection of Correction Candidates


 Ranking of candidates through
    Ranking of candidates through
   document specific language and
    document specific language and
   error profile
    error profile
Overview of IMPACT tools
Lexica and Strategies and Tools for
Lexicon Building & Lexicon Deployment
         in OCR and Retrieval

               Historical Language

Available through the IMPACT Centre of Competence:
                       www.digitisation.eu
Lexica for 9 languages


Example:

·Dutch IR lexicon:  (will be available as web service)
     WNT (Dictionary of the Dutch language), 475,498 distinct word forms, 215,180 lemmata,
     and 558,438 distinct lemma/word form combinations, with 1,636,709 attestations.
·Dutch OCR lexicon (1)
     Based on a large historical corpus from the DBNL (Digitale Bibliotheek voor de
     Nederlandse Letteren, Digital Library for Dutch Literature).
·Dutch OCR lexicon (2)
           Based on the IR lexicon (WNT) [best results in CCS experiment]
·Dutch Named Entity lexicon (persons, locations, organizations)
·Set of historical spelling variation rules
     Using a reference set of 10,000 pairs of historical words taken from the WNT-based lexico
     to which parallel modern words have been added
Lexica in OCR: Results
for 9 languages
Lexica in OCR:
External Dictonary interface

· Function of the tool: enable addition of 
  (IMPACT) special lexica to FineReader 
  engine
· Availability: free for non‐commercial use
Lexica in OCR:
OCR Evaluation Tool

· Function of the tool: evaluate influence of 
  lexica on OCR, measure OCR (word) 
  accuracy
· Availability: free for non‐commercial use, 
  Impact Interoperability Framework
Lexica in Retrieval:
BlackLab

· Function of the tool: enable use of lexica 
  and linguistic annotation in a lucene‐
  based search engine
  ·   Deployment of IMPACT IR lexica in query expansion
  ·   Keyword-in-context display of search results
  ·   Grouping and sorting of search results
  ·   Support for linguistic annotation (lemma, part of speech, etc)
  ·   Exploitation of named entity tagging
  ·   Incremental indexing
  ·   Supports CLARIN standards (CQL)

· Availability: free for non‐commercial use
Toolbox for
lexicon building and deployment

· Function of the tools: support lexicon 
  building for historical language
  ·   Corpus-based lexicon building
  ·   Lexicon-building from dictionary quotations
  ·   Spelling variation: pattern extraction and matching
  ·   Lemmatization

· Availability: free for non‐commercial use
CoBaLT corpus-based
lexicon building tool

· Function of the tool: productivity tool  for 
  IR lexicon building from corpora
  · Both for pre-lemmatized and unannotated corpora
  · Web-based
· Availability: free for non‐commercial use
CoBaLT corpus-based
lexicon building tool
Dictionary attestation tool


· Function of the tool: productivity tool  for IR 
  lexicon building from dictionary quotations 
  · Manual evaluation and correction of large quantities of
    automatically matched occurrences of a headword (scripts
    available) in the quotations of the particular article in a
    comprehensive dictionary
  · High productivity: revision speed (OED) 400-600
    entries/hour
  · Web-based
· Availability: free for non‐commercial use
Dictionary attestation tool
Spellingvariation
and lemmatization tools

· Function of the tools: 
  · match (inflected) historical spellings of words to
    modern dictionary headword form
  · Derive spelling variation patterns from example
    material
· Availability: free for non‐commercial use
NERT
named entity recognition tool

· Function of the tool: mark up named 
  entities in text
· Extension of Stanford NE recognizer with 
  extensions for spelling variation and OCR 
  errors
· Availability: free for non‐commercial use
Named entity
matching tool

· Function of the tool: match named entities 
  with historical spelling to standard form
· Availability: free for non‐commercial use
Named entity
attestation tool

· Function of the tool: manual evaluation 
  and correction of automatically matched 
  occurrences of Named Entities in text 
  material
· Web‐based
· Availability: free for non‐commercial use
Named entity
attestation tool
Overview of IMPACT tools
Background / Use case

  A diverse set of tools relevant to OCR & digitisation being
  developed by IMPACT and also others
But:
  Mix of prototypes, robust commercial solutions, 3rd party tools
  Variety of programming languages, environments, backgrounds
Solution:
  Wrapping of tools as web services allows:
  - quick & easy integration → uniform access
  - flexible, loose coupling of separate modules
  - platform independent execution via the web

    IMPACT Interoperability Framework (IIF)
Overview: Service Oriented Architecture

IMPACT Interoperability Framework Technologies:


                                                     Java
                                                     REST/SOAP
                                                     Apache Maven
                                                     Apache Tomcat
                                                     Apache Axis2
                                                     Apache Synapse
                                                     Taverna Engine


Interoperability through open technology standards (OASIS/W3C)
Workflow example


                   • OCR task =
                     processing pipeline

                   • Building blocks =
                     individual tools

                   • Workflow =
                     interaction between
                     tools (mashups)

                   • Collaboration with
Availability & Takeup




Availability
    IIF components released as open source (Apache 2.0 License):
    https://github.com/impactcentre/interoperability-framework
    Community workflow resources (CC License):
    http://www.myexperiment.org/groups/235
Takeup:
    SCAPE – Scalable Preservation Environments
    DDMAL – Digital Music Archives & Libraries Laboratory Montreal

Más contenido relacionado

Similar a IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demokritos, University of Salford, IBM, Uni-versity of Innsbruck, LMU University of Munich, INL Institute for Dutch Lexicology and KB

The Improving Access to Text (IMPACT) project and other European initiatives
The Improving Access to Text (IMPACT) project and other European initiativesThe Improving Access to Text (IMPACT) project and other European initiatives
The Improving Access to Text (IMPACT) project and other European initiativesMichael Day
 
IMPACT Demo Dag at KB
IMPACT Demo Dag at KBIMPACT Demo Dag at KB
IMPACT Demo Dag at KBcneudecker
 
An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...cneudecker
 
IMPACT at OCR Summit
IMPACT at OCR SummitIMPACT at OCR Summit
IMPACT at OCR Summitcneudecker
 
Experimental Workflow Development in Digitisation
Experimental Workflow Development in DigitisationExperimental Workflow Development in Digitisation
Experimental Workflow Development in Digitisationcneudecker
 
The British Library, London: Old Maps Online
The British Library,  London: Old Maps OnlineThe British Library,  London: Old Maps Online
The British Library, London: Old Maps OnlinePetr Pridal
 
Archaeology & cultural heritage application working group part 2
Archaeology & cultural heritage application working group part 2Archaeology & cultural heritage application working group part 2
Archaeology & cultural heritage application working group part 2Manolis Vavalis
 
Centre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens NeudeckerCentre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens NeudeckerBiblioteca Nacional de España
 
FOCUS K3D Newsletter (Feb 2010)
FOCUS K3D Newsletter (Feb 2010)FOCUS K3D Newsletter (Feb 2010)
FOCUS K3D Newsletter (Feb 2010)FOCUS K3D
 
3D-ICONS - D5.1: Report on 3D Publication Formats Suitable for Europeana
3D-ICONS - D5.1: Report on 3D Publication Formats Suitable for Europeana3D-ICONS - D5.1: Report on 3D Publication Formats Suitable for Europeana
3D-ICONS - D5.1: Report on 3D Publication Formats Suitable for Europeana3D ICONS Project
 
Wonderland @ Cattid - Sun's Virtual Workplace
Wonderland @ Cattid - Sun's Virtual WorkplaceWonderland @ Cattid - Sun's Virtual Workplace
Wonderland @ Cattid - Sun's Virtual Workplacevincenzo de simone
 
DURAARK Preserving Architectural Knowledge
DURAARK Preserving Architectural KnowledgeDURAARK Preserving Architectural Knowledge
DURAARK Preserving Architectural KnowledgeRuggero Lancia
 
DURAARK presentation at DEDICATE final seminar, October 21st 2013, Michelle L...
DURAARK presentation at DEDICATE final seminar, October 21st 2013, Michelle L...DURAARK presentation at DEDICATE final seminar, October 21st 2013, Michelle L...
DURAARK presentation at DEDICATE final seminar, October 21st 2013, Michelle L...lindlar
 
Building new knowledge from distributed scientific corpus: HERBADROP & EUROPE...
Building new knowledge from distributed scientific corpus: HERBADROP & EUROPE...Building new knowledge from distributed scientific corpus: HERBADROP & EUROPE...
Building new knowledge from distributed scientific corpus: HERBADROP & EUROPE...Nuno Freire
 
GAME TECHNOLOGIES USAGE FOR USERS ATTRACTION TO EDUCATIONAL RESOURCES
GAME TECHNOLOGIES USAGE FOR USERS ATTRACTION TO EDUCATIONAL RESOURCESGAME TECHNOLOGIES USAGE FOR USERS ATTRACTION TO EDUCATIONAL RESOURCES
GAME TECHNOLOGIES USAGE FOR USERS ATTRACTION TO EDUCATIONAL RESOURCESAlexander Lavrov
 
FOCUS K3D Project Flyer
FOCUS K3D Project FlyerFOCUS K3D Project Flyer
FOCUS K3D Project FlyerFOCUS K3D
 

Similar a IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demokritos, University of Salford, IBM, Uni-versity of Innsbruck, LMU University of Munich, INL Institute for Dutch Lexicology and KB (20)

The Improving Access to Text (IMPACT) project and other European initiatives
The Improving Access to Text (IMPACT) project and other European initiativesThe Improving Access to Text (IMPACT) project and other European initiatives
The Improving Access to Text (IMPACT) project and other European initiatives
 
IMPACT Demo Dag at KB
IMPACT Demo Dag at KBIMPACT Demo Dag at KB
IMPACT Demo Dag at KB
 
An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...
 
IMPACT at OCR Summit
IMPACT at OCR SummitIMPACT at OCR Summit
IMPACT at OCR Summit
 
Experimental Workflow Development in Digitisation
Experimental Workflow Development in DigitisationExperimental Workflow Development in Digitisation
Experimental Workflow Development in Digitisation
 
IMPACT Final Conference - Stefan Pletschacher
IMPACT Final Conference - Stefan PletschacherIMPACT Final Conference - Stefan Pletschacher
IMPACT Final Conference - Stefan Pletschacher
 
Bne impact co_c
Bne impact co_cBne impact co_c
Bne impact co_c
 
1.5
1.51.5
1.5
 
The British Library, London: Old Maps Online
The British Library,  London: Old Maps OnlineThe British Library,  London: Old Maps Online
The British Library, London: Old Maps Online
 
Seminar 2019 at CSE
Seminar 2019 at CSESeminar 2019 at CSE
Seminar 2019 at CSE
 
Archaeology & cultural heritage application working group part 2
Archaeology & cultural heritage application working group part 2Archaeology & cultural heritage application working group part 2
Archaeology & cultural heritage application working group part 2
 
Centre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens NeudeckerCentre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens Neudecker
 
FOCUS K3D Newsletter (Feb 2010)
FOCUS K3D Newsletter (Feb 2010)FOCUS K3D Newsletter (Feb 2010)
FOCUS K3D Newsletter (Feb 2010)
 
3D-ICONS - D5.1: Report on 3D Publication Formats Suitable for Europeana
3D-ICONS - D5.1: Report on 3D Publication Formats Suitable for Europeana3D-ICONS - D5.1: Report on 3D Publication Formats Suitable for Europeana
3D-ICONS - D5.1: Report on 3D Publication Formats Suitable for Europeana
 
Wonderland @ Cattid - Sun's Virtual Workplace
Wonderland @ Cattid - Sun's Virtual WorkplaceWonderland @ Cattid - Sun's Virtual Workplace
Wonderland @ Cattid - Sun's Virtual Workplace
 
DURAARK Preserving Architectural Knowledge
DURAARK Preserving Architectural KnowledgeDURAARK Preserving Architectural Knowledge
DURAARK Preserving Architectural Knowledge
 
DURAARK presentation at DEDICATE final seminar, October 21st 2013, Michelle L...
DURAARK presentation at DEDICATE final seminar, October 21st 2013, Michelle L...DURAARK presentation at DEDICATE final seminar, October 21st 2013, Michelle L...
DURAARK presentation at DEDICATE final seminar, October 21st 2013, Michelle L...
 
Building new knowledge from distributed scientific corpus: HERBADROP & EUROPE...
Building new knowledge from distributed scientific corpus: HERBADROP & EUROPE...Building new knowledge from distributed scientific corpus: HERBADROP & EUROPE...
Building new knowledge from distributed scientific corpus: HERBADROP & EUROPE...
 
GAME TECHNOLOGIES USAGE FOR USERS ATTRACTION TO EDUCATIONAL RESOURCES
GAME TECHNOLOGIES USAGE FOR USERS ATTRACTION TO EDUCATIONAL RESOURCESGAME TECHNOLOGIES USAGE FOR USERS ATTRACTION TO EDUCATIONAL RESOURCES
GAME TECHNOLOGIES USAGE FOR USERS ATTRACTION TO EDUCATIONAL RESOURCES
 
FOCUS K3D Project Flyer
FOCUS K3D Project FlyerFOCUS K3D Project Flyer
FOCUS K3D Project Flyer
 

Más de IMPACT Centre of Competence

Más de IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

Último

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Último (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demokritos, University of Salford, IBM, Uni-versity of Innsbruck, LMU University of Munich, INL Institute for Dutch Lexicology and KB

  • 1. Click to edit document name Overview of IMPACT tools
  • 3. Baseline: FineReader XIX  First Omnifont OCR for Fraktur and Old European Scripts  Special dictionaries for Old European languages  Available as a result of METAe project  Baseline for IMPACT project 1
  • 4. IMPACT project: Improvements at every step  Image pre-processing  Better binarisation  Layout analysis  General quality improvements  Better historical newspapers segmentation  Recognition  New classifiers  Better gothic recognition, including new graphemes  Old Slavonic script supported  Export  ALTO XML  PDF MRC: smaller file size with the same image quality  Extensibility  Extended language support  Training API  Licensing and pricing  Special XIX pricing for Impact project 2
  • 6. Technology availability  FineReader Engine 9 – 10 - 11  Recognition Server 3.0  FineReader Online new  Cloud OCR SDK new 4
  • 8. Text Recognition  Improved recognition of pre-trained gothic fonts. New graphemes. DoW:  Test results: 20% ~ 30% recognition quality improvement “Significantly improved performance and substitution rate especially for historical texts.”  New language: Old Church Slavonic  Test results: ~5% recognition errors 6
  • 9. More: Native ALTO Support Available in  FineReader Engine > 10 R2  Cloud OCR SDK 7
  • 10. More: PDF MRC The same image quality with much lower file size Original: jp2, 251 kb PDF MRC, 53 Kb 8
  • 12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Tools developed by the National Center for Scientific Research (NCSR) "Demokritos"  Border Removal Tool Image enhancement tools Download demo: It detects  and  removes  noisy  http://www.iit.demokritos.gr/~bgat/H-DocPro/ black borders as well as noisy text  regions.  Moreover,  it  detects  the  optimal  page  frames  of  double  page  document images. IMPACT event: Project outcomes | 26 June 2012: NCSR “Demokritos” Tools, B.Gatos 1
  • 13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Tools developed by the National Center for Scientific Research (NCSR) "Demokritos"  Border Removal Tool Image enhancement tools Download demo: http://www.iit.demokritos.gr/~bgat/H-DocPro/ (SET‐A) 38718  randomly  selected  historical  images  with  and  without  noisy  black  borders.  (SET‐B) 22383  images  with  noisy  black  borders, subset of SET‐A .  (SET‐C) 3467 double page historical images IMPACT event: Project outcomes | 26 June 2012: NCSR “Demokritos” Tools, B.Gatos 2
  • 14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Tools developed by the National Center for Scientific Research (NCSR) "Demokritos"  Page Curl Correction Tool Image enhancement tools Download demo: It  rectifies  document  images  http://www.iit.demokritos.gr/~bgat/H-DocPro/ which  suffer  from  warping  and  perspective  distortions  that  deteriorate  the  performance  of  OCR. IMPACT event: Project outcomes | 26 June 2012: NCSR “Demokritos” Tools, B.Gatos 3
  • 15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Tools developed by the National Center for Scientific Research (NCSR) "Demokritos"  Page Curl Correction Tool Image enhancement tools Download demo: 420 randomly  selected historical  http://www.iit.demokritos.gr/~bgat/H-DocPro/ document images IMPACT event: Project outcomes | 26 June 2012: NCSR “Demokritos” Tools, B.Gatos 4
  • 16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Tools developed by the National Center for Scientific Research (NCSR) "Demokritos" OCR evaluation toolkit  Character Accuracy Download: http://www.iit.demokritos.gr/~bgat/OCREval/  Word Accuracy  Rejection Rate  Based on the isri‐ocr‐evaluation‐tools   (Information  Science  Research   Characters Marked Institute,  University  of  Nevada  ‐ Source code from  Accuracy after Correction  http://code.google.com/p/isri‐ocr‐  Accuracy by Character Class evaluation‐tools/ Code license: Apache License 2.0)  Figure of Merit OCREval.exe [a,u8,u16] stop_words.txt (or “-”) gt.txt (or batchgt.txt) ocr.txt (or batch.txt) character_report.txt word_report.txt overall_report.xml IMPACT event: Project outcomes | 26 June 2012: NCSR “Demokritos” Tools, B.Gatos 5
  • 17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Tools developed by the National Center for Scientific Research (NCSR) "Demokritos"  It provides an integrated GUI for  Word Spotting Application indexing historical documents  Download demo: without an OCR engine. It allows  http://www.iit.demokritos.gr/~bgat/WordSpot/ searching the database for  instances of a query keyword using  several different methods (list,  query by example, free text) IMPACT event: Project outcomes | 26 June 2012: NCSR “Demokritos” Tools, B.Gatos 6
  • 18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Tools developed by the National Center for Scientific Research (NCSR) "Demokritos" Word Spotting Application Download demo: http://www.iit.demokritos.gr/~bgat/WordSpot/ Benchmarking set: French book, BnF – year of  publication: 1838 (153 pages), German book,  BSB – year of publication: 1788  (126 pages) IMPACT event: Project outcomes | 26 June 2012: NCSR “Demokritos” Tools, B.Gatos 7
  • 19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Tools developed by the National Center for Scientific Research (NCSR) "Demokritos"  Scientific impact  Part  of  the  PhD  thesis  of  N.  Stamatopolous who was awarded  the  "Best  PhD  Thesis  of  2011  Impact Factor: 2.918 Impact Factor: 1.091 Award  in  the  area  of  Informatics  and Telematics“ in Greece.  Publications on a great number  of widely recognized journals and  international conferences.  IMPACT  related  publications  have  127  citations  from  non‐ Impact Factor: 1.474 IMPACT  researchers  (information  calculated from Scopus) IMPACT event: Project outcomes | 26 June 2012: NCSR “Demokritos” Tools, B.Gatos 8
  • 21. IMPACT Project Outcomes, 26 June 2012, The Hague IMPACT Image and Ground Truth Repository • Content Management System • Datasets – 667,437 images – 52,880 PAGE ground truth files • Invaluable resource for planning, experimentation and evaluation relating to digitisation projects • Access to existing material via CoC • Further development and usage in other projects and domains (Europeana Newspapers) 1
  • 22. IMPACT Project Outcomes, 26 June 2012, The Hague Aletheia • Semi-automated tool for ground truth and showcase production • Full PAGE support • Page border, print space • Layout regions (incl. metadata) • Text lines, words and glyphs • Unicode text at all levels • Reading order, layers etc. • Quality control 2
  • 23. IMPACT Project Outcomes, 26 June 2012, The Hague Layout Evaluation Tool • Performance evaluation of page segmentation and region classification • Configuration of scenarios based on different types of errors Miss / Part. Miss Split Misclass. Merge False Detection 3
  • 24. IMPACT Project Outcomes, 26 June 2012, The Hague Text Line and Word Segmentation • Modules for integration in workflows and/or other tools • Independent of OCR • Used in • Aletheia • Dewarping • T-OCR • Word spotting 4
  • 25. IMPACT Project Outcomes, 26 June 2012, The Hague Dewarping Tool • Correction of arbitrary warping artefacts • Two stage process: • Model detection • Image transformation • Improved legibility • Increased OCR accuracy 5
  • 26. IMPACT Project Outcomes, 26 June 2012, The Hague More Information  PRImA http://www.primaresearch.org  IMPACT CoC http://www.digitisation.eu 6
  • 28. IBM Labs in Haifa IBM`s Highlights  CONCERT (COoperative eNgine for Correction of ExtRacted Text)  IBM`s collaborative correction platform announced in August 2010  http://www-03.ibm.com/press/us/en/pressrelease/32380.wss  IBM`s Adaptive OCR integrated  EE2/EE3 Dictionaries integrated  Dutch, German, Spanish  Sophisticated monitoring mechanism integrated 2011  Designed for both malicious users and game users  CONCERT Games introduction  IMPACT Conference 2011  Beyond the productivity tools  Adaptive OCR System  Font and language independent 1
  • 29. IBM Labs in Haifa CONCERT  Adaptive collaborative correction platform  Uses the feedback from the users to improve productivity  Fully connected to the Adaptive OCR Engine  Strong emphasis on productivity tools  Reduce the time for verification/correction  Patented smart-key approach  Motivate volunteers  Separating data entry process into several complementary tasks  Optimized application dedicated to each task  Break down the tasks into subtask  Make it suitable for parallel processing  Online compilation  User Monitoring  Robust against malicious users and designed for gamification 2
  • 30. IBM Labs in Haifa Adaptive OCR Engine  Consistent and reliable confidence level  Important for quality assurance  No use of prior knowledge on the font  Crazy font can be handled  Good use of the feedback from the users  Character and Word level  Robust to distortion  Page level distortion and printing variations  Easy to migrate between books from the same publisher  Continues update  Not too slow  Around 2-3 times slower than OMNI Engines 3
  • 31. IBM Labs in Haifa What is next ?  Search Over OCR  Beyond transcription  Improve User Feedback  Online advisor  Best performers list  Community building around content  Integrate community tools within the platform  Complete CONCERT Games  iPhone/iPad/Android/Desktop  E-Book creation  Fully digital transcription  Using original font as option  Page distortion correction  Fully integrate the word-based page distortion correction 4
  • 33. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. The Functional Extension Parser (FEP) A Document Understanding Platform Günter Mühlberger University of Innsbruck Department for German Language and Literature Studies
  • 34. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Functional Extension Parser  A book is more than just pure text – it contains a lot of structural metadata – These metadata are (often) encoded in the layout of a document – Size of characters, position on page, distance to other lines, etc. is used to express structural meaning such as headlines, footnotes, caption lines, etc.  FEP is a platform to process digitised or born digital documents and to “understand” the meaning of the layout by using a rules engine – Modular approach – Images and OCR as input – Web-GUI for visualisation, editing and correction – Evaluation module – Export module (METS/ALTO, PDF, ePUB,...)  Several pilots – EOD Network: Enhancement of PDF eBook of historical books – DNB: Extraction of metadata from title pages – Internal project: Card index  FEP is available for free for research applications – Commercial deployment is foreseen via technology transfer platform of the University Innsbruck 2
  • 36. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Postcorrection Tool Ludwig-Maximilians-Universität München Centrum für Informations- und Sprachverarbeitung The Hague, June 26th 2012 Annette Gotscharek, Ulrich Reffle, Christoph Ringlstetter, Klaus U. Schulz, Thorsten Vobl
  • 37. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Postcorrection Tool as Carrier of IMPACT Technology  Advanced language technology used for improvement of interactive postcorrection Advanced language technology used for improvement of interactive postcorrection  Lexica, matching tool, profiler integrated as background technology Lexica, matching tool, profiler integrated as background technology  Document centric knowledge from unsupervised analysis of OCRed Document centric knowledge from unsupervised analysis of OCRed document used for detection of error classes and suggested corrections document used for detection of error classes and suggested corrections  Batchmode for corrections of many errors in „one shot“ Batchmode for corrections of many errors in „one shot“  Rich graphical user interface to let users fully benefit from „knowledge“ on Rich graphical user interface to let users fully benefit from „knowledge“ on document derived error classes document derived error classes
  • 38. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Improved Lexicon Technology  Valid historical words not marked as Valid historical words not marked as errors errors  Historical variants proposed as correction Historical variants proposed as correction candidates candidates
  • 39. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Improved Workflow - Batch Processing  Strings with identical error patterns Strings with identical error patterns corrected as batch corrected as batch
  • 40. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Improved Selection of Correction Candidates  Ranking of candidates through Ranking of candidates through document specific language and document specific language and error profile error profile
  • 42. Lexica and Strategies and Tools for Lexicon Building & Lexicon Deployment in OCR and Retrieval Historical Language Available through the IMPACT Centre of Competence: www.digitisation.eu
  • 43. Lexica for 9 languages Example: ·Dutch IR lexicon:  (will be available as web service) WNT (Dictionary of the Dutch language), 475,498 distinct word forms, 215,180 lemmata, and 558,438 distinct lemma/word form combinations, with 1,636,709 attestations. ·Dutch OCR lexicon (1) Based on a large historical corpus from the DBNL (Digitale Bibliotheek voor de Nederlandse Letteren, Digital Library for Dutch Literature). ·Dutch OCR lexicon (2) Based on the IR lexicon (WNT) [best results in CCS experiment] ·Dutch Named Entity lexicon (persons, locations, organizations) ·Set of historical spelling variation rules Using a reference set of 10,000 pairs of historical words taken from the WNT-based lexico to which parallel modern words have been added
  • 44. Lexica in OCR: Results for 9 languages
  • 45. Lexica in OCR: External Dictonary interface · Function of the tool: enable addition of  (IMPACT) special lexica to FineReader  engine · Availability: free for non‐commercial use
  • 46. Lexica in OCR: OCR Evaluation Tool · Function of the tool: evaluate influence of  lexica on OCR, measure OCR (word)  accuracy · Availability: free for non‐commercial use,  Impact Interoperability Framework
  • 47. Lexica in Retrieval: BlackLab · Function of the tool: enable use of lexica  and linguistic annotation in a lucene‐ based search engine · Deployment of IMPACT IR lexica in query expansion · Keyword-in-context display of search results · Grouping and sorting of search results · Support for linguistic annotation (lemma, part of speech, etc) · Exploitation of named entity tagging · Incremental indexing · Supports CLARIN standards (CQL) · Availability: free for non‐commercial use
  • 48. Toolbox for lexicon building and deployment · Function of the tools: support lexicon  building for historical language · Corpus-based lexicon building · Lexicon-building from dictionary quotations · Spelling variation: pattern extraction and matching · Lemmatization · Availability: free for non‐commercial use
  • 49. CoBaLT corpus-based lexicon building tool · Function of the tool: productivity tool  for  IR lexicon building from corpora · Both for pre-lemmatized and unannotated corpora · Web-based · Availability: free for non‐commercial use
  • 51. Dictionary attestation tool · Function of the tool: productivity tool  for IR  lexicon building from dictionary quotations  · Manual evaluation and correction of large quantities of automatically matched occurrences of a headword (scripts available) in the quotations of the particular article in a comprehensive dictionary · High productivity: revision speed (OED) 400-600 entries/hour · Web-based · Availability: free for non‐commercial use
  • 53. Spellingvariation and lemmatization tools · Function of the tools:  · match (inflected) historical spellings of words to modern dictionary headword form · Derive spelling variation patterns from example material · Availability: free for non‐commercial use
  • 54. NERT named entity recognition tool · Function of the tool: mark up named  entities in text · Extension of Stanford NE recognizer with  extensions for spelling variation and OCR  errors · Availability: free for non‐commercial use
  • 55. Named entity matching tool · Function of the tool: match named entities  with historical spelling to standard form · Availability: free for non‐commercial use
  • 56. Named entity attestation tool · Function of the tool: manual evaluation  and correction of automatically matched  occurrences of Named Entities in text  material · Web‐based · Availability: free for non‐commercial use
  • 59. Background / Use case A diverse set of tools relevant to OCR & digitisation being developed by IMPACT and also others But: Mix of prototypes, robust commercial solutions, 3rd party tools Variety of programming languages, environments, backgrounds Solution: Wrapping of tools as web services allows: - quick & easy integration → uniform access - flexible, loose coupling of separate modules - platform independent execution via the web  IMPACT Interoperability Framework (IIF)
  • 60. Overview: Service Oriented Architecture IMPACT Interoperability Framework Technologies: Java REST/SOAP Apache Maven Apache Tomcat Apache Axis2 Apache Synapse Taverna Engine Interoperability through open technology standards (OASIS/W3C)
  • 61. Workflow example • OCR task = processing pipeline • Building blocks = individual tools • Workflow = interaction between tools (mashups) • Collaboration with
  • 62. Availability & Takeup Availability IIF components released as open source (Apache 2.0 License): https://github.com/impactcentre/interoperability-framework Community workflow resources (CC License): http://www.myexperiment.org/groups/235 Takeup: SCAPE – Scalable Preservation Environments DDMAL – Digital Music Archives & Libraries Laboratory Montreal