SlideShare una empresa de Scribd logo
1 de 27
Descargar para leer sin conexión
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Structural analysis of documents
Functional Extension Parser (FEP)

Günter Mühlberger
University Innsbruck Library (UIBK)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Agenda
      Introduction
      Features
        – What do we recognise with the structural analysis?
      Benefits
        – Why is structural analysis useful?
      Architecture
        – How does it work?
      Results
        – How good are we?
      Roadmap
        – When will it come into being?
      Business
        – Which offers will be available?
                                                                                                                                                         2
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Introduction
      Document understanding platform
      Try to enhance and exploit the logical structure of documents for
        – Display
        – Navigation
        – Retrieval
      Enhance OCR output with structural metadata
        – Fully automated processing
        – Interactive correction




IMPACT EVA/MINERVA 12th Nov. 2008                                                                                                                        3
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Features
      General
        – We are able to recognise all structural elements which have some layout
          representation: e.g. region, size, typeface, distance to other elements, etc.
        – Focus in IMPACT: Basic features which are typical for all documents
        – Rules set can be extended or specified according to other datasets
                      E.g. journals, dissertations, index cards, yearbooks, newspapers, etc.
        – The better the OCR, the better our structural analysis
      Basic features for books
        –     Page numbers
        –     Running titles (headers)
        –     Print space
        –     Footnotes
        –     Signature marks
        –     Headings (within the running text)
        –     Table of contents entries (additional to headings)
        –     Front/Body/Back
        –     Paragraphs
                                                                                                                                                         4
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      Print space
      Headings
      Footnotes




                                                                                                                                                         5
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      Running title (header)
      Page number
      Signature mark




                                                                                                                                                         6
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      Table of contents
        – (linked with headings in
          the running text,
          respectively page
          numbers)




                                                                                                                                                         7
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Benefits (1)
      Display
        – Correct print space allows to display images centred (no flipping
          between pages)
      Search & retrieval
        – Scoring of results
                      Could take into account structural data (headings, footnotes)
        – Noise reduction
                      Front, body, back are separated, text from the front is often misleading
                      Running titles repeat the same words
                      Footnotes can be included or excluded
        – Facetted search
                      Results can be displayed for running text, footnotes, headings



                                                                                                                                                         8
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Benefits (2)
      Navigation
        – Page numbers allow usage of original table of contents
        – Original table of contents can be linked with headings/page numbers in
          the book
      Document editing
        –     Further mark up (e.g. TEI) is supported
        –     Manual preparation for Print-on-Demand is eased (print space)
        –     Selective OCR correction can be applied:
        –     E.g. only headings, running text, footnotes could be fed to CONCERT
      Document matching
        – Contributions or footnotes can be matched with existing bibliographical
          databases


                                                                                                                                                         9
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      Improved display in the
      Internet and PDF




                                                                                                                                                         10
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      Refinement of full-text
      search
      Facets for e.g.
        – Running text
        – Footnotes
        – Headings
      Less noise
        – Running titles,
          signature marks
          excluded from search




                                                                                                                                                         11
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      Clickable table of contents
      entries
        – Google style
      Selective OCR correction
        – Correct only ToC,
          headings, footnotes, etc.




                                                                                                                                                         12
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      Matching of documents
      with external sources
        – Match footnotes with
          library catalogues
          (bibliographies)Clickable
          table of content
        – Match table of contents
          entries and headings with
          bibliographies




                                                                                                                                                         13
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      Improved editing
        – Alternating print spaces
          for Print on Demand
        – Further processing for
          TEI editions etc.




                                                                                                                                                         14
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Architecture
      Input
        – Results from OCR processing on word level (coordinates)
        – E.g. ALTO file, ABBYY XML file or Google HTML
      Output
        – Structural annotations for recognized text features, e.g. page numbers,
          running titles, headings, etc.
        – E.g. XML, ALTO, METS, TEI, etc.
      General workflow
        –     OCR result files are parsed (FEP general XML format)
        –     Rules set is applied to the dataset (rules are managed by rules engine)
        –     Results are stored in a database
        –     Export on various levels is provided
      Optional
        – Online or offline correction (GUI)
        – Adaptation of rules set
        – Quality assurance on basis of ground truth

                                                                                                                                                         15
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         16
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




The FEP Core
      Based on expert-system like rule engine for java (Jess)
      Both manually crafted rules and rules obtained by machine learning
      Uses fuzzy logic to deal with uncertainty

Typical rules:
      IF there is a numeral in the first line of the page AND this numeral is centred
      THEN this numeral may be the page number
      IF there is a numeral in the first line of the page AND this numeral is at the
      right hand side of the page AND this numeral is an odd number THEN this
      numeral may be the page number
      IF there is a numeral in the first line of the page AND this numeral is at the
      left hand side of the page AND this numeral is an even number THEN this
      numeral may be the page number.

IMPACT EVA/MINERVA 12th Nov. 2008                                                                                                                        17
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Results
      Basic rules set
        – General features for books from 1700 to 2000
        – Dataset of 155 books, 30.673 pages (141 training set, 41 evaluation set)
        – All books were manually annotated (ground truth)
      Recall, Precision, F-Measure
        – E.g. 10 lines with headings in a book. We find 12 lines, 8 of them are
          correct, 4 are false.
        – Recall                = 8 of 10                   = 0,8
        – Precision             = 8 of 12                   = 0,66
        – F-Measure             = 2*0.8*0.66/(0.8+0.66)     = 0,72
      More explanations
        – Important: We are counting lines, not structural items!
                      E.g. a heading consists of two lines (often with different size of typeface we have
                      to find both to succeed)
        – Difference between training and evaluation sets are marginal

                                                                                                                                                         18
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




  Results on Evaluation Set

                                                Recall                            Precision                          F‐measure
Running text                                                           0,99                              0,98                              0,98
Footnotes                                                              0,83                              0,89                              0,86
Page numbers                                                           0,97                                      1                         0,98
Running titles                                                         0,97                                      1                         0,98
Heading                                                                0,85                              0,80                              0,82
Signature marks                                                        0,68                              0,89                              0,77
                                                                                                                                                           19
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Roadmap
      Summer 2011: Beta version
        – Integration into IMPACT Interoperability Platform
        – Basic rules set: books from 1700 to 1900
      End of the year: Version 1.0
        – Full featured version
        – Enhanced online correction interface
        – FEP as a service, not as a product for local installation




                                                                                                                                                         20
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Business offers
      Web-service for processing single volumes and correction
        – Will be integrated into eBooks-on-Demand EOD Network
        – Already now 30 libraries are uploading their images to OCR server in
          Innsbruck
        – FEP will be an additional service for general material
        – Similar offers can be made to other libraries or networks as well
      Adaptation of rules set
        – For specific datasets much more can be detected than just the basic
          features
        – E.g. journals with a fixed structure over many years or parliamentary
          papers, dissertations, research papers, etc.
        Onsite installations
        – Not our focus, but could be done for very large datasets or due to legal
          requirements (e.g. Google images)

                                                                                                                                                         21
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         22
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         23
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




IMPACT EVA/MINERVA 12th Nov. 2008                                                                                                                        24
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         25
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Results: TOC

      25 TOC entries in total
      22 TOC entries are completely correct
      1 TOC entry was missed
      2 TOC entries are grouped incorrectly
      1 TOC entry has no link
      1 TOC entry has a wrong link




IMPACT EVA/MINERVA 12th Nov. 2008                                                                                                                        26
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




             Thank you for your attention!




                                                                                                                                                         27

Más contenido relacionado

La actualidad más candente

Bratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdfBratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdfIMPACT Centre of Competence
 
Tel concertation meeting project presentations - 7-2-2014
Tel concertation meeting   project presentations - 7-2-2014Tel concertation meeting   project presentations - 7-2-2014
Tel concertation meeting project presentations - 7-2-2014munarmu
 
Jpl cv en JUN10
Jpl cv en JUN10Jpl cv en JUN10
Jpl cv en JUN10JoaoPL
 
Jpl cv en_mar11
Jpl cv en_mar11Jpl cv en_mar11
Jpl cv en_mar11JoaoPL
 
iDiscover: Towards the next generation of contextualised mobile museum guides
iDiscover: Towards the next generation of contextualised mobile museum guidesiDiscover: Towards the next generation of contextualised mobile museum guides
iDiscover: Towards the next generation of contextualised mobile museum guidesiDiscover Interactief Erfgoed
 
Jpl Cv En v2
Jpl Cv En v2Jpl Cv En v2
Jpl Cv En v2JoaoPL
 
CV of Joao Penha-Lopes (En)
CV of Joao Penha-Lopes (En)CV of Joao Penha-Lopes (En)
CV of Joao Penha-Lopes (En)JoaoPL
 
J01 dov winer_scientix_national_workshop_20151109
J01 dov winer_scientix_national_workshop_20151109J01 dov winer_scientix_national_workshop_20151109
J01 dov winer_scientix_national_workshop_20151109evaminerva
 
J01 dov winer_scientix_national_workshop_20151109
J01 dov winer_scientix_national_workshop_20151109J01 dov winer_scientix_national_workshop_20151109
J01 dov winer_scientix_national_workshop_20151109evaminerva
 
Workshop Barcelona: Copyright Limitations and Exceptions for Education in the EU
Workshop Barcelona: Copyright Limitations and Exceptions for Education in the EUWorkshop Barcelona: Copyright Limitations and Exceptions for Education in the EU
Workshop Barcelona: Copyright Limitations and Exceptions for Education in the EUOpenCourseWare Europe
 
IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...
IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...
IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...IMPACT Centre of Competence
 
EuropeanaConnect
EuropeanaConnectEuropeanaConnect
EuropeanaConnectMax Kaiser
 
I Lab2 I Lab Vision Ws 3 Oct 06 V2
I Lab2 I Lab Vision Ws 3 Oct 06 V2I Lab2 I Lab Vision Ws 3 Oct 06 V2
I Lab2 I Lab Vision Ws 3 Oct 06 V2imec.archive
 

La actualidad más candente (14)

Bratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdfBratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdf
 
Enoll hannover-2013-anna
Enoll hannover-2013-annaEnoll hannover-2013-anna
Enoll hannover-2013-anna
 
Tel concertation meeting project presentations - 7-2-2014
Tel concertation meeting   project presentations - 7-2-2014Tel concertation meeting   project presentations - 7-2-2014
Tel concertation meeting project presentations - 7-2-2014
 
Jpl cv en JUN10
Jpl cv en JUN10Jpl cv en JUN10
Jpl cv en JUN10
 
Jpl cv en_mar11
Jpl cv en_mar11Jpl cv en_mar11
Jpl cv en_mar11
 
iDiscover: Towards the next generation of contextualised mobile museum guides
iDiscover: Towards the next generation of contextualised mobile museum guidesiDiscover: Towards the next generation of contextualised mobile museum guides
iDiscover: Towards the next generation of contextualised mobile museum guides
 
Jpl Cv En v2
Jpl Cv En v2Jpl Cv En v2
Jpl Cv En v2
 
CV of Joao Penha-Lopes (En)
CV of Joao Penha-Lopes (En)CV of Joao Penha-Lopes (En)
CV of Joao Penha-Lopes (En)
 
J01 dov winer_scientix_national_workshop_20151109
J01 dov winer_scientix_national_workshop_20151109J01 dov winer_scientix_national_workshop_20151109
J01 dov winer_scientix_national_workshop_20151109
 
J01 dov winer_scientix_national_workshop_20151109
J01 dov winer_scientix_national_workshop_20151109J01 dov winer_scientix_national_workshop_20151109
J01 dov winer_scientix_national_workshop_20151109
 
Workshop Barcelona: Copyright Limitations and Exceptions for Education in the EU
Workshop Barcelona: Copyright Limitations and Exceptions for Education in the EUWorkshop Barcelona: Copyright Limitations and Exceptions for Education in the EU
Workshop Barcelona: Copyright Limitations and Exceptions for Education in the EU
 
IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...
IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...
IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...
 
EuropeanaConnect
EuropeanaConnectEuropeanaConnect
EuropeanaConnect
 
I Lab2 I Lab Vision Ws 3 Oct 06 V2
I Lab2 I Lab Vision Ws 3 Oct 06 V2I Lab2 I Lab Vision Ws 3 Oct 06 V2
I Lab2 I Lab Vision Ws 3 Oct 06 V2
 

Similar a Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger

Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)cneudecker
 
IMPACT Demo Dag at KB
IMPACT Demo Dag at KBIMPACT Demo Dag at KB
IMPACT Demo Dag at KBcneudecker
 
An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...cneudecker
 
IMPACT HPC Cloud Day
IMPACT HPC Cloud DayIMPACT HPC Cloud Day
IMPACT HPC Cloud Daycneudecker
 
IMPACT at OCR Summit
IMPACT at OCR SummitIMPACT at OCR Summit
IMPACT at OCR Summitcneudecker
 
Experimental Workflow Development in Digitisation
Experimental Workflow Development in DigitisationExperimental Workflow Development in Digitisation
Experimental Workflow Development in Digitisationcneudecker
 
OCR challenges in historic documents and the contribution of IMPACT
OCR challenges in historic documents and the contribution of IMPACTOCR challenges in historic documents and the contribution of IMPACT
OCR challenges in historic documents and the contribution of IMPACTcneudecker
 
Centre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens NeudeckerCentre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens NeudeckerBiblioteca Nacional de España
 
Targeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical CollectionsTargeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical CollectionsEmma Huber
 
IMPACT Final Event 26-06-2012 - Library experiences in IMPACT: National and ...
IMPACT Final Event 26-06-2012  - Library experiences in IMPACT: National and ...IMPACT Final Event 26-06-2012  - Library experiences in IMPACT: National and ...
IMPACT Final Event 26-06-2012 - Library experiences in IMPACT: National and ...IMPACT Centre of Competence
 
Models and tools for aggregating and annotating content on ECLAP
Models and tools for aggregating and annotating content on ECLAPModels and tools for aggregating and annotating content on ECLAP
Models and tools for aggregating and annotating content on ECLAPPaolo Nesi
 
Presentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information DayPresentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information DayEuropeana Newspapers
 
Mapping the European TEL Project Landscape Using Social Network Analysis and ...
Mapping the European TEL Project Landscape Using Social Network Analysis and ...Mapping the European TEL Project Landscape Using Social Network Analysis and ...
Mapping the European TEL Project Landscape Using Social Network Analysis and ...Michael Derntl
 
Share.TEC dissemination activities 2009, E. Axdorph
Share.TEC dissemination activities 2009, E. AxdorphShare.TEC dissemination activities 2009, E. Axdorph
Share.TEC dissemination activities 2009, E. AxdorphShare.TEC
 
Dissemination activities
Dissemination activitiesDissemination activities
Dissemination activitiesguest1e6768
 
Dissemination activities(E. Axdorph) - 2nd Share.TEC project workshop
Dissemination activities(E. Axdorph) - 2nd Share.TEC project workshopDissemination activities(E. Axdorph) - 2nd Share.TEC project workshop
Dissemination activities(E. Axdorph) - 2nd Share.TEC project workshopErik Axdorph
 
Europeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers
 

Similar a Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger (20)

Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)
 
IMPACT Demo Dag at KB
IMPACT Demo Dag at KBIMPACT Demo Dag at KB
IMPACT Demo Dag at KB
 
An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...
 
IMPACT HPC Cloud Day
IMPACT HPC Cloud DayIMPACT HPC Cloud Day
IMPACT HPC Cloud Day
 
IMPACT at OCR Summit
IMPACT at OCR SummitIMPACT at OCR Summit
IMPACT at OCR Summit
 
Experimental Workflow Development in Digitisation
Experimental Workflow Development in DigitisationExperimental Workflow Development in Digitisation
Experimental Workflow Development in Digitisation
 
IMPACT Final Conference - Stefan Pletschacher
IMPACT Final Conference - Stefan PletschacherIMPACT Final Conference - Stefan Pletschacher
IMPACT Final Conference - Stefan Pletschacher
 
OCR challenges in historic documents and the contribution of IMPACT
OCR challenges in historic documents and the contribution of IMPACTOCR challenges in historic documents and the contribution of IMPACT
OCR challenges in historic documents and the contribution of IMPACT
 
Centre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens NeudeckerCentre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens Neudecker
 
Targeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical CollectionsTargeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical Collections
 
IMPACT Final Event 26-06-2012 - Library experiences in IMPACT: National and ...
IMPACT Final Event 26-06-2012  - Library experiences in IMPACT: National and ...IMPACT Final Event 26-06-2012  - Library experiences in IMPACT: National and ...
IMPACT Final Event 26-06-2012 - Library experiences in IMPACT: National and ...
 
text summarization
text summarizationtext summarization
text summarization
 
Models and tools for aggregating and annotating content on ECLAP
Models and tools for aggregating and annotating content on ECLAPModels and tools for aggregating and annotating content on ECLAP
Models and tools for aggregating and annotating content on ECLAP
 
Presentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information DayPresentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information Day
 
Metadata
MetadataMetadata
Metadata
 
Mapping the European TEL Project Landscape Using Social Network Analysis and ...
Mapping the European TEL Project Landscape Using Social Network Analysis and ...Mapping the European TEL Project Landscape Using Social Network Analysis and ...
Mapping the European TEL Project Landscape Using Social Network Analysis and ...
 
Share.TEC dissemination activities 2009, E. Axdorph
Share.TEC dissemination activities 2009, E. AxdorphShare.TEC dissemination activities 2009, E. Axdorph
Share.TEC dissemination activities 2009, E. Axdorph
 
Dissemination activities
Dissemination activitiesDissemination activities
Dissemination activities
 
Dissemination activities(E. Axdorph) - 2nd Share.TEC project workshop
Dissemination activities(E. Axdorph) - 2nd Share.TEC project workshopDissemination activities(E. Axdorph) - 2nd Share.TEC project workshop
Dissemination activities(E. Axdorph) - 2nd Share.TEC project workshop
 
Europeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday Muehlberger
 

Más de Biblioteca Nacional de España

La colección de relaciones de sucesos en la Biblioteca Nacional de España
La colección de relaciones de sucesos en la Biblioteca Nacional de EspañaLa colección de relaciones de sucesos en la Biblioteca Nacional de España
La colección de relaciones de sucesos en la Biblioteca Nacional de EspañaBiblioteca Nacional de España
 
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos Aramburo
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos AramburoIdentidad común: las fuentes del patrimonio bibliográfico. Ana Santos Aramburo
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos AramburoBiblioteca Nacional de España
 
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...Biblioteca Nacional de España
 
RDA. Autoridades. Fundamentos. Identificación de entidades. Relaciones
RDA. Autoridades. Fundamentos. Identificación de entidades. RelacionesRDA. Autoridades. Fundamentos. Identificación de entidades. Relaciones
RDA. Autoridades. Fundamentos. Identificación de entidades. RelacionesBiblioteca Nacional de España
 
Pleno del Real Patronato. Biblioteca Nacional de España
Pleno del Real Patronato. Biblioteca Nacional de EspañaPleno del Real Patronato. Biblioteca Nacional de España
Pleno del Real Patronato. Biblioteca Nacional de EspañaBiblioteca Nacional de España
 
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de España
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de EspañaObjetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de España
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de EspañaBiblioteca Nacional de España
 
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...Biblioteca Nacional de España
 
Evaluación actuaciones 2018. Planificación actuaciones 2019
Evaluación actuaciones 2018. Planificación actuaciones 2019Evaluación actuaciones 2018. Planificación actuaciones 2019
Evaluación actuaciones 2018. Planificación actuaciones 2019Biblioteca Nacional de España
 
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos Aramburo
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos AramburoPleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos Aramburo
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos AramburoBiblioteca Nacional de España
 
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...Biblioteca Nacional de España
 

Más de Biblioteca Nacional de España (20)

La colección de relaciones de sucesos en la Biblioteca Nacional de España
La colección de relaciones de sucesos en la Biblioteca Nacional de EspañaLa colección de relaciones de sucesos en la Biblioteca Nacional de España
La colección de relaciones de sucesos en la Biblioteca Nacional de España
 
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos Aramburo
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos AramburoIdentidad común: las fuentes del patrimonio bibliográfico. Ana Santos Aramburo
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos Aramburo
 
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...
 
Data privacy in library authority files: a survey
Data privacy in library authority files: a surveyData privacy in library authority files: a survey
Data privacy in library authority files: a survey
 
Perfil de RDA de la BNE. Resumen de cambios
Perfil de RDA de la BNE. Resumen de cambiosPerfil de RDA de la BNE. Resumen de cambios
Perfil de RDA de la BNE. Resumen de cambios
 
RDA. Autoridades. Fundamentos. Identificación de entidades. Relaciones
RDA. Autoridades. Fundamentos. Identificación de entidades. RelacionesRDA. Autoridades. Fundamentos. Identificación de entidades. Relaciones
RDA. Autoridades. Fundamentos. Identificación de entidades. Relaciones
 
RDA: el nuevo texto
RDA: el nuevo textoRDA: el nuevo texto
RDA: el nuevo texto
 
Pleno del Real Patronato. Biblioteca Nacional de España
Pleno del Real Patronato. Biblioteca Nacional de EspañaPleno del Real Patronato. Biblioteca Nacional de España
Pleno del Real Patronato. Biblioteca Nacional de España
 
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de España
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de EspañaObjetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de España
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de España
 
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...
 
Evaluación actuaciones 2018. Planificación actuaciones 2019
Evaluación actuaciones 2018. Planificación actuaciones 2019Evaluación actuaciones 2018. Planificación actuaciones 2019
Evaluación actuaciones 2018. Planificación actuaciones 2019
 
Dirección Técnica. Objetivos 2019
Dirección Técnica. Objetivos 2019Dirección Técnica. Objetivos 2019
Dirección Técnica. Objetivos 2019
 
Evaluación 2018. Objetivos 2019
Evaluación 2018. Objetivos 2019Evaluación 2018. Objetivos 2019
Evaluación 2018. Objetivos 2019
 
Evaluación actuaciones 2018. Dirección Cultural
Evaluación actuaciones 2018. Dirección CulturalEvaluación actuaciones 2018. Dirección Cultural
Evaluación actuaciones 2018. Dirección Cultural
 
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos Aramburo
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos AramburoPleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos Aramburo
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos Aramburo
 
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...
 
VIAF GDPR
VIAF GDPRVIAF GDPR
VIAF GDPR
 
Renacer prensa historica
Renacer prensa historicaRenacer prensa historica
Renacer prensa historica
 
RDA y Linked data (Ricardo Santos Muñoz)
RDA y Linked data (Ricardo Santos Muñoz)RDA y Linked data (Ricardo Santos Muñoz)
RDA y Linked data (Ricardo Santos Muñoz)
 
Desarrollo actual de RDA (Pilar Tejero López)
Desarrollo actual de RDA (Pilar Tejero López)Desarrollo actual de RDA (Pilar Tejero López)
Desarrollo actual de RDA (Pilar Tejero López)
 

Último

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 

Último (20)

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 

Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger

  • 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Structural analysis of documents Functional Extension Parser (FEP) Günter Mühlberger University Innsbruck Library (UIBK)
  • 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Agenda Introduction Features – What do we recognise with the structural analysis? Benefits – Why is structural analysis useful? Architecture – How does it work? Results – How good are we? Roadmap – When will it come into being? Business – Which offers will be available? 2
  • 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Introduction Document understanding platform Try to enhance and exploit the logical structure of documents for – Display – Navigation – Retrieval Enhance OCR output with structural metadata – Fully automated processing – Interactive correction IMPACT EVA/MINERVA 12th Nov. 2008 3
  • 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Features General – We are able to recognise all structural elements which have some layout representation: e.g. region, size, typeface, distance to other elements, etc. – Focus in IMPACT: Basic features which are typical for all documents – Rules set can be extended or specified according to other datasets E.g. journals, dissertations, index cards, yearbooks, newspapers, etc. – The better the OCR, the better our structural analysis Basic features for books – Page numbers – Running titles (headers) – Print space – Footnotes – Signature marks – Headings (within the running text) – Table of contents entries (additional to headings) – Front/Body/Back – Paragraphs 4
  • 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Print space Headings Footnotes 5
  • 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Running title (header) Page number Signature mark 6
  • 7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Table of contents – (linked with headings in the running text, respectively page numbers) 7
  • 8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Benefits (1) Display – Correct print space allows to display images centred (no flipping between pages) Search & retrieval – Scoring of results Could take into account structural data (headings, footnotes) – Noise reduction Front, body, back are separated, text from the front is often misleading Running titles repeat the same words Footnotes can be included or excluded – Facetted search Results can be displayed for running text, footnotes, headings 8
  • 9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Benefits (2) Navigation – Page numbers allow usage of original table of contents – Original table of contents can be linked with headings/page numbers in the book Document editing – Further mark up (e.g. TEI) is supported – Manual preparation for Print-on-Demand is eased (print space) – Selective OCR correction can be applied: – E.g. only headings, running text, footnotes could be fed to CONCERT Document matching – Contributions or footnotes can be matched with existing bibliographical databases 9
  • 10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Improved display in the Internet and PDF 10
  • 11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Refinement of full-text search Facets for e.g. – Running text – Footnotes – Headings Less noise – Running titles, signature marks excluded from search 11
  • 12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Clickable table of contents entries – Google style Selective OCR correction – Correct only ToC, headings, footnotes, etc. 12
  • 13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Matching of documents with external sources – Match footnotes with library catalogues (bibliographies)Clickable table of content – Match table of contents entries and headings with bibliographies 13
  • 14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Improved editing – Alternating print spaces for Print on Demand – Further processing for TEI editions etc. 14
  • 15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Architecture Input – Results from OCR processing on word level (coordinates) – E.g. ALTO file, ABBYY XML file or Google HTML Output – Structural annotations for recognized text features, e.g. page numbers, running titles, headings, etc. – E.g. XML, ALTO, METS, TEI, etc. General workflow – OCR result files are parsed (FEP general XML format) – Rules set is applied to the dataset (rules are managed by rules engine) – Results are stored in a database – Export on various levels is provided Optional – Online or offline correction (GUI) – Adaptation of rules set – Quality assurance on basis of ground truth 15
  • 16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 16
  • 17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. The FEP Core Based on expert-system like rule engine for java (Jess) Both manually crafted rules and rules obtained by machine learning Uses fuzzy logic to deal with uncertainty Typical rules: IF there is a numeral in the first line of the page AND this numeral is centred THEN this numeral may be the page number IF there is a numeral in the first line of the page AND this numeral is at the right hand side of the page AND this numeral is an odd number THEN this numeral may be the page number IF there is a numeral in the first line of the page AND this numeral is at the left hand side of the page AND this numeral is an even number THEN this numeral may be the page number. IMPACT EVA/MINERVA 12th Nov. 2008 17
  • 18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Results Basic rules set – General features for books from 1700 to 2000 – Dataset of 155 books, 30.673 pages (141 training set, 41 evaluation set) – All books were manually annotated (ground truth) Recall, Precision, F-Measure – E.g. 10 lines with headings in a book. We find 12 lines, 8 of them are correct, 4 are false. – Recall = 8 of 10 = 0,8 – Precision = 8 of 12 = 0,66 – F-Measure = 2*0.8*0.66/(0.8+0.66) = 0,72 More explanations – Important: We are counting lines, not structural items! E.g. a heading consists of two lines (often with different size of typeface we have to find both to succeed) – Difference between training and evaluation sets are marginal 18
  • 19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Results on Evaluation Set Recall Precision F‐measure Running text 0,99 0,98 0,98 Footnotes 0,83 0,89 0,86 Page numbers 0,97 1 0,98 Running titles 0,97 1 0,98 Heading 0,85 0,80 0,82 Signature marks 0,68 0,89 0,77 19
  • 20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Roadmap Summer 2011: Beta version – Integration into IMPACT Interoperability Platform – Basic rules set: books from 1700 to 1900 End of the year: Version 1.0 – Full featured version – Enhanced online correction interface – FEP as a service, not as a product for local installation 20
  • 21. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Business offers Web-service for processing single volumes and correction – Will be integrated into eBooks-on-Demand EOD Network – Already now 30 libraries are uploading their images to OCR server in Innsbruck – FEP will be an additional service for general material – Similar offers can be made to other libraries or networks as well Adaptation of rules set – For specific datasets much more can be detected than just the basic features – E.g. journals with a fixed structure over many years or parliamentary papers, dissertations, research papers, etc. Onsite installations – Not our focus, but could be done for very large datasets or due to legal requirements (e.g. Google images) 21
  • 22. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 22
  • 23. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 23
  • 24. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT EVA/MINERVA 12th Nov. 2008 24
  • 25. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 25
  • 26. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Results: TOC 25 TOC entries in total 22 TOC entries are completely correct 1 TOC entry was missed 2 TOC entries are grouped incorrectly 1 TOC entry has no link 1 TOC entry has a wrong link IMPACT EVA/MINERVA 12th Nov. 2008 26
  • 27. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Thank you for your attention! 27