The document discusses a structural analysis tool called the Functional Extension Parser (FEP). It can recognize structural elements in documents like pages numbers, headings, footnotes, and tables of contents. This information adds context and can improve search, navigation, and other functions. The FEP follows a rule-based approach and currently achieves over 80% accuracy for common book elements. It is being integrated into the IMPACT project and will soon be available as a web service.
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger
1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Structural analysis of documents
Functional Extension Parser (FEP)
Günter Mühlberger
University Innsbruck Library (UIBK)
2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Agenda
Introduction
Features
– What do we recognise with the structural analysis?
Benefits
– Why is structural analysis useful?
Architecture
– How does it work?
Results
– How good are we?
Roadmap
– When will it come into being?
Business
– Which offers will be available?
2
3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Introduction
Document understanding platform
Try to enhance and exploit the logical structure of documents for
– Display
– Navigation
– Retrieval
Enhance OCR output with structural metadata
– Fully automated processing
– Interactive correction
IMPACT EVA/MINERVA 12th Nov. 2008 3
4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Features
General
– We are able to recognise all structural elements which have some layout
representation: e.g. region, size, typeface, distance to other elements, etc.
– Focus in IMPACT: Basic features which are typical for all documents
– Rules set can be extended or specified according to other datasets
E.g. journals, dissertations, index cards, yearbooks, newspapers, etc.
– The better the OCR, the better our structural analysis
Basic features for books
– Page numbers
– Running titles (headers)
– Print space
– Footnotes
– Signature marks
– Headings (within the running text)
– Table of contents entries (additional to headings)
– Front/Body/Back
– Paragraphs
4
5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Print space
Headings
Footnotes
5
6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Running title (header)
Page number
Signature mark
6
7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Table of contents
– (linked with headings in
the running text,
respectively page
numbers)
7
8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Benefits (1)
Display
– Correct print space allows to display images centred (no flipping
between pages)
Search & retrieval
– Scoring of results
Could take into account structural data (headings, footnotes)
– Noise reduction
Front, body, back are separated, text from the front is often misleading
Running titles repeat the same words
Footnotes can be included or excluded
– Facetted search
Results can be displayed for running text, footnotes, headings
8
9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Benefits (2)
Navigation
– Page numbers allow usage of original table of contents
– Original table of contents can be linked with headings/page numbers in
the book
Document editing
– Further mark up (e.g. TEI) is supported
– Manual preparation for Print-on-Demand is eased (print space)
– Selective OCR correction can be applied:
– E.g. only headings, running text, footnotes could be fed to CONCERT
Document matching
– Contributions or footnotes can be matched with existing bibliographical
databases
9
10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Improved display in the
Internet and PDF
10
11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Refinement of full-text
search
Facets for e.g.
– Running text
– Footnotes
– Headings
Less noise
– Running titles,
signature marks
excluded from search
11
12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Clickable table of contents
entries
– Google style
Selective OCR correction
– Correct only ToC,
headings, footnotes, etc.
12
13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Matching of documents
with external sources
– Match footnotes with
library catalogues
(bibliographies)Clickable
table of content
– Match table of contents
entries and headings with
bibliographies
13
14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Improved editing
– Alternating print spaces
for Print on Demand
– Further processing for
TEI editions etc.
14
15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Architecture
Input
– Results from OCR processing on word level (coordinates)
– E.g. ALTO file, ABBYY XML file or Google HTML
Output
– Structural annotations for recognized text features, e.g. page numbers,
running titles, headings, etc.
– E.g. XML, ALTO, METS, TEI, etc.
General workflow
– OCR result files are parsed (FEP general XML format)
– Rules set is applied to the dataset (rules are managed by rules engine)
– Results are stored in a database
– Export on various levels is provided
Optional
– Online or offline correction (GUI)
– Adaptation of rules set
– Quality assurance on basis of ground truth
15
16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
16
17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
The FEP Core
Based on expert-system like rule engine for java (Jess)
Both manually crafted rules and rules obtained by machine learning
Uses fuzzy logic to deal with uncertainty
Typical rules:
IF there is a numeral in the first line of the page AND this numeral is centred
THEN this numeral may be the page number
IF there is a numeral in the first line of the page AND this numeral is at the
right hand side of the page AND this numeral is an odd number THEN this
numeral may be the page number
IF there is a numeral in the first line of the page AND this numeral is at the
left hand side of the page AND this numeral is an even number THEN this
numeral may be the page number.
IMPACT EVA/MINERVA 12th Nov. 2008 17
18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Results
Basic rules set
– General features for books from 1700 to 2000
– Dataset of 155 books, 30.673 pages (141 training set, 41 evaluation set)
– All books were manually annotated (ground truth)
Recall, Precision, F-Measure
– E.g. 10 lines with headings in a book. We find 12 lines, 8 of them are
correct, 4 are false.
– Recall = 8 of 10 = 0,8
– Precision = 8 of 12 = 0,66
– F-Measure = 2*0.8*0.66/(0.8+0.66) = 0,72
More explanations
– Important: We are counting lines, not structural items!
E.g. a heading consists of two lines (often with different size of typeface we have
to find both to succeed)
– Difference between training and evaluation sets are marginal
18
19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Results on Evaluation Set
Recall Precision F‐measure
Running text 0,99 0,98 0,98
Footnotes 0,83 0,89 0,86
Page numbers 0,97 1 0,98
Running titles 0,97 1 0,98
Heading 0,85 0,80 0,82
Signature marks 0,68 0,89 0,77
19
20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Roadmap
Summer 2011: Beta version
– Integration into IMPACT Interoperability Platform
– Basic rules set: books from 1700 to 1900
End of the year: Version 1.0
– Full featured version
– Enhanced online correction interface
– FEP as a service, not as a product for local installation
20
21. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Business offers
Web-service for processing single volumes and correction
– Will be integrated into eBooks-on-Demand EOD Network
– Already now 30 libraries are uploading their images to OCR server in
Innsbruck
– FEP will be an additional service for general material
– Similar offers can be made to other libraries or networks as well
Adaptation of rules set
– For specific datasets much more can be detected than just the basic
features
– E.g. journals with a fixed structure over many years or parliamentary
papers, dissertations, research papers, etc.
Onsite installations
– Not our focus, but could be done for very large datasets or due to legal
requirements (e.g. Google images)
21
22. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
22
23. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
23
24. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT EVA/MINERVA 12th Nov. 2008 24
25. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
25
26. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Results: TOC
25 TOC entries in total
22 TOC entries are completely correct
1 TOC entry was missed
2 TOC entries are grouped incorrectly
1 TOC entry has no link
1 TOC entry has a wrong link
IMPACT EVA/MINERVA 12th Nov. 2008 26
27. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Thank you for your attention!
27