IMPACT Final Event 26-06-2012 - Automated metadata extraction from title pages. Report on the FEP pilot at the German National Library by Christa Schöning-Walter (DNB)
Similar a IMPACT Final Event 26-06-2012 - Automated metadata extraction from title pages. Report on the FEP pilot at the German National Library by Christa Schöning-Walter (DNB)
Similar a IMPACT Final Event 26-06-2012 - Automated metadata extraction from title pages. Report on the FEP pilot at the German National Library by Christa Schöning-Walter (DNB) (20)
IMPACT Final Event 26-06-2012 - Automated metadata extraction from title pages. Report on the FEP pilot at the German National Library by Christa Schöning-Walter (DNB)
1. Outline
IMPACT final event – The Hague – 26 June 2012 1. Institutional background
2. IMPACT test case
Metadata extraction from title pages 3. Strategic goals
4. Preliminary work
Evaluation of the FEP pilot
5. Results
at the German National Library
6. Perspective
Christa Schöning-Walter
1 2 | IMPACT event | June 26, 2012 |
The German National Library (DNB) The German National Library (DNB)
– some facts and figures (I) – some facts and figures (II)
− Legal deposit: − Collection size (January 2012): 27 million media units
Collecting, cataloguing, archiving
and making available to the − Daily input: 1.500 physical units (each with 2 copies)
general public all German and − Since 2006:
German-language publications,
Collection mandate includes non-physical media
publications about Germany etc (online publications)
from 1913
− DNBG = Law regarding the German National Library
− Bibliographic services: − PflAV = Legal Deposit Regulation
− National Bibliography
− Authority files − Since 2009:
Considerations on and implementation of automated
− Bibliographic standards
cataloguing processes
− 2 sites: Leipzig, Frankfurt am Main
3 | IMPACT event | June 26, 2012 | 4 | IMPACT event | June 26, 2012 |
1
2. Target of the IMPACT scenario Starting point
Opening questions (summer 2011): Since January 2012:
− Experimental application studies in collaboration with
− Can metadata extraction from title pages successfully the University of Innsbruck
be done by a rule engine in case of simple structured
monographic publications? − Using the rule-based exploitation features of FEP
(Functional Extension Parser)
− Is this useful in order to accelerate the cataloguing
processes if no machine-readable metadata from What is FEP?
other sources is available?
− Software platform for the purpose of analysing the
Test case: Theses logical structure of documents
− Developed within IMPACT work package EE4
− 14.000 print units annually
(Goal: enrichment of OCR output with structure
− simple structure !?
information)
5 | IMPACT event | June 26, 2012 | 6 | IMPACT event | June 26, 2012 |
Strategic goals Conceptual design of the workflow
Example: http://d-nb.info/1017138931
In particular:
Accession Repository FEP results
− Making descriptive cataloguing less time-consuming and
(Printed media
literature processing of printed media faster by units) OAI-Harvester Cataloguing
− Partial digitisation
− Automated metadata extraction
Bibliographic Qualitiy
− Result transfer into the bibliographic record Data Provider check
record
− Quality check and completion of cataloguing by the Statistics
staff
a
Generally: Service partner
Scan service OCR output/
Stack
− Gaining experience in the area of automated metadata (title page + Indexing
extraction / automated cataloguing ToC)
7 | IMPACT event | June 26, 2012 | 8 | IMPACT event | June 26, 2012 |
2
3. The Objective: Automated
exploitation of descriptive bibliographic data
− Specification, implementation,
evaluation and gradually
improvement of
− Appropriate structure
types
− Dictionaries
(controlled vocabulary,
The idea: indicating keywords,
Taking bibliographic data over abbreviations etc)
from metadata mining tools. − Expert rules
− Etc
Illustration: University of Innsbruck
9 | IMPACT event | June 26, 2012 | 10 |
Preliminary work (I) Preliminary work (II)
− Specification of the bibliographic statements to be mined − Going over some hundreds of title pages of theses
from the title page (scans from 2009-2011 + documents from daily business)
Attribute Value
− Exploring typical structural patterns / regularities etc,
Publication year 2010 such as Examples of indicating phrases to find out
Language code /1ger − Prefixes the creator:
/1eng von
− Phrases von <Verfasser> vorgelegte Dissertation
Creator <last name>,<first name>
− Notation von Herrn/Frau:
Title <full title>:<additional title information>/ vorgelegt von(:)
− Position vorgelegt JJJJ von
<author statement>
vorgelegt dem Fachbereich ... von
Size 30 cm Name:
21 cm Name des Verfassers:
Theses statement <city name>, <corporate body name>,
Expert rules Name der Verfasserin:
verfasst von(:)
<type of publication>,<year of graduation>
eingereicht von
11 | IMPACT event | June 26, 2012 | 12 | IMPACT event | June 26, 2012 |
...
3
4. Preliminary work (III) Preliminary work (IV)
Theses statement items (examples):
…
Choosing / preparing Berlin, ESCP Europe Wirtschaftshochschule − Setting up a sample of documents for evaluation
dictionaries for tagging, Berlin, Freie Univ. purposes:
matching and mapping Berlin, Humboldt-Univ.
Berlin, Steinbeis-Hochsch.
− 1.000 theses from several universities
purposes: Berlin, Techn. Univ. − Publication year: 2010 – 2011
Berlin, Univ. der Künste
− List of universities …
− Different dimensions (A- and B-size)
which have the right to − Scans: 300 dpi, bitonal
graduation (identifying Academic grades (examples): − Transfer format: Pdf (in future: XML files)
the corporate bodies) …
M.A. Master of Arts / Magister Artium − Ground truth determination:
− Name Authority File M.Sc. Master of Science
M.Eng. Master of Engineering − Manually region tagging on image files
subset (identifying
LL.M. Master of Laws / Legum Magister (done in Vietnam by the Aletheia tool)
personal names) M.F.A. Master of Fine Arts
M.Mus. Master of Music
− List of academic grades M.Ed. Master of Education
13 | IMPACT event | June 26, 2012 | … 14 | IMPACT event | June 26, 2012 |
Document processing in brief Results
− Database: Storage of all Second test phase with a revised list of universities (June 2012):
available information
(OCR output, automatically
or manually produced
annotations, dictionaries,
facts etc)
− Input of expert rules
− Rule engine: Stepwise
proceeding taking
intermediary results into
account
Illustration: University of Innsbruck (1) total conformity (2) complete title + noise (just to be deleted by the staff)
15 | IMPACT event | June 26, 2012 | 16 | IMPACT event | June 26, 2012 |
4
5. Forecast: Feasibility study New ideas
− Technical and organisational requirements: − Extraction of defined structures from the body of
Operational aspects, technical workflow, interfaces etc monographic publications, such as table of contents,
abstracts, pure text (without any introductory remarks,
− Further functional enhancements needed: footers, references etc)
− Dictionary maintenance: Expanding controlled
vocabulary, sorting out unsuitable items etc Target:
− Taking additional facts into account: Ground truth etc − Improvement of the results of current automated
subject cataloguing projects, such as
− Additional expert rules (?)
− Thematic classification by machine learning
− Additional functions: Language guesser, document techniques
size etc
− Subject headings obtainment by text analysis
− Customising FEP (?) techniques
Reducing the noise via preceding
structure analysis processes
17 | IMPACT event | June 26, 2012 | 18 | IMPACT event | June 26, 2012 |
Thank you for your attention.
Christa Schöning-Walter Sandra Hamm
Staff position ’Automated Cataloguing’ Project leader
c.schoening@dnb.de s.hamm@dnb.de
German National Library
Digital Services
Frankfurt am Main, Germany
19 | IMPACT event | June 26, 2012 |
5