Presentatie 'Transkribus. A research infrastructure for transcribing, recognizing and searching archival documents' tijdens de studiedag Googelen door Archieven op 13 oktober bij het Nationaal Archief.
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Transkribus | Günter Mühlberger
1. Transkribus.
A research infrastructure for
transcribing, recognizing and searching
archival documents
Günter Mühlberger
University of Innsbruck,
Digitisation and Digital Preservation Group
3. Voorts Hooge Mogende heeren, Inde practyque vande decisie by uw Hoge
voorts hoomge Mogende heeren, Inde pra tiqune vande decasie by ende hoge
4. Mo: gedaen in t'voorJaer over t'vinden vande middelen syn verscheyden differenten
Mo: gedaen in t' voorJaer over t' vinden vande middelen syn verscheyden differenten
5. voorgevallen tusschen Stadt en Lande, als namentl ouer den voet ende
voorgevallen tusschen Stadt en Lande, als nanent ouer den voet ende
8. Archives are starting to digitise their holdings.
BTW: Documents in archives are unique, were never published before and contain extremely
interesting content!
15. READ
• H2020 e-Infrastructure Project
• Duration: 1.1.2016 – 30.6.2019
• Budget: 8,2 mill. EUR grant
• Coordinated by University of Innsbruck
• 14 partners and more than 20 institutions connected with an Memorandum of
Understanding
• Main objectives
• Foster research in Pattern Recognition, Machine Learning, Natural Language
Processing, Digital Humanities
• Set up a service platform (“Transkribus”) to make the technology available to
archives, scholars, public.
• Transform this research infrastructure into a permanent service
18. Documents in Transkribus
• Private
• All documents in Transkribus are first of all private – visible only to the “owner” of the
document
• Local
• For simple operations, but all services are available only for remote documents
• Remote
• Standard mode
• Stored on the servers of the University of Innsbruck
• Upload of documents
• HTTP
• PDF
• FTP
• METS Link
• Direkt download from repository
23. Researchers can go “shopping” and collect documents from various
repositories and digital libraries in their private Transkribus collection
24. Transcribe text in a reliable, secure, and machine readable way = create
a scholarly transcription
And use the text to train the HTR engines
25.
26.
27.
28. Finished? Write an email…
Training process will be made available to the user as well
(but will need some time due to a lack of resources in Innsbruck)
29. HTR engine(s) – current implementation
• Hidden Markov Models - HMM (already available)
• Training takes some hours
• Recognition takes 20-60’ for one page
• Strong limitations on dictionary and resolution of images
• Recurrent Neural Networks - RNN (coming soon)
• Training takes some days
• Recognition takes less than 60’’ (!)
• No limitation on resolution of images
• Free choice of dictionary – less dependency
• Main limitations for both HTR engines
• Layout Analysis (“line finder”)
• Need for dictionaries
30.
31.
32.
33. What to do with the automatically recognized
text?
• Measure results
• Correct the text
• Search in the full-text
• Invite people to support you in transcribing
36. Correct text
• Character Error Rates
• Above 20% correction takes as long as keying, but readers who have
difficulties to decipher may benefit
• Above 10% correction is faster, but experienced readers prefer to key
• Below 10% correction is much faster and even experienced readers will
accept correction instead of keying
• Currently typical figures are 10% CER
• Under lab conditions significantly better results are already possible
37. Search full-text
• Private search, you will get only results from collections where you
are member
• Facetted search
• Configurable search
38.
39.
40.
41. Share your documents among your working group, colleagues, students
and volunteers…
42.
43. Export documents
• Various formats
• XML (PAGE)
• METS (Metadata Encoding and Transmission Standard – LoC)
• ALTO (Analyzed Layout and Text Object – LoC)
• DOCX
• TEI (Text Encoding Initiative)
• PDF
• Excel
• …
44.
45. How to access services via machines?
Services in Transkribus are accessible via REST interface
46.
47. What will come next?
• Table editor
• eLearning Interface
• Web-interface for simplified transcription (crowd-sourcing)
• Text2Image matching tool
• ScanApp
• …
54. Web-interface
Every document in Transkribus will also be accessible via a web-
interface suitable to involve volunteers and the crowd
55.
56. txt2img tool
Many printed or digital editions are available.
Automated matching may simplify the training data production.
(only good matches will be taken for training)
57. ScanApp
Researchers are enabled to use mobile phones as document scanners
(images are sent directly to their private Transkribus collection and archives may
benefit from this)
59. Try out?
We are happy to support you to set up test projects
Conclusion of a Memorandum of Understanding is a simple way to take
part in the project!
60. Credits
Hubert Alisade Hilde Boe Laurant Bolli Max Bryan Elaine Charwat Vincent
Christlein Sebastian Colutto Hervé Déjean Barbara Denicolo Markus Diem
Felix Dietrich Reko Etelävuori Stefan Fiel Basilis Gatos Beat Gnädinger Tobias
Grüning Vili Haukkovaara Gerhard Heyer Tobias Hodel Frederic Kaplan
Maria Kallio Istvan Kecskemeti Florian Kleber Roger Labahn Eva Lang Sören
Laube Gundram Leifert Georgios Louloudis Philip Kahle Rory McNicholl
Jean-Luc Meunier Johannes Michael Hannes Obermair Moises Pastor
Nathanael Philipp Hannelore Putz George Retsinas Veronica Romero Joan
Andreu Sanchez Robert Sablatnig Christian Sieber Giorgos Sfikas Philip
Schofield Louise Seaward Nikolaos Stamatopolous Tobias Strauss Melissa
Terras Alejandro Hector Toselli Enrique Vidal Mauricio Villegas Max
Weidemann Welf Wustlich Herbert Wurster and many, many more!
61. Thank you for your attention!
More information on the project and the Transkribus platform
http://read.transkribus.eu/
http://transkribus.eu/
http://transkribus.eu/wiki/
This project has received funding from the European Union’s
Horizon 2020 research and innovation programme under
grant agreement No 674943.