SlideShare una empresa de Scribd logo
1 de 43
Descargar para leer sin conexión
Universität Innsbruck
Christoph-Probst-Platz, Innrain 52
6020 Innsbruck
http://info.uibk.ac.at
User-driven correction of OCR errors.
Combing crowdsourcing and information retrieval
technology
Günter Mühlberger,Johannes Zelger
David Sagmeister,Albert Greinöcker
Universität Innsbruck / Höhere Technische
Bundeslehranstalt Anichstraße - Innsbruck
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• Introduction
• Crowdsourcing approaches for OCR correction
• Our approach
• Evaluation
• Future work
Agenda
2
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
Introduction
3
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• Digitisation of historical printed material
– Google: Billions of files, libraries: Millions of files
– Still hard to get access to these files
• OCR quality
– There are only a few reliable data on the accuracy of OCR on large scale datasets
– E.g. we do not know „how good the Google collection“ is as a whole, or per
language, per century, decade or year, per text type, etc.
• Tanner (2009)
– Has done evaluation of OCR accuracy on British Newspapers
– Differences per newspaper are stronger than per publishing date
– Overall we are speaking about 10% to 40% Word Error Rate, with an average of
22% WER for standard words and 31% for significant words
– Evaluation done within the IMPACT project has shown similar figures
Digitisation and OCR quality
4
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• What does this mean for the end-user?
– End-users are either searching a collection or are reading an interesting item
(which they may have found by searching).
– But for reading a page/book they have the original image – so the full-text is much
less important for them
• If we take the figures from above:
– End-users will miss e.g. 20% or 30% of all occurances of a search term which
would be interesting for them simply because the OCR is wrong.
• Maybe acceptable to occasional users, but surely not for humanities
researches or family historians: They want to get „all relevant
occurrences“
– What is “relevant” is decided by the user, some may be interested just within a
specific time period, or periodical, or collection of documents
– Note: Not all words are frequent in all collections („London“ in a Tyrolian newspaper
collections is seldom whereas it is frequent in a British Newspaper Collection)
End-usersand OCR quality
5
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
Crowd sourcing for OCR
6
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• OCR as an „ideal“ field for crowd-sourcing
– Simple to realize: Provide link between image and text and let the
user correct it
• Three (and a half) main approaches
– reCAPTCHA
– Australian National Library (Newspaper Digitization Project)
– National Library of Finland (gamefication)
– IBM: CONCERT (CollaborativeCorrection Platform)
Approaches
7
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
reCAPTCHA
8
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
Australian National Library
9
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
Australian National Library
10
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
National Library Finland:Digitalkoot
11
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
IBM CONCERT (COoperative eNgine for Correction of ExtRacted Text)
12
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• OCR correction with the support of the crowd does work (but not
always)!
• In the case of reCAPTCHA and DigitalKoot users have no influence on
what they correct (de-motivating)
– reCAPTCHA is successful due to the sheer size of interactions
• User specific benefit is provided mainly by the approach of the Australian
National Library
– User reads the text carefully when editing
– Finds corrected words immediately after submitting correct text
– Can decide what to correct
• Power users vs. crowd users
– A very small segment of all users are carrying out the actual work
– Australia: Top 6 users corrected about 25% of the texts
– transcribe Bentham project: Top 7 users produced 70% of all transcripts
Conclusion
13
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
Proposed approach
14
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• Let„s combine searching and crowd based correction!
• Provide users with a powerful instrument to correct exactly
those words where they are interested in (searching for)
• Relieve users from actually editing words, but let them just
approve or reject the results of the OCR engine
Searching AND correcting
15
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
Search interface
16
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• User has the chance to
– select the Edit Distance (ED): 0-2
– display already approved words
– search only within the index (without showing word snippets)
• In this way users can play around and
– have influence on the recall of the system
– see the index (which is very helpful to get an impression of the OCR
errors)
– see what already has been done
Search interface:Features
17
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
Result page: Features
18
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• Users see the word snippets of their search
• Buttons
– Select all as „false“ or „correct“
• Red: A word snippet does not represent the correct text
• Green: A word snippet represents the correct text (match between search term and
word snippet)
– Deselect all
– Reverse selection
– Save
• Save
– Green word snippets: The text is either approved (if it is the same as in the
OCR text) or the wrong OCR text is corrected by the correct search term
– Red word snippets: Nothing is changed on the OCR text
Features
19
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
Result page (2)
20
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• Result sets (on the left hand side)
– 150 word snippets are currently shown in the standard view
– Can be parametrized
– Currently ordered by file path (other criteria could be word
confidence)
• Index (on the right hand side)
– All index terms are listed which are „behind“ a fuzzy search
– Number of occurrences are shown for this result set
– User gets an overview of „which tokens are behind these snippets“
– User is able to decide quickly which tokens are „real“ words
Additional features
21
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• Improve precision
– Search with ED0
– All word snippets should display the search term
– Those which do not are classical OCR errors
– If they are selected they get the status „approved“
– Those which are errors are currently just deselected (and not marked
as false)
• Approvals are directly written into the ALTO file
– Correction status: true „approved“
Correction strategies (1)
22
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
Example 1: Search for „nelle“
23
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
OCR errors
24
neue nelle neue nelle
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
Select correct word images = green = approved
25
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• Search for a word with ED1 or ED2
– The number of hits (and word snippets) increases significantly
– Sometime more, sometimes less, depending very much on the
search string and the length of the string
• Strategy
– One may go through all word snippets and deselect wrong ones or
select correct ones  takes some time and is boring
• But
Due to ED2 many other correct words are included in the result set
• Therefore another correction strategy may be more interesting
Correction strategy (2): Improve recall
26
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• Recommended method
– Go for all tokens representing „real words“ which appear in the index
on the right hand side
– By clicking on a word of the index a ED0 search is triggered
– In many cases ED0 searches retrieve good results with just a few
OCR errors  approval is very simple and fast
• Once the „real words“ are done, only those word snippets
appear with „real“ OCR errors of the search term which is our
real objective to correct
Correction strategies (3)
27
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
Example: Search for „Feuerwehr“ED2
28
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
„Feuerwehr“(fire brigade)
29
Feuenvehr Fenerwehr
Feuerwehr, Feuermeh
Feuerwehr- Feuerweh,
Feuerwehr. Feuerwerk
Feuerweh Feuerwehren
Feuerwehr-, Feuerwehr^
Feuerweh? Feuerwehr
Feuermehr Feueràhr
Feuerwert Feuerweihe
Fenerwchr
• Examples of erroneous
words in red
• These words are the „rest“
which appears after having
approved the „real“ words
(green)
• They will finally be replaced
by the correct word:
• In ALTO: correction status
true: substitute: Feuerwehr
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
Validating „real“ words from the index
30
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• Those which were approved in the steps before are hidden to
the user.
– But users are able to see them if interested or if they want to do a
final check
– Overwriting is possible, status has to be changed
• Therefore the final correction screen shows now instead 324
word snippets for “Feuerwehr” ED2 only those which were not
approved before.
Repeated search for „Feuerwehr“ED2
31
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
Finally the „real“ OCR errors are replaced by the
correct word
32
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• Test set
– From the Europeana Newspaper Project
– 16.000 pages from the Tessmann Library, several millions are waiting to get
indexed
– METS/ALTOfiles
• Standard technology
– JAVA, Javascript (Ajax), Lucene
• Images are cropped on the fly
– „Hardest“ task: takes some seconds on a 4 core engine
– First batch of 150 snippets is done immediatly, second batch preprocessed in the
background
• A testset is available online
– http://dbis-faxe.uibk.ac.at/Website%202.0/CorrectionServlet
– Attention: Not a stable link!
Implementation
33
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• Our method provides the chance to improve precision and
recall of search terms in a rather quick and straight forward
way.
• Fuzzy search allows to increase the recall of search terms
significantly and to „correct“ erroneous terms quickly
• No need to edit text – only typing a search term once and than
clicking on the index terms for new searches
• Snowball system since approved words are stored
permanently and are reused for the next correction sessions
as well
Conclusion
34
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
Evaluation
35
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• Currently not enough data for providing good figures on the
evaluation of the tool – implementation in real world scenario will be
necessary
• But: Doan, A. et al. 2011. Crowdsourcingsystems on the World-
Wide Web. Communications of the ACM.
• Four main criteria for crowd sourcing projects
(1) How to recruit and retain users?
(2) What contributions can users make?
(3) How to combine user contributions to solve the target problem?
(4) How to evaluate users and their contributions?
Evaluation
36
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• Users are searching anyway!
• Those who are searching have a specific interest!
• Satisfaction will be higher if precision and especially recall is higher for
noisy OCR text
 motivation should be there
• Power users of the archive may be willing to contribute a good deal of
their time to improve the full-text search
 working power should be there
• Our tool is a piggypack of the search interface – can be integrated in a
simple way (e.g. an extra tab which is performed anyway and users may
try out what is behind)
• Searching the index provides useful insights to the user
 learning curve (get to know your full-text archive!)
(1) How to recruit and retain users?
37
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• Contributions of users are
– Improve precision
– Improve recall by correcting OCR errors of search terms
– All these words are significant and meaningful to a user
• Only a small portion of words is interesting!
– Text contains a lot of words which are not meaningful or are very seldomly
part of a search
– Austrian Newspapers Online: 50% of all full-text searches go for person
names, 20% for geo-names, only a small portion for keywords
– This means that the corrections/approvals done by the user with our method
is more valuable than to correct running text
– The whole number of corrected words may not be so high, but these should
be significant and relevant words
(2) What contributionscan users make?
38
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• Storage of contributions
– All contributions are stored in two ways:
• The Lucene index is immediately updated so that the next search already takes
benefit from approvals/corrections
• Approvals/corrections are directly stored in the OCR XML files (in this case ALTO):
Words are either marked as „correction status true“ „approved“ or the new
alternative of the word is included as well.
• Main benefit for the next user
– The next user will see which word snippets are already approved (are
shown in blue and gray) – in other words: The contributions are visible to
everyone though they are distributed among large amounts of text
– This should users give the feeling that someone already has worked in this
field as well
(3) How to combine user contributionsto solve the
target problem?
39
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• Have not tackled this field so far
• Strategy could be
– Randomly select approved or corrected words and provide them to
other users for review
– If specific users provided too many errors a log file could be utilized
to reset the correction status within the ALTO files
(4) How to evaluateusers and their contributions?
40
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
Future work
41
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• Improve user interface
– Allow to mark word snippets also as „false“
• Release as Open Source package
– Will be done during 2014
– JAVA, AJAX, LUCENE – only OS components
• Implementation of the tool in a real world scenario
• Include a edit distance that is more meaningful for OCR errors than the
Fuzzy search of Lucene
– E.g. larger ED than 2, but based on typical OCR problems (c-e, etc.)
• Use the data for machine learning
– For all word snippets metadata such as title of the publication, size of the print,
language, date of printing, etc. is available
– Use it to discriminate „hard“ cases by asking users to go for specific sets (which are
selected automatically)
Further work and improvements
42
Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
Thank you for your attention!
43

Más contenido relacionado

Similar a Datech2014 - Session 3 - User-driven correction of OCR errors. Combining crowdsourcing and information retrieval technology

Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic MinerAutomatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic MinerFrancesco Osborne
 
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDB
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDBMongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDB
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDBMongoDB
 
Europeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers
 
Linda Treude, Sabine Wolf: Features for the Future Library #bcs2015
Linda Treude, Sabine Wolf: Features for the Future Library #bcs2015Linda Treude, Sabine Wolf: Features for the Future Library #bcs2015
Linda Treude, Sabine Wolf: Features for the Future Library #bcs2015KISK FF MU
 
Supporting Springer Nature Editors by means of Semantic Technologies
Supporting Springer Nature Editors by means of Semantic TechnologiesSupporting Springer Nature Editors by means of Semantic Technologies
Supporting Springer Nature Editors by means of Semantic TechnologiesFrancesco Osborne
 
Publishing conference proceedings internationally: how does it work
Publishing conference proceedings internationally: how does it workPublishing conference proceedings internationally: how does it work
Publishing conference proceedings internationally: how does it workAliaksandr Birukou
 
Putting it into practice: a digitisation case study
Putting it into practice: a digitisation case studyPutting it into practice: a digitisation case study
Putting it into practice: a digitisation case studyJISC Digital Media
 
Chocolate Flavoured Data Science
Chocolate Flavoured Data ScienceChocolate Flavoured Data Science
Chocolate Flavoured Data ScienceThilo Stadelmann
 
Very Gentle Linked Data Workshop
Very Gentle Linked Data WorkshopVery Gentle Linked Data Workshop
Very Gentle Linked Data WorkshopAdrian Stevenson
 
IMPACT Final Conference - Hildelies Balk-Pennington de Jongh
IMPACT Final Conference - Hildelies Balk-Pennington de JonghIMPACT Final Conference - Hildelies Balk-Pennington de Jongh
IMPACT Final Conference - Hildelies Balk-Pennington de JonghIMPACT Centre of Competence
 
Machine Learning and Industrie 4.0
Machine Learning and Industrie 4.0Machine Learning and Industrie 4.0
Machine Learning and Industrie 4.0Peter Schleinitz
 
British Library Labs 21st Century Curatorship Talk
British Library Labs 21st Century Curatorship TalkBritish Library Labs 21st Century Curatorship Talk
British Library Labs 21st Century Curatorship Talklabsbl
 
Acquisition policy and business models of research libraries in a digital era...
Acquisition policy and business models of research libraries in a digital era...Acquisition policy and business models of research libraries in a digital era...
Acquisition policy and business models of research libraries in a digital era...dduin
 
2016 iccgis module1_methods_andtechniques
2016 iccgis module1_methods_andtechniques2016 iccgis module1_methods_andtechniques
2016 iccgis module1_methods_andtechniquesUUUI ICA
 
Innovation and project management at ETH Library
Innovation and project management at ETH LibraryInnovation and project management at ETH Library
Innovation and project management at ETH LibraryETH-Bibliothek
 
Présentation Günter Mühlberger, BnF Information Day
Présentation Günter Mühlberger, BnF Information DayPrésentation Günter Mühlberger, BnF Information Day
Présentation Günter Mühlberger, BnF Information DayEuropeana Newspapers
 
Big data week 2018 - Graph Analytics on Big Data
Big data week 2018 - Graph Analytics on Big DataBig data week 2018 - Graph Analytics on Big Data
Big data week 2018 - Graph Analytics on Big DataChristos Hadjinikolis
 

Similar a Datech2014 - Session 3 - User-driven correction of OCR errors. Combining crowdsourcing and information retrieval technology (20)

Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic MinerAutomatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
 
You Ask We Scan
You Ask We Scan  You Ask We Scan
You Ask We Scan
 
Ukgs2013 dave pattern
Ukgs2013 dave patternUkgs2013 dave pattern
Ukgs2013 dave pattern
 
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDB
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDBMongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDB
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDB
 
Europeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday Muehlberger
 
Linda Treude, Sabine Wolf: Features for the Future Library #bcs2015
Linda Treude, Sabine Wolf: Features for the Future Library #bcs2015Linda Treude, Sabine Wolf: Features for the Future Library #bcs2015
Linda Treude, Sabine Wolf: Features for the Future Library #bcs2015
 
Supporting Springer Nature Editors by means of Semantic Technologies
Supporting Springer Nature Editors by means of Semantic TechnologiesSupporting Springer Nature Editors by means of Semantic Technologies
Supporting Springer Nature Editors by means of Semantic Technologies
 
Publishing conference proceedings internationally: how does it work
Publishing conference proceedings internationally: how does it workPublishing conference proceedings internationally: how does it work
Publishing conference proceedings internationally: how does it work
 
Putting it into practice: a digitisation case study
Putting it into practice: a digitisation case studyPutting it into practice: a digitisation case study
Putting it into practice: a digitisation case study
 
Chocolate Flavoured Data Science
Chocolate Flavoured Data ScienceChocolate Flavoured Data Science
Chocolate Flavoured Data Science
 
Very Gentle Linked Data Workshop
Very Gentle Linked Data WorkshopVery Gentle Linked Data Workshop
Very Gentle Linked Data Workshop
 
IMPACT Final Conference - Hildelies Balk-Pennington de Jongh
IMPACT Final Conference - Hildelies Balk-Pennington de JonghIMPACT Final Conference - Hildelies Balk-Pennington de Jongh
IMPACT Final Conference - Hildelies Balk-Pennington de Jongh
 
Machine Learning and Industrie 4.0
Machine Learning and Industrie 4.0Machine Learning and Industrie 4.0
Machine Learning and Industrie 4.0
 
British Library Labs 21st Century Curatorship Talk
British Library Labs 21st Century Curatorship TalkBritish Library Labs 21st Century Curatorship Talk
British Library Labs 21st Century Curatorship Talk
 
Acquisition policy and business models of research libraries in a digital era...
Acquisition policy and business models of research libraries in a digital era...Acquisition policy and business models of research libraries in a digital era...
Acquisition policy and business models of research libraries in a digital era...
 
2016 iccgis module1_methods_andtechniques
2016 iccgis module1_methods_andtechniques2016 iccgis module1_methods_andtechniques
2016 iccgis module1_methods_andtechniques
 
Innovation and project management at ETH Library
Innovation and project management at ETH LibraryInnovation and project management at ETH Library
Innovation and project management at ETH Library
 
3D-ICONS Guidelines
3D-ICONS Guidelines 3D-ICONS Guidelines
3D-ICONS Guidelines
 
Présentation Günter Mühlberger, BnF Information Day
Présentation Günter Mühlberger, BnF Information DayPrésentation Günter Mühlberger, BnF Information Day
Présentation Günter Mühlberger, BnF Information Day
 
Big data week 2018 - Graph Analytics on Big Data
Big data week 2018 - Graph Analytics on Big DataBig data week 2018 - Graph Analytics on Big Data
Big data week 2018 - Graph Analytics on Big Data
 

Más de IMPACT Centre of Competence

Más de IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

Último

Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 

Último (20)

Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 

Datech2014 - Session 3 - User-driven correction of OCR errors. Combining crowdsourcing and information retrieval technology

  • 1. Universität Innsbruck Christoph-Probst-Platz, Innrain 52 6020 Innsbruck http://info.uibk.ac.at User-driven correction of OCR errors. Combing crowdsourcing and information retrieval technology Günter Mühlberger,Johannes Zelger David Sagmeister,Albert Greinöcker Universität Innsbruck / Höhere Technische Bundeslehranstalt Anichstraße - Innsbruck
  • 2. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid • Introduction • Crowdsourcing approaches for OCR correction • Our approach • Evaluation • Future work Agenda 2
  • 3. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid Introduction 3
  • 4. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid • Digitisation of historical printed material – Google: Billions of files, libraries: Millions of files – Still hard to get access to these files • OCR quality – There are only a few reliable data on the accuracy of OCR on large scale datasets – E.g. we do not know „how good the Google collection“ is as a whole, or per language, per century, decade or year, per text type, etc. • Tanner (2009) – Has done evaluation of OCR accuracy on British Newspapers – Differences per newspaper are stronger than per publishing date – Overall we are speaking about 10% to 40% Word Error Rate, with an average of 22% WER for standard words and 31% for significant words – Evaluation done within the IMPACT project has shown similar figures Digitisation and OCR quality 4
  • 5. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid • What does this mean for the end-user? – End-users are either searching a collection or are reading an interesting item (which they may have found by searching). – But for reading a page/book they have the original image – so the full-text is much less important for them • If we take the figures from above: – End-users will miss e.g. 20% or 30% of all occurances of a search term which would be interesting for them simply because the OCR is wrong. • Maybe acceptable to occasional users, but surely not for humanities researches or family historians: They want to get „all relevant occurrences“ – What is “relevant” is decided by the user, some may be interested just within a specific time period, or periodical, or collection of documents – Note: Not all words are frequent in all collections („London“ in a Tyrolian newspaper collections is seldom whereas it is frequent in a British Newspaper Collection) End-usersand OCR quality 5
  • 6. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid Crowd sourcing for OCR 6
  • 7. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid • OCR as an „ideal“ field for crowd-sourcing – Simple to realize: Provide link between image and text and let the user correct it • Three (and a half) main approaches – reCAPTCHA – Australian National Library (Newspaper Digitization Project) – National Library of Finland (gamefication) – IBM: CONCERT (CollaborativeCorrection Platform) Approaches 7
  • 8. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid reCAPTCHA 8
  • 9. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid Australian National Library 9
  • 10. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid Australian National Library 10
  • 11. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid National Library Finland:Digitalkoot 11
  • 12. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid IBM CONCERT (COoperative eNgine for Correction of ExtRacted Text) 12
  • 13. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid • OCR correction with the support of the crowd does work (but not always)! • In the case of reCAPTCHA and DigitalKoot users have no influence on what they correct (de-motivating) – reCAPTCHA is successful due to the sheer size of interactions • User specific benefit is provided mainly by the approach of the Australian National Library – User reads the text carefully when editing – Finds corrected words immediately after submitting correct text – Can decide what to correct • Power users vs. crowd users – A very small segment of all users are carrying out the actual work – Australia: Top 6 users corrected about 25% of the texts – transcribe Bentham project: Top 7 users produced 70% of all transcripts Conclusion 13
  • 14. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid Proposed approach 14
  • 15. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid • Let„s combine searching and crowd based correction! • Provide users with a powerful instrument to correct exactly those words where they are interested in (searching for) • Relieve users from actually editing words, but let them just approve or reject the results of the OCR engine Searching AND correcting 15
  • 16. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid Search interface 16
  • 17. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid • User has the chance to – select the Edit Distance (ED): 0-2 – display already approved words – search only within the index (without showing word snippets) • In this way users can play around and – have influence on the recall of the system – see the index (which is very helpful to get an impression of the OCR errors) – see what already has been done Search interface:Features 17
  • 18. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid Result page: Features 18
  • 19. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid • Users see the word snippets of their search • Buttons – Select all as „false“ or „correct“ • Red: A word snippet does not represent the correct text • Green: A word snippet represents the correct text (match between search term and word snippet) – Deselect all – Reverse selection – Save • Save – Green word snippets: The text is either approved (if it is the same as in the OCR text) or the wrong OCR text is corrected by the correct search term – Red word snippets: Nothing is changed on the OCR text Features 19
  • 20. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid Result page (2) 20
  • 21. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid • Result sets (on the left hand side) – 150 word snippets are currently shown in the standard view – Can be parametrized – Currently ordered by file path (other criteria could be word confidence) • Index (on the right hand side) – All index terms are listed which are „behind“ a fuzzy search – Number of occurrences are shown for this result set – User gets an overview of „which tokens are behind these snippets“ – User is able to decide quickly which tokens are „real“ words Additional features 21
  • 22. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid • Improve precision – Search with ED0 – All word snippets should display the search term – Those which do not are classical OCR errors – If they are selected they get the status „approved“ – Those which are errors are currently just deselected (and not marked as false) • Approvals are directly written into the ALTO file – Correction status: true „approved“ Correction strategies (1) 22
  • 23. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid Example 1: Search for „nelle“ 23
  • 24. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid OCR errors 24 neue nelle neue nelle
  • 25. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid Select correct word images = green = approved 25
  • 26. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid • Search for a word with ED1 or ED2 – The number of hits (and word snippets) increases significantly – Sometime more, sometimes less, depending very much on the search string and the length of the string • Strategy – One may go through all word snippets and deselect wrong ones or select correct ones  takes some time and is boring • But Due to ED2 many other correct words are included in the result set • Therefore another correction strategy may be more interesting Correction strategy (2): Improve recall 26
  • 27. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid • Recommended method – Go for all tokens representing „real words“ which appear in the index on the right hand side – By clicking on a word of the index a ED0 search is triggered – In many cases ED0 searches retrieve good results with just a few OCR errors  approval is very simple and fast • Once the „real words“ are done, only those word snippets appear with „real“ OCR errors of the search term which is our real objective to correct Correction strategies (3) 27
  • 28. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid Example: Search for „Feuerwehr“ED2 28
  • 29. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid „Feuerwehr“(fire brigade) 29 Feuenvehr Fenerwehr Feuerwehr, Feuermeh Feuerwehr- Feuerweh, Feuerwehr. Feuerwerk Feuerweh Feuerwehren Feuerwehr-, Feuerwehr^ Feuerweh? Feuerwehr Feuermehr Feueràhr Feuerwert Feuerweihe Fenerwchr • Examples of erroneous words in red • These words are the „rest“ which appears after having approved the „real“ words (green) • They will finally be replaced by the correct word: • In ALTO: correction status true: substitute: Feuerwehr
  • 30. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid Validating „real“ words from the index 30
  • 31. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid • Those which were approved in the steps before are hidden to the user. – But users are able to see them if interested or if they want to do a final check – Overwriting is possible, status has to be changed • Therefore the final correction screen shows now instead 324 word snippets for “Feuerwehr” ED2 only those which were not approved before. Repeated search for „Feuerwehr“ED2 31
  • 32. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid Finally the „real“ OCR errors are replaced by the correct word 32
  • 33. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid • Test set – From the Europeana Newspaper Project – 16.000 pages from the Tessmann Library, several millions are waiting to get indexed – METS/ALTOfiles • Standard technology – JAVA, Javascript (Ajax), Lucene • Images are cropped on the fly – „Hardest“ task: takes some seconds on a 4 core engine – First batch of 150 snippets is done immediatly, second batch preprocessed in the background • A testset is available online – http://dbis-faxe.uibk.ac.at/Website%202.0/CorrectionServlet – Attention: Not a stable link! Implementation 33
  • 34. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid • Our method provides the chance to improve precision and recall of search terms in a rather quick and straight forward way. • Fuzzy search allows to increase the recall of search terms significantly and to „correct“ erroneous terms quickly • No need to edit text – only typing a search term once and than clicking on the index terms for new searches • Snowball system since approved words are stored permanently and are reused for the next correction sessions as well Conclusion 34
  • 35. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid Evaluation 35
  • 36. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid • Currently not enough data for providing good figures on the evaluation of the tool – implementation in real world scenario will be necessary • But: Doan, A. et al. 2011. Crowdsourcingsystems on the World- Wide Web. Communications of the ACM. • Four main criteria for crowd sourcing projects (1) How to recruit and retain users? (2) What contributions can users make? (3) How to combine user contributions to solve the target problem? (4) How to evaluate users and their contributions? Evaluation 36
  • 37. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid • Users are searching anyway! • Those who are searching have a specific interest! • Satisfaction will be higher if precision and especially recall is higher for noisy OCR text  motivation should be there • Power users of the archive may be willing to contribute a good deal of their time to improve the full-text search  working power should be there • Our tool is a piggypack of the search interface – can be integrated in a simple way (e.g. an extra tab which is performed anyway and users may try out what is behind) • Searching the index provides useful insights to the user  learning curve (get to know your full-text archive!) (1) How to recruit and retain users? 37
  • 38. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid • Contributions of users are – Improve precision – Improve recall by correcting OCR errors of search terms – All these words are significant and meaningful to a user • Only a small portion of words is interesting! – Text contains a lot of words which are not meaningful or are very seldomly part of a search – Austrian Newspapers Online: 50% of all full-text searches go for person names, 20% for geo-names, only a small portion for keywords – This means that the corrections/approvals done by the user with our method is more valuable than to correct running text – The whole number of corrected words may not be so high, but these should be significant and relevant words (2) What contributionscan users make? 38
  • 39. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid • Storage of contributions – All contributions are stored in two ways: • The Lucene index is immediately updated so that the next search already takes benefit from approvals/corrections • Approvals/corrections are directly stored in the OCR XML files (in this case ALTO): Words are either marked as „correction status true“ „approved“ or the new alternative of the word is included as well. • Main benefit for the next user – The next user will see which word snippets are already approved (are shown in blue and gray) – in other words: The contributions are visible to everyone though they are distributed among large amounts of text – This should users give the feeling that someone already has worked in this field as well (3) How to combine user contributionsto solve the target problem? 39
  • 40. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid • Have not tackled this field so far • Strategy could be – Randomly select approved or corrected words and provide them to other users for review – If specific users provided too many errors a log file could be utilized to reset the correction status within the ALTO files (4) How to evaluateusers and their contributions? 40
  • 41. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid Future work 41
  • 42. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid • Improve user interface – Allow to mark word snippets also as „false“ • Release as Open Source package – Will be done during 2014 – JAVA, AJAX, LUCENE – only OS components • Implementation of the tool in a real world scenario • Include a edit distance that is more meaningful for OCR errors than the Fuzzy search of Lucene – E.g. larger ED than 2, but based on typical OCR problems (c-e, etc.) • Use the data for machine learning – For all word snippets metadata such as title of the publication, size of the print, language, date of printing, etc. is available – Use it to discriminate „hard“ cases by asking users to go for specific sets (which are selected automatically) Further work and improvements 42
  • 43. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid Thank you for your attention! 43