Simon TannerBlog: simon-tanner.blogspot.co.ukTwitter: @SimonTanner      www.slideshare.net/KDCS/
King’s Digital Consultancy Services              www.digitalconsultancy.net
Deciding whether Optical Character Recognition is feasible(PDF document) created for the Oxford University DigitalLibraryw...
UniformityLanguageText alignmentComplexity of alignmentLines, graphics and picturesHandwriting
Evaluating OCR accuracy is about more than justcharacter to character accuracy rates    Character accuracy rates are misle...
Consider this scenario:  1,000 words with 5,000 characters  (an average of 5 per word) excluding spaces  90% character acc...
1009080706050      1801             1810                      1820                             1830                       ...
OCR Results                   % characters     % words correct     No. of corrections OCR Engine          correct FineRead...
OCR Results                  % characters     % words correct   No. of correctionsOCR Engine          correctFineReader   ...
Simon TannerBlog: simon-tanner.blogspot.co.ukTwitter: @SimonTanner
Digitisation Doctor Optical Character Recognition
Digitisation Doctor Optical Character Recognition
Digitisation Doctor Optical Character Recognition
Digitisation Doctor Optical Character Recognition
Digitisation Doctor Optical Character Recognition
Próxima SlideShare
Cargando en...5
×

Digitisation Doctor Optical Character Recognition

1,270

Published on

Optical Character Recognition guidance and advice

Published in: Tecnología, Negocios
0 comentarios
1 Me gusta
Estadísticas
Notas
  • Sea el primero en comentar

Sin descargas
reproducciones
reproducciones totales
1,270
En SlideShare
0
De insertados
0
Número de insertados
1
Acciones
Compartido
0
Descargas
29
Comentarios
0
Me gusta
1
Insertados 0
No embeds

No notes for slide

Transcript of "Digitisation Doctor Optical Character Recognition"

  1. 1. Simon TannerBlog: simon-tanner.blogspot.co.ukTwitter: @SimonTanner www.slideshare.net/KDCS/
  2. 2. King’s Digital Consultancy Services www.digitalconsultancy.net
  3. 3. Deciding whether Optical Character Recognition is feasible(PDF document) created for the Oxford University DigitalLibrarywww.odl.ox.ac.uk/papers/OCRFeasibility_final.pdfMeasuring Mass Text Digitization Quality and Usefulness:Lessons Learned from Assessing the OCR Accuracy of theBritish Librarys 19th Century Online Newspaper Archivewww.dlib.org/dlib/july09/munoz/07munoz.htmlwww.impact-project.eu
  4. 4. UniformityLanguageText alignmentComplexity of alignmentLines, graphics and picturesHandwriting
  5. 5. Evaluating OCR accuracy is about more than justcharacter to character accuracy rates Character accuracy rates are misleading (more later…)It is also about assessing the functionality enabledthrough the OCR’s output Search accuracy Volume of hits returned Ability to structure searches and results Accuracy of result ranking Amount of correction required to achieve the required performance
  6. 6. Consider this scenario: 1,000 words with 5,000 characters (an average of 5 per word) excluding spaces 90% character accuracy means: 4,500 characters correct Possibly a maximum 900 words correct (90%) Possibly a minimum 500 words correct (50%) Reality is somewhere in between Depending on the number of “significant words” the search results could still be almost 100% or near zero
  7. 7. 1009080706050 1801 1810 1820 1830 1840 1850 1860 1870 1880 1890 1900 characters words words with capital letter start significant words Poly. (characters) Poly. (words) Poly. (significant words) Poly. (words with capital letter start
  8. 8. OCR Results % characters % words correct No. of corrections OCR Engine correct FineReader 91.1 70.9 110 PrimeOCR 93.95 79.1 79Total number of characters = 2109Total number of words = 379 I am petfood, God toil! uttedy-toverthroW, at feaft; $gy abafe Mens affections tp; and seal for all Party-making Notions amdngft CfiriftiansybefGieirie will raife his,Church to that prof- perous, flourilhing State prophefied of, and prOmifed in the Scrip* tures. There mult be more Love, and Charity, and Unanimity amongft Chriftians,.
  9. 9. OCR Results % characters % words correct No. of correctionsOCR Engine correctFineReader 73.7 57.5 31PrimeOCR 75.9 62.37 28Total number of characters = 411Total number of words = 73 A THEATRE erein be reprc-fented as wel the miferies & calamities tijat foiioto tht too* e^jr alfo the greate toyts and plefures tobtcf) tbe fatrfc faltooenio^ An Argument both profitable and dele&able, to all that finccrcly loue the word of Codt. *Deuifedby S. hhnv&n~ derlS^oodt. s 3^ Scene and allowed according to the order appointed. , ^ Imprinted at London by Henry Bynncman* Anno Domini. CVM PHIT
  10. 10. Simon TannerBlog: simon-tanner.blogspot.co.ukTwitter: @SimonTanner
  1. ¿Le ha llamado la atención una diapositiva en particular?

    Recortar diapositivas es una manera útil de recopilar información importante para consultarla más tarde.

×