20140408 digital newspapers collections [idlc kuala lumpur]
1. digital newspaper
collections:
if you build one, who will
visit?
Frederick Zarndt
IFLA Newspapers Section
frederick@frederickzarndt.com
@cowboyMontana
hashtag #IFLAnewspaper
3. why digitize newspapers?
“News is only the first rough
draft of history.”
Alan Barth writing for 1943
Washington Post
Wikipedia contributors, “Alan Barth," Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/wiki/
Alan_Barth (accessed March 2014).
5. • newspapers are deteriorating
• microfilm is dissolving
• no storage space or space is too expensive
6. • newspapers are deteriorating
• microfilm is dissolving
• no storage space or space is too expensive
7. • newspapers are deteriorating
• microfilm is dissolving
• no storage space or space is too expensive
8. • newspapers are deteriorating
• microfilm is dissolving
• no storage space or space is too expensive
9. the principal reason to digitize newspapers
is to provide non-destructive, universal
access to newspapers for as many users as
possible
10. PhotobyDAVIDILIFF.License:CC-BY-SA3.0
readingrooms
bythenumbers*
Monthly average
Visitors Requests for Newspapers
Population Reading Room Microform Print
Australia 22,876,000 5,130 345 240
France 65,350,000 3,000 2,000 1,000
Netherlands 16,847,000 NA NA NA
New Zealand 4,414,000 NA NA NA
Norway 4,985,000 600 400 NA
Singapore 5,184,000 NA 300 NA
UK 62,262,000 2,000 6,900 4,816
USA 313,292,000 NA NA NA
*numbers from 2012
11. physical versus digital
monthly averages 2012
requests for newspapers digitised historical newspapers
population paper + microform unique visitors
22,876,000 585 150,000
37,692,000 NA 12,800
5,405,000 NA NA
65,350,000 3,000 22,000
16,847,000 NA 50,000
4,414,000 NA 83,333
4,985,000 400 1,500
5,184,000 300 12,400
62,262,000 11,716 NA
313,292,000 NA NA
15. national: a single (national) library which
funds and manages a national newspapers
digitization program.
• Papers Past, National Library of New
Zealand
• Newspaper SG, National Library of
Singapore
• Historiallinen Sanomalehtikirjasto,
National Library of Finland
• and others …
programs
16. national: centrally funded and centrally
managed program with several participants.
strict standards for participants.
• National Digital Newspaper Program
(Library of Congress)
• Australian Newspaper Digitisation
Program
programs
17. cooperative: organizations collaborate to
achieve a common goal but digitization
programs are managed separately. flexible
standards.
• Europeana newspapers
• Digital Public Library of America
programs
18. individual: organization digitizes on its own.
may or, more usually, does not follow open
standards. all commercial organizations.
• ProQuest Historical Newspapers
• Newspapers.com
• Newsbank
• many others…
programs
19. • the design of a digitization
program requires careful thought
and must be adapted to local
circumstances
• determine principal or targeted
user demographic and use cases
• ask those who have gone before
• join the IFLA Newspapers
Section! (ask me how)
programs
Image courtesy of Donald Zolan.
21. as of Mar 2014
library collection ~size pages dates
National Library of Australia Trove 12,668,000 1803-1995
California Digital Newspaper Collection CDNC 545,000 1846-2012
Naitonal Library of Finland Historical Newspaper Library 3,006,000 1771-1919
Bibliotheque nationale de France Gallica 2,200,000 1293-2000
Koninklijke Bibliotheek Historische Kranten 9,000,000 1618-1995
National Library of New Zealand Papers Past 3,109,000 1839-1945
National Library of Norway NBDigital Aviser 12,000,000 1763-2012
Singapore National Library Newspaper SG 2,400,000 1831-2009
British Library British Newspaper Archive 7,598,000 1710-1954
Library of Congress Chronicling America 7,293,000 1836-1922
digital historic newspaper collections
32. Newspaper collection
user survey
• California Digital Newspaper Collection and
Cambridge Public Library published a user
survey in Mar 2013
• 604 / 32 responses
• surveys are (mostly) identical except for
organization name
37. • 72% visit UDN for genealogical research
• 20% visit for various other types of historical research
• 87% find obituaries useful
• Over 60% find the other genealogical article types (birth
and wedding announcements) useful
• Only 7% do not find genealogical articles useful
• Many are writing family histories and consequently also
look for general background information
• Older content is much more highly valued than more recent
content (see more detailed explanation that follows)
• 44% find smaller, rural papers more useful, while only 15%
find larger, metropolitan papers more useful
Utah Digital Newspapers:
2012 user survey
John Herbert and Randy Olsen. Small town papers: still delivering the news.
WLIC 2012, Helsinki Finland. http://conference.ifla.org/past-wlic/2012/119-
herbert-en.pdf
38. “The ‘typical’ Trove user is a very well educated,
highly paid, English speaking employed woman
aged fifty or over, with a significant or primary
interest in family or local history, who visits the
Trove website very frequently. Users of Trove
newspapers are older than the average Trove
user; only 13% of newspaper users are under 40
years or age.”
Marie-Louise Ayres. ‘Singing for their supper’: Trove, Australian
newspapers, and the crowd. WLIC 2013,Singapore. http://
library.ifla.org/245/1/153-ayres-en.pdf.
Engaged users: who are they?
39. “Many of Trove’s user engagement features are
very popular. More than 100,000 users have
registered to date, and more than 2 million tags
and nearly 60,000 comments had been added…
[Trove] text correction, however, stands head and
shoulders above any other user engagement
features.”
Marie-Louise Ayres. ‘Singing for their supper’: Trove, Australian
newspapers, and the crowd. WLIC 2013,Singapore. http://
library.ifla.org/245/1/153-ayres-en.pdf.
Engaged users: who are they?
40. Crowdsourcing is the practice of obtaining
needed services, ideas, or content by soliciting
contributions from a large group of people,
and especially from an online community,
rather than from traditional employees or
suppliers. ... [It] is different from ordinary
outsourcing since it is a task or problem that is
outsourced to an undefined public rather than
a specific, named group.
Wikipedia contributors, "Crowdsourcing," Wikipedia, The Free Encyclopedia, http://
en.wikipedia.org/wiki/Crowdsourcing (accessed March 17, 2013)
42. Deaths. lln»rieff, Esq. of <c .. Qn.
Sunday, the till. greatly Drandrellt, of
Orms4irJi.- ~ ; ;✓ ' • * On ijfr r inn
l j j j i l F i i j ' 1 1 f H a v o d i v y d ,
Carnarvonshire, S ; **" *- ' « ' March
Oxford, F. Tfovmeud, Uerald. » • V .
•On Tncsdav last, Mr. Charles.
IWilinson, this 8 ; had vf thesis#,, a week
ago, which tcrminate<i'iu his death. . / ' ■
O'i Sunday, dJst nit. at. AsbtCnvHall,
mar Lancaster, Mr.,Geo. Worn ick,
many years house'steward hit late Once
The Hamilton and Brandon. He locked
himself h»oWn'r«wte<: soon. twelve
o'clock" that dny, and fii»-d a loaded pistol
" t h r o u g h I n s b e a d , 1 w h i c h
instantaneously killed him. Coronet's
Verdict, shot himself in a temporary fit of
Friday week,
raw OCR text
Excerpt from The British Newspaper Archive, Chester Courant, Tuesday 6-Apr-1819, page 3.
newspaper image
43. Accuracy
• Edwin Kiljin (Koninklijke Bibliotheek the Netherlands)
reports raw OCR character accuracies of 68% for early 20th
century newspapers
• Rose Holley (National Library of Australia) reports raw
OCR character accuracy varied from 71% to 98% on a sample
Trove digitized newspapers
Rose Holley. How good can it get? Analysing and improving OCR accuracy
in large scale historic newspaper digitisation programs. D-Lib Magazine.
March/April 2009.
Edwin Kiljin. The current state-of-art in newspaper digitization. D-Lib
Magazine. January/February 2008.
44. uncorrected OCR accuracy by
newspaper title
title
OCR character
accuracy
~OCR word
accuracy
PRP Pacific Rural Press 1871 - 1922 92.6% 68.1%
SFC San Francisco Call 1890 - 1913 92.6% 68.1%
LAH Los Angeles Herald 1873 - 1910 88.7% 54.9%
LH Livermore Herald 1877 - 1899 88.6% 54.6%
DAC Daily Alta California 1841 - 1891 88.2% 53.4%
CFJ California Farmer and Journal
of Useful Sciences 1855 - 1880
86.5% 48.4%
SN Sausalito News 1885 - 1922 70.4% 17.3%
*Word accuracy assumes average word length is 5 characters
45. OCR accuracy by newspaper title
title
OCR character
accuracy
corrected
accuracy
PRP Pacific Rural Press 1871 - 1922 92.6% 99.3%
SFC San Francisco Call 1890 - 1913 92.6% 99.6%
LAH Los Angeles Herald 1873 - 1910 88.7% 99.1%
LH Livermore Herald 1877 - 1899 88.6% 99.9%
DAC Daily Alta California 1841 - 1891 88.2% 99.9%
CFJ California Farmer and Journal
of Useful Sciences 1855 - 1880
86.5% 99.8%
SN Sausalito News 1885 - 1922 70.4% 100.0%
46. corrected accuracy by
newspaper title
title
OCR character
accuracy
~OCR word
accuracy
corrected
accuracy
~corrected word
accuracy
PRP 1871 - 1922 92.6% 68.1% 99.3% 96.5%
SFC 1890 - 1913 92.6% 68.1% 99.6% 98.0%
LAH 1873 - 1910 88.7% 54.9% 99.1% 95.6%
LH 1877 - 1899 88.6% 54.6% 99.9% 99.5%
DAC 1841 - 1891 88.2% 53.4% 99.9% 99.5%
CF 1855 - 1880 86.5% 48.4% 98.3% 91.8%
SN 1885 - 1922 70.4% 17.3% 100.0% 100.0%
*Word accuracy assumes average word length is 5 characters
47. correction accuracy
by user
user average OCR accuracy correction accuracy
A 70.4% 100.0%
B 87.1% 99.5%
C 95.4% 99.5%
D 86.5% 98.3%
E 95.3% 100.0%
F 91.0% 100.0%
G 91.0% 99.8%
H 90.5% 99.0%
I 96.6% 99.8%
J 94.8% 100.0%
K 86.8% 99.3%
48. How does low text accuracy affect search recall?
The Facts
• Average uncorrected OCR character accuracy of the CDNC
sample data is ~89%
• Average length of an English word is 5 characters
• Average word accuracy is 89% x 89% x 89% x 89% x 89% =
55.8% - round up to 60% or 6 out of 10 words correct
Accuracy
50. Accuracy
The Facts
• Average corrected character accuracy of the CDNC sample
data is ~99.4%
• Average word accuracy of CDNC corrected text is 99.4% x
99.4% x 99.4% x 99.4% x 99.4% = 97.0%
52. A search for “Arndt” at Chronicling America gives
10,267 results*
• If Chronicling America text accuracy is 55.8% (same as
uncorrected CDNC sample), then 8,133 instances of
“Arndt” were not found
• If text accuracy is 97.0%, then 317 instances of “Arndt”
were not found
Accuracy
* Search performed 31 Oct 2012
53. Accuracy
Suppose the word/name is longer than 5
characters?
The Facts
• Assume that average uncorrected / corrected OCR
character accuracy is ~89% / ~99% same as CDNC.
name name length raw text accuracy corrected text accuracy
Eklund 6 49.7% 94.2%
Kennedy 7 44.2% 93.25
Espinosa 8 39.4% 92.3%
Bonaparte 9 35% 91.4%
Chatterjee 10 31.2% 90.4%
54. Accuracy
name
number of search
results
missing results with raw
text accuracy
missing results with corrected
text accuracy
Eklund 2,951 2,987 182
Kennedy 360,723 455,392 26,111
Espinosa 1,918 2,950 160
Bonaparte 44,664 82,947 4,203
Chatterjee 19 42 2
Chronicling America searches done 19-Mar-2013
(6,025,474 pages from 1836 to 1922).
57. • “I enjoy the correction - it’s a great way to learn more
about past history and things of interest whilst doing a
‘service to the community’ by correcting text for the
benefit of others.”
• “I have recently retired from IT and thought that I could
be of some assistance to the project. It benefits me and
other people. It helps with family research.”
Rose Holley. Many Hands Make Light Work. National Library of
Australia March 2009.
motivation
Trove users’ report
58. “I am interested in all kinds of history. I have pursued genealogy
as a hobby for many years. I correct text at CDNC because I see it
as a constructive way to contribute to a worthwhile project.
Because I am interested in history, I enjoy it.”
Wesley, California
Personal communications with CDNC text correctors.
motivation
CDNC users’ report
59. !
“I only correct the text on articles of local interest - nothing at
state, national or international level, no advertisements, etc. The
objective is to be able to help researchers to locate local people,
places, organizations and events using the on-line search at
CDNC. I correct local news & gossip, personal items, real estate
transactions, superior court proceedings, county and local board
of supervisors meetings, obituaries, birth notices, marriages,
yachting news, etc.”
Ann, California
Personal communications with CDNC text correctors.
motivation
CDNC users’ report
60. “I have always been interested in history, especially the
development of the American West, and nothing brings it alive
better than newspapers of the time. I believe them to be an
invaluable source of knowledge for us and future generations.”
David, United Kingdom
motivation
CDNC users’ report
Personal communications with CDNC text correctors.
61. CDNC is an excellent source of information matching my
personal interest in such topics as sea history, development
of shipbuilding, clippers and other ships etc. ...
Unfortunately, the quality of text ... is rather poor I’m afraid.
This is why I started to do all corrections necessary for
myself ... and to leave the corrected text for use of others. ....
I am not doing this very regularly as this is just my hobby
and pleasure.
Jerzey, Poland
motivation
CDNC users’ report
Personal communications with CDNC text correctors.
62. As an amateur historical researcher my time for research is very
limited. Making time to travel to archives, libraries, and historical
societies does not happen as often as I would like. The Cambridge
Public Library’s online newspaper collection has been an invaluable
resource and it is fun. I am very grateful for all the help I have
received over the years from so many research organizations.
Correcting text has several benefits. It makes it much more likely that
I will find a story if I decide to search for it in the future. It is a way of
saying ‘thank you’ to the Cambridge Library for having such a great
resource available and maybe I can make the next person’s research a
little easier. It is my own little historical preservation project.
Cambridge Historical Newspapers Text Corrector
motivation
Cambridge users’ report
Personal communications with Cambridge text correctors.
64. “when someone transcribes a document, they are
actually better fulfilling the mission of a cultural
heritage organization than someone who simply stops
by to flip through the pages”
HTMBSBO benefit
Paraphrased from Trevor Owen’s blog http://www.trevorowens.org/2012/03/
crowdsourcing-cultural-heritage-the-objectives-are-upside-down/ (accessed June
2013).
65. “in addition to increasing search accuracy or lowering
the costs of document transcription, crowdsourcing is
the single greatest advancement in getting people using
and interacting with library collections”
HTMBSBO benefit
Paraphrased from Trevor Owen’s blog http://www.trevorowens.org/2012/03/
crowdsourcing-cultural-heritage-the-objectives-are-upside-down/ (accessed June
2013).
66. conclusions
Conclusion of the Sonata
for piano #32, opus 111 by
Ludwig van Beethoven
• newspaper digitization may be difficult but
there are many, many examples of successful
digitization programs. ask for help! and join
the IFLA Newspapers Section!
• digital newspaper collections are the most used
digital library collections
• benefits to crowdsourced text correction and
tagging are multi-faceted: data accuracy,
patron engagement, increased web traffic
• know your user community!!
67. • Library of Congress National Digital Newspaper
Program http://www.loc.gov/ndnp/
• Australian Newspaper Digitisation Program
http://www.nla.gov.au/content/newspaper-
digitisation-program
• IFLA Newspapers Section Digitisation projects
and best practices http://www.ifla.org/node/6777
• ICON: International Coalition on Newspapers
http://icon.crl.edu/digitization.htm
68. Wikipedia contributors, "List of online newspaper archives," Wikipedia, The Free Encyclopedia, https://
en.wikipedia.org/wiki/Wikipedia:List_of_online_newspaper_archives (accessed March 17, 2013).
69.
70. Become a member of the IFLA Newspapers
Section! See http://www.ifla.org/
membership or ask me.
!
Frederick Zarndt, Secretary
IFLA Newspapers Section
frederick@frederickzarndt.com
71. ?!
Frederick Zarndt
Secretary, IFLA Newspapers Section
frederick@frederickzarndt.com
Photo held by John Oxley Library, State Library of Queensland. Original from
Courier-mail, Brisbane, Queensland, Australia.