1. CROWDSOURCING IN THE
DIGITALKOOT PROJECT
Majlis Bremer-Laamanen
IMPACT 24TH OF OCTOBER, 2011
Microtask.com:
Digitalkoot: Making Old Archives Accessible Using Crowdsourcing by
Otto Chrons and Sami Sundell,
Discussions Managing Director Harri Holopainen
harri@microtask.com
2. The Centre for Preservation and Digitisation: statistics
• Established in 1990 • Digitisation: 1,3
• Digitisation started in million pages
1998 • Audio digitisation
• Over 50 employees and cataloguing
music 1,300 unique
• Yearly average (past
cassettes and the
three years):
sleeves
• Microfilm
• Conservation:
production: 1, 3
10,000-15,000 units
million exposures
3. ENRICHING CONTENT
(http://digi.nationallibrary.fi, http://www.doria.fi/handle/10024/4194)
• Newspapers - > 2 million pages, the Historical Newspaper Library
• Journals - > 2,7 million pages, free to 1910, in all legal deposit
libraries to 1944
• Books - > travel, novels, Dissertations 17th century, Save the Book
• Ephemera - > industrial price lists
• Sound - > national sound archive, C-casettes
• Interest groups: the creators, users, contributors of the material
4. Context for mass digitisation and crowdsourcing
Client
Accessibility
Centre for Preservation and Digitisation
Temporary Physical
Preparation for Post- storage for
Digitisation objects
Transferring Digitisation processing digitised objects Retrieval
Physical
Objects
Mass digitisation activities in the most cost-effective manner:
Newspapers, books, journals, ephemera, audio:
• Logistics for physical items
• Process for digital objects: network services and long-term preservation
• Metadata Mets - Alto: capturing through process
• Metadata development: User experience and crowdsourcing
• Customizing of the tracking systems (CCS, Item Tracking, Scan Client)
• Operational environment: scaling architecture and implementation
5. DIGITALKOOT
DIGI = TO DIGITISE
TALKOOT = PEOPLE GATHERING TO WORK TOGETHER
VOLUNTARILY (WITHOUT PAYMENT)
FIRST EXPERIENCE 2011:
DIGITALKOOT: correction of OCR by gamification, turning useful
activities into games ”THE MOLE HUNT” by Microtask.com.
– People can spend hours on games
– Turning useful activities into games
– Activities can be rewarded with scores, achievments and social benefits
From February, 8th to September 15th, 2011: about 80.000
visitors, 4000 hours of effective game time. More than 5 million
tasks.
6. CHALLENGES
Meaningful tasks without breaking the flow of the game
Real-time feedback – many simultaneous players doing
the same task
Build a bridge to save the moles from falling down =>
– Correct typing gives you a block to the bridge
– Incorrect is punished by explosion
15. GAMIFICATION CHALLENGES
Balancing game play elements with task completion speed and
accuracy
Keep the motivation of people and enlarge the audience
Introduction of meaningful tasks into the game without breaking
game play mechanisms
Instant feedback on players´ actions (simultaneous players)
•pressure to adapt to varying feedback situations/latencities
16. POSITIVE EFFECT OF VERIFICATION
”The wisdom of the crowds”
• includes answers from possible spammers
Game start: verification tasks only
Accurate work shown => verification lowered in phases, never zero
Verification tasks are created automatically:
• A randomly selected task is sent to several players: all have to
agree on the result => verification task
17. VERIFICATION OF THE OCR
Players and their pace cannot be synchronized.
Verification tasks to the task stream:
•Fed to players varies according to the number of active players
•The system knows the answer: the game play is improved by fast
feedback
•Downside: no new information produced
18. USERS: February 8th to March 31st, 2011
31,816 visitors, 4,768 players, 2,740 hours of game time, 2,5 million
tasks.
1 % via Internet, 99 % via Facebook
Half of the users were men.
Gametime: seconds to over 100 hours (altogether).
Median time: => 9 minutes.
Women >13 minutes and 54 % of the tasks
Hardest working top 4 were all men
19. ACCURACY
OCR-system 0.8 confidential about accuracy => human correction in 30%
Random selection of 2 articles:
•1,467 words Digitalkoot result: only14 mistakes /228 OCR
•516 words Digitalkoot result: 1 mistake/118 OCR
•>> well over 99% possible by gamification
Spammer play:
•One player 1,5 hours and 5,692 tasks was detected by the verification
system and only 4 tasks were accepted
20. Enriching Digitisation Production
Processes, METS Profiles: a new
development platform RESOURCE
DIGITAL
Articles
Illustrations COMPREHENSIVE
Poems LEVEL OF DIGITAL COLLECTIONS
MARK UP
Standards & OAI-PMH
Structural metadata METS, ALTO complient METS SIP
POST packages
PROCESSING
METS EXPORT
Administrative/technical metadata MIX/PREMIS
Packesges include:
SCANNING JPEG2000
Descriptive metadata MARC21/MODS OCR TXT as ALTO XML
PDF
CATALOGUING Two Bibliographic
Newspapers Records JPEG(150)
Serials
METSXML
Books
Parchments MARCXML
Notes
Maps SOURCE MATERIAL
Audio
PHYSICAL COLLECTIONS
21. IN THE MEDIA
-Until March 31st, over 30 articles: all around the world: New York
Times…
-Television appearances ongoing
-Helsingin Sanomat : HS talkoot using the National Library´s
digitised newspaper material Historical Newspaper Library >
advertising Digitalkoot e.g. September 15th
-Influenced user interest
=> stabilisation to 300 individual users per week
22. NEXT
1) Marking of articles and/or
images
2) Indexing articles and/or
images
23. KUVATALKOOT
Goal: sophisticated
user experience
Collections discovery and
Luonnon-kirja ala-alkeiskouluin tarpeeksi / Z. Topelius, 1868
reuse of digital content by
researchers and people at
large:
Researchers will get better
systematic coverage of
images and articles in
published printed material.