SlideShare una empresa de Scribd logo
1 de 33
Descargar para leer sin conexión
A Bimodal Crowdsourcing Platform for
Demographic Historical Manuscripts
Alicia Fornés, Josep Lladós, Joan Mas, Joana Maria Pujades, Anna Cabré
Computer Vision Center - Centre for Demographic Studies
Universitat Autònoma de Barcelona
2
Index
 Introduction
 5CofM project: The Barcelona Marriage Licenses
 Bi-modal Crowdsourcing Platform
 Contents view
 Labeling view
 Running experience
 Generalization to other kind of documents
 Conclusions
3
5CofM: Barcelona Marriage Licenses
5CofM project: Five Centuries of Marriages
• Advanced Grant – European Research Council.
• 2011 – 2016.
• Partners:
• Universitat Autònoma de Barcelona (UAB)
• Centre for Demographic Studies (CED).
• Computer Vision Center (CVC).
• Aim:
This project is based on the data-mining of the Llibres d'Esposalles conserved at the
Archive of the Barcelona Cathedral. This extraordinary data source comprises 291 books
of marriage licenses records, with information of approximately 610.000 unions
celebrated in over 250 parishes of the Diocese between 1451 and 1905.
4
The Barcelona Marriage Licenses
The Marriage Licenses contain information about:
– The couple (groom/bride)
– Their parents
– Their occupation (job)
– The place of origin
– The parish (church) where they married
– The fee that was paid (depending on their social class)
NAME
DATE
JOB
PLACE
FEE
NAME
NAME
5
The Barcelona Marriage Licenses
Index Marriage Licenses
6
The Barcelona Marriage Licenses
“Llibres d’esposalles” from the Archives of the Barcelona Cathedral
• 244 books
• From 1451 to 1905
• Approximately 550.000 marriages licenses
Ground truth
• From the volume 69
• 50 documents
• 20 classes
Index License marriage
Husband’s
surname
License marriage Fee
6
7
The Barcelona Marriage Licenses: Continuity
1481: volume 3 1601: volume 61
Marriage license
Husband’s surname
1729: volume 127 1860: volume 200
Fee
Marriage license
Fee
Husband’s surname
Marriage license
Fee
Husband’s surname
Marriage license
Fee
8
The Barcelona Marriage Licenses: Fees
Marriage licenses fees for the two year period that starts on
the first of May, 1627 and ends on the last day of April, 1629
Dukes, Marquises, Counts and
Viscounts.
Noble knights and Lords of vassals.
Knights, Honored Citizens and
Bourgeoisies.
Merchants, Notaries of Barcelona,
Shopkeepers of distinguish materials,
Chemists and Druggists.
Shopkeepers of materials, Royal
Notaries, Surgeons, Traders, Solicitors,
Middlemen and Artists.
The rest.
The poor ones for the love of God.
12 ll
2ll 6s
1ll 4s
12s
6s
4s
-
9
CED objectives (scholars)
– Genealogic tree
• Ancestors / descendants
– Immigration / Emigration
• Family names appear / disappear
• French surnames (descendants)
– Population (by num. of marriages)
• Plagues, epidemics, baby boom
– Parish churches
• Neighborhood is/becomes rich/poor
– Evolution of a family name
• Jobs, fees (higher or lower)
– Relationships between families
• Strategic, commercial reasons
CVC objectives
(computer scientists)
– Layout analysis
• Text-line segmentation
– Word Spotting
• Query by example
• Query by string
– Handwriting Recognition
– Syntactic analysis
The Barcelona Marriage Licenses
10
Document Image Analysis: Tasks
• Layout analysis: to detect (crop) records, lines, words for subsequent recognition.
• Full transcription: to convert images to editable text.
• Word spotting: given a query word to search,
to locate at image level visually similar word snippets.
dit dia rebere$ de Hieronym Ponsich corder de Bar^(a) fill de Jua$ Pon=
BLOCKS
WORDS
LINES
11
Index
 Introduction
 5CofM project: The Barcelona Marriage Licenses
 Bi-modal Crowdsourcing Platform
 Contents view
 Labeling view
 Running experience
 Generalization to other kind of documents
 Conclusions
12
Technical architecture
Image Space
Transcription
Space
Contextual
knowledge
Space
HW recognition
Crowdsourcing
Data mining
• Harmonization
• Record linkage
Scanning
exploitation
13
Crowdsourcing platform
• Manual transcription  tedious and time consuming task
• Crowdsourcing Platform (Divide & Conquer)
• Split and distribute a big amount of small and simple tasks
• Crowdsourcing architecture:
• Image space (digitized documents)
• Transcription space (extraction of information)
• Contextual space (semantic meaning)
14
Crowdsourcing platform
• Web-based application: Integration of two points of view
• Contents view: Semantic information  demographic research
• Labeling view: Ground-truthing  document analysis research
http://www.cvc.uab.es/5cofm/
15
Crowdsourcing platform: Administration
Administration: Managing documents and Users
16
Crowdsourcing platform: User login
17
Contents view (semantics): Form filling
18
Contents view (semantics): Form filling (Indices)
19
Contents view (semantics): Checking correction
Check for posible spelling errors (words that appear only once?)
20
Contents view (semantics): Record Linkage
• Record Linkage  Genealogical tree
• Batch process searches links between individuals:
• Parent’s marriage, Brothers/Sisters marriages
• The search allows spelling variations
• String Edit distance (Levenshtein), with different costs for substitutions
• Useful for harmonization of names, surnames…
• The expert decides the correct linkage from the candidates
Year Bride Father Mother Year Groom Bride Similarity
1638 Jeronima Lluis
Teixidor
Paula 1606 Lluis
Teixidor
Paula 1
1638 Joana Nicolau
Ferrer
Antiga 1613 Nicolau
Ferrera
Antiga 0.95
21
Index
 Introduction
 5CofM project: The Barcelona Marriage Licenses
 Bi-modal Crowdsourcing Platform
 Contents view
 Labeling view
 Running experience
 Generalization to other kind of documents
 Conclusions
22
Labeling view (annotation): Transcription (lines)
Literal transcription  Ground-truth for handwriting recognition methods
23
Labeling view (annotation): Word Labeling
Word meta-data:
• Bounding-box (coordinates)
• Cathegory
(e.g. groom’s name,
occupation…)
• The system does the
automatic correspondence
 The user validates!
Integrated platform: put into correspondence contents view  labeling view
24
Index
 Introduction
 5CofM project: The Barcelona Marriage Licenses
 Bi-modal Crowdsourcing Platform
 Contents view
 Labeling view
 Running experience
 Generalization to other kind of documents
 Conclusions
25
Running Experience
ADVANTAGES
• Digital source
• Not necessary to go to the Archive
• No timetable limitations
• Parallelization
• Many users work simultaneously
• Centralization
• Easier management of images, users, database...
• Easy to see “who works on what”
• Automatic control
• System forces to fill some fields, raises warnings
• Useful for detection of spelling errors (auto-correction)
26
Running Experience
ADVANTAGES
• Security
• Frequent back-up
• Users can visualize the documents assigned to them, but not
download them
• Monitoring
• Administrator can monitor the user’s work and provide feedback
• Visualization and confort
• Drag (move), zoom in/out
DISADVANTAGES
• Internet connection is always needed
• If system is down (e.g. maintenance)  no one can work
27
Index
 Introduction
 5CofM project: The Barcelona Marriage Licenses
 Bi-modal Crowdsourcing Platform
 Contents view
 Labeling view
 Running experience
 Generalization to other kind of documents
 Conclusions
Generalization to other demographic manuscripts
• The platform has been adapted for census documents
29
Index
 Introduction
 5CofM project: The Barcelona Marriage Licenses
 Bi-modal Crowdsourcing Platform
 Contents view
 Labeling view
 Running experience
 Generalization to other kind of documents
 Conclusions
Conclusions
• Web-based crowdsourcing platform for demographic manuscripts
• Integrates the needs of demographers and computer scientists
Future directions
• Improve validation
• Combine the output of several users
• Compare with the output of document analysis techniques
• Mobile-based applications
• For crowdsourcing  Faster ground-truth generation
• For browsing and searching  User friendly interfaces
Crowdsourcing on mobile devices
Task 1
Page layout
R · 30 s/T · 1 T/P · 29 P
Initial
(29 pages)
Redundancy: each task solved by different people
Task 2
Bounding Box
R · 30 s/T · 18 T/P · 29 P
s/T = seconds per task
T/P = task per page
R = 5, Redundancy
Task 3
Word
Segmentation
R · 10 s/T · 360 T/P · 29 P
32
Browsing the marriage licenses on a mobile device
33
33
Thank you!!

Más contenido relacionado

Similar a Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts

Forty Years of the OTA
Forty Years of the OTAForty Years of the OTA
Forty Years of the OTAMartin Wynne
 
Neven Vrček - Role of Governments, Academy & Science Parks - University of Za...
Neven Vrček - Role of Governments, Academy & Science Parks - University of Za...Neven Vrček - Role of Governments, Academy & Science Parks - University of Za...
Neven Vrček - Role of Governments, Academy & Science Parks - University of Za...CUBCCE Conference
 
Gaenovium - Open data in the Netherlands
Gaenovium - Open data in the NetherlandsGaenovium - Open data in the Netherlands
Gaenovium - Open data in the NetherlandsBob Coret
 
ESDG seminar 2019: reconstructing a country
ESDG seminar 2019: reconstructing a countryESDG seminar 2019: reconstructing a country
ESDG seminar 2019: reconstructing a countryRick Mourits
 
Nemeth Marton - Widening the limits of cognitive reception with online digita...
Nemeth Marton - Widening the limits of cognitive reception with online digita...Nemeth Marton - Widening the limits of cognitive reception with online digita...
Nemeth Marton - Widening the limits of cognitive reception with online digita...BOBCATSSS 2017
 
[DCSB] Dr Gabriel Bodard (KCL) “A View on Digital Classics Collaboration: fro...
[DCSB] Dr Gabriel Bodard (KCL) “A View on Digital Classics Collaboration: fro...[DCSB] Dr Gabriel Bodard (KCL) “A View on Digital Classics Collaboration: fro...
[DCSB] Dr Gabriel Bodard (KCL) “A View on Digital Classics Collaboration: fro...Digital Classicist Seminar Berlin
 
Digital archaeology and museums
Digital archaeology and museumsDigital archaeology and museums
Digital archaeology and museumsdejp3
 
Widening the limits of cognitive reception with online digital library graph ...
Widening the limits of cognitive reception with online digital library graph ...Widening the limits of cognitive reception with online digital library graph ...
Widening the limits of cognitive reception with online digital library graph ...Marton Nemeth
 
Research into Practice case study 2: Library linked data implementations an...
	Research into Practice case study 2:  Library linked data implementations an...	Research into Practice case study 2:  Library linked data implementations an...
Research into Practice case study 2: Library linked data implementations an...Hazel Hall
 
Standards in health informatics - Problem, clinical models and terminologies
Standards in health informatics - Problem, clinical models and terminologiesStandards in health informatics - Problem, clinical models and terminologies
Standards in health informatics - Problem, clinical models and terminologiesSilje Ljosland Bakke
 
Semantic Web for Cultural Heritage valorisation
Semantic Web for Cultural Heritage valorisationSemantic Web for Cultural Heritage valorisation
Semantic Web for Cultural Heritage valorisationValentina Carriero
 
Liber 2014 - Chain Reactions: TEL & RLUK on their Linked Open data.
Liber 2014 - Chain Reactions: TEL & RLUK on their Linked Open data.Liber 2014 - Chain Reactions: TEL & RLUK on their Linked Open data.
Liber 2014 - Chain Reactions: TEL & RLUK on their Linked Open data.Mike Mertens
 
The National Bibliographic Knowledgebase
The National Bibliographic KnowledgebaseThe National Bibliographic Knowledgebase
The National Bibliographic KnowledgebaseJisc
 
Linked Statistical Data 101
Linked Statistical Data 101Linked Statistical Data 101
Linked Statistical Data 101Oscar Corcho
 
SENESCHAL: Semantic ENrichment Enabling Sustainability of arCHAeological Link...
SENESCHAL: Semantic ENrichment Enabling Sustainability of arCHAeological Link...SENESCHAL: Semantic ENrichment Enabling Sustainability of arCHAeological Link...
SENESCHAL: Semantic ENrichment Enabling Sustainability of arCHAeological Link...CIGScotland
 
OpenAIRE workshop @ OR2016 - From Repositories, for repositories
OpenAIRE workshop @ OR2016 - From Repositories, for repositoriesOpenAIRE workshop @ OR2016 - From Repositories, for repositories
OpenAIRE workshop @ OR2016 - From Repositories, for repositoriesOpenAIRE
 

Similar a Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts (20)

Forty Years of the OTA
Forty Years of the OTAForty Years of the OTA
Forty Years of the OTA
 
Neven Vrček - Role of Governments, Academy & Science Parks - University of Za...
Neven Vrček - Role of Governments, Academy & Science Parks - University of Za...Neven Vrček - Role of Governments, Academy & Science Parks - University of Za...
Neven Vrček - Role of Governments, Academy & Science Parks - University of Za...
 
Gaenovium - Open data in the Netherlands
Gaenovium - Open data in the NetherlandsGaenovium - Open data in the Netherlands
Gaenovium - Open data in the Netherlands
 
ESDG seminar 2019: reconstructing a country
ESDG seminar 2019: reconstructing a countryESDG seminar 2019: reconstructing a country
ESDG seminar 2019: reconstructing a country
 
Nemeth Marton - Widening the limits of cognitive reception with online digita...
Nemeth Marton - Widening the limits of cognitive reception with online digita...Nemeth Marton - Widening the limits of cognitive reception with online digita...
Nemeth Marton - Widening the limits of cognitive reception with online digita...
 
Open access in Latin America and the Caribbean (LAC)
Open access in Latin America and the Caribbean (LAC)Open access in Latin America and the Caribbean (LAC)
Open access in Latin America and the Caribbean (LAC)
 
[DCSB] Dr Gabriel Bodard (KCL) “A View on Digital Classics Collaboration: fro...
[DCSB] Dr Gabriel Bodard (KCL) “A View on Digital Classics Collaboration: fro...[DCSB] Dr Gabriel Bodard (KCL) “A View on Digital Classics Collaboration: fro...
[DCSB] Dr Gabriel Bodard (KCL) “A View on Digital Classics Collaboration: fro...
 
Digital archaeology and museums
Digital archaeology and museumsDigital archaeology and museums
Digital archaeology and museums
 
Widening the limits of cognitive reception with online digital library graph ...
Widening the limits of cognitive reception with online digital library graph ...Widening the limits of cognitive reception with online digital library graph ...
Widening the limits of cognitive reception with online digital library graph ...
 
Research into Practice case study 2: Library linked data implementations an...
	Research into Practice case study 2:  Library linked data implementations an...	Research into Practice case study 2:  Library linked data implementations an...
Research into Practice case study 2: Library linked data implementations an...
 
Open access in Latin America and the Caribbean (LAC)
Open access in Latin America and the Caribbean (LAC)Open access in Latin America and the Caribbean (LAC)
Open access in Latin America and the Caribbean (LAC)
 
Standards in health informatics - Problem, clinical models and terminologies
Standards in health informatics - Problem, clinical models and terminologiesStandards in health informatics - Problem, clinical models and terminologies
Standards in health informatics - Problem, clinical models and terminologies
 
Winter, Chandler, Biedenbach, Pearson, and Stanton, "It’s Only as Good as the...
Winter, Chandler, Biedenbach, Pearson, and Stanton, "It’s Only as Good as the...Winter, Chandler, Biedenbach, Pearson, and Stanton, "It’s Only as Good as the...
Winter, Chandler, Biedenbach, Pearson, and Stanton, "It’s Only as Good as the...
 
Semantic Web for Cultural Heritage valorisation
Semantic Web for Cultural Heritage valorisationSemantic Web for Cultural Heritage valorisation
Semantic Web for Cultural Heritage valorisation
 
Liber 2014 - Chain Reactions: TEL & RLUK on their Linked Open data.
Liber 2014 - Chain Reactions: TEL & RLUK on their Linked Open data.Liber 2014 - Chain Reactions: TEL & RLUK on their Linked Open data.
Liber 2014 - Chain Reactions: TEL & RLUK on their Linked Open data.
 
The National Bibliographic Knowledgebase
The National Bibliographic KnowledgebaseThe National Bibliographic Knowledgebase
The National Bibliographic Knowledgebase
 
Linked Statistical Data 101
Linked Statistical Data 101Linked Statistical Data 101
Linked Statistical Data 101
 
SENESCHAL: Semantic ENrichment Enabling Sustainability of arCHAeological Link...
SENESCHAL: Semantic ENrichment Enabling Sustainability of arCHAeological Link...SENESCHAL: Semantic ENrichment Enabling Sustainability of arCHAeological Link...
SENESCHAL: Semantic ENrichment Enabling Sustainability of arCHAeological Link...
 
OpenAIRE workshop @ OR2016 - From Repositories, for repositories
OpenAIRE workshop @ OR2016 - From Repositories, for repositoriesOpenAIRE workshop @ OR2016 - From Repositories, for repositories
OpenAIRE workshop @ OR2016 - From Repositories, for repositories
 
Ee bdm ws-v1
Ee bdm ws-v1Ee bdm ws-v1
Ee bdm ws-v1
 

Más de IMPACT Centre of Competence

Más de IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

Último

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 

Último (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts

  • 1. A Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts Alicia Fornés, Josep Lladós, Joan Mas, Joana Maria Pujades, Anna Cabré Computer Vision Center - Centre for Demographic Studies Universitat Autònoma de Barcelona
  • 2. 2 Index  Introduction  5CofM project: The Barcelona Marriage Licenses  Bi-modal Crowdsourcing Platform  Contents view  Labeling view  Running experience  Generalization to other kind of documents  Conclusions
  • 3. 3 5CofM: Barcelona Marriage Licenses 5CofM project: Five Centuries of Marriages • Advanced Grant – European Research Council. • 2011 – 2016. • Partners: • Universitat Autònoma de Barcelona (UAB) • Centre for Demographic Studies (CED). • Computer Vision Center (CVC). • Aim: This project is based on the data-mining of the Llibres d'Esposalles conserved at the Archive of the Barcelona Cathedral. This extraordinary data source comprises 291 books of marriage licenses records, with information of approximately 610.000 unions celebrated in over 250 parishes of the Diocese between 1451 and 1905.
  • 4. 4 The Barcelona Marriage Licenses The Marriage Licenses contain information about: – The couple (groom/bride) – Their parents – Their occupation (job) – The place of origin – The parish (church) where they married – The fee that was paid (depending on their social class) NAME DATE JOB PLACE FEE NAME NAME
  • 5. 5 The Barcelona Marriage Licenses Index Marriage Licenses
  • 6. 6 The Barcelona Marriage Licenses “Llibres d’esposalles” from the Archives of the Barcelona Cathedral • 244 books • From 1451 to 1905 • Approximately 550.000 marriages licenses Ground truth • From the volume 69 • 50 documents • 20 classes Index License marriage Husband’s surname License marriage Fee 6
  • 7. 7 The Barcelona Marriage Licenses: Continuity 1481: volume 3 1601: volume 61 Marriage license Husband’s surname 1729: volume 127 1860: volume 200 Fee Marriage license Fee Husband’s surname Marriage license Fee Husband’s surname Marriage license Fee
  • 8. 8 The Barcelona Marriage Licenses: Fees Marriage licenses fees for the two year period that starts on the first of May, 1627 and ends on the last day of April, 1629 Dukes, Marquises, Counts and Viscounts. Noble knights and Lords of vassals. Knights, Honored Citizens and Bourgeoisies. Merchants, Notaries of Barcelona, Shopkeepers of distinguish materials, Chemists and Druggists. Shopkeepers of materials, Royal Notaries, Surgeons, Traders, Solicitors, Middlemen and Artists. The rest. The poor ones for the love of God. 12 ll 2ll 6s 1ll 4s 12s 6s 4s -
  • 9. 9 CED objectives (scholars) – Genealogic tree • Ancestors / descendants – Immigration / Emigration • Family names appear / disappear • French surnames (descendants) – Population (by num. of marriages) • Plagues, epidemics, baby boom – Parish churches • Neighborhood is/becomes rich/poor – Evolution of a family name • Jobs, fees (higher or lower) – Relationships between families • Strategic, commercial reasons CVC objectives (computer scientists) – Layout analysis • Text-line segmentation – Word Spotting • Query by example • Query by string – Handwriting Recognition – Syntactic analysis The Barcelona Marriage Licenses
  • 10. 10 Document Image Analysis: Tasks • Layout analysis: to detect (crop) records, lines, words for subsequent recognition. • Full transcription: to convert images to editable text. • Word spotting: given a query word to search, to locate at image level visually similar word snippets. dit dia rebere$ de Hieronym Ponsich corder de Bar^(a) fill de Jua$ Pon= BLOCKS WORDS LINES
  • 11. 11 Index  Introduction  5CofM project: The Barcelona Marriage Licenses  Bi-modal Crowdsourcing Platform  Contents view  Labeling view  Running experience  Generalization to other kind of documents  Conclusions
  • 12. 12 Technical architecture Image Space Transcription Space Contextual knowledge Space HW recognition Crowdsourcing Data mining • Harmonization • Record linkage Scanning exploitation
  • 13. 13 Crowdsourcing platform • Manual transcription  tedious and time consuming task • Crowdsourcing Platform (Divide & Conquer) • Split and distribute a big amount of small and simple tasks • Crowdsourcing architecture: • Image space (digitized documents) • Transcription space (extraction of information) • Contextual space (semantic meaning)
  • 14. 14 Crowdsourcing platform • Web-based application: Integration of two points of view • Contents view: Semantic information  demographic research • Labeling view: Ground-truthing  document analysis research http://www.cvc.uab.es/5cofm/
  • 18. 18 Contents view (semantics): Form filling (Indices)
  • 19. 19 Contents view (semantics): Checking correction Check for posible spelling errors (words that appear only once?)
  • 20. 20 Contents view (semantics): Record Linkage • Record Linkage  Genealogical tree • Batch process searches links between individuals: • Parent’s marriage, Brothers/Sisters marriages • The search allows spelling variations • String Edit distance (Levenshtein), with different costs for substitutions • Useful for harmonization of names, surnames… • The expert decides the correct linkage from the candidates Year Bride Father Mother Year Groom Bride Similarity 1638 Jeronima Lluis Teixidor Paula 1606 Lluis Teixidor Paula 1 1638 Joana Nicolau Ferrer Antiga 1613 Nicolau Ferrera Antiga 0.95
  • 21. 21 Index  Introduction  5CofM project: The Barcelona Marriage Licenses  Bi-modal Crowdsourcing Platform  Contents view  Labeling view  Running experience  Generalization to other kind of documents  Conclusions
  • 22. 22 Labeling view (annotation): Transcription (lines) Literal transcription  Ground-truth for handwriting recognition methods
  • 23. 23 Labeling view (annotation): Word Labeling Word meta-data: • Bounding-box (coordinates) • Cathegory (e.g. groom’s name, occupation…) • The system does the automatic correspondence  The user validates! Integrated platform: put into correspondence contents view  labeling view
  • 24. 24 Index  Introduction  5CofM project: The Barcelona Marriage Licenses  Bi-modal Crowdsourcing Platform  Contents view  Labeling view  Running experience  Generalization to other kind of documents  Conclusions
  • 25. 25 Running Experience ADVANTAGES • Digital source • Not necessary to go to the Archive • No timetable limitations • Parallelization • Many users work simultaneously • Centralization • Easier management of images, users, database... • Easy to see “who works on what” • Automatic control • System forces to fill some fields, raises warnings • Useful for detection of spelling errors (auto-correction)
  • 26. 26 Running Experience ADVANTAGES • Security • Frequent back-up • Users can visualize the documents assigned to them, but not download them • Monitoring • Administrator can monitor the user’s work and provide feedback • Visualization and confort • Drag (move), zoom in/out DISADVANTAGES • Internet connection is always needed • If system is down (e.g. maintenance)  no one can work
  • 27. 27 Index  Introduction  5CofM project: The Barcelona Marriage Licenses  Bi-modal Crowdsourcing Platform  Contents view  Labeling view  Running experience  Generalization to other kind of documents  Conclusions
  • 28. Generalization to other demographic manuscripts • The platform has been adapted for census documents
  • 29. 29 Index  Introduction  5CofM project: The Barcelona Marriage Licenses  Bi-modal Crowdsourcing Platform  Contents view  Labeling view  Running experience  Generalization to other kind of documents  Conclusions
  • 30. Conclusions • Web-based crowdsourcing platform for demographic manuscripts • Integrates the needs of demographers and computer scientists Future directions • Improve validation • Combine the output of several users • Compare with the output of document analysis techniques • Mobile-based applications • For crowdsourcing  Faster ground-truth generation • For browsing and searching  User friendly interfaces
  • 31. Crowdsourcing on mobile devices Task 1 Page layout R · 30 s/T · 1 T/P · 29 P Initial (29 pages) Redundancy: each task solved by different people Task 2 Bounding Box R · 30 s/T · 18 T/P · 29 P s/T = seconds per task T/P = task per page R = 5, Redundancy Task 3 Word Segmentation R · 10 s/T · 360 T/P · 29 P
  • 32. 32 Browsing the marriage licenses on a mobile device