1. Linked Humanities Data:
The Next Frontier?
A Case-Study in Historical Census Data
Albert Meroño-Peñuela
Knowledge Representation & Reasoning Group
29-10-2012
2. The Dutch historical censuses
(1795-1971)
29-10-2012 Linked Humanities Data: The Next Frontier? 2
3. The Dutch historical censuses
(1795-1971)
29-10-2012 Linked Humanities Data: The Next Frontier? 3
4. The Dutch historical censuses
(1795-1971)
• Population,
Houses and
Occupation
censuses
• 507 Excel files
• 2,288 tables
• 33,283
annotated cells
29-10-2012 Linked Humanities Data: The Next Frontier? 4
6. Heterogeneity: semantic
• Variable meaning
– Plaatselijke indeling / Kom, buiten de kom + Wijk +
Naam / Plaats
– Variable design (age 14-18, 19-20 vs. 14-15, 16-20)
• Variable values
– RomschKatholik, RomsKatholic, VaticanChristelijk
– Change in municipalities, occupations
29-10-2012 Linked Humanities Data: The Next Frontier? 6
7. (Current) Harmonization
• Manually create a (more general) translation
table using standard CS
– Map occupation literals with HISCO codes
– Map municipality literals with AC codes
• Cons
– Expensive
– Detail/specificity loss
– Process is non-repeatable
29-10-2012 Linked Humanities Data: The Next Frontier? 7
8. Additional requirements
• Errors: non-destructive update of values
• Provenance: record who did what, when, why
• Datamodel: do not commit to a specific one
• Linkage: enrich the dataset by linking it to
others (e.g. labour strikes, book publications
in NL)
• Publication: open data for researchers
29-10-2012 Linked Humanities Data: The Next Frontier? 8
9. Census RDF: arch
• RDF Data Cube
Vocabulary (cell data)
• D2S Vocabulary (layout
data)
• Open Annotation Core
Data Model (annotation
data)
29-10-2012 Linked Humanities Data: The Next Frontier? 9
10. Census RDF: cell data
29-10-2012 Linked Humanities Data: The Next Frontier? 10
11. Census RDF: layout data
29-10-2012 Linked Humanities Data: The Next Frontier? 11
13. Querying the RDF’d census
29-10-2012 Linked Humanities Data: The Next Frontier? 13
14. Not ready-to-publish RDF
• Disconnected graphs (but 279,136 possible variable
mappings!)
• Complex & non-homogeneous SPARQL queries
• Contradictory annotation statements
• Drifted concepts
– Tile settler -> roof repairer
– Shoemaker (works with leather) -> shoemaker (owns a
company)
29-10-2012 Linked Humanities Data: The Next Frontier? 14
15. New challenges
• Dynamic ontologies
– Different concept formalizations depending on the
time frame
– Subjective definitions (contested concepts)
• Partitions and counting
– Cannot merge counts of non aligned concepts
– Infer individuals?
• Format round-tripping
– On-demand XLS, CSV, RDF, RDB conversions with(out)
data loss
29-10-2012 Linked Humanities Data: The Next Frontier? 15