SlideShare una empresa de Scribd logo
1 de 43
CEDAR & PRELIDA
Preservation of Linked Socio-
Historical Data
Albert Meroño-Peñuela
@albertmeronyo
PRELIDA consolidation workshop @ ISWC, 17-10-2014
CEDAR: Harmonizing Historical Census
Data in the Semantic Web
CEDAR: Source Historical Data
Dutch Historical Censuses (1795-1971)
[Public Historical Statistical Data]
4
From scans to spreadsheets
CEDAR goal: cross queries
?
1795 1830 1889 1930 1971
(through ~3K tables)
Towards 5-star Census Data
Towards 5-star Census Data
>1 year ago
1 year ago
• Web publishable
• Machine processable
• Dynamic schema
• Easily link with other
datasets
Why with semantic technology?
• Web publishable, human & machine readable
• Finer granularity level (cell level)
• Statistical comparability by leveraging
semantic descriptions
• Provenance
• Harmonization through linkage to other
datasets (the 5th star)
RDF Data Cube
“There are many situations where it would be useful to
be able to publish multi-dimensional data, such as
statistics, on the web in such a way that they can be
linked to related data sets and concepts.”
RDF Data Cube vocabulary (QB)
• SDMX compatible
• Defines cubes as a set of observations that consist of
dimensions, measures and attributes
• Dimensions: time period, region, sex (qb:DimensionProperty)
• Measure: population life expectancy (qb:MeasureProperty)
• Attribute: unit of measure = years, metadata status =
measured (qb:AttributeProperty)
Observation: “the measured life expectancy of males in
Newport in the period 2004-2006 is 76.7 years”
CEDAR Integrator
https://github.com/CEDAR-project/Integrator
Raw data
cedar:BRT_1889_08_T1-S0-K17 a tablink:DataCell ;
rdfs:label "K17";
tablink:value "12.0" ;
tablink:dimension cedar:BRT_1889_08_T1-S0-A8 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-K6 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-J3 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-K4 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-K5 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-B8 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-C12 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-E17 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-F17 ;
tablink:sheet cedar:BRT_1889_08_T1-S0 .
Harmonization Rules as Open
Annotations
cedar:BRT_1889_08_T1-S0-K4-mapping a oa:Annotation ;
oa:hasBody cedar:BRT_1889_08_T1-S0-K4-mapping-body ;
oa:hasTarget cedar:BRT_1889_08_T1-S0-K4 ;
oa:serializedAt "2014-09-24"^^xsd:date ;
oa:serializedBy
<https://github.com/CEDAR-project/Integrator> ;
prov:wasGeneratedBy
cedar:BRT_1889_08_T1-S0-mapping-activity .
cedar:BRT_1889_08_T1-S0-K4-mapping-body a rdfs:Resource ;
sdmx-dimension:sex sdmx-code:sex-F .
Harmonized RDF Data Cube
cedar:BRT_1889_02_T1-S0-K17-h a qb:Observation ;
cedar:population "12"^^xml:decimal ;
maritalstatus:maritalStatus
maritalstatus:single ;
cedarterms:occupationPosition cedarterms:job-D ;
sdmx-dimension:sex sdmx-code:sex-F ;
cedarterms:occupation hisco:88030 ;
sdmx-dimension:refArea gg:11150 ;
prov:wasDerivedFrom
cedar:BRT_1889_08_T1-S0-K17 ;
prov:wasGeneratedBy
cedar:BRT_1889_08_T1-S0-K17-activity .
Classification Systems and
Concept Schemes
• Some missing harmonized dimensions!
• Encode all variables and their values using concept
schemes
• Some already exist
– Which ones? How many of them?
– Where?
– By whom?
– Are they used at all? Can I reuse them?
• Some need to be created
– Manual and expert knowledge based
– Can we do it automatically? Or assist the process?
Dutch Historical
Censuses
(CEDAR)
Dutch Ships
and Sailors
Gemeente
geschiede
nis.nl
HISCO
ICONCLASS
Dutch
Historical
Religions
Dutch
Historical
House Types
Existing dimensions
• HISCO
http://historyofwork.iisg.nl/
Existing dimensions
• Gemeentegeschiedenis.nl
Existing LSD dimensions
• P1: Discoverability? How to discover
dimensions created by others?
• P2: Reusability? How often are dimensions
reused? Can we reuse dimensions created by
others?
• P3: Relevance? What’s the size of LSD?
LSD Dimensions
http://lsd-dimensions.org/
https://github.com/albertmeronyo/LSD-Dimensions
Hourly JSON-LD dumps
http://lsd-dimensions.org/
Existing LSD dimensions
• P1: Discoverability? How to discover
dimensions created by others? LSD
Dimensions
• P2: Reusability? How often are dimensions
reused? Can we reuse dimensions created by
others? Logarithmic law / probably yes
• P3: Relevance? What’s the size of LSD? ~7.9%
of the LOD cloud
Creating new LSD Dimensions
• CEDAR needs concept schemes for
– Historical religious denominations (i.e. religions in
the NL in 18th-20th c.)
– Historical occupations (id.)
– Historical building types (id.)
https://github.com/CEDAR-project/TabCluster
TabCluster
Leverages
● Lexical properties
○ Hierarchical clustering in Python scipy
○ String distances
● Semantic properties (LOD tagging)
○ skos:Concept of most frequent cluster-term
○ Closest common skos:broader skos:Concept of all
cluster-terms
Compatibility? Remixability? Reusability?
Sarven Capadisli, Albert Meroño-Peñuela, Sören Auer, Reinhard Riedl. “Semantic Similarity
and Correlation of Linked Statistical Data Analysis”. 2nd Int. Workshop on Semantic Statistics
(SemStats) ISWC 2014.
Concept Drift
Census classification of
occupations as for
1859
• Root node is void
• Depth 1: occupation groups
• Leaves: actual occupations
Concept Drift
Census classification of
occupations as for
1889
• Root node is void
• Depth 1: occupation groups
• Leaves: actual occupations
Concept Drift
Census classification of
occupations as for
1899
• Root node is void
• Depth 1: occupation groups
• Leaves: actual occupations
Concept Drift
Upper ontologies
(HISCO, AC)
Year-
dependent
ontologies
1859 1869 1879
Concept Drift
Upper ontologies
(HISCO, AC)
Year-
dependent
ontologies
Concept Drift
Upper ontologies
(HISCO, AC)
Year-
dependent
ontologies
? ?
Preserving CEDAR
Preserving CEDAR
• DANS-EASY as backend (http://easy.dans.knaw.nl/)
• Archived objects: Turtle snapshots
– 20Go uncompressed, 200Mo compressed (per
snapshot)
– Versioning (stats on current release)
• Users still need to
– SPARQL the data => bring up the endpoint on demand
– Run analytics on the data => outsource statistical
analysis
Thank you
Questions, suggestions, comments most
welcome
@albertmeronyo
http://www.cedar-project.nl
http://krr.cs.vu.nl/
http://easy.dans.knaw.nl/
http://lsd-dimensions.org/
Me in 6 tweets
http://www.albertmeronyo.org
• Background: Computer Science, Web hacker, AI & Law
• PhD candidate at the VU University Amsterdam, DANS,
and eHumanities group (KNAW)
• Topic: Semantic Web for the Humanities
• CEDAR project (2012-2015): harmonized historical
Dutch censuses in the Semantic Web
• Problem: statistical data publishing, concept drift and
dynamics of meaning
• Last paper: What is Linked Historical Data? (EKAW
2014)

Más contenido relacionado

La actualidad más candente

Approaches to representing and delivering geospatial data in the semantic Web...
Approaches to representing and delivering geospatial data in the semantic Web...Approaches to representing and delivering geospatial data in the semantic Web...
Approaches to representing and delivering geospatial data in the semantic Web...
Paul Box
 

La actualidad más candente (11)

ESTA-LD exploring spatio-temporal linked statistical data
ESTA-LD exploring spatio-temporal linked statistical dataESTA-LD exploring spatio-temporal linked statistical data
ESTA-LD exploring spatio-temporal linked statistical data
 
What happened?
What happened?What happened?
What happened?
 
Sdwwg experiences and outlook
Sdwwg experiences and outlookSdwwg experiences and outlook
Sdwwg experiences and outlook
 
Os Percy
Os PercyOs Percy
Os Percy
 
Approaches to representing and delivering geospatial data in the semantic Web...
Approaches to representing and delivering geospatial data in the semantic Web...Approaches to representing and delivering geospatial data in the semantic Web...
Approaches to representing and delivering geospatial data in the semantic Web...
 
Os Racicot
Os RacicotOs Racicot
Os Racicot
 
In The Land Of Graphs...
In The Land Of Graphs...In The Land Of Graphs...
In The Land Of Graphs...
 
SFScon 21 - Marco Montanari - Open history Map
SFScon 21 - Marco Montanari - Open history MapSFScon 21 - Marco Montanari - Open history Map
SFScon 21 - Marco Montanari - Open history Map
 
Eighth openCypher Implementers Group Meeting: Status Update
Eighth openCypher Implementers Group Meeting: Status UpdateEighth openCypher Implementers Group Meeting: Status Update
Eighth openCypher Implementers Group Meeting: Status Update
 
The 2nd graph database in sv meetup
The 2nd graph database in sv meetupThe 2nd graph database in sv meetup
The 2nd graph database in sv meetup
 
Avito Demand Prediction Challenge - Kaggle
Avito Demand Prediction Challenge - KaggleAvito Demand Prediction Challenge - Kaggle
Avito Demand Prediction Challenge - Kaggle
 

Similar a CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

DH101 2013/2014 course 6 - Semantic coding, RDF, CIDOC-CRM
DH101 2013/2014 course 6 - Semantic coding, RDF, CIDOC-CRMDH101 2013/2014 course 6 - Semantic coding, RDF, CIDOC-CRM
DH101 2013/2014 course 6 - Semantic coding, RDF, CIDOC-CRM
Frederic Kaplan
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
rchbeir
 
Citing and understanding spatial references for eResearch: Spatial Identifie...
Citing and understanding spatial references for eResearch: Spatial Identifie...Citing and understanding spatial references for eResearch: Spatial Identifie...
Citing and understanding spatial references for eResearch: Spatial Identifie...
Paul Box
 

Similar a CEDAR & PRELIDA Preservation of Linked Socio-Historical Data (20)

CBS CEDAR Presentation
CBS CEDAR PresentationCBS CEDAR Presentation
CBS CEDAR Presentation
 
DH101 2013/2014 course 6 - Semantic coding, RDF, CIDOC-CRM
DH101 2013/2014 course 6 - Semantic coding, RDF, CIDOC-CRMDH101 2013/2014 course 6 - Semantic coding, RDF, CIDOC-CRM
DH101 2013/2014 course 6 - Semantic coding, RDF, CIDOC-CRM
 
LSD Dimensions: Use and Reuse of Linked Statistical Data as RDF Data Cube
LSD Dimensions: Use and Reuse of Linked Statistical Data as RDF Data CubeLSD Dimensions: Use and Reuse of Linked Statistical Data as RDF Data Cube
LSD Dimensions: Use and Reuse of Linked Statistical Data as RDF Data Cube
 
An Open and Shut Case? Shared Standards for Stratigraphic Data and Heritage L...
An Open and Shut Case? Shared Standards for Stratigraphic Data and Heritage L...An Open and Shut Case? Shared Standards for Stratigraphic Data and Heritage L...
An Open and Shut Case? Shared Standards for Stratigraphic Data and Heritage L...
 
Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organi...
Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organi...Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organi...
Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organi...
 
ePOM - Intro to Ocean Data Science - Raster and Vector Data Formats
ePOM - Intro to Ocean Data Science - Raster and Vector Data FormatsePOM - Intro to Ocean Data Science - Raster and Vector Data Formats
ePOM - Intro to Ocean Data Science - Raster and Vector Data Formats
 
CAA2014 L'ontologie du CIDOC CRM pour interroger une base de données d'Archéo...
CAA2014 L'ontologie du CIDOC CRM pour interroger une base de données d'Archéo...CAA2014 L'ontologie du CIDOC CRM pour interroger une base de données d'Archéo...
CAA2014 L'ontologie du CIDOC CRM pour interroger une base de données d'Archéo...
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
 
Esta ld -exploring-spatio-temporal-linked-statistical-data
Esta ld -exploring-spatio-temporal-linked-statistical-dataEsta ld -exploring-spatio-temporal-linked-statistical-data
Esta ld -exploring-spatio-temporal-linked-statistical-data
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph Introduction
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 
Linked Data: principles and examples
Linked Data: principles and examples Linked Data: principles and examples
Linked Data: principles and examples
 
Statistical data in RDF
Statistical data in RDFStatistical data in RDF
Statistical data in RDF
 
The Statistics of Stairway to Heaven: A Semantic Story About Digital Humanities
The Statistics of Stairway to Heaven: A Semantic Story About Digital HumanitiesThe Statistics of Stairway to Heaven: A Semantic Story About Digital Humanities
The Statistics of Stairway to Heaven: A Semantic Story About Digital Humanities
 
Maintaining scholarly standards in the digital age: Publishing historical gaz...
Maintaining scholarly standards in the digital age: Publishing historical gaz...Maintaining scholarly standards in the digital age: Publishing historical gaz...
Maintaining scholarly standards in the digital age: Publishing historical gaz...
 
Citing and understanding spatial references for eResearch: Spatial Identifie...
Citing and understanding spatial references for eResearch: Spatial Identifie...Citing and understanding spatial references for eResearch: Spatial Identifie...
Citing and understanding spatial references for eResearch: Spatial Identifie...
 
RDF Stream Processing and the role of Semantics
RDF Stream Processing and the role of SemanticsRDF Stream Processing and the role of Semantics
RDF Stream Processing and the role of Semantics
 
Grails goes Graph
Grails goes GraphGrails goes Graph
Grails goes Graph
 
Data Mining introduction and basic concepts
Data Mining introduction and basic conceptsData Mining introduction and basic concepts
Data Mining introduction and basic concepts
 
Spatial Data Science with R
Spatial Data Science with RSpatial Data Science with R
Spatial Data Science with R
 

Más de PRELIDA Project

Más de PRELIDA Project (17)

Steps towards a Data Value Chain
Steps towards a Data Value ChainSteps towards a Data Value Chain
Steps towards a Data Value Chain
 
Preserving linked data: sustainability and organizational infrastructure
Preserving linked data: sustainability and organizational infrastructurePreserving linked data: sustainability and organizational infrastructure
Preserving linked data: sustainability and organizational infrastructure
 
Organizational and Economic Issues in Linked Data Preservation
Organizational and Economic Issues in Linked Data PreservationOrganizational and Economic Issues in Linked Data Preservation
Organizational and Economic Issues in Linked Data Preservation
 
CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultura...
CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultura...CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultura...
CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultura...
 
Experiments with evolving RDF
Experiments with evolving RDFExperiments with evolving RDF
Experiments with evolving RDF
 
Privacy‐Aware Preservation: Challenges from the Perspective of a Linked Data ...
Privacy‐Aware Preservation: Challenges from the Perspective of a Linked Data ...Privacy‐Aware Preservation: Challenges from the Perspective of a Linked Data ...
Privacy‐Aware Preservation: Challenges from the Perspective of a Linked Data ...
 
Media Ecology Project
Media Ecology ProjectMedia Ecology Project
Media Ecology Project
 
HIBERLINK: Reference Rot and Linked Data: Threat and Remedy
HIBERLINK: Reference Rot and Linked Data: Threat and RemedyHIBERLINK: Reference Rot and Linked Data: Threat and Remedy
HIBERLINK: Reference Rot and Linked Data: Threat and Remedy
 
DIACHRON Preservation: Evolution Management for Preservation
DIACHRON Preservation: Evolution Management for PreservationDIACHRON Preservation: Evolution Management for Preservation
DIACHRON Preservation: Evolution Management for Preservation
 
DIACHRON Project Overview
DIACHRON Project OverviewDIACHRON Project Overview
DIACHRON Project Overview
 
PRELIDA Project Draft Roadmap
PRELIDA Project Draft RoadmapPRELIDA Project Draft Roadmap
PRELIDA Project Draft Roadmap
 
D.3.1: State of the Art - Linked Data and Digital Preservation
D.3.1: State of the Art - Linked Data and Digital PreservationD.3.1: State of the Art - Linked Data and Digital Preservation
D.3.1: State of the Art - Linked Data and Digital Preservation
 
Introduction to PRELIDA Consolidation and Dissemination Workshop
Introduction to PRELIDA Consolidation and Dissemination WorkshopIntroduction to PRELIDA Consolidation and Dissemination Workshop
Introduction to PRELIDA Consolidation and Dissemination Workshop
 
D3.1 State of the art assessment on Linked Data and Digital Preservation
D3.1 State of the art assessment on Linked Data and Digital PreservationD3.1 State of the art assessment on Linked Data and Digital Preservation
D3.1 State of the art assessment on Linked Data and Digital Preservation
 
Gap Analysis
Gap AnalysisGap Analysis
Gap Analysis
 
Towards long-term preservation of linked data - the PRELIDA project
Towards long-term preservation of linked data - the PRELIDA projectTowards long-term preservation of linked data - the PRELIDA project
Towards long-term preservation of linked data - the PRELIDA project
 
Introduction to Prelida
Introduction to PrelidaIntroduction to Prelida
Introduction to Prelida
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 

CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

  • 1. CEDAR & PRELIDA Preservation of Linked Socio- Historical Data Albert Meroño-Peñuela @albertmeronyo PRELIDA consolidation workshop @ ISWC, 17-10-2014
  • 2. CEDAR: Harmonizing Historical Census Data in the Semantic Web
  • 3. CEDAR: Source Historical Data Dutch Historical Censuses (1795-1971) [Public Historical Statistical Data]
  • 4. 4 From scans to spreadsheets
  • 5. CEDAR goal: cross queries ? 1795 1830 1889 1930 1971 (through ~3K tables)
  • 7. Towards 5-star Census Data >1 year ago 1 year ago
  • 8.
  • 9. • Web publishable • Machine processable • Dynamic schema • Easily link with other datasets
  • 10. Why with semantic technology? • Web publishable, human & machine readable • Finer granularity level (cell level) • Statistical comparability by leveraging semantic descriptions • Provenance • Harmonization through linkage to other datasets (the 5th star)
  • 11. RDF Data Cube “There are many situations where it would be useful to be able to publish multi-dimensional data, such as statistics, on the web in such a way that they can be linked to related data sets and concepts.”
  • 12.
  • 13.
  • 14. RDF Data Cube vocabulary (QB) • SDMX compatible • Defines cubes as a set of observations that consist of dimensions, measures and attributes • Dimensions: time period, region, sex (qb:DimensionProperty) • Measure: population life expectancy (qb:MeasureProperty) • Attribute: unit of measure = years, metadata status = measured (qb:AttributeProperty) Observation: “the measured life expectancy of males in Newport in the period 2004-2006 is 76.7 years”
  • 16. Raw data cedar:BRT_1889_08_T1-S0-K17 a tablink:DataCell ; rdfs:label "K17"; tablink:value "12.0" ; tablink:dimension cedar:BRT_1889_08_T1-S0-A8 ; tablink:dimension cedar:BRT_1889_08_T1-S0-K6 ; tablink:dimension cedar:BRT_1889_08_T1-S0-J3 ; tablink:dimension cedar:BRT_1889_08_T1-S0-K4 ; tablink:dimension cedar:BRT_1889_08_T1-S0-K5 ; tablink:dimension cedar:BRT_1889_08_T1-S0-B8 ; tablink:dimension cedar:BRT_1889_08_T1-S0-C12 ; tablink:dimension cedar:BRT_1889_08_T1-S0-E17 ; tablink:dimension cedar:BRT_1889_08_T1-S0-F17 ; tablink:sheet cedar:BRT_1889_08_T1-S0 .
  • 17. Harmonization Rules as Open Annotations cedar:BRT_1889_08_T1-S0-K4-mapping a oa:Annotation ; oa:hasBody cedar:BRT_1889_08_T1-S0-K4-mapping-body ; oa:hasTarget cedar:BRT_1889_08_T1-S0-K4 ; oa:serializedAt "2014-09-24"^^xsd:date ; oa:serializedBy <https://github.com/CEDAR-project/Integrator> ; prov:wasGeneratedBy cedar:BRT_1889_08_T1-S0-mapping-activity . cedar:BRT_1889_08_T1-S0-K4-mapping-body a rdfs:Resource ; sdmx-dimension:sex sdmx-code:sex-F .
  • 18. Harmonized RDF Data Cube cedar:BRT_1889_02_T1-S0-K17-h a qb:Observation ; cedar:population "12"^^xml:decimal ; maritalstatus:maritalStatus maritalstatus:single ; cedarterms:occupationPosition cedarterms:job-D ; sdmx-dimension:sex sdmx-code:sex-F ; cedarterms:occupation hisco:88030 ; sdmx-dimension:refArea gg:11150 ; prov:wasDerivedFrom cedar:BRT_1889_08_T1-S0-K17 ; prov:wasGeneratedBy cedar:BRT_1889_08_T1-S0-K17-activity .
  • 19. Classification Systems and Concept Schemes • Some missing harmonized dimensions! • Encode all variables and their values using concept schemes • Some already exist – Which ones? How many of them? – Where? – By whom? – Are they used at all? Can I reuse them? • Some need to be created – Manual and expert knowledge based – Can we do it automatically? Or assist the process?
  • 20. Dutch Historical Censuses (CEDAR) Dutch Ships and Sailors Gemeente geschiede nis.nl HISCO ICONCLASS Dutch Historical Religions Dutch Historical House Types
  • 23. Existing LSD dimensions • P1: Discoverability? How to discover dimensions created by others? • P2: Reusability? How often are dimensions reused? Can we reuse dimensions created by others? • P3: Relevance? What’s the size of LSD?
  • 26.
  • 27.
  • 28.
  • 29. Existing LSD dimensions • P1: Discoverability? How to discover dimensions created by others? LSD Dimensions • P2: Reusability? How often are dimensions reused? Can we reuse dimensions created by others? Logarithmic law / probably yes • P3: Relevance? What’s the size of LSD? ~7.9% of the LOD cloud
  • 30. Creating new LSD Dimensions • CEDAR needs concept schemes for – Historical religious denominations (i.e. religions in the NL in 18th-20th c.) – Historical occupations (id.) – Historical building types (id.)
  • 32. TabCluster Leverages ● Lexical properties ○ Hierarchical clustering in Python scipy ○ String distances ● Semantic properties (LOD tagging) ○ skos:Concept of most frequent cluster-term ○ Closest common skos:broader skos:Concept of all cluster-terms
  • 33. Compatibility? Remixability? Reusability? Sarven Capadisli, Albert Meroño-Peñuela, Sören Auer, Reinhard Riedl. “Semantic Similarity and Correlation of Linked Statistical Data Analysis”. 2nd Int. Workshop on Semantic Statistics (SemStats) ISWC 2014.
  • 34. Concept Drift Census classification of occupations as for 1859 • Root node is void • Depth 1: occupation groups • Leaves: actual occupations
  • 35. Concept Drift Census classification of occupations as for 1889 • Root node is void • Depth 1: occupation groups • Leaves: actual occupations
  • 36. Concept Drift Census classification of occupations as for 1899 • Root node is void • Depth 1: occupation groups • Leaves: actual occupations
  • 37. Concept Drift Upper ontologies (HISCO, AC) Year- dependent ontologies 1859 1869 1879
  • 38. Concept Drift Upper ontologies (HISCO, AC) Year- dependent ontologies
  • 39. Concept Drift Upper ontologies (HISCO, AC) Year- dependent ontologies ? ?
  • 41. Preserving CEDAR • DANS-EASY as backend (http://easy.dans.knaw.nl/) • Archived objects: Turtle snapshots – 20Go uncompressed, 200Mo compressed (per snapshot) – Versioning (stats on current release) • Users still need to – SPARQL the data => bring up the endpoint on demand – Run analytics on the data => outsource statistical analysis
  • 42. Thank you Questions, suggestions, comments most welcome @albertmeronyo http://www.cedar-project.nl http://krr.cs.vu.nl/ http://easy.dans.knaw.nl/ http://lsd-dimensions.org/
  • 43. Me in 6 tweets http://www.albertmeronyo.org • Background: Computer Science, Web hacker, AI & Law • PhD candidate at the VU University Amsterdam, DANS, and eHumanities group (KNAW) • Topic: Semantic Web for the Humanities • CEDAR project (2012-2015): harmonized historical Dutch censuses in the Semantic Web • Problem: statistical data publishing, concept drift and dynamics of meaning • Last paper: What is Linked Historical Data? (EKAW 2014)

Notas del editor

  1. Good afternoon everybody. I’m Albert Meroño. It’s a great pleasure to be here today, thanks to the organisers for the invitation… Today I’m gonna talk a bit about preservation of linked socio-historical data. And the work that we’ve been doing at the CEDAR project to publish socio-historical data on the SW. And we study the pros and cons of using semantic technologies to enhance the research methodologies of historians and social scientists. The interesting thing about preservation and CEDAR is a double angle: What we do is to re-publish PRESERVED data (from the 18th c.) At the same time we think on how to PRESERVE that re-publication (preserve the Linked Data)
  2. These things are in the archive
  3. The things in the archive change. Availability of new technology forces us opening the archive, taking the data out of it, doing something to it, store the new version.
  4. 2 problems: layout interpretation, and semantic alignment
  5. We like 5 star datasets. Historians also like 5 star datasets. HOWEVER, they still want their non-standard formats for data diving. Data diving guides their research and suggests new research questions.
  6. This is super cool. NOW, how do we connect with the archive to produce it?....
  7. From the ARCHIVE to RDF Data Cube TURTLE
  8. Work on progress on
  9. Interesting – they explain change explicitly, linking together metadata from different periods of time and map shapes.
  10. To what extent can we build these classifications automatically?
  11. ………………… BUT ALL DONE?
  12. Archiving the serialization of such semantic-statistic relationships?
  13. CHANGE OVER TIME