by Albert Meroño, presented at the 3rd PRELIDA Consolidation and Dissemination Workshop, Riva, Italy, October, 17, 2014. More information about the workshop at: prelida.eu
Introduction to Multilingual Retrieval Augmented Generation (RAG)
CEDAR & PRELIDA Preservation of Linked Socio-Historical Data
1. CEDAR & PRELIDA
Preservation of Linked Socio-
Historical Data
Albert Meroño-Peñuela
@albertmeronyo
PRELIDA consolidation workshop @ ISWC, 17-10-2014
9. • Web publishable
• Machine processable
• Dynamic schema
• Easily link with other
datasets
10. Why with semantic technology?
• Web publishable, human & machine readable
• Finer granularity level (cell level)
• Statistical comparability by leveraging
semantic descriptions
• Provenance
• Harmonization through linkage to other
datasets (the 5th star)
11. RDF Data Cube
“There are many situations where it would be useful to
be able to publish multi-dimensional data, such as
statistics, on the web in such a way that they can be
linked to related data sets and concepts.”
12.
13.
14. RDF Data Cube vocabulary (QB)
• SDMX compatible
• Defines cubes as a set of observations that consist of
dimensions, measures and attributes
• Dimensions: time period, region, sex (qb:DimensionProperty)
• Measure: population life expectancy (qb:MeasureProperty)
• Attribute: unit of measure = years, metadata status =
measured (qb:AttributeProperty)
Observation: “the measured life expectancy of males in
Newport in the period 2004-2006 is 76.7 years”
19. Classification Systems and
Concept Schemes
• Some missing harmonized dimensions!
• Encode all variables and their values using concept
schemes
• Some already exist
– Which ones? How many of them?
– Where?
– By whom?
– Are they used at all? Can I reuse them?
• Some need to be created
– Manual and expert knowledge based
– Can we do it automatically? Or assist the process?
23. Existing LSD dimensions
• P1: Discoverability? How to discover
dimensions created by others?
• P2: Reusability? How often are dimensions
reused? Can we reuse dimensions created by
others?
• P3: Relevance? What’s the size of LSD?
29. Existing LSD dimensions
• P1: Discoverability? How to discover
dimensions created by others? LSD
Dimensions
• P2: Reusability? How often are dimensions
reused? Can we reuse dimensions created by
others? Logarithmic law / probably yes
• P3: Relevance? What’s the size of LSD? ~7.9%
of the LOD cloud
30. Creating new LSD Dimensions
• CEDAR needs concept schemes for
– Historical religious denominations (i.e. religions in
the NL in 18th-20th c.)
– Historical occupations (id.)
– Historical building types (id.)
41. Preserving CEDAR
• DANS-EASY as backend (http://easy.dans.knaw.nl/)
• Archived objects: Turtle snapshots
– 20Go uncompressed, 200Mo compressed (per
snapshot)
– Versioning (stats on current release)
• Users still need to
– SPARQL the data => bring up the endpoint on demand
– Run analytics on the data => outsource statistical
analysis
42. Thank you
Questions, suggestions, comments most
welcome
@albertmeronyo
http://www.cedar-project.nl
http://krr.cs.vu.nl/
http://easy.dans.knaw.nl/
http://lsd-dimensions.org/
43. Me in 6 tweets
http://www.albertmeronyo.org
• Background: Computer Science, Web hacker, AI & Law
• PhD candidate at the VU University Amsterdam, DANS,
and eHumanities group (KNAW)
• Topic: Semantic Web for the Humanities
• CEDAR project (2012-2015): harmonized historical
Dutch censuses in the Semantic Web
• Problem: statistical data publishing, concept drift and
dynamics of meaning
• Last paper: What is Linked Historical Data? (EKAW
2014)
Notas del editor
Good afternoon everybody. I’m Albert Meroño. It’s a great pleasure to be here today, thanks to the organisers for the invitation…
Today I’m gonna talk a bit about preservation of linked socio-historical data.
And the work that we’ve been doing at the CEDAR project to publish socio-historical data on the SW. And we study the pros and cons of using semantic technologies to enhance the research methodologies of historians and social scientists.
The interesting thing about preservation and CEDAR is a double angle:
What we do is to re-publish PRESERVED data (from the 18th c.)
At the same time we think on how to PRESERVE that re-publication (preserve the Linked Data)
These things are in the archive
The things in the archive change.
Availability of new technology forces us opening the archive, taking the data out of it, doing something to it, store the new version.
2 problems: layout interpretation, and semantic alignment
We like 5 star datasets. Historians also like 5 star datasets. HOWEVER, they still want their non-standard formats for data diving. Data diving guides their research and suggests new research questions.
This is super cool. NOW, how do we connect with the archive to produce it?....
From the ARCHIVE to RDF Data Cube TURTLE
Work on progress on
Interesting – they explain change explicitly, linking together metadata from different periods of time and map shapes.
To what extent can we build these classifications automatically?
………………… BUT ALL DONE?
Archiving the serialization of such semantic-statistic relationships?