GBSN - Microbiology (Unit 3)Defense Mechanism of the body
Data standardization process for social sciences and humanities
1. dans.knaw.nl
DANS is een instituut van KNAW en NWO
Data standardization process
for social sciences and humanities
Vyacheslav Tykhonov
Senior Information Scientist
Data Archiving and Networked Services
(DANS-KNAW, Netherlands)
Time Machine conference 2018
3. Common problems in data management
Data standardization process plays a key role in the data
management plan of any organization but current situation in
research data management is very complex:
• too much data chaos in datasets
• no data transparency
• sometimes no standards available
• no provenance information attached to data
• homonyms, synonyms, generalizations, specializations,
spelling variations and mistakes, language versions are all
complicating the keyword-based search and retrieval of
information
4. Controlled vocabulary and thesaurus
• Linked data is one step forward (or actually backward in the right
direction) on solving some of standardization problems.
• By having shared controlled vocabularies (CV) created and
maintained by experts on various domains, the digital items can
be annotated with them and easily retrieved by other experts
from the same domain without being librarian. It’s clear
indication which vocabulary is good enough and shared by a
critical mass.
• A thesaurus is a semantic network of unique concepts, including
relationships between synonyms, broader and narrower
(parent/child) contexts, and other related concepts. Thesaurus is
hierarchy for controlled vocabularies.
5. Timbuctoo datastore
• Developed by Huygens ING (KNAW, Netherlands)
for academic research in the Arts and Humanities,
which often yields complex and heterogeneous
data.
• lives up to academic standards for working with
such content: the infrastructure accommodates
different views on a subject and leaves the
interpretation of the data to the researcher.
• keeps track of data provenance and does not
impose a certain research methodology on its
6. DataverseEU data repository
Dataverse is data repository developed by Harvard IQSS.
DataverseEU project funded by CESSDA, consortium for the promotion of the results of social science
research and supporting international research cooperation. We’re developing multilingual web interface
and localizing metadata fields and developed data standardization technique based on APIs for CESSDA
CVs, Topic Classification and CESSDA CV Manager services.
DataverseEU countries:
• Hungary (TARKI)
• Sweden (SND)
• Slovenia (ADP)
• Germany (GESIS)
• France (SciencesPro)
• Austria (AUSSDA)
• United Kingdom (UKDA)
• Italy (UniData)
• Belgium (SODA)
• Latvia (LSZDA)
• Netherlands (DANS-KNAW)
7. SKOS RDF Vocabularies is perfect input for Timbuctoo
We’re importing thesaurus delivered as SKOS RDF, for example:
Timbuctoo API endpoint delivers back JSON suitable for web
applications.
10. Record in Dublin Core from Dataverse OAI-PMH endpoint
Here is a problem: values standardized but…
we’ve just lost controlled vocabularies relationship in the Knowledge Graph!
We need Linked Data repository (Timbuctoo) to keep all relations
alive.
11. Standardized RDF harvested by Timbuctoo
All relations exported and available in the Knowledge Graph
and ready for the further querying and exploration:
12. Timbuctoo GraphiQL data exploration tool
All semantic relations can be found in @context field, PIDs of linked datasets in @data:
Standardized metadata exported to the Linked Open Data Cloud (LOD)!