1. 5 years of Dataverse
evolution
Slava Tykhonov
Senior Information Scientist,
Research & Innovation meeting (DANS-KNAW)
26.01.2021
2. Dataverse based Clio Infra collaboration platform (2015)
Clio Infra functionality based on the Dataverse solution:
- teams curate, share and analyze research datasets collaboratively
- teams members can share the responsibility to collect data on specific variables
(for example, countries) and inform each other about changes and additions
- dataset version control system is able to track changes in datasets
- other researchers can download their own copy of the data if dataset is
published as Open Data Dataverse in flexible metadata store (Dataverse) that
connected with Research datasets storage by data processing engine
4. DANS Dataverse 3.x migration (2016)
Basic DataverseNL services:
• Federated login for Netherlands
institutions
• Persistent Identifier Services (DOI and
handle)
• Integration with archival systems
Applications:
• Modern and historical world maps
visualisations
• Data API and Geo API services for
projects with data
• Panel datasets constructor
• Time series plot
• Treemaps
• Pie and chart visualizations
• Descriptive statistics tools
5. Major challenges to provide services for researchers
● Maintenance concerns - who will be in charge after project is finished?
● Infrastructure problems - how to install and run tools for researchers?
● Various Interoperability issues - how to leverage data exchange between
different systems and services
Software updates and bug fixing, licences, technical staff training, legal aspects
and so on...
6. The influence of APIs standards on innovation
Source: V. Tykhonov “API Economy”
7. Interoperability in EOSC
● Technical interoperability defined as the “ability of different information technology systems
and software applications to communicate and exchange data”. It should allow “to accept
data from each other and perform a given task in an appropriate and satisfactory manner
without the need for extra operator intervention”.
● Semantic interoperability is “the ability of computer systems to transmit data with
unambiguous, shared meaning. Semantic interoperability is a requirement to enable machine
computable logic, inferencing, knowledge discovery, and data”.
● Organisational interoperability refers to the “way in which organisations align their
business processes, responsibilities and expectations to achieve commonly agreed and
mutually beneficial goals. Focus on the requirements of the user community by making
services available, easily identifiable, accessible and user-focused”.
● Legal interoperability covers “the broader environment of laws, policies, procedures and
cooperation agreements”
Source: EOSC Interoperability Framework v1.0
9. DANS Data Stations - Future Data Services
Dataverse is API based data platform and a key framework for Open Innovation!
10. Dataverse architecture in the nutshell
Basic components: Database (postgres), search index (solr) and web application (Glassfish/Payara)
Simple but
powerful!
How about
maintenance?
12. The Cathedral and the Bazaar
“The Cathedral and the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary
(abbreviated CatB) is an essay, and later a book, by Eric S. Raymond on software engineering methods,
based on his observations of the Linux kernel development process and his experiences managing an
open source project, fetchmail. It examines the struggle between top-down and bottom-up design.”
Wikipedia
Some important points:
● Smart data structures and dumb code works a lot better than the other way
around
● When writing gateway software of any kind, take pains to disturb the data
stream as little as possible—and never throw away information unless the
recipient forces you to!
● Any tool should be useful in the expected way, but a truly great tool lends itself
to uses you never expected
13. Principle of good enough
The principle of good enough or "good enough" principle is a rule in software and systems design. It
indicates that consumers will use products that are good enough for their requirements, despite the
availability of more advanced technology.
Wikipedia
The KISS Principle of "Keep it Simple, Stupid” provides a series of design rules, some of them:
● Separate mechanisms from policy
● Write simple programs
● Write transparent programs
● Value developer time over machine time
● Make data complicated when required, not the program
● Build on potential users' expected knowledge
● Write programs which fail in a way that is easy to diagnose
● Prototype software before polishing it
● Make the program and protocols extensible
14. What should be simplified to make Dataverse “good enough”?
“One-liner” installation requirements include:
● even users without any technical knowledge should be able to install it
● simple, clear and transparent infrastructure ready for integration (Docker based)
● reverse proxy and load balancer should be set up both locally and on a remote host to
run Dataverse website (Nginx/Traefik)
Q: How do we cross the chasm?
A: Let’s try to capture the
mainstream!
15. Using Dataverse to fight against COVID-19
1300+ people
registered in the
organization
15
16. Jupyter integration: datasets conversion to pandas
dataframe
Can AI researchers read and reuse data directly from Dataverse in a collaborative
way?
17. Crossing the chasm...
The technology adoption requires further automation of all processes.
Our goal is to deliver production ready Dataverse for the European Open Science
Cloud (EOSC):
● SSHOC project: Docker/Kubernetes, common CI/CD pipeline, integration
tests, previewers, language localization, external tools
● EOSC Synergy Software Quality Assurance (SqaaS) pipeline integration
● CLARIAH - leveraging metadata schema with CLARIN community, CLARIN
tools integration, development common pipelines
● FAIRsFAIR - enabling FAIR Data Points (FDP) in Dataverse
● ODISSEI - using Dataverse as a data registry
18. Services in European Open Science Cloud (EOSC)
● EOSC requires the level 8 of maturity (at least)
● we need the highest quality of software to be
accepted as a service
● clear and transparent evaluation of services is
essential
● the evidence of technical maturity is the key to
success
● the limited warranty will allow to stop out-of-
warranty services
19. Running Dataverse in production on Cloud
HTTP(S) Load
Balancer Kubernetes Engine
Container Registry
Dataverse Service
Kubernetes Cluster
K8S Cluster Node
Dataverse Deployment
PostgreS
QL
Service
Solr Deployment
PostgreSQL Deployment
Users
Certbot Cronjob
Email Relay Deployment
Certbot
Service
Email
relay
Service
Dataverse Service
Solr
Service
20. Dataverse Kubernetes
Project maintained by Oliver Bertuch (FZ Julich) and available in Global
Dataverse Community Consortium github (GDCC)
Google Cloud, Amazon AWS, Microsoft Azure platforms supported
Open Source, community pull requests are welcome
http://github.com/IQSS/dataverse-kubernetes
21. SQA process with Selenium tests for Dataverse
Selenium IDE allows
to create and replay
all UI tests in your
browser
Shared tests can be
reused by community
to increase
reproducibility
SQA for the service maturity = unit tests + integration tests
21
Source: SSHOC project, data repositories task WP5.2
22. CI/CD pipeline with SQAaaS (S)
1
2
3
git
push
Push GCP
container
registry
webhook
Create
docker
image
Kubernetes
Deployment
git clone
Jenkins pipeline (Jenkinsfile)
9
7
Run SQA
S 8
1. Developer pushes code to GitHub
2. Jenkins receives notification - build trigger
3. Jenkins clones the workspace
4. (S) Runs SQA tests and does FAIRness check
5. (S) Issuing digital badge according to the results
6. (S) SQAaaS API triggers appropriate workflow
7. Creates docker image if success
8. Pushes new docker image to container registry
9. Updates the kubernetes deployment
22
Source: EOSC Synergy project
23. Data Commons is essential for integrations
Merce Crosas, “Harvard Data Commons”
25. Our goals to increase Dataverse interoperability
Provide a custom FAIR metadata schema for European research communities:
● CESSDA metadata (Consortium of European Social Science Data Archives)
● Component MetaData Infrastructure (CMDI) metadata from CLARIN
linguistics community
Connect metadata to ontologies and CVs:
● link metadata fields to common ontologies (Dublin Core, DCAT)
● define semantic relationships between (new) metadata fields (SKOS)
● select available external controlled vocabularies for the specific fields
● provide multilingual access to controlled vocabularies
26. One metadata field can be linked to many ontologies
Language switch in Dataverse will change the language of suggested terms!
27. The FAIR Signposting Profile
Herbert Van de Sompel
https://hvdsomp.info
Two levels of access to Web resources:
● level one provides a concise set of links or a
minimal set of links by value in the HTTP
header
● level two delivers a complete comprehensive
set of links by reference meaning in a
standalone document (link set)
28. Dataverse meta(data) in FAIR Data Point (FDP)
● RESTful web service that enables data
owners to expose their data sets using
rich machine-readable metadata
● Provides standardized descriptions
(RDF-based metadata) using
controlled vocabularies and ontologies
● FDP spec is public
Source: FDP
The goal is to run FDP on
Dataverse side (DCAT, CVs) and
provide metadata export in RDF!
30. Dataverse localization with Weblate
● service to connect files to Weblate in order to
translate them in a structured way
● several options for project visibility: accept
translations by the crowd, or only give access
to a select group of translators.
● Weblate indicates untranslated strings,
strings with failing checks, and strings that
need approval.
● when new strings are added with an upgrade
of Dataverse, Weblate can indicate which
strings are new and untranslated.
34. Make Data Count
Make Data Count is part of a broader Research Data Alliance (RDA) Data Usage Metrics Working Group
which helped to produce a specification called the COUNTER Code of Practice for Research Data.
The following metrics can be downloaded directly from the DataCite hub for datasets hosted by Dataverse
installations:
● Total Views for a Dataset
● Unique Views for a Dataset
● Total Downloads for a Dataset
● Downloads for a Dataset
● Citations for a Dataset (via Crossref)
Dataverse Metrics API is a powerful source for BI tools used for the Data Landscape monitoring.
35. Metrics for BI and integration with Apache Superset
Source: Apache Superset (Open Source)
37. Apache Airflow for Dataverse pipelines
● Intended for acyclic processes,
around those processing data with a
point of "completion."
● DAG (Directed Acyclic Graph) is a
collection of all the tasks organized in
a way that reflects their relationships
and dependencies
● absolutely essential component for
the harvesting and depositing data
● Airflow dashboard allows to get a
clear overview and status of all
running processes
On the roadmap of ODISSEI project!
38. Conclusion
Due to the open architecture and the use of open standards, Dataverse team has
managed to attract the best people and create a strong community, and finally
build a product completely aligned with principles of Open Innovation.
Suitable for the future, community-driven, it has all chances to “cross the chasm”
and become a prominent FAIR data repository on all continents.
Dataverse already has a very rich ecosystem for technological innovation that will
allow to integrate tools which don't exist yet.
“Any tool should be useful in the expected way, but a truly great tool
lends itself to uses you never expected”...