https://zbmed.github.io/damalos DaMaLOS 2020 workshop at ISWC 2020
The swings and roundabouts of a decade of fun and games with Research Objects
Over the past decade we have been working on the concept of Research Objects http://researchobject.org). We worked at the 20,000 feet level as a new form of FAIR+R scholarly communication and at the 5 feet level as a packaging approach for assembling all the elements of a research investigation, like the workflows, data and tools associated with a publication. Even in our earliest papers [1] we recognised that there was more to success than just using Linked Data. We swing from fancy RDF models and SHACL that specialists could use to simple RO-Crates and JSON-LD that mortal programmers could use. Success depends on a use case need, a community and tools which we found particularly in the world of computational workflows right from the outset. All along there have been politics and that has recently got even hotter. All fun and games!
[1] Sean Bechhofer, et al (2013) Why Linked Data is Not Enough for Scientists FGCS 29(2)599-611 https://doi.org/10.1016/j.future.2011.08.004 v
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
The swings and roundabouts of a decade of fun and games with Research Objects
1. The swings and roundabouts
of a decade of fun and games
with Research Objects
Carole Goble
The University of Manchester
Researchobject.org
carole.goble@manchester.ac.uk
DaMaLOS workshop 02 Nov 2020 at ISWC 2020
3. Our RO start up – what, why and how…
ROs In the Large –TheVision
A new form of Scholarly Communication.
RDM support throughout the research cycle.
ROs in the Small –The Implementation
Packaging digital components.
Referencing physical components.
4. OurWorld of FAIRThematic Research Infrastructures
(aka Cyberinfrastructure) – Biology, Biodiversity
“facilities that provide resources and services for
research communities to conduct research and
foster innovation….they may be single-sited,
distributed, or virtual.
• major scientific equipment or sets of instruments
• collections, archives or scientific data
• computing systems and communication networks
• any other research and innovation infrastructure
of a unique nature which is open to external
users”
European Commission
Research
Software
Engineers
https://ec.europa.eu/info/research-and-innovation/strategy/european-research-infrastructures_en
5. FAIR Data Commons
Diverse Research Objects – models, data, pipelines, lab
protocols and SOPs, provenance... citable, exchangeable,
publishable, preserved, executable objects and
collections of objects.
Assemble and share large scale, multi-
element datasets. Secure referencing
and moving of sensitive data. Zoo of
catalogues & resources.
Across 13 Research Infrastructures.
Reproduce, port, share, and execute analytics & pipelines
NIH Data
Commons
6. FAIR Digital Objects
PIDs + Metadata
Sounds like Linked
Data!
The FAIR Guiding Principles for
Data Stewardship and
Management
Scientific Data 3, 160018 (2016)
doi:10.1038/sdata.2016.18
7. Motivation
in 2007.
Still is.
The different objects that are the research
…..documentation
Structured, interrelated objects in context
….documentation
The narrative paper The narrative paper
8. From Manuscripts to Research Objects
“An article about computational science in a scientific publication is not the scholarship itself, it is
merely advertising of the scholarship.The actual scholarship is the complete software
development environment, [the complete data] and the complete set of instructions which
generated the figures.” David Donoho, “Wavelab and Reproducible Research,” 1995
research outcomes more than just
publications
data software, models, workflows, SOPs,
lab protocols are first class citizens of
scholarship
added information required to make
research FAIR and Reproducible (FAIR+R) …
9. Do Research
Project Research
Infrastructure Services
Assemble
Methods, Materials
Analyse
Results
Quality
Assessment
Track and Credit
Disseminate
Deposit &
Licence
Public Marketplace
Services
Publish
Share
Results
Any
research
product
Selected
products
Manage
Results
ROs all the way along,
not just the end
Science 2.0 Repositories: Time for a Change in Scholarly Communication Assante, Candela, Castelli, Manghi, Pagano, D-Lib 2015
Ainsworth and Buchan: e-Labs and Work Objects: Towards Digital Health Economies, EuropeComm 2009, pp 205-216
Experiment
ObserveSimulate
10. Digital Objects
electric butterflies
Digital twins
Actionable knowledge units
FAIR Digital Objects
courtesy Dimitris Koureas
Coordinator DiSSCo EU Research
Infrastructure
Specimen object image
courtesy of Alex Hardisty
11. Digital Objects as First Class Entities
FAIR Digital Object Framework
A KnowledgeGraph of FDOs
• Schwardmann (2020), DigitalObjects – FAIR DigitalObjects:Which ServicesAre Required? Data Science Journal
• EOSC Interoperability Framework Draft (2020)
• HardistyA, et al (2020) Conceptual design blueprint for the DiSSCo digitization infrastructure RIO 6: e54280.
• DONA DigitalObjectArchitecture DigitalObject Interface Protocol (2018)
• https://fairdigitalobjectframework.org/
12. From Manuscripts to FAIR+R Research Objects
research objects related and
bundled together …
one shareable, cite-able,
exchangable resource that can be
versioned and snapshot …
metadata describing context
and content of objects
dependencies, versions,
relationships, provenance …
enough to be reproducible
virtual objects , links to physical objects (people, specimens, equipment)
integrated view over fragmented & scattered specialised repositories
internal content and references
13. Bigger on the inside
than the outside Packaging
internal content and references
From Manuscripts to FAIR+R Research Objects
15. Packaging of Digital Objects
Driver: ComputationalWorkflows
http://wf4ever.org/
Preservation of computational workflows in data-
intensive science
• Workflow-centric Research Objects
• Computational workflows, provenance of executions,
interconnections between workflows and related
resources (e.g., datasets, publications, etc.), social
aspects in the experiments.
• Wf-centric RO creation & management best practices
• analysis and management of decay in workflows.
16. Methods
(..)
De novo assembly and binning
Raw reads from each run were first assembled with SPAdes v.3.10.020 with option --meta21.
Thereafter, MetaBAT 215 (v.2.12.1) was used to bin the assemblies using a minimum contig
length threshold of 2,000 bp (option --minContig 2000) and default parameters. Depth of
coverage required for the binning was inferred by mapping the raw reads back to their
assemblies with BWA-MEM v.0.7.1645 and then calculating the corresponding read depths
of each individual contig with samtools v.1.546 (‘samtools view -Sbu’ followed by ‘samtools
sort’) together with the jgi_summarize_bam_contig_depths function from MetaBAT 2. The
QS of each metagenome-assembled genome (MAG) was estimated with CheckM v.1.0.722
using the lineage_wf workflow and calculated as: level of completeness − 5 ×
contamination. Ribosomal RNAs (rRNAs) were detected with the cmsearch function from
INFERNAL v.1.1.247 (options -Z 1000 --hmmonly --cut_ga) using the Rfam48 covariance
models of the bacterial 5S, 16S and 23S rRNAs. Total alignment length was inferred by the
sum of all non-overlapping hits. Each gene was considered present if more than 80% of the
expected sequence length was contained in the MAG. Transfer RNAs (tRNAs) were identified
with tRNAscan-s.e. v.2.049 using the bacterial tRNA model (option -B) and default
parameters. Classification into high- and medium-quality MAGs was based on the criteria
defined by the minimum information about a metagenome-assembled genome (MIMAG)
standards23 (high: >90% completeness and <5% contamination, presence of 5S, 16S and 23S
rRNA genes, and at least 18 tRNAs; medium: ≥ 50% completeness and <10% contamination).
(...)
(..)
Assignment of MAGs to reference databases
Four reference databases were used to classify the set of MAGs recovered from the human
gut assemblies: HR, RefSeq, GenBank and a collection of MAGs from public datasets. HR
comprised a total of 2,468 high-quality genomes (>90% completeness, <5% contamination)
retrieved from both the HMP catalogue (https://www.hmpdacc.org/catalog/) and the
HGG8. From the RefSeq database, we used all the complete bacterial genomes available (n
= 8,778) as of January 2018. In the case of GenBank, a total of 153,359 bacterial and 4,053
eukaryotic genomes (3,456 fungal and 597 protozoan genomes) deposited as of August
2018 were considered. Lastly, we surveyed 18,227 MAGs from the largest datasets publicly
available as of August 201813,16,17,18,19, including those deposited in the Integrated Microbial
Genomes and Microbiomes (IMG/M) database52. For each database, the function ‘mash
sketch’ from Mash v.2.053 was used to convert the reference genomes into a MinHash
sketch with default k-mer and sketch sizes. Then, the Mash distance between each MAG
and the set of references was calculated with ‘mash dist’ to find the best match (that is, the
reference genome with the lowest Mash distance). Subsequently, each MAG and its closest
relative were aligned with dnadiff v.1.3 from MUMmer 3.2354 to compare each pair of
genomes with regard to the fraction of the MAG aligned (aligned query, AQ) and ANI.
(..)
https://doi.org/10.1038/s41586-019-0965-1
Data pipeline & analysis
reporting & reproducibility
17. Data pipeline & analysis
reporting & reproducibility
https://doi.org/10.1038/s41586-019-0965-1
Lots of scripts &
datasets
18. EBI’s MGnify
metagenomics workflows as
workflows using aWfMS
Workflow description
Input data files
Command line tools,
containerised tools,
workflows
Output data files
19. Standards-based metadata framework for bundling resources (physically
and logically) with context into citable reproducible packages.
researchobject.org Linked Data!!
Bechhofer et al (2013)Why linked data is not enough for scientists https://doi.org/10.1016/j.future.2011.08.004
Bechhofer et al (2010) Research Objects:Towards Exchange and Reuse of Digital Knowledge, https://eprints.soton.ac.uk/268555/
20. A Research Object bundles and relates digital resources
of a scientific experiment/investigation + context
Data used and results produced in experimental study
Methods employed to produce and analyse that data
Provenance and settings for the experiments
People, specimens, equipment etc involved in the
investigation
Annotations about these resources, to improve
understanding and interpretation
Package
21. Manifest
Construction
Manifest
Identification
to locate things
Aggregates
to link things together
Annotations
about things & their
relationships
Package
Research Objects => Metadata Objects
Manifest
Description
Type Checklists
what should be there
Provenance
where it came from
Versioning
its evolution
Dependencies
what else is needed
Manifest
Base Profile,
Profiles for
RO types
22. We are in a SemanticWeb Conference….
Linked Data Middleware
• Manifests described using Linked Data
– Identifiers to resources, including people (orcid)
– OWL / RDF / SPARQL / JSON-LD
• Mismash of specialized ontologies
– Construct the manifest itself
• W3CWeb AnnotationVocabulary
• OAI Object Exchange and Reuse
– Describe manifest content
• Wf4Ever RO ontology,Wf4Ever ROEvo …
• Dublin Core, FOAF, SIOC, Creative Commons, PROV, PAV…
• RDF shapes (SHACL, ShEx)
– Capture requirements, expectations and validate profiles
– Hard to express checklists
25. European Open Science Cloud
Interoperability Framework
EOSC Interoperability Framework (v1.0)
May 2020, Draft for community consultation
Chair:Oscar Corcho, UPM
2021 we start to combine
26. Used? Yes ….……
NIH DataCommons
transferring and archiving
very large HTS datasets
in a location-
independent way
keep the context of data
content together when
its scattered. Scalability
https://doi.org/10.1109/BigData.2016.7840618
A framework for
standardizing and
sharing computations
and analyses generated
from High-throughput
genome sequencing.
Standardized as IEEE
2791-2020
https://doi.org/10.1371/journal.pbio.3000099
Virtualized collaborative
working environment for
Earth Science researchers
to share resources (data,
workflows), ideas,
knowledge, and results.
NIH DataSTAGE RO
Composer to
exchange between
Seven Bridges
Platform genomics
platform and the
Mendeley Data
repository
https://doi.org/10.1016/j.future.2019.03.046
30. Machine-processable
Standards
Low tech
Graceful degradation
Commodity tooling
Incremental
Multi-platform
Technology Independent
Keep it
simple
E X A M P L E S
Developer
friendliness
Desiderata & Norms
Balance and prioritise
• "just enough complexity" or “just
enough standards” so…
• sufficient extra benefits from what
already exists (Linked Data,
vocabularies, tooling, validation,
transformation
• without compromising the developer
entry-level experience so much that
they rather do their own thing.
31. Research ObjectTensions
Research Infrastructures sit in the middle
Academic
Viewpoint
Infrastructure
Viewpoint
Green field site
Theoretical purity
Use latest thing
Proof of concept
Sophistication
Narrow developer audience
Strive for super generic
The end
Exposing the tech
Pre-existing platforms
Practicality
Use things that work
Production
Simplicity
Wide developer audience
Several specific is ok!
The means
Hiding the tech
32. Ladder Model of OSS Adoption
(adapted from Carbone P., Value Derived from Open
Source is a Function of Maturity Levels)
[FLOSS@Sycracuse]
"it's better, initially, to
make a small number of
users really love you than
a large number kind of
like you"
Paul Buchheit
paulbuchheit.blogspot.com
33. Not really mortal developer friendly
• “Easy to make, hard to use…”
• Daunting Linked Data tech stack
• Being too clever
– Infer what is in the object and what
kind of object it is
– Massive reuse of ontologies
• Make developers (and researchers)
lives easier not more demanding….
34. Developer friendliness matters
Reinvent with fewer features
Easy to understand and simple
conceptually…
… with strong opinionated
guide to current best practices
… using software stacks
widely used on theWeb
35. Indeed.
Linked Data is not enough.
Research Infrastructures:
“digital technologies (hardware,
software), resources (data, services,
digital libraries, standards), comms
(protocols, access rights, networks),
people and organisational
structures”
36. CommunityUse Driver Tools
Developer friendly – so possible
Incentives – so rewarding
Adoption path – so acceptable
Metadata capture
Early benefit
Exchange, reproducibility, executable objects
Portability between platforms, Archiving
Platform & user buy-in & consensus
Passionate, dedicated leadership
Active engaged community, seed Support
Linked Data and a Spec is not enough
and sometimes too much
GuidesReference
examples
39. Swing Back to Basics
Peter Sefton
BagIT data profile
+ schema.org
+ JSON-LD annotations
Semantic Web world vs RealWorld
DataCrate
from the Open Repository community
"As a researcher…I’m a bit b****y fed up
with Data Management", Cameron Neylon
40. Be HumbleArchivist and library people know the
importance of metadata and standards…
… and for things to work 5, 10, 20 years later.
End-users need to have their own way to
“bypass the system”…
…. their field, repositories, institutions , journals
etc. will always be lagging behind the curve
Most who want to make their data is FAIR …
… do not have the resources or knowledge to
start championing all of this to all levels & need
tools and ramps.
http://www.lisbdnet.com/
https://ischools.org/
42. RO-Crate
https://w3id.org/ro/crate
RO-Crate is a community effort to establish a lightweight approach to packaging research data
with their metadata.
It is based on schema.org annotations in JSON-LD, and aims to make best-practice in formal
metadata description accessible and practical for use in a wider variety of situations, from an
individual researcher working with a folder of data, to large data-intensive computational
research environments.
RO-Crate is the marriage of Research Objects with DataCrate. It aims to build on their respective
strengths, but also to draw on lessons learned from those projects and similar research data
packaging efforts. For more details, see RO-Crate background.
The RO-Crate specification details how to capture a set of files and resources as a dataset with
associated metadata – including contextual entities like people, organizations, publishers, funding,
licensing, provenance, workflows, geographical places, subjects and repositories.
A growing list of RO-Crate tools and libraries simplify creation and consumption of RO-Crates,
including the graphical interface Describo.
The RO-Crate community help shape the specification or get help with using it!
43.
44. Reinvent as Lightweight Underware
Linked data but Developer Friendly
Easy to understand and simple
conceptually…
… with strong opinionated guide to
current best practices
… using software stacks widely used
on theWeb BagIT data profile , schema.org, JSON-LD, JSONSchema
Flattened compacted JSON-LD, no need for RDF libraries
Example driven rather than strict specification
Implementers add additional metadata using schema.org
types and properties
Data entities are files/directories or web resources
Boundness of elements is explicit
Single graph, data structure depth 1
Swung a bit far….
and swung back…
45. Tooling!
How can I use it?
While we’re mostly focusing on the specification, some tools already exist for working with
RO-Crates:
● Describo interactive desktop application to create, update and export RO-Crates for
different profiles. (~ beta)
● CalcyteJS is a command-line tool to help create RO-Crates and HTML-readable
rendering (~ beta)
● ro-crate - JavaScript/NodeJS library for RO-Crate rendering as HTML. (~ beta)
● ro-crate-js - utility to render HTML from RO-Crate (~ alpha)
● ro-crate-ruby Ruby library to consume/produce RO-Crates (~ alpha)
● ro-crate-py Python library to consume/produce RO-Crates (~ planning)
These applications use or expose RO-Crates:
● Workflow Hub imports and exports Workflow RO-Crates
● OCFL-indexer NodeJS application that walks the Oxford Common File Layout on the
file system, validate RO-Crate Metadata Files and parse into objects registered in
Elasticsearch. (~ alpha)
● ONI indexer
● ocfl-tools
● ocfl-viewer
● Research Object Composer is a REST API for gradually building and depositing
Research Objects according to a pre-defined profile. (RO-Crate support alpha)
● … (yours?)
https://uts-eresearch.github.io/describo/
46. • RDF and schema.org
but you don't need to
know.
• Extend RO-Crate ….
– Add your own schema.org
types and properties.
– Add in your own ontologies
…and it still works!
https://arkisto-platform.github.io/case-studies/
Under-ware
47. Driver! Profile for workflows
https://about.workflowhub.eu/Workflow-RO-Crate/
48. Infrastructure families
Driver! Profile for workflows
On-boarding developers
Web and dev friendly
Workflow-RO-Crate profiles
RO in practice External references – logically & physically contained –
versions, snapshots, multi-typed, active, multi-
stewarded, multi-authored, governance…
49. More than plain JSON, Just Enough Linked Data
Retain benefits of Linked Data in the toolbox
• querying, graph stores, vocabularies, clickable URI
as identifiers)
• customization and conventions
Plus all the other stuff a developer expects
• documentation, examples, libraries, tools
• simplifications rather than generalizations (less
flexibility frees up developers)
• “Just enough standards” cf. schema.org
Linked Data “exotics” there for when the time is right
if needed by the right people.
50. Keep your eye on the target…..
How do we make RO’s normative?
– Propaganda and incentive models to scientists,
target the Research Infrastructures to deliver.
– Digital library community allies!
Developer friendliness matters
– Underware, incremental, ramps, embed, metadata
automation, persuasive design
Linked Data has a role
– As a means but it is not an end.
– Simpler version of Linked Data makes an adoption
path (cf. Knowledge Graphs, schema.org, JSON-
LD)
FAIR principles for Research Objects….
– Unifying the vision with the practical
51. Acknowledgements
Barend Mons
Sean Bechhofer
Matthew Gamble
Raul Palma
Jun Zhao
Mark Robinson
AlanWilliams
Norman Morrison
Stian Soiland-Reyes
Tim Clark
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
Ian Cottam
Susanna Sansone
Kristian Garza
Daniel Garijo
Catarina Martins
Iain Buchan
Michael Crusoe
Rob Finn
StuartOwen
Finn Bacall
Bert Droebeke
Laura Rodríguez Navas
Ignacio Eguinoa
Carl Kesselman
Ian Foster
Kyle Chard
Vahan Simonyan
Ravi Madduri
Raja Mazumder
GilAlterovitz
Denis Dean II
DurgaAddepalli
Wouter Haak
Anita DeWaard
Paul Groth
Oscar Corcho
Peter Sefton
Eoghan Ó Carragáin
Frederik Coppens
Jasper Koehorst
Simone Leo
Nick Juty
LJ Garcia Castro
Karl Sebby
Alexander Kanitz
Ana Trisovic
http://researchobject.org
Gavin Kennedy
Mark Graves
José María Fernández Jose
Manuel Gomez-Perez
Jason A. Clark
Salvador Capella-Gutierrez
Alasdair J. G. Gray
Kristi Holmes
Giacomo Tartari
Hervé Ménager
Paul Walk
Brandon Whitehead
Erich Bremer
Mark Wilkinson
Jen Harrow
And many more....
Notas del editor
More information on our website: https://zbmed.github.io/damalos
The swings and roundabouts of a decade of fun and games with Research Objects
Over the past decade we have been working on the concept of Research Objects http://researchobject.org). We worked at the 20,000 feet level as a new form of FAIR+R scholarly communication and at the 5 feet level as a packaging approach for assembling all the elements of a research investigation, like the workflows, data and tools associated with a publication. Even in our earliest papers [1] we recognised that there was more to success than just using Linked Data. We swing from fancy RDF models and SHACL that specialists could use to simple RO-Crates and JSON-LD that mortal programmers could use. Success depends on a use case need, a community and tools which we found particularly in the world of computational workflows right from the outset. All along there have been politics and that has recently got even hotter. All fun and games!
[1] Sean Bechhofer, et al (2013) Why Linked Data is Not Enough for Scientists FGCS 29(2)599-611 https://doi.org/10.1016/j.future.2011.08.004 v
I stopped living in research land a long time ago…
RIs not academic …
https://datascience.nih.gov/CancerResearchDataCommons
We make the rocket launchers for the rocket science
We are the microscope, telescope, datascope instrument makers
The Data-Driven Open Science, e-Laboratories
Inside and outside
Elsewhere, on-date
Within, during
Reproducible Research Environment
Integrated infrastructure for producing and working with reproducible research.
Reproducible Research Publication Environment
Distributing and reviewing; academic credit; legal licensing etc.
Such a merge between research infrastructure and research marketplace would overcome known "cultural barriers" and the "methodological barriers" mentioned above. In the RI scope, research products publishing would:
be facilitated by the very ICT services that are generating them;
take advantage of research activity awareness and product interlinking;
support access rights issues that fit the need of the community; and
subsume the major costs of storing and curating products. In other words, by bringing marketplace features within the RI, researchers can finally achieve their best effort in terms of scholarly communication.
Science 2.0 Repositories: Time for a Change in Scholarly Communication
Massimiliano Assante, Leonardo Candela, Donatella Castelli, Paolo Manghi and Pasquale PaganoIstituto di Scienza e Tecnologie dell'Informazione, Consiglio Nazionale delle Ricerche, Italy{assante, candela, castelli, manghi, pagano}@isti.cnr.it DOI: 10.1045/january2015-assante
Beyond packaging and not just at the end!
A knowledge graph of the butterfly
FDO Framework
PIDs
Byte-stream described by metadata
https://www.dona.net/digitalobjectarchitecture
https://fairdigitalobjectframework.org/
FDO types and type definition registries
https://www.dona.net/sites/default/files/2018-11/DOIPv2Spec_1.pdf
https://www.eoscsecretariat.eu/sites/default/files/eosc-interoperability-framework-v1.0.pdf
https://www.rd-alliance.org/group/gede-group-european-data-experts-rda/wiki/gede-digital-object-topic-group
A Knowledge Graph of the investigation
No just encapsulated
References
RO in practice
Interpretation, Comparison, Portability, Reuse, Credit, Citation
Interpretation, Comparison, Preservation, Repair, Portability, Reuse, Execution
Active Research, Evolving codes, New data, Portability, Reuse, Execution
A knowledge graph about a workflow.
Most workflows in papers are ad-hoc, artisan
So workflows are freedom ….
Move fast, break grow it
Bring enterprise scale
Its so brittle, data science
So we came up with ROs
ROs combine containers and incremental metadata
Like DataONE Packages
Its all about the metadata
The first implementation of Research Objects in 2009 for sharing workflows in myExperiment https://doi.org/10.1093/nar/gkq429 was based on RDF ontologies http://eprints.soton.ac.uk/id/eprint/267787, building on Dublin Core, FOAF, SIOC, Creative Commons, OAI-ORE and (later) DBPedia to form myExperiment ontologies for describing social networking, attribution and creditation, annotations, aggregation packs, experiments, view statistics, contributions, and workflow components. http://web.archive.org/web/20091115080336/http://rdf.myexperiment.org/ontologies Programmatic access to Research Objects was facilitated with an RDF endpoint that exposed individual myExperiment resources, also queriable from a SPARQL endpoint, both using the myExperiment vocabularies and RDF formats RDF/XML and Turtle.
Re-use of existing vocabularies. Making lots of new vocabularies. - too many??
Stack becomes very large to learn and use.
RDF as its own worst enemy.
Need for simplicity. Cater for "web developer".
Re-use of existing vocabularies. Making lots of new vocabularies. - too many??
Stack becomes very large to learn and use.
RDF as its own worst enemy.
Need for simplicity. Cater for "web developer".
JSON-LD
Howard Ratner, Chair STM Future Labs Committee, CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US) STM Innovations Seminar 2012
projects
But really friends, and 1 platform and 1 project
Championing through bubble projects
Desiderata, norms
But did we really do this?
But it's about prioritizing, so say we use a production-ready database but put a bleeding-edge research object application on top, like how Peter Sefton's team are putting things quickly together. but of course it's way simplified
kind of "just enough complexity" (or just enough standards!) to get sufficient extra benefits from what already exists (Linked Data, vocabularies, tooling, validation, transformation etc) without compromising the developer entry-level experience too much that they rather do their own thing (another triangle wheel).
We started in theory, but we work in practice
ROs sit in the middle
Inspection states
Conform to a profile – STILL do not have profile validation…
Re-use of existing vocabularies. Making lots of new vocabularies. - too many??
Stack becomes very large to learn and use.
RDF as its own worst enemy.
Need for simplicity. Cater for "web developer".
Heath Robinson
Base Type
nested ROs?
specialised Workflow-RO-Crate profiles for different WfMS
RO-Crate -> WRO-Crate -> Galaxy-WRO-Crate
JSon schema
Data Crate examples…. From
you can point to also how RDF graph have been reinvented as knowledge graphs - just the same thing with less features, and then suddenly it took off
14:45
Linked Data and Semantic web are the same
Stian, 14:45
as it became understandable!
Stian, 14:45
and schema.org vs. all the mismash of specialized ontologies
Sometimes its too much
For Practice, A Village
Sustainability – get some traction….
Sponsored by BioExcel
University of Technology Sydney
"As a researcher…I’m a bit bloody fed up with Data Management" https://cameronneylon.net/blog/as-a-researcher-im-a-bit-bloody-fed-up-with-data-management/
Hence desktop solutions like Describo are also important. Not sure if you mentioned Describo!
Library & Information Science Network
We did oversimplify to just the datacrate !
Data structure Depth 1
Adopt software stacks widely used on the Web
BagIT data profile , schema.org, JSON-LD, JSONSchema
Flattened JSON-LD, no need for RDF libraries
A self-described container using LD as a core principle
Data entities are files/directories or web resources
Easy to understand and simple conceptually
Data structure depth limited to 1
Boundness of elements is explicit
Strong opinionated guide to current best practices
Example driven rather than strict specification
Implementers can add additional metadata using other schema.org types and properties
KGs are simplified and then could be adopted.
We swung too far and threw away the remote referencing – hence RO-Crate 1.1
Next RO-Crate and containers…
Conceptual Definition
A key premise of RO-Crate is the existence of a wide variety of resources on the Web that can help describe research. As such RO-Crate relies on concepts from the Web and in particular linked data.
Linked Data as a core principle
Linked Data [cite] is a core principle of RO-Crate, and IRIs are therefore used to identify the RO-Crate, its constituent parts and metadata descriptions, as well as to the properties and classes used in the metadata. IRIs are a generalization of URIs (which include well-known http/https URLs), permitting international Unicode characters without %-encoding [RFC3987].
Using Linked Data, where consumers can follow the links for more (ideally both human- or machine-readable) information, the RO-Crate can then be sufficiently self-described and related using global identifiers, without needing to recursively fully describe every referenced entity.
The base of Linked Data and shared vocabularies also means multiple RO-Crates and other Linked Data resources can be indexed, combined, queried, integrated or transformed using existing Semantic Web technologies like SPARQL and knowledge graph triple stores.
RO-Crate is a self-described container
An RO-Crate is defined as a self-described Root Data Entity, which describes and contains data entities, which are further described using contextual entities. A data entity is primarily a file (bytes on disk somewhere) or a directory (set of named files); while a contextual entity exists outside the information system (e.g. a Person) and its bytes are primarily metadata.
The Root Data Entity is a directory, the RO-Crate Root, identified by the presence of the RO-Crate Metadata File ro-crate-metadata.json. This is a JSON-LD file that describes the RO-Crate, its content and related metadata using Linked Data.
The minimal requirements for the root data entity metadata is name, description and datePublished, as well as a contextual entity identifying its license; but additional metadata is frequently added depending on the purpose of the particular RO-Crate.
Because the RO-Crate is just a set of related files and directories, it can be stored, transferred or published in multiple ways, such as BagIt [RFC8493], Oxford Common File Layout [OCFL], downloadable ZIP archives in Zenodo or dedicated online repositories, as well as published directly on the web, e.g. using GitHub Pages. Combined with Linked Data identifiers this caters for a diverse set of storage and access requirements across different scientific domains, e.g. metagenomics workflows producing ~100 GB of genome data, or cultural heritage records with access restrictions for personally identifiable data.
Data Entities are described using Contextual Entities
RO-Crate distinguishes between data and contextual entities in a similar way to HTTP terminology information (data) and non-information (contextual) resource [HttpRange-14]. While the data entities in the most common scenario are files and directories located by relative IRI references within the RO-Crate Root, they can also be web resources identified with absolute http/https IRIs.
As both types of entities are identified by IRIs, their distinction is allowed to be blurry; data entities can be located anywhere and be complex, and the contextual entities can have a Web presence beyond their description inside the RO-Crate. For instance <https://orcid.org/0000-0002-1825-0097> is primarily an identifier for a person, but is secondarily also a web page that describes that person and their academic work.
Any particular IRI might appear as a contextual entity in one RO-Crate, and a data entity in another; their distinction is that data entities can be considered to be contained or captured by the RO-Crate, while the contextual entities are mainly explaining the RO-Crate and its entities. It follows that an RO-Crate should have one or more data entities, although this is not a formal requirement.
Best Practice Guide rather than strict specification
RO-Crate builds on Linked Data standards and community-evolved vocabularies like JSON-LD and schema.org, but rather than enforce additional constraints and limits using perhaps intimidating technologies like RDF shapes (SHACL, ShEx), RO-Crate aims instead to build a set of best practice guides on how to apply the existing standards in a common way to describe research outputs and their provenance.
As such the RO-Crate specification [RO-Crate 1.1] can be seen as an opinionated and example-driven guide to writing schema.org metadata as JSON-LD, and leaves it open for implementers to add additional metadata using other schema.org types and properties, or even using other Linked Data vocabularies/ontologies or their own ad-hoc terms.
This does not mean that RO-Crate has not got constraints, crucially the specification is quite strict on the style of the flattened JSON-LD to facilitate lightweight developer consumption and generation of the RO-Crate Metadata file without the need for RDF libraries.
Checking Simplicity
As previously discussed, one aim of RO-Crate is to be conceptually simple. This simplicity has been repeatable checked and confirmed through a community process. See for example discussion at (GitHub issue #71 on ad-hoc vocabularies).
To further verify this idea, we have formalized the RO-Crate definition (see Appendix X). An important result of this exercise is that the underlying data structure of RO-Crate is a depth limited tree of 1 level. The formalization also emphasizes the boundness of the structure, namely, that elements are specifically identified as part of the RO-Crate or are typed as being references external to the structure (i.e. ContextualEntities).
Active objects, versioning, nesting, references
STILL A LONG WAY TO GO! Still need better guides
Two draft profiles
ComputationalWorkflow v 0.5
10 mandatory, 15 recommended, 7 optional fields
FormalParameter v 0.1
3 mandatory, 1 recommended, 3 optional fields
Two draft types
ComputationalWorkflow v 0.1
FormalParameter v 0.1
Mappings to DataCite & OpenAIRE
What I would like to see in the end is how we are NOT throwing out the Linked Data foundation - and that we want to retain all the benefits, e.g. querying, graph stores, vocabularies, clickable URI as identifiers - but they are mere things we want to keep in our toolbox rather than a goal in itself - and as such the toolbox also have the other things developers expects; documentation, examples, libraries, simplifications rather than generalizations [removing unneeded degrees of freedom like multiple RDF formats]; showing them how to do the Right Thing (tm) almost by accident. Then they can - if they want to and when time is right - pick up some of those shiny tools in the bottom of the toolbox.
Just like in cars - they benefit a lot from reusing standards - like communication buses, agreeing on mounts and screws, almost interchangable parts so that manufacturers can mix and match when they put it together. But you as a consumer don't care until you realize your exhaust pipe is broken and the mechanic now can choose how to fix it rather than be restrained to one single manufacturer.
some kind of "Why not just plain JSON?" point perhaps. We are not after the exotic features that you will hear about in the rest of the conference, but the bog standard tooling that arguably you could also do in a NoSQL database with enough customization and conventions. But rather than invent our own C&C we retain the Linked Data conventions - "Just enough standards" - following the lead of schema.org
Keep your eye on the target
How do we persuade Scientists?
We do not
That’s why we need to persuade the infrastructure people
What other things do we need to do
Small Picture – you don’t
Developer Friendliness Matters
Its underware – customers are infrastructures
Marrying the FDO vision with the RO
Embed in Research Infrastructures