https://ucsb.zoom.us/meeting/register/tZYod-ippz4pHtaJ0d3ERPIFy2QIvKqjwpXR
FAIRy stories: the FAIR Data principles in theory and in practice
The ‘FAIR Guiding Principles for scientific data management and stewardship’ [1] launched a global dialogue within research and policy communities and started a journey to wider accessibility and reusability of data and preparedness for automation-readiness (I am one of the army of authors). Over the past 5 years FAIR has become a movement, a mantra and a methodology for scientific research and increasingly in the commercial and public sector. FAIR is now part of NIH, European Commission and OECD policy. But just figuring out what the FAIR principles really mean and how we implement them has proved more challenging than one might have guessed. To quote the novelist Rick Riordan “Fairness does not mean everyone gets the same. Fairness means everyone gets what they need”.
As a data infrastructure wrangler I lead and participate in projects implementing forms of FAIR in pan-national European biomedical Research Infrastructures. We apply web-based industry-lead approaches like Schema.org; work with big pharma on specialised FAIRification pipelines for legacy data; promote FAIR by Design methodologies and platforms into the researcher lab; and expand the principles of FAIR beyond data to computational workflows and digital objects. Many use Linked Data approaches.
In this talk I’ll use some of these projects to shine some light on the FAIR movement. Spoiler alert: although there are technical issues, the greatest challenges are social. FAIR is a team sport. Knowledge Graphs play a role – not just as consumers of FAIR data but as active contributors. To paraphrase another novelist, “It is a truth universally acknowledged that a Knowledge Graph must be in want of FAIR data.”
[1] Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
FAIRy stories: the FAIR Data principles in theory and in practice
1. FAIRy stories: the FAIR Data
principles in theory and in practice
Carole Goble
The University of Manchester, UK
carole.goble@manchester.ac.uk
The views expressed in this talk are my own
NSF Convergence Accelerator Series Tracks A&B webinar, 19th May 2021
3. Why do we need FAIR data in Research?
“there must be loads of legacy data. We’re desperately trying to go
back and look at what we knew from SARS 10 years ago”
https://www.covid19dataportal.org/
https://www.rd-alliance.org/group/rda-covid19-rda-covid19-omics-rda-covid19-epidemiology-rda-covid19-
clinical-rda-covid19-1
https://doi.org/10.15497/rda00052
4. Why do we need FAIR data in Research?
COVID Data sharing boost – mobilising people, infrastructure & initiatives
Spotlighted technical, territorial & practices
Provider: collection, upload and governance bottlenecks
User: find and access to datasets, licenses, data and metadata quality
Access to data for processing at scale, common standards
Behaviour inertia and relapse
Long term sustainability
“global pandemic is not sufficient to radically modify
scientific practices”*
* Larregue et al https://blogs.lse.ac.uk/impactofsocialsciences/2020/11/30/covid-19-where-is-the-data/
6. Why do we need FAIR data in Research?
information flows, secondary use
Figure: KnowledgeTurning, Information Flow Josh Sommer, Chordoma Foundation, 2011
Community domain enclaves
Resource fragmentation
Flow across platforms/ sovereignties
Pan-discipline drivers
Knowledge churn, loss and cost
7. 2016
A set of GUIDING PRINCIPLES to
enhance the value of all digital
resources and their reuse by PEOPLE
and by MACHINES
ALIGNING a COMMUNITY around
common data guidelines
FAIR Research Data
9. What ARE the FAIR principles?
Aspirational guardrails
Not a standard, nor metrics
A contract between data
provider and user
In the original paper
https://www.go-fair.org/fair-principles/
Relaunch a dialogue - research and policy communities.
Reboot a journey - wider accessibility and reusability of data.
11. “enhancing the ability of machines to
automatically find and use data or any digital
object, and support its reuse by individuals”
INCF Statement
12. Persistent identifiers
Globally unique, resolvable for
data and always for metadata
Structured metadata
Community defined descriptive
metadata using common
terminologies and standards
Linked Data
Vocabularies are FAIR, (meta)data
reference (meta)data, provenance
Automation-
readiness
Access protocols
Open, free and universally
implementable comms protocols
Semantic Web ->
Linked Data ->
Knowledge Graphs.
Machine-processable
metadata.
[Icons: FAIRsharing]
13. Open as possible, Closed as necessary
Clear licences for innovation and reuse
Sensitive data, GDPR, IPR, jumpy Deans.
Crossing sovereignty boundaries
• Data sharing becomes data visiting &
federated analysis
An industry in controlled secure access….
• Data Usage Ontology, Beacon Passports,
Trusted Research Environments etc….
Terms of access and use: FAIR ≠ OPEN
FAIR OPEN
SAFE
Privacy preservation
Regulatory rigour
14. FAIR Implicit Assumptions & Implications
Data are first class objects
Primarily aimed at data creators
and providers for benefit of
consumers.
Operating in an (Open) Data
Ecosystem.
Adoption at scale in legacy
settings.
Data sharing
16. The Life Sciences Infrastructure Zoo
Flows around a Federated & Diverse System
1466 data repositories
(100+ in EOSC-Life)
916 data format and metadata
standards*
from compounds to clinical trials
https://fairsharing.org/ accessed May 2021
Common standards & agreements
mappings of PIDs and metadata
moving metadata around
accountability and responsibility
17. FAIR players simplified
Researchers and
company
scientists who
generate and use
the data
Service providers
who manage data
and infrastructure
Local -> Global level
Public -> Commercial
Authorities who
drive policy, practice
& resources
Funders, Policy makers,
Publishers, Professional
societies, Standards
organisations, Institutions
18. Global and national initiatives
Dedicated projects
Community Orgs
Funders
Policy
Publishers
FAIR
first
stage
Dedicated Services
19. Where we are going
Where we are
[Susanna Sansone]
FAIR
first
stage
20. FAIR first stage :
Policymakers, Data service providers
How to define, measure compliance and certify FAIR data?
What is a dataset?
General repos vs Curated authoritative archives?
Principles for Data Repositories
https://www.rd-alliance.org/trust-principles-rda-community-effort
https://fairassist.org/
23. 1. A common mechanism for metadata
Respect and work with the huge legacy
resources: repositories, registries, tools …
community standards
Find, register, index, search resources
Move metadata between services
withoutAPIs
Repositories ->Tools, Aggregators (e.g. licenses)
-> Registries (upload, auto-curation)
Registries -> Registries (across disciplines)
Contribute to Knowledge Graphs
a little bit of semantics at scale
semantic underware
invisible to users
visible to developers & services
24. Picture: Carole Goble, Turing Lecture 2018
Schema.org: Semantic Mark up for the Web
Cartel of commercial search engines
Wide web use, web infrastructure
Web pages and sitemaps
Types (830+) IceCreamShop
Properties (1300+) hasMenu
Not targeted at science - too much / too little
Dataset type – 120 properties
(Google Data Profile requires 2 properties)
No type for Protein, Gene, Taxon
25. Harnessing Schema.org for Bioscience
Profile
Data model
Marginality information
Controlled vocabularies
Cardinality
Documentation
Examples
New (properties | types)
definition & consensus
deployment and use
tools & support
Opinionated conventions
Profiles & Link to domain ontologies
}Add Bioscience properties & types if necessary
Examples &Usage Guidelines
}
Community
26. Harnessing Schema.org for Bioscience
ChemicalSubstance
definition & consensus
deployment and use
tools & support
Opinionated conventions
Profiles & Link to domain ontologies
Add Bioscience properties & types if necessary
Examples &Usage Guidelines
Community
27. Bioschemas metadata stratification
broad & shallow / deepish & narrowish
Generic
Subject
specific
MolecularEntity,
Protein,
Sample,Taxon,
ChemicalSubstance…
DataCatalog
Dataset
dataset 5 minimum, 8
recommended properties
license & provenance
https://bioschemas.org/profiles/
Crosswalks to metadata schemas *
• DCAT, DataCite,CrossRef, OpenAIRE, DDI
• DCT:issued <-> Schema:dataPublished
What is a dataset?
Include community ontologies
• Type: ChemicalSubstance
• Property: biologicalRole
• ExpectedType: ChEBI ontology
* https://zenodo.org/record/4420116#.YKFOpaHTX18
30. Lessons: Putting FAIR into Practice
A little bit of semantics at scale -> build critical mass
Profiles
• Schema.org culture – Catch 22
• Consensus building, retention & Ontology-itis
Provider mark-up
• Developer friendly in house tools & wacky web implementations
• Adoption incentives & costs of adapting database processes
Consumer services
• Adoption incentives – Catch 22 & tipping points
• DataCatalog & Dataset popular -> Google Dataset search
Consumer-provider readiness
• Tools and training community take-up….
31. 2. Packaging Research Objects
Gather together into a “crate” files,
unbounded references, & other
crates.
FAIR content: metadata,
identifiers, provenance, citation
about the content
FAIR crates: metadata, PIDs,
provenance, citation about the
crate.
more FAIR middleware -> towards FAIR Digital Objects*
*FAIR Digital Objects for Science: From Data Pieces to Actionable Knowledge Units:
https://doi.org/10.3390/publications8020021
32. Why “crate up” objects? FAIR+R
Flows:
Researchers work with multiple and
different objects using multiple
infrastructures over periods of time
exchange between platforms and people
Parts:
Research has associated objects
linked together by context
metadata files with files
datasets, scripts, SOPs, articles …
0
held in different places
made at different times by
different people & processes
publish, report, reuse, cite, reproduce
register, deposit, archive, port
point to big, sensitive & active content
33. Aggregate files and/or any URI-addressable
content with structured metadata
Web and Linked Data Native
machine and human readable PIDs + JSON-LD +
Schema.org, search engine & developer friendly
Flex for open ended content, respect legacy
typed by a profile + add more schema.org and
domain ontologies
http://www.researchobject.org/ro-crate/
Archive file
format
FAIR Object Middleware
35. It’s FAIR metadata middleware, stupid
• smart use of wheels already invented
• get tools, services on board
• developer friendly, firm best practice
Known and Unknown unknowns
One size does not fit all
• contextual interpretation
• descriptive openedness , multi-interpretation
Analogous to FAIR Software
• RDA/ReSA FAIR4Research SoftwareWG
Lessons: Putting FAIR into Practice
36. 3. Making (legacy) datasets FAIR: FAIRification
[Picture credit: EgonWillighagen]
37. Credit to: Ian Harrow, FAIR & OM projects
FAIR as enabler for the digital transformation
● Biopharma R&D productivity can be
improved by implementing the FAIR Data
Principles.
● FAIR enables powerful new AI analytics access
to data for machine learning and prediction
● Fairly AI Ready
● Challenges
○ change the culture, show business value,
achieve the ‘FAIR enough’
○ Sustain FAIR solutions and activities
Slide credit: Susanna Sansone
38. Making (legacy) datasets FAIR: FAIRification
> 100 Public-Private partnerships of
European Commission, universities SMEs
and Big Pharma translational projects
Pharma’s own datasets
40. FAIRification of legacy datasets
Practical
advice
Assessment
processes
FAIR levels of
projects / data
Selection of
datasets
Cost/Benefit
analysis
Methodology
Steps for 1 or
more datasets
Cultural change
Legal templates
Squads & BYODs
Maturity models
41. Interlinking data from different sources
The lessons of good
global and persistent
identifiers.
Mapping identifiers
and services for
mapping ids to ids and
concepts to concepts.
https://fairplus.github.io/the-fair-cookbook/content/recipes/interoperability/identifier-mapping.html
42. FAIR by Design
At the start of a collection, built in throughout the life cycle
change management, capacity building
FAIRifying Retrospectively
Legacy datasets, build a cohort,
cost benefit and FAIR readiness over a collection of datasets
44. FA(I)R
New FAIRVariants
FAIR++
Legal > Organisational >
Semantic >Technical*
Business and change analysis.
Cost Benefit Analysis.
Scientific / BusinessValue
Sustainability
“…make a decision that
these data are valuable
enough to invest in the work
required for FAIRification.”
interoperability
*EOSC Interoperability Framework
45. What does FAIRifying a dataset mean?
A database?A pdf? Depositing to a public archive?
Identifier and ontology selecting, assigning,
mapping between and to existing vocabs, and knowing
about ontology services.
High-fidelity ETL loss-less moving (meta)data
from one system to another
Lessons: Putting FAIR into Practice
46. Lessons: Putting FAIR into Practice
FAIR enough.
Repository manager
Admin monitoring
Bioscientist
Scientific analysis
“Fairness does mean everyone
gets the same. Fairness means
everyone gets what they need”
(Rick Riordan).
Maturity and importance spectrum
Its not all worth it.
FAIR gardens + FAIR scrub
How to assess FAIR maturity
levels, not to be certified but
to make decisions.
47. FAIR ≠ FREE - an expensive, expert team sport
Mostly manual,
mostly specific
48. “It is a truth
universally
acknowledged
that a
Knowledge
Graph must be
in want of FAIR
data.
And FAIR data
is in want of
Knowledge
Graphs.”
harvesting
added value
DataCite PID Graph
Bottlenecks:
identifiers and ontologies
curating and ingest pipelines of data providers
49. 4. FAIR Data by Design at Source
Data management platform for Project Hubs
organising, cataloguing, sharing and publishing
multiple kinds of research objects in multiple
repositories for multi-partner projects.
Community developed Knowledge Hub
for guides, examples, tools, and pointers.
Assembled and written by Life Science
researchers and data stewards for their peers.
https://rdmkit.elixir-europe.org
https://fair-dom.org
50. Lessons: Putting FAIR into Practice
Data creators
• Retention not sharing, act local not global
• Advantage*: intimate knowledge, data
flirting, credits & incentives
Process change and values
• Access to infrastructure with seamless
information flows,Values
• Time & resources to embed into practice
FAIR Stewardship skills
• Professionalisation & know-how
*Pasquetto, I. V., Borgman, C. L., & Wofford, M. F. (2019). Uses and Reuses of Scientific Data: The Data Creators’
Advantage. Harvard Data Science Review, 1(2). https://doi.org/10.1162/99608f92.fc14bf2d
51. Summary: FAIRy stories
Theory -> mobilised some
Practice -> marathon that takes a village
Move the story from data providers to
enabling creators & consumers prepare to
share FAIR -> Research on Research
Authorities Change Mgt
Stewardship
Service Providers
Sustained infrastructure
52. Acknowledgements
Special thanks to
• Stian Soiland-Reyes (Uni of Manchester/Uni of Amsterdam)
• Nick Juty & Ebtisam Alharbi (University of Manchester)
• Susanna Sansone (University of Oxford)
• Tony Burdett (EMBL-EBI)
• Ibrahim Emam (ImperialCollege)
• EgonWillighagen (Maastricht University)
• Alasdair Gray (Heriot-Watt University)
Manchester, Research Object, RDMkit, FAIRDOM, FAIRplus, Bioschemas colleagues
(about 130 people)
Icons from the noun project
(https://thenounproject.com/)