Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

The swings and roundabouts of a decade of fun and games with Research Objects

Research Objects and their instantiation as RO-Crate: motivation, explanation, examples, history and lessons, and opportunities for scholarly communications, delivered virtually to 17th Italian Research Conference on Digital Libraries

  • Sé el primero en comentar

  • Sé el primero en recomendar esto

The swings and roundabouts of a decade of fun and games with Research Objects

  1. 1. The swings and roundabouts of a decade of fun and games with Research Objects Professor Carole Goble The University of Manchester, ELIXIR-UK Software Sustainability Institute UK 17th Italian Research Conference on Digital Libraries (IRCDL) 2021, 18th February 2021
  2. 2. I work at supporting Data Driven Research at three scales Across the research data lifecycle Life Sciences, Biodiversity social sciences, astronomy, digital libraries, chemistry etc EU Research Infrastructures
  3. 3. Data availability Methods Source data Suppl info
  4. 4. Methods De novo assembly and binning Raw reads from each run were first assembled with SPAdes v.3.10.020 with option --meta21. Thereafter, MetaBAT 215 (v.2.12.1) was used to bin the assemblies using a minimum contig length threshold of 2,000 bp (option --minContig 2000) and default parameters. Depth of coverage required for the binning was inferred by mapping the raw reads back to their assemblies with BWA-MEM v.0.7.1645 and then calculating the corresponding read depths of each individual contig with samtools v.1.546 (‘samtools view -Sbu’ followed by ‘samtools sort’) together with the jgi_summarize_bam_contig_depths function from MetaBAT 2. The QS of each metagenome-assembled genome (MAG) was estimated with CheckM v.1.0.722 using the lineage_wf workflow and calculated as: level of completeness − 5 × contamination. Ribosomal RNAs (rRNAs) were detected with the cmsearch function from INFERNAL v.1.1.247 (options -Z 1000 --hmmonly --cut_ga) using the Rfam48 covariance models of the bacterial 5S, 16S and 23S rRNAs. Total alignment length was inferred by the sum of all non-overlapping hits. Each gene was considered present if more than 80% of the expected sequence length was contained in the MAG. Transfer RNAs (tRNAs) were identified with tRNAscan-s.e. v.2.049 using the bacterial tRNA model (option -B) and default parameters. Classification into high- and medium-quality MAGs was based on the criteria defined by the minimum information about a metagenome-assembled genome (MIMAG) standards23 (high: >90% completeness and <5% contamination, presence of 5S, 16S and 23S rRNA genes, and at least 18 tRNAs; medium: ≥ 50% completeness and <10% contamination). Assignment of MAGs to reference databases Four reference databases were used to classify the set of MAGs recovered from the human gut assemblies: HR, RefSeq, GenBank and a collection of MAGs from public datasets. HR comprised a total of 2,468 high-quality genomes (>90% completeness, <5% contamination) retrieved from both the HMP catalogue ( and the HGG8. From the RefSeq database, we used all the complete bacterial genomes available (n = 8,778) as of January 2018. In the case of GenBank, a total of 153,359 bacterial and 4,053 eukaryotic genomes (3,456 fungal and 597 protozoan genomes) deposited as of August 2018 were considered. Lastly, we surveyed 18,227 MAGs from the largest datasets publicly available as of August 201813,16,17,18,19, including those deposited in the Integrated Microbial Genomes and Microbiomes (IMG/M) database52. For each database, the function ‘mash sketch’ from Mash v.2.053 was used to convert the reference genomes into a MinHash sketch with default k-mer and sketch sizes. Then, the Mash distance between each MAG and the set of references was calculated with ‘mash dist’ to find the best match (that is, the reference genome with the lowest Mash distance). Subsequently, each MAG and its closest relative were aligned with dnadiff v.1.3 from MUMmer 3.2354 to compare each pair of genomes with regard to the fraction of the MAG aligned (aligned query, AQ) and ANI. scripts, workflows, SOPs & datasets
  5. 5. scripts, workflows, SOPs & datasets
  6. 6. Objects are the Outcomes of Research research outcomes are more than just publications and data software, models, workflows, SOPs, lab protocols…. all are first class citizens of scholarship information required to make research FAIR and Reproducible (FAIR+R) …
  7. 7. From FAIR data to FAIR Digital Objects To be FAIR each digital object type has its own metadata requirements, and may have its own repositories and registries FAIR Digital Objects for Science: From Data Pieces to Actionable Knowledge Units:
  8. 8. From FAIR data to FAIR Digital Objects The FAIR Guiding Principles for Data Stewardship and Management Scientific Data 3, 160018 (2016) doi:10.1038/sdata.2016.18
  9. 9. Objects are the Outcomes of Research Each object has its own metadata and repositories
  10. 10. De-contextualised Static, Fragmented Lost Semantic linking Contextualised Active, Unified Semantic linking Buried in a PDF figure Scattered Reporting and Reading
  11. 11. Data Files Methods: Scripts, CompWorkflows URLs to external data Since 2007…
  12. 12. Systems Biology Structured, interrelated objects in context Data files SOP docs Models URLs to resources Since 2008…
  13. 13. shareable, cite-able, exchangeable resource, with versioning and snapshots Research Object general framework - research outcomes related and bundled together
  14. 14. enriching resources and collections with additional information required to make research FAIR and reproducible metadata describing content and context dependencies, versions, relationships, provenance, annotations … DCAT EML MIAPPE CodeMeta SBML Bioschemas/CWL DataCite
  15. 15. integrated view over fragmented resources using PIDs bigger on the inside than the outside encapsulated content and references to external resources
  16. 16. has its own metadata, can be registered and deposited in its own right, unpackaged and accessed, activated and reproduced if appropriate encapsulated content and references to external resources
  17. 17. RO is the Unit of Knowledge Exchange Between scholarly communication and research infrastructure platforms -> Between researchers
  18. 18. From the big picture to the small picture in practice
  19. 19. standards-based metadata framework for bundling resources with context into citable reproducible packages Linked Data Bechhofer et al (2013)Why linked data is not enough for scientists Bechhofer et al (2010) Research Objects:Towards Exchange and Reuse of Digital Knowledge, Since 2010…
  20. 20. self-describing, chiefly metadata, objects RO Metadata file Structured metadata about the RO and content files links to web resources RO Content Archive file format / packaging system BagIt, zip OCFL, Git type id description datePublished … directories license author organisation
  21. 21. self-describing, chiefly metadata, objects RO Metadata file Structured metadata about the RO and content image file links to web resources RO Content Archive file format / packaging system directory of data type, id description datePublished creator size format … " https://github/script type id description datePublished … license author organisation “Web Linked Data” approach
  22. 22. self-describing, chiefly metadata, objects RO Metadata file Structured metadata about the RO and content image file links to web resources RO Content Archive file format / packaging system directory of data type, id description datePublished creator size format … " type id description datePublished … license author organisation Workflow RO “Web Linked Data” approach
  23. 23. self-describing, chiefly metadata, objects How do we describe the metadata? PIDs + JSON-LD + descriptors How can I add additional metadata?, domain ontologies How do I define a checklist of what is expected to be in a type of RO? RO Profile Standard Web Mark-up
  24. 24. RO-Crate in a nutshell Practical lightweight approach to packaging research data entities (any object) with metadata Aggregate files and/or any URI-addressable content, with contextual information to aid decisions about re-use:Who WhatWhenWhereWhy How. Web Native Machine readable. Human readable. Search engine friendly. Familiar. Extensible and Incremental: add additional metadata; nested and typed by their profile. Open Community effort 38 contributors fromAU, EU and US
  25. 25. Lifting the Skirts to look at the underwear Released 30th October 2020
  26. 26. Community specific Combine Archive Publisher specific [Bergmann et al. 2014]. Platform specific DataONE A trend
  27. 27. Examples
  28. 28. Cultural Heritage: A data curation service for endangered languages: 500,000 files in 28,624 items and 574 collections long term preservation and accessibility of research data objects [Marco La Rosa, Peter Sefton]
  29. 29. Describe the data as it’s being created using open standards and tools source: [Marco La Rosa, Peter Sefton]
  30. 30. Scalable verified collections of references Processing big genomic & clinical data distributed over multiple locations NIH Data Commons [Chard, et al 2016] minids Retain and archive processed datasets Reference and transfer large data on demand Controlled access to sensitive data [Kesselman, Foster]
  31. 31. Data and Method Commons 13 EU Life Science Research Infrastructures Sharing data, tools and workflows in the cloud 100+ data collections method commons
  32. 32. Data and Method Commons 13 EU Life Science Research Infrastructures exchanging and preserving secure data objects Sharing data, tools and workflows in the cloud interchange, stewardship, recording dependencies -> portability & reuse/reproducibility of workflow objects
  33. 33. Profile
  34. 34. Do Research Assemble Methods, Materials Analyse Results Quality Assessment Track and Credit Disseminate Deposit & Licence Publish Share Results Manage Results Science 2.0 Repositories: Time for a Change in Scholarly Communication Assante, Candela, Castelli, Manghi, Pagano, D-Lib 2015 Experiment Observe Simulate Describe and release the data , workflows, research as its being created, updated and used
  35. 35. Getting embedded into the European Open Science Cloud Collaborative environments and EGI-ACE data spaces for Earth Science researchers RO-Crate as interchange format and integration with Zenodo, using Describo Pan-European natively FAIR and GDPR compliant data storage and sharing fabric Science Mesh using Cloud Services for Synchronization and Sharing (CS3) RO to manage research activities and release snapshots to Zenodo
  36. 36. exchange between genomics platform and repository. standardise and share analyses generated from genome sequencing. Scholarly Communication Drivers Exchange & Import/Export Reuse & Reproduce Report & Archive Living Objects Share & Access Figure: RDMKit,
  37. 37. Swings and Roundabouts why is it called “RO-Crate”??
  38. 38. Activation & research in a project Adoption by other projects Adoption Adoption in mainstream infrastructure
  39. 39. Research ObjectVersion 1 2010-2017 Sharing of data-intensive computational workflows Preservation and managing their decay • documentation of workflows • interconnections between workflows and related resources (e.g., datasets, publications, etc.) • documentation of their dependencies and their outputs • social aspects Type-cast  not just for workflows or biology!
  40. 40. Howard Ratner, Chair STM Future Labs Committee, CEO EVP Nature PublishingGroup Director of Development for CHORUS 2012 2017
  41. 41. adoption stuck in inner groups not mainstream or outsiders a swing back to basics reboot!
  42. 42. Machine-processable Standards Low tech Graceful degradation Commodity tooling Incremental Multi-platform Technology Independent Keep it simple E X A M P L E S Developer friendliness A swing back to basics "just enough complexity / standards” sufficient extra benefits from what already exists… …without compromising the developer entry-level experience so much that they rather do their own thing.
  43. 43. Version 1 Linked Data Purity Construct metadata file Describe metadata content Mismash of ontologies combined together to describe metadata file RDF Shapes Represent profiles Hard to make checklists SHACL, ShEx W3C Web Annotation Vocabulary OAI Object Exchange and Reuse
  44. 44. Indeed. Linked Data is not enough. And sometimes its too much. Research Infrastructures: “digital technologies (hardware, software), resources (data, services, digital libraries, standards), comms (protocols, access rights, networks), people and organisational structures”
  45. 45. Tensions Research Infrastructures sit in the middle Academic Viewpoint Infrastructure Viewpoint Green field site Theoretical purity Use latest thing Proof of concept Sophistication Narrow developer audience Strive for super generic The end Exposing the tech Pre-existing platforms Practicality Use things that work Production Simplicity & Familiarity Wide developer audience Several specific is ok! The means Hiding the tech
  46. 46. 2018: Digital Libraries & Developer friendly reboot Peter Sefton Semantic Web world vs RealWorld Open Repositories & Dig Lib Community DataCrate simple web stack RO rich RDF stack + fewer features easier to understand, conceptually simpler opinionated guide to current best practices software stacks widely used on the Web
  47. 47. Just Enough Linked Data Just InTime simplifications rather than generalizations Retain benefits of Linked Data querying, graph stores, vocabularies, clickable URIs customization and conventions The stuff a developer needs documentation, examples, libraries, tools, community limited flexibility frees up developers familiarity is important Linked Data “exotics” there for when the time is right if needed by the right people +
  48. 48. Developer friendly ->Tooling, Libraries How can I use it? While we’re mostly focusing on the specification, some tools already exist for working with RO-Crates: ● Describo interactive desktop application to create, update and export RO-Crates for different profiles. (~ beta) ● CalcyteJS is a command-line tool to help create RO-Crates and HTML-readable rendering (~ beta) ● ro-crate - JavaScript/NodeJS library for RO-Crate rendering as HTML. (~ beta) ● ro-crate-js - utility to render HTML from RO-Crate (~ alpha) ● ro-crate-ruby Ruby library to consume/produce RO-Crates (~ alpha) ● ro-crate-py Python library to consume/produce RO-Crates (~ planning) These applications use or expose RO-Crates: ● Workflow Hub imports and exports Workflow RO-Crates ● OCFL-indexer NodeJS application that walks the Oxford Common File Layout on the file system, validate RO-Crate Metadata Files and parse into objects registered in Elasticsearch. (~ alpha) ● ONI indexer ● ocfl-tools ● ocfl-viewer ● Research Object Composer is a REST API for gradually building and depositing Research Objects according to a pre-defined profile. (RO-Crate support alpha) ● … (yours?)
  49. 49. People! A Community Sponsored RO2018, Amsterdam e-Science Conference Diverse set of people Variety of stakeholders Set of collective norms Open platform for communication RO2018
  50. 50. People! Leadership, Champions, Early Adopters we had to reboot the community twice…. Stian Soiland-Reyes The University of Manchester, UK Peter Sefton University ofTechnology Sydney Infrastructure buy in lifts from project to mainstream
  51. 51. tendency to first focus on providers… …but (developer) consumers build the ecosystem…
  52. 52. Consumers Archivist and library specialists know the importance of metadata and standards… … and for things to work 5, 10, 20 years later. End-users and Developers handle legacy … …. their field, repositories, institutions , journals etc. will always be lagging behind the curve Researchers, to make their object FAIR … … do not have the resources or know-how…. Specification & standards-based realisation Incremental & Extensible, Practical, Familiar Built in to their platforms they use in their day-to-day work
  53. 53. Problem Community Reference Examples Tools Best Practice Guides Tutorials Embedded Go Round and Round and Round & Maintain Momentum
  54. 54. Back to the Big Picture. Opportunities for Scholarly Communications
  55. 55. I concentrated on developers….what about end users? Big Picture User-Ware Familiar tools in the research process Infrastructure Under-Ware Familiar tech in the infrastructure Registries Repositories Tools Applications Built in On ramps Metadata automation
  56. 56. I concentrated on developers….what about end users? Infrastructure RO-Crate Commons Ecosystem Knowledge graphs Citation and tracking Reproducibility Release updates New services Built by developers New user capabilities Living and actionable publications whose content transcends platforms threaded publications FAIR Digital Objects
  57. 57. Back to FAIR Digital Objects Actionable knowledge unit Digital butterfly – digital twins Bags of references courtesy Dimitris Koureas Coordinator DiSSCo EU Research Infrastructure Specimen object image courtesy of Alex Hardisty [Hardisty et al, 2020]
  58. 58. European Open Science Cloud EOSC Interoperability Framework Specimen Data Refinery Workflows to Digitise Natural History Specimens FAIR Digital Object Framework Open Digital Specimen Workflow Infrastructure RO-Crate + 01aa75ed71a1/language-en/format-PDF/source-190308283
  59. 59. FAIR Principles for Digital Research Objects FAIR all the way down Unbounded FAIR Distributed FAIR Living FAIR Analogous to FAIR Software FAIR RO-Crate is a practical start Fig from EOSC Interoperability Framework
  60. 60. Preparing for Digital Objects….. Digital library community allies! Make Research Object’s normative – Promote to researchers but … – ... target Research Infrastructures to deliver – Move RO(-Crate)s across the scholarly ecosystem Developer friendliness matters – Persuasive design FAIR principles for Research Objects…. – Unifying the vision with the practical Release paradigm instead of a Publish one
  61. 61. Barend Mons Sean Bechhofer Matthew Gamble Raul Palma Jun Zhao Mark Robinson AlanWilliams Norman Morrison Stian Soiland-Reyes Tim Clark Alejandra Gonzalez-Beltran Philippe Rocca-Serra Ian Cottam Susanna Sansone Kristian Garza DanielGarijo Catarina Martins Iain Buchan Michael Crusoe Rob Finn StuartOwen Finn Bacall Bert Droebeke Laura Rodríguez Navas Ignacio Eguinoa Carl Kesselman Ian Foster Kyle Chard Vahan Simonyan Ravi Madduri Raja Mazumder GilAlterovitz Denis Dean II DurgaAddepalli Wouter Haak Anita DeWaard Paul Groth Oscar Corcho Peter Sefton Eoghan Ó Carragáin FrederikCoppens Jasper Koehorst Simone Leo Nick Juty LJ Garcia Castro Karl Sebby Alexander Kanitz Ana Trisovic Gavin Kennedy Mark Graves José María Fernández Jose Manuel Gomez-Perez Jason A. Clark Salvador Capella-Gutierrez Alasdair J. G. Gray Kristi Holmes Giacomo Tartari Hervé Ménager Paul Walk Brandon Whitehead Erich Bremer Mark Wilkinson Jen Harrow Marco La Rosa And many more....