Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Metadata for Research Objects

1.931 visualizaciones

Publicado el

Presentation given at ISKOUK Meeting "Making Metadata Work", 23rd June, 2014.

Publicado en: Ciencias, Educación, Tecnología
  • Sé el primero en comentar

  • Sé el primero en recomendar esto

Metadata for Research Objects

  1. 1. Sean Bechhofer @seanbechhofer Making Metadata Work, ISKO London, 23rd June 2014 Metadata for Research Objects 1
  2. 2. Publication • Publications are about argumentation: Convince the reader of the validity of a position – Reproducible Results System: facilitates enactment and publication of reproducible research. • Results are reinforced by reproducability – Explicit representation of method. • Verifiability as a key factor in scientific discovery. J. Mesirov Accessible Reproducible Research Science 327(5964), p.415-416, 2010 doi:10.1126/science.1179653 Stodden et. al. Reproducible Research: Addressing the Need for Data and Code Sharing in Computational Science Computing in Science and Engineering 12(5), p.8-13, 2010 doi:10.1109/MCSE.2010.113 C.Goble et. al. Accelerating Scientists’ Knowledge Turns Communications in Computer and Information Science Volume 348, 2013, pp 3-25 doi:10.1007/978-3-642-37186-8_1
  3. 3. Reproducible Science 3 Goble: SSI Collaborations Workshop 2014
  4. 4. Scientific Workflows 4 » Scientific workflows are at the heart of experimental science › Enable automation of scientific methods › Support experimental reproducibility › Encourage best practices » There is then a need to preserve these workflows › Scientific development based on method reuse and repurpose › Conservation is key » Workflow preservation is a multidimensional challenge › Representation of complex objects › Decay analysis, diagnosis, and prevention › Social Objects that can be inspected, reused, repurposed Preservation of scientific workflows in data-intensive science
  5. 5. Preservation Technical Multi-step computational process Repeatable and comparative Explicate computation Social Virtual Witnessing Transparent, precise, citable documentation Accurate provenance logs Reusable protocols, know-how, best practice Can I review / repeat your method? Can I defend my method? Can I reuse / reproduce this method?
  6. 6. Context: Semantic Web and Linked Data • SW: Explicit machine-readable representation of information • LD: A set of best practices for publishing and connecting data on the Web 1. Use URIs to name things 2. Use dereferencable HTTP URIs 3. Provide useful content on lookup using standards 4. Include links to other stuff 6
  7. 7. • An aggregation object that bundles together experimental resources that are essential to a computational scientific study or investigation. – data used – results produced in an experiment study; – (computational) methods employed to produce and analyse that data; – people involved in the investigation. • Plus annotation information that provides additional information about both the bundle itself and the resources of the bundle – descriptions – provenance Research Objects 7
  8. 8. ROs as a Currency 8 Creator Contributor Collaborator Comparator Re-User Evaluator Reviewer Trainee Trainer Reader Publisher Curator Librarian Repository Manager
  9. 9. • Three principles underlie the approach: • Identity – Referring to resources (and the aggregation itself) • Aggregation – Describing the aggregation structure and its constituent parts • Annotation – Associating information with aggregated resources. Research Objects 9
  10. 10. Identity • Mechanisms for referring to the resources that are aggregated within a Research Object • URIs – Web Resources • DOIs – Documents/papers/datasets • ORCID IDs – Researchers 10
  11. 11. Identifier Issues • HTTP URIs provide both access and identification • PIDs: Persistent Identifiers (e.g.DOIs) tend to resolve to human-readable landing pages – With embedded links to further (possibly machine- readable) resources • ROs seen as non-information resources with descriptive (RDF) metadata – Redirection/negotiation – Standard patterns for Linked Data resources • Bidirectional mappings between URIs and PIDs • Versioning through, e.g. Memento 11 H. Van de Sompel et. al. Persistent Identifiers for Scholarly Assets and the Web: The Need for an Unambiguous Mapping 9th International Digital Curation Conference
  12. 12. Aggregation • Open Archives Initiation Object Reuse and Exchange (OAI ORE) is a standard for describing aggregations of web resources – • Uses a Resource Map to describe the aggregated resources • Proxies allow for statements about the resources within the aggregation – Capturing context and viewpoints • Several concrete serialisations – RDF/XML, Atom, RDFa 12 Graceful Degradation
  13. 13. Annotation • Open Annotation specification is a community developed data model for annotation of web resources – • Developed by the W3C Open Annotation Community Group • Allows for “stand-off” annotations – Annotation as a first class citizen • Developed to fit with Web Architecture 13 Graceful Degradation
  14. 14. Annotation Content • Essential to the understanding and interpretation of the scientific outcomes captured by a Research Object as well as the reuse of the resources within it. – Provenance information about the experiments, the study or any other experimental resources – Evolution information about the Research Object and its resources, – Descriptions of computational methods or processes – Dependency information or settings about the experiment executions 14
  15. 15. Core & Extensions • Core model provides support for aggregation and annotation • Extensions provide additional vocabularies for domain specific tasks • Workflow Provenance – Information capturing workflow executions • Workflow Description – Abstractions describing Processes, inputs and outputs • Research Object Evolution – Information describing change and “snapshots” 15
  16. 16. RO Model 16
  17. 17. Provenance • W3C’s PROV model allows for capture of information relating to – Attribution  Who did it? – Derivation  Data sources used – Activities  What happened (and when) • Significant eco-system (generators, viewers, consumers) has grown up around PROV – IPAW & TAPP 17 Copyright © 2013 W3C® (MIT, ERCIM, Keio, Beihang), All Rights Reserved.
  18. 18. Tooling 18
  19. 19. preservation and access to preserved ROs as depicted in Figure 6. Optionally, an external repository may used to support the frequently evolving research objects. The repositories may be housed in a single multiple physical repositories, and use the same or differing technologies (e.g. a repository may use a dig preservation solution for the Preservation Repository and specialized digital library solution for the Acce Repository). Additionally, as the Preservation Repository does not have the same interactive u requirements as the access and live repositories, it could be implemented with slower (or offline) stora alternatives. Figure 6. Conceptual Archival System Storage Architecture. ROs and OAIS • ROs as Information Packages in OAIS • myExperiment as live/access repository • ROHUB as archival repository 19
  20. 20. SCAPE: Planning and Watch 20 Watch OperationsPlanning Env & Users Repository plan deploy monitor monitor monitor access ingest, harvest execution • SCAPE project concerned with Digital Preservation. • Planning and Watch infrastructure to helpmmonitor the state of a repository and co-ordinate appropriate actions • Driven by policies.
  21. 21. myExperiment and RODL Decay, Service Deprecation, Data source monitoring, Checklists, Minimal Models Wf4Ever: Monitoring and Watch 21 Watch OperationsPlanning Env & Users Repository plan deploy monitor monitor monitor access ingest, harvest execution • Ideas applied to workflow preservation
  22. 22. Decay • Survey of 92 Taverna workflows from myExperiment • Volatile Third-Party Resources • Missing Data • Missing Execution Environments • Poor descriptions 22 Belhajjame et. al. Why workflows break — Understanding and combating decay in Taverna workflows e-Science 2012 doi:10.1109/eScience.2012.6404482 (a) An overview of the decay causes. (b) Workflow decay due to third party resources. Fig. 3. Summary of workflow decay causes.
  23. 23. Checklists and Validation • Checklists widely used to support safety, quality and consistency • Common in experimental science – Expressing minimum information required – Supporting “health” monitoring of workflow-centric ROs. • Checklists can be defined in terms of the RO model and its annotations – Generic checklist service then executes against that model and the given annotations – Provenance 23
  24. 24. Minim Data Model pliant” or “ minimally compliant” with a checklist if it satisfies all of its MAY, SHOULD or MUST items respectively. Fig. 1. An overview of the Minim model schema. Checklist Requirement QueryTestRule SparqlQuery Result modifier (string) Query pattern (string)Rule CardinalityTest Min cardinality (integer) AggregationTest URI template (string) Max cardinality (integer) min max affirmRuleaggregatesTemplate hasRequirement: hasMustRequirement hasShouldRequirement hasMayRequirement isLiveTemplate sparql_query result_mod toModel Notation key: Explicit entity Implicit (super)class Literal value (type) property query graph QueryResultTest RuleTest exists 0..1 0..1 1 1 0..1 0..1 1 1 1 1..* SoftwareEnvRule URI template (string) Query AccessibilityTest URI template (string) ExistsTest Rule max 1 1 Query Model isDerivedBy 1..1 Our Minim data model (see Figure 1) provides 4 core constructs to express a quality requirement: 24 Zhao et. al. A Checklist-Based Approach for Quality Assessment of Scientific Information 3rd In. Workshop on Linked Science, 2013
  25. 25. Checklist Evaluation 25
  26. 26. Checklist Evaluation 26
  27. 27. RO Bundle • A single, transferable object encapsulating the description and resources of an RO – Download, transfer, publish • ZIP-based format (resources) plus a manifest describing aggregation and annotations (description) – Unpack with standard tooling • JSON-LD as a representation for manifest – Lightweight linked-data format – Compatible with existing JSON tooling and services – PROV-O and OAC for annotations 27
  28. 28. Bundling via git/Zenodo/figshare • Scientist works with local folder structure. – Version management via github. – Local tooling produces metadata description – Metadata about the aggregation (and its resources) provided by “hidden folder” • Zenodo/figshare pull snapshot from github – Providing DOIs for the aggregrations – Additional release cycles can prompt new DOIs 28
  29. 29. Zenodo 29
  30. 30. figshare 30
  31. 31. ROs as RDFa 31
  32. 32. RDFa 32
  33. 33. Code as a Research Object 33
  34. 34. COMBINE Archive 34
  35. 35. GigaScience/ISA 35
  36. 36. IPython 36
  37. 37. Wrap Up • Aggregation objects bundling together experimental resources that are essential to a computational scientific study or investigation – Intended to support greater transparency and reproducability • Annotations provide additional information about the bundle and its contents – Metadata is key here • Use of existing standards, vocabularies and infrastructure • Nascent tooling to support creation, management and publication 37
  38. 38. Thanks! • All the members of the Wf4Ever team – iSOCO: Intelligent Software Components S.A., Spain – University of Manchester, School of Computer Science, Manchester, United Kingdom – University of Oxford, Department of Zoology, Oxford, UK – Poznan Supercomputing and Networking Center. Poznan, Poland – IAA: Instituto de Astrofísica de Andalucía, Granada, Spain – Leiden University Medical Centre, Centre for Human and Clinical Genetics, The Netherlands • Colleagues in Manchester’s Information Management Group • RO Advisory Board Members 38