How e-infrastructure can contribute to Linked Germplasm Data
EarthCube DDMA AGU
1. A Community Roadmap for Enabling
Access to Geosciences Data
Tanu Malik
Ian Foster
Computation Institute
University of Chicago and Argonne National Lab.
tanum@ci.uchicago.edu, foster@anl.gov
www.ci.anl.gov
www.ci.uchicago.edu
3. Access is Vital for EarthCube’s Success
• The goal of EarthCube is to create a sustainable
infrastructure that enables the sharing of all
geosciences data, information, and knowledge in an
open, transparent and inclusive manner.
I cant get access to *.
It is difficult for me to *.
I want to integrate data from other disciplines, but *.
Access refers to software and activities that make data and computational
resources easily, efficiently and reliably available to scientists across
disciplines.
www.ci.anl.gov
3
www.ci.uchicago.edu
4. Access Workshop Goals
• Encourage discussions on emergent issues:
– Use of cloud computing
– Exploiting the general principle of moving computation to data
– A technological and governance framework for cross-disciplinary
access, service architecture, brokering principles, real-time data, uniform
authentication and authorization environment, etc.
– Improving access to data in publications.
• Bring some standardization on research data life cycle issues:
– In general, data, once generated, follow a lifecycle---they are
stored, described, processed, transformed, accessed, discovered, analyze
d, and curated. In organized networks and campaigns, lifecycle stages are
often documented and standardized, though vary significantly across
networks and campaigns. In individual initiatives, the lifecycle stages
continue to remain ad hoc and ill-defined. [RDLM-Workshop2011]
• Obtain community consensus on a few use cases
www.ci.anl.gov
4
www.ci.uchicago.edu
5. Workshop Activity Outcomes
• Use Case 1: Can I access “not large” but “big data”
to conduct statistical analysis?
• Use Case 2: I have a hypothesis not tied to a
physical instrument or geophysical parameter. Can
I still access all the data, in an “interactive” fashion
to test my hypothesis?
• Use Case 3: The storm dust paper is vital to my
research. Can I access the data in the publication
and change parameters of experiments to
understand the nature of storm dust?
www.ci.anl.gov
5
www.ci.uchicago.edu
6. Workshop Reflections
• Its all about data!
People
Import Import
Resources, Resources,
Data Services Data
Services
Export Export
www.ci.anl.gov
6
www.ci.uchicago.edu
7. Workshop Reflections-2
• Discussing technology issues in insolation is a
recipe for disaster.
– Access is closely aligned with other subgroups
– It is important to organize in functional units
www.ci.anl.gov
7
www.ci.uchicago.edu
8. Workshop Reflections-3
• Challenges will continue
Social Challenges Changing Requirements/
Changing Technology
• Transparency
• Openness Adoption Culture
• Establishing social ties • Real-time data
Adoption is slow • Cross-disciplinary Data
Sustainability • High dimensionality
Establishing practices • Network bandwidth,
Computational resource,
Data management constraints
www.ci.anl.gov
8
www.ci.uchicago.edu
10. Enabling A Data Sharing Space: The
DataSpace
• Embrace a “semi--structured” notion
• Ingest data in raw form,
Structuring and refinement of the data and metadata.
• Open, extensible architecture that supports Import
Software as a Service (SaaS) model,
Process for vetting contributed services prior to their incorporation.
Based on on-demand resources
Resc,
• Emphasis on usability instead Services Data
DataSpace
on developing technology/infrastructure
Export
&
www.ci.anl.gov
10
www.ci.uchicago.edu
11. Post-Charette
• 2 Earthcube PI meets at University of Colorado, Boulder
– A Concept group meeting,
o some representation from Community groups,
o July 10, 2012
– A Concept and Community group meeting,
o October 4 -5, 2012
• Primary objective: Convergence
– Through Roadmaps
– Architecture
– On future steps
www.ci.anl.gov
11
www.ci.uchicago.edu
12. Highlights: Summary of Roadmaps
• Workplace to collaborate,
• Lower barriers for participation,
• Openness and extensibility,
• Feedback and reproducibility,
• Discovery of materials held by long-tailed
scientists,
• Education and reward system for scientists,
• Cross-domain teams and broad collaboration
• A new community paradigm.
www.ci.anl.gov
12
www.ci.uchicago.edu
15. Acknowledgements
• Don Middleton, NCAR • Dave Fulker, OPeNDAP,
• Robert Gibb, New Zealand Landcare • Amarnath Gupta, UCS,
Research • Robert Jacob, ANL
• Jeff Heard, U. of North Carolina
• Chris Jenkins, JPL
• Doug Lindholm, U. of Colorado
• Craig Mattocks, U. Miami
• Joseph Baker, Virginia Tech
• Beth Plale, Indiana Univ.
• Anne Wilson, U of Colorado
• Stephen M. Richard, AZGS
• Chris Lynnes, NASA/ESIP Federation
• Sameer Sirugeri, Microsoft
• Karsten Steinhauser, U. of
• Zhangfan Xing, JPL,
Minnesota
• John Williams, NCAR
• Ruth Duerr, NSIDC
www.ci.anl.gov
15
www.ci.uchicago.edu
Shared, standard, reusable software interfacesFor disparate data types, disparate storage, varying protocols;Deliver data in user-requested format and translation between standards.Link various kinds of data Integration of high resolution topography scans & geodetic data;Integration of geologic data in deep time;Geo-located, and non-geo-located datasets;Observation and simulation datasets for comparison.Real-time access to data and facilities Capabilities within Cloud, Grid such a shared storage and data spacesIn low bandwidth settingsSimulation and modeling capabilities within HPC, and Science Portals Access refers to software and activities that make data and computational resources easily, efficiently and reliably available to scientists.
Access Paradigms: The SaaS model, the brokering approach. The SaaS model increases usage and adoption by making access to data and resources easy and convenient. The brokering approach implements mediation and distribution capabilities in a transparent way. Discuss these paradigms in context of the needs of the publishers of the big data and the needs of the long-tail geoscientist. Issues relating to access control, confidentiality, and the role of governance bodies for emerging access paradigms.Structural Data Integration for Access: issues relating to data a, data models, and standards for data integration. discuss novel data types needed by current science cases and their abstraction to data models and knowledge-based models based on space-time integration.Scalable Resource Access:scalable access to resources, such as HPC systems, cloud-based systems (parallel storage systems, parallel analysis systems as map-reduce [8], Hadoop, SciDB [19]), especially at marginal cost. to store and manipulate data even when the structure of the data is not fully known to the system; associating the cloud with a set of services for recognizing the structure of a wide variety of file types used in the geoscience applications, extracting structure from the data, and traversing files to extract metadata.
However, in cases where researchers are interested in studying a phenomena, can an EarthCube framework provide adequate semantics to express a search query, a generic model for data access of events, and interactively discover ‘events’ within data and perform ‘first look’ analytics, while keeping provenance and history of all analyses?
Earlier Resources were at the center, and data was massaged so that the resources and services can access itBut now the data is going to be central and services will feed into it and so the
The Sher Dataspace embodies a “semi-‐‑structured” notion compared, on the one hand, with rigidly structured systems like, say, relational database systems, where a data schema needs to be specified first before data can be stored and, on the other hand with, say, filesystems, which are unstructured and do not support any notion of a schema or content-‐‑based metadata. In Sher, data can be ingested as a file (or a heterogeneous package, e.g. a folder) with minimal metadata. Services are provided for capturing this metadata as well as the package structure. Further services are provided for on-‐‑going structuring and refinement of the data and metadata. Examples include user-‐‑specified annotations; extraction of information for well-‐‑known filetypes (e.g., netCDF); extraction of metadata for proprietary file types using software libraries (e.g., NMR data); structuring of data and associated information, e.g. associating a set of flat files with a database along with the set of data cleaning routines and load scripts that were used to create the data, etc. Thus, the Dataspace concept supports the model of data being transformed incrementally from a relatively unstructured state with minimal metadata, to a highly structured form with rich metadata, using an array of structuring and refinement services. A key enabling characteristic of Sher is its open, extensible architecture that supports the Software as a Service (SaaS) model, thereby removing the burden of maintaining software and software environments from the client [52]. Using this SaaS model, Sher facilitates creation of third-‐‑party services that can be contributed into the system, i.e., a SherStore, similar to the Apple AppStore, including the notion of vetting contributed services prior to their incorporation.