Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

How Cyverse.org enables scalable data discoverability and re-use

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio

Eche un vistazo a continuación

1 de 15 Anuncio

How Cyverse.org enables scalable data discoverability and re-use

Descargar para leer sin conexión

Cyverse.org designs, builds, and operates an innovative, integrated life sciences cyberinfrastructure. It provides data management and analysis capabilities with point and click, cloud, API, and command-line interfaces that engage users of any computing proficiency and is based on an extensible platform that integrates local and national-scale HPC, storage, and cloud resources. Cyverse directly supports thousands of users who store and access over 2PB of research data, use millions of compute hours annually, and participate in the platform's improvement, plus a secondary user community from partner projects that have built atop it. Cyverse is organized around "Data Store" and "App Catalog" services, each of which enables users to upload digital research assets that can be kept private, shared, or made public. Recently, Cyverse has been transitioning from passively enabling digital sharing towards active facilitation. It is partnering with repositories like NCBI SRA to enable direct submission from Cyverse applications, adopting commonly-used ontologies, enabling import/export of virtual machine images, developing metadata-driven persistent landing pages for data sets, and providing DOI (and other identifier) services. These new features are expected to further catalyze growth of an interoperable, interconnected network of shared research infrastructure across the biological sciences.

Cyverse.org designs, builds, and operates an innovative, integrated life sciences cyberinfrastructure. It provides data management and analysis capabilities with point and click, cloud, API, and command-line interfaces that engage users of any computing proficiency and is based on an extensible platform that integrates local and national-scale HPC, storage, and cloud resources. Cyverse directly supports thousands of users who store and access over 2PB of research data, use millions of compute hours annually, and participate in the platform's improvement, plus a secondary user community from partner projects that have built atop it. Cyverse is organized around "Data Store" and "App Catalog" services, each of which enables users to upload digital research assets that can be kept private, shared, or made public. Recently, Cyverse has been transitioning from passively enabling digital sharing towards active facilitation. It is partnering with repositories like NCBI SRA to enable direct submission from Cyverse applications, adopting commonly-used ontologies, enabling import/export of virtual machine images, developing metadata-driven persistent landing pages for data sets, and providing DOI (and other identifier) services. These new features are expected to further catalyze growth of an interoperable, interconnected network of shared research infrastructure across the biological sciences.

Anuncio
Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

Similares a How Cyverse.org enables scalable data discoverability and re-use (20)

Anuncio

Más de Matthew Vaughn (14)

Más reciente (20)

Anuncio

How Cyverse.org enables scalable data discoverability and re-use

  1. 1. Transforming Science Through Data-driven Discovery How Cyverse.org enables scalable data discoverability and re-use Matt Vaughn, co-PI @mattdotvaughn vaughn@tacc.utexas.edu
  2. 2. History and Context ~ $100m direct NSF investment over 10 years Currently working to sustain its successes beyond 2018 iPlant 2008 Empowering a New Plant Biology iPlant 2013 Cyberinfrastructure for Life Science CyVerse 2016 Transforming Science Through Data-Driven Discovery Plant Science Cyberinfrastructure Collaborative A "new type of organization" that is "community- driven" uniting "biologists, computer and information scientists and experts from other disciplines working in an integrated team" to provide "computational and cyberinfrastructure capabilities and expertise that are capable of handling large and heterogeneous plant biology data sets"
  3. 3. What is Cyberinfrastructure? •Data storage and retrieval •Software (system & user) •Computing capability •Human expertise and support Organized into systems that solve problems of size and scope that would not otherwise be solvable
  4. 4. Platform Overview Ready to use Platforms Foundational Capabilities Established CI Components Extensible Services EaseofUse
  5. 5. Adoption and Outputs • Over 40K registered users (15-20% active) • Millions of computing hours on XSEDE, campus HPC, Cyverse systems, and commercial cloud • 2+ PB user data stored in CyVerse Data Store • Hundreds of publications, courses, and discoveries • Spin-off technologies • Jetstream: NSF production cloud • Syndicate: Software-defined storage system • Agave API: Multitenant science PaaS • Communities such as iAnimal, iMicrobe, iPlant.UK • 3rd party software resources using it as a platform
  6. 6. Federation Metadata Finding and re-using Data (1) iRODS (2+PB) ElasticSearchTucson Resources Austin Resources Catalog Servers CSHL Resource iPlant.UK Resources Data Store APIs Agave API AWS S3 Public FTP SFTP At the heart of all Cyverse applications is a data-centric architecture, designed to be scaled and extended
  7. 7. Finding and re-using Data (2) • Browser-based file manager • Upload from local or URI • Download • Add/Edit comments and tags • AVU metadata + structured templates • Share with collaborators or any Cyverse user The Cyverse Discovery Environment Data Window
  8. 8. Finding and re-using Data (3) • Browser-based file manager • Upload from local or URI • Download • Add/Edit comments and tags • AVU metadata + structured templates • Share with collaborators or any Cyverse user Google Drive, for big data The Cyverse Discovery Environment Data Window
  9. 9. Finding and re-using Software (1) • Extendable App Catalog • Provide Dockerfile + GUI specification • Develop VM image • Deploy application web service Info view for a Cyverse Discovery Environment application
  10. 10. Finding and re-using Software (2) • Extendable App Catalog • Provide Dockerfile + GUI specification • Develop VM image • Deploy application web service • Require links to documentation, example files and usage, appropriate software and domain ontologies Public or shared Atmosphere VM images tagged with “GWAS”
  11. 11. Finding and re-using Software (3) • Extendable App Catalog • Provide Dockerfile + GUI specification • Develop VM image • Deploy application web service • Require links to documentation, example files and usage, appropriate software and domain ontologies • Give credit to app author and software authorApplication and Data catalogs available to 3rd parties
  12. 12. Cyverse Data Commons (1) Data Commons Landing Page (1.0) Persistent URL for each data set. No authentication required. Fast browsing and retrieval. NCBI SRA Submission Workflow in DE Cyverse is the analysis home for a lot of genomics data. To get it off our systems, we need to help get it into the SRA!
  13. 13. Cyverse Data Commons (2) Actively facilitating publication and discovery of data stored with CyVerse Candidate Research Data @ Data Store Identify, organize, rename files and folders Prepare a DataCite metadata document Submit to Cyverse Curation Team Data snapshot made public. DOI issued. Candidate VM image Document contents & capabilities Prepare a DataCite metadata document Submit to Cyverse Curation Team Public image released. DOI issued.
  14. 14. Summary • Cyverse is a model for providing cyberinfrastructure to diverse bioscience user communities • State of the art has shifted at least twice since we started work • Had to overcome initial reticence to “give data” to Cyverse • Still hard to get developers and providers to maintain after contributing • Cost recovery model - We have started using the term ‘subsidized’ rather than free but it might be too late. • Natural syngergy between our organization and ODEN objectives
  15. 15. Transforming Science Through Data-driven Discovery Parker Antin Nirav Merchant Eric Lyons Matt Vaughn @mattdotvaughn vaughn@tacc.utexas.edu Doreen Ware Dave Micklos CyVerse is supported by the National Science Foundation under Grant No. DBI-0735191 and DBI-1265383. CyVerse Executive Team

Notas del editor

  • (Brief) History and Context
    In the mid-2000s, realization inside the NSF that biology had some unique CI challenges not being met
    Plant Genome was already spending on full-genome characterization projects (Arabidopsis 2010, etc).
    Big data was on horizon - NGS just emergent
    BIO-specific CI. Chose plant sciences due to strong communities and sharing culture. 
    Funded iPlant in 2008
    Project spend its first 18 months assessing the immediate and future needs for plant science, began developing CI
    Renewed in 2013, with broadened mandate to cover BIO in general excepting human disease
    Rebranded in 2016 as part of a strategy to operate sustainably after initial program is over.
  • What is Cyberinfrastructure?
    Before diving in to specifics, define Cyberinfrastructure
    This is remarkably similar to the definition of a Commons
    So, our charge was:
    Blend data storage + computing capability, reproducible analysis, and human expertise
  • Platform Overview
    Vertically integrated set of offerings that serve a variety of users (technical skill, science use case, geographic location, etc)
    Data Storage is centralized, sharing is easy. Tied to ability to analyze in situ. 
    Ease of use <-> Ease of Re-use
    Everything below the consumer-facing layer: LEGO building blocks
    At the bottom: Federation is baked in. We own almost no hardware! This is key. Hard to sustain!  
  • Adoption and Outputs (END 6:00)
    So, what if you build and they don’t come? Luckily, they did.
    On average, we serve as many users as other major CI investments like leadership class clusters or the XSEDE project. But different users!
    Home to lots of training and consulting (~25% effort)
    Cyverse has spun out at least three successful open data ecosystem products
  • Finding and re-using Data 1
    EARLY DESIGN DECISION: Availability of a scalable “Data Store”
    OPTIONAL: You don’t have to keep all your data there, but we hope to add sufficient value that you do. 
    Tech Stack: There was nothing ready to go. Combines iRODS + ElasticSearch + Agave APIs
    Currently 2+ PB of user files. At UA this is purchased as needed. At TACC, sliced from our Corral storage offering. CSHL and Plant.UK federating in.
    Agave APIs give us access to other storage protocols like S3, SFTP, FTP, Azure, etc.
  • Finding and re-using Data 2
    Why don’t you just give us Google, Dropbox, Box? 
    Data Store APIs let us implement Data Window GUI.
    Here’s an example from Cyverse’s DE workbench
    Comprehensive, easy Data Management, but petascale
    Aside: Provenance under the hood, but we don’t expose via UI yet
  • Finding and re-using Data 3
    Google Drive for Scientific Big Data
    Can do local caching as well but hard to do native support well
    This has been our story to date on Data.. more in a  minute
  • Finding and re-using Software (1)
    Reagents (Data) and Protocols (Apps) both must be sharable and reusable
    Software -> Application Catalog
    Each front-end GUI has its own concept and implementation but share common infrastructure or are interoperable
    Here’s the DE, our flagship workbench application
    Deploying apps to these catalogs involves
    Docker or VM image
    GUI specifications (written in some DSL or metadata form)
    About half of applications in Cyverse are community contributed
  • Finding and re-using Software (2)
    Here’s ATMOSPHERE image catalog
    Mandate provision of help docs and examples data/usage
  • Finding and re-using Software (3)
    Here’s an example of Cyverse App and Data available in a 3rd PARTY APPLICATION
    Give credit and attribution to App contributor as well as primary software author (if different)
  • Turning back to dataCyverse Data Commons (1) (12:00)
    To date, Cyverse strategy around data has been
    "Bring it in, use and discover within the platform, we won’t lock you in"
    This was not selfish - adopters needed a clear path and we wanted to be sure our CI was externally reliable
    We’ve been working to broaden this approach as our technology has matured under a banner called “Cyverse Data Commons"
    We hold a lot of 1’ data. Some of it has a natural home, like NCBI. We have taken responsibility to help that happen.
    In other cases, it makes sense to publish in place
    No natural repository
    Data is too large to move
    There is an expectation that re-users will perform extensive re-analysis on it
    Accomplish this now with “Community Data” and deep-linking
     Improving offerings over course of 2016
  • Cyverse Data Commons (2)
    Here are two example workflow being implemented
    Both result in a persistent, resolvable identifier
    Note: The VM workflow is already implemented in our sister project Jetstream. Images can be exported and are being published at IU Scholarworks
    Uses DataCite schema.
    Indexed by public search engines
    Feeds into our ElasticSearch-based metadata service to allow easy search and retrieve
    Search API will be publicly accessible later this year
  • Bullet points

×