Cyverse.org designs, builds, and operates an innovative, integrated life sciences cyberinfrastructure. It provides data management and analysis capabilities with point and click, cloud, API, and command-line interfaces that engage users of any computing proficiency and is based on an extensible platform that integrates local and national-scale HPC, storage, and cloud resources. Cyverse directly supports thousands of users who store and access over 2PB of research data, use millions of compute hours annually, and participate in the platform's improvement, plus a secondary user community from partner projects that have built atop it. Cyverse is organized around "Data Store" and "App Catalog" services, each of which enables users to upload digital research assets that can be kept private, shared, or made public. Recently, Cyverse has been transitioning from passively enabling digital sharing towards active facilitation. It is partnering with repositories like NCBI SRA to enable direct submission from Cyverse applications, adopting commonly-used ontologies, enabling import/export of virtual machine images, developing metadata-driven persistent landing pages for data sets, and providing DOI (and other identifier) services. These new features are expected to further catalyze growth of an interoperable, interconnected network of shared research infrastructure across the biological sciences.
How Cyverse.org enables scalable data discoverability and re-use
1. Transforming Science Through Data-driven Discovery
How Cyverse.org enables scalable
data discoverability and re-use
Matt Vaughn, co-PI
@mattdotvaughn
vaughn@tacc.utexas.edu
2. History and Context
~ $100m direct NSF
investment over 10
years
Currently working to
sustain its successes
beyond 2018
iPlant 2008
Empowering a
New Plant Biology
iPlant 2013
Cyberinfrastructure
for Life Science
CyVerse 2016
Transforming Science
Through Data-Driven
Discovery
Plant Science Cyberinfrastructure Collaborative
A "new type of organization" that is "community-
driven" uniting "biologists, computer and information
scientists and experts from other disciplines working
in an integrated team" to provide "computational and
cyberinfrastructure capabilities and expertise that are
capable of handling large and heterogeneous plant
biology data sets"
3. What is Cyberinfrastructure?
•Data storage and retrieval
•Software (system & user)
•Computing capability
•Human expertise and support
Organized into systems that solve problems of size
and scope that would not otherwise be solvable
4. Platform Overview
Ready to use
Platforms
Foundational
Capabilities
Established CI
Components
Extensible
Services
EaseofUse
5. Adoption and Outputs
• Over 40K registered users (15-20%
active)
• Millions of computing hours on
XSEDE, campus HPC, Cyverse
systems, and commercial cloud
• 2+ PB user data stored in CyVerse
Data Store
• Hundreds of publications, courses,
and discoveries
• Spin-off technologies
• Jetstream: NSF production
cloud
• Syndicate: Software-defined
storage system
• Agave API: Multitenant
science PaaS
• Communities such as iAnimal,
iMicrobe, iPlant.UK
• 3rd party software resources
using it as a platform
6. Federation
Metadata
Finding and re-using Data (1)
iRODS (2+PB)
ElasticSearchTucson
Resources
Austin
Resources
Catalog Servers
CSHL
Resource
iPlant.UK
Resources
Data Store APIs
Agave API
AWS S3
Public FTP
SFTP
At the heart of all Cyverse applications is a data-centric
architecture, designed to be scaled and extended
7. Finding and re-using Data (2)
• Browser-based file manager
• Upload from local or URI
• Download
• Add/Edit comments and tags
• AVU metadata + structured
templates
• Share with collaborators or any
Cyverse user
The Cyverse Discovery Environment Data Window
8. Finding and re-using Data (3)
• Browser-based file manager
• Upload from local or URI
• Download
• Add/Edit comments and tags
• AVU metadata + structured
templates
• Share with collaborators or any
Cyverse user
Google Drive, for big data
The Cyverse Discovery Environment Data Window
9. Finding and re-using Software (1)
• Extendable App Catalog
• Provide Dockerfile + GUI
specification
• Develop VM image
• Deploy application web
service
Info view for a Cyverse Discovery Environment application
10. Finding and re-using Software (2)
• Extendable App Catalog
• Provide Dockerfile + GUI
specification
• Develop VM image
• Deploy application web
service
• Require links to
documentation, example files
and usage, appropriate
software and domain
ontologies
Public or shared Atmosphere VM images tagged with “GWAS”
11. Finding and re-using Software (3)
• Extendable App Catalog
• Provide Dockerfile + GUI
specification
• Develop VM image
• Deploy application web
service
• Require links to
documentation, example files
and usage, appropriate
software and domain
ontologies
• Give credit to app author and
software authorApplication and Data catalogs available to 3rd parties
12. Cyverse Data Commons (1)
Data Commons Landing Page (1.0)
Persistent URL for each data set. No authentication
required. Fast browsing and retrieval.
NCBI SRA Submission Workflow in DE
Cyverse is the analysis home for a lot of genomics
data. To get it off our systems, we need to help get it
into the SRA!
13. Cyverse Data Commons (2)
Actively facilitating publication and discovery of data stored with CyVerse
Candidate
Research
Data @
Data Store
Identify,
organize,
rename
files and
folders
Prepare a
DataCite
metadata
document
Submit to
Cyverse
Curation
Team
Data
snapshot
made
public. DOI
issued.
Candidate
VM image
Document
contents &
capabilities
Prepare a
DataCite
metadata
document
Submit to
Cyverse
Curation
Team
Public
image
released.
DOI issued.
14. Summary
• Cyverse is a model for providing cyberinfrastructure to diverse
bioscience user communities
• State of the art has shifted at least twice since we started work
• Had to overcome initial reticence to “give data” to Cyverse
• Still hard to get developers and providers to maintain after
contributing
• Cost recovery model - We have started using the term ‘subsidized’
rather than free but it might be too late.
• Natural syngergy between our organization and ODEN objectives
15. Transforming Science Through Data-driven Discovery
Parker Antin
Nirav Merchant
Eric Lyons
Matt Vaughn
@mattdotvaughn
vaughn@tacc.utexas.edu
Doreen Ware
Dave Micklos
CyVerse is supported by the National Science Foundation under Grant No. DBI-0735191 and DBI-1265383.
CyVerse Executive Team
Notas del editor
(Brief) History and Context
In the mid-2000s, realization inside the NSF that biology had some unique CI challenges not being met
Plant Genome was already spending on full-genome characterization projects (Arabidopsis 2010, etc).
Big data was on horizon - NGS just emergent
BIO-specific CI. Chose plant sciences due to strong communities and sharing culture.
Funded iPlant in 2008
Project spend its first 18 months assessing the immediate and future needs for plant science, began developing CI
Renewed in 2013, with broadened mandate to cover BIO in general excepting human disease
Rebranded in 2016 as part of a strategy to operate sustainably after initial program is over.
What is Cyberinfrastructure?
Before diving in to specifics, define Cyberinfrastructure
This is remarkably similar to the definition of a Commons
So, our charge was:
Blend data storage + computing capability, reproducible analysis, and human expertise
Platform Overview
Vertically integrated set of offerings that serve a variety of users (technical skill, science use case, geographic location, etc)
Data Storage is centralized, sharing is easy. Tied to ability to analyze in situ.
Ease of use <-> Ease of Re-use
Everything below the consumer-facing layer: LEGO building blocks
At the bottom: Federation is baked in. We own almost no hardware! This is key. Hard to sustain!
Adoption and Outputs (END 6:00)
So, what if you build and they don’t come? Luckily, they did.
On average, we serve as many users as other major CI investments like leadership class clusters or the XSEDE project. But different users!
Home to lots of training and consulting (~25% effort)
Cyverse has spun out at least three successful open data ecosystem products
Finding and re-using Data 1
EARLY DESIGN DECISION: Availability of a scalable “Data Store”
OPTIONAL: You don’t have to keep all your data there, but we hope to add sufficient value that you do.
Tech Stack: There was nothing ready to go. Combines iRODS + ElasticSearch + Agave APIs
Currently 2+ PB of user files. At UA this is purchased as needed. At TACC, sliced from our Corral storage offering. CSHL and Plant.UK federating in.
Agave APIs give us access to other storage protocols like S3, SFTP, FTP, Azure, etc.
Finding and re-using Data 2
Why don’t you just give us Google, Dropbox, Box?
Data Store APIs let us implement Data Window GUI.
Here’s an example from Cyverse’s DE workbench
Comprehensive, easy Data Management, but petascale
Aside: Provenance under the hood, but we don’t expose via UI yet
Finding and re-using Data 3
Google Drive for Scientific Big Data
Can do local caching as well but hard to do native support well
This has been our story to date on Data.. more in a minute
Finding and re-using Software (1)
Reagents (Data) and Protocols (Apps) both must be sharable and reusable
Software -> Application Catalog
Each front-end GUI has its own concept and implementation but share common infrastructure or are interoperable
Here’s the DE, our flagship workbench application
Deploying apps to these catalogs involves
Docker or VM image
GUI specifications (written in some DSL or metadata form)
About half of applications in Cyverse are community contributed
Finding and re-using Software (2)
Here’s ATMOSPHERE image catalog
Mandate provision of help docs and examples data/usage
Finding and re-using Software (3)
Here’s an example of Cyverse App and Data available in a 3rd PARTY APPLICATION
Give credit and attribution to App contributor as well as primary software author (if different)
Turning back to dataCyverse Data Commons (1) (12:00)
To date, Cyverse strategy around data has been
"Bring it in, use and discover within the platform, we won’t lock you in"
This was not selfish - adopters needed a clear path and we wanted to be sure our CI was externally reliable
We’ve been working to broaden this approach as our technology has matured under a banner called “Cyverse Data Commons"
We hold a lot of 1’ data. Some of it has a natural home, like NCBI. We have taken responsibility to help that happen.
In other cases, it makes sense to publish in place
No natural repository
Data is too large to move
There is an expectation that re-users will perform extensive re-analysis on it
Accomplish this now with “Community Data” and deep-linking
Improving offerings over course of 2016
…
Cyverse Data Commons (2)
Here are two example workflow being implemented
Both result in a persistent, resolvable identifier
Note: The VM workflow is already implemented in our sister project Jetstream. Images can be exported and are being published at IU Scholarworks
Uses DataCite schema.
Indexed by public search engines
Feeds into our ElasticSearch-based metadata service to allow easy search and retrieve
Search API will be publicly accessible later this year