How Cyverse.org enables scalable data discoverability and re-use

Transforming Science Through Data-driven Discovery
How Cyverse.org enables scalable
data discoverability and re-use
Matt Vaughn, co-PI
@mattdotvaughn
vaughn@tacc.utexas.edu

History and Context
~ $100m direct NSF
investment over 10
years
Currently working to
sustain its successes
beyond 2018
iPlant 2008
Empowering a
New Plant Biology
iPlant 2013
Cyberinfrastructure
for Life Science
CyVerse 2016
Transforming Science
Through Data-Driven
Discovery
Plant Science Cyberinfrastructure Collaborative
A "new type of organization" that is "community-
driven" uniting "biologists, computer and information
scientists and experts from other disciplines working
in an integrated team" to provide "computational and
cyberinfrastructure capabilities and expertise that are
capable of handling large and heterogeneous plant
biology data sets"

What is Cyberinfrastructure?
•Data storage and retrieval
•Software (system & user)
•Computing capability
•Human expertise and support
Organized into systems that solve problems of size
and scope that would not otherwise be solvable

Platform Overview
Ready to use
Platforms
Foundational
Capabilities
Established CI
Components
Extensible
Services
EaseofUse

Adoption and Outputs
• Over 40K registered users (15-20%
active)
• Millions of computing hours on
XSEDE, campus HPC, Cyverse
systems, and commercial cloud
• 2+ PB user data stored in CyVerse
Data Store
• Hundreds of publications, courses,
and discoveries
• Spin-off technologies
• Jetstream: NSF production
cloud
• Syndicate: Software-defined
storage system
• Agave API: Multitenant
science PaaS
• Communities such as iAnimal,
iMicrobe, iPlant.UK
• 3rd party software resources
using it as a platform

Federation
Metadata
Finding and re-using Data (1)
iRODS (2+PB)
ElasticSearchTucson
Resources
Austin
Resources
Catalog Servers
CSHL
Resource
iPlant.UK
Resources
Data Store APIs
Agave API
AWS S3
Public FTP
SFTP
At the heart of all Cyverse applications is a data-centric
architecture, designed to be scaled and extended

• Browser-based file manager
• Upload from local or URI
• Download
• Add/Edit comments and tags
• AVU metadata + structured
templates
• Share with collaborators or any
Cyverse user
The Cyverse Discovery Environment Data Window

• Browser-based file manager
• Upload from local or URI
• Download
• Add/Edit comments and tags
• AVU metadata + structured
templates
• Share with collaborators or any
Cyverse user
Google Drive, for big data
The Cyverse Discovery Environment Data Window

Finding and re-using Software (1)
• Extendable App Catalog
• Provide Dockerfile + GUI
specification
• Develop VM image
• Deploy application web
service
Info view for a Cyverse Discovery Environment application

specification
service
• Require links to
documentation, example files
and usage, appropriate
software and domain
ontologies
Public or shared Atmosphere VM images tagged with “GWAS”

specification
service
• Require links to
documentation, example files
and usage, appropriate
software and domain
ontologies
• Give credit to app author and
software authorApplication and Data catalogs available to 3rd parties

Cyverse Data Commons (1)
Data Commons Landing Page (1.0)
Persistent URL for each data set. No authentication
required. Fast browsing and retrieval.
NCBI SRA Submission Workflow in DE
Cyverse is the analysis home for a lot of genomics
data. To get it off our systems, we need to help get it
into the SRA!

Cyverse Data Commons (2)
Actively facilitating publication and discovery of data stored with CyVerse
Candidate
Research
Data @
Data Store
Identify,
organize,
rename
files and
folders
Prepare a
DataCite
metadata
document
Submit to
Cyverse
Curation
Team
Data
snapshot
made
public. DOI
issued.
Candidate
VM image
Document
contents &
capabilities
Prepare a
DataCite
metadata
document
Submit to
Cyverse
Curation
Team
Public
image
released.
DOI issued.

Summary
• Cyverse is a model for providing cyberinfrastructure to diverse
bioscience user communities
• State of the art has shifted at least twice since we started work
• Had to overcome initial reticence to “give data” to Cyverse
• Still hard to get developers and providers to maintain after
contributing
• Cost recovery model - We have started using the term ‘subsidized’
rather than free but it might be too late.
• Natural syngergy between our organization and ODEN objectives

Transforming Science Through Data-driven Discovery
Parker Antin
Nirav Merchant
Eric Lyons
Matt Vaughn
@mattdotvaughn
vaughn@tacc.utexas.edu
Doreen Ware
Dave Micklos
CyVerse is supported by the National Science Foundation under Grant No. DBI-0735191 and DBI-1265383.
CyVerse Executive Team

How Cyverse.org enables scalable data discoverability and re-use

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a How Cyverse.org enables scalable data discoverability and re-use

Similar a How Cyverse.org enables scalable data discoverability and re-use (20)

Más de Matthew Vaughn

Más de Matthew Vaughn (14)

Último

Último (20)

How Cyverse.org enables scalable data discoverability and re-use

Notas del editor