"In this talk, hear about two high-performant research services developed and operated by the Computation Institute at the University of Chicago running on AWS. Globus.org, a high-performance, reliable, robust file transfer service, has over 10,000 registered users who have moved over 25 petabytes of data using the service. The Globus service is operated entirely on AWS, leveraging Amazon EC2, Amazon EBS, Amazon S3, Amazon SES, Amazon SNS, etc. Globus Genomics is an end-to-end next-gen sequencing analysis service with state-of-art research data management capabilities. Globus Genomics uses Amazon EC2 for scaling out analysis, Amazon EBS for persistent storage, and Amazon S3 for archival storage. Attend this session to learn how to move data quickly at any scale as well as how to use genomic analysis tools and pipelines for next generation sequencers using Globus on AWS.
"
4. Most labs have limited resources
Heidorn: NSF grants in 2007
$1,000,000
$100,000
$10,000
< $350,000
80% of awards
50% of grant $$
$1,000
2000
4000
6000
8000
6. Automation is required to
apply more sophisticated
methods to far more data
Outsourcing is needed to
achieve economies of scale
in the use of automated
methods
7. Building a discovery cloud
• Identify time-consuming activities amenable to
automation and outsourcing
• Implement as high-quality, low-touch SaaS
• Leverage IaaS for reliability,
Software as a service
economies of scale
Platform as a service
Infrastructure as a service
• Extract common elements as
research automation platform
Bonus question: Sustainability
8. We aspire (initially) to create a
great user experience for
research data management
What would a “dropbox for
science” look like?
10. It should be trivial to Collect, Move, Sync, Share, Analyze,
Annotate, Publish, Search, Backup, & Archive BIG DATA
… but in reality it’s often very challenging
!
Staging
Store
! Ingest
Expired
Store
credentials
Registry
Permission
denied
Communit
Community
yStore
Store
!
Analysis
!
Store Quota
Network
failed. Retry.
exceeded
Archive
Mirror
14. 2
1
User A selects
file(s) to share;
selects user/group,
sets share
permissions
Globus Online tracks
shared files; no need
to move files to cloud
storage!
Data
Source
3
User B logs in to
Globus Online
and accesses
shared file
15. Extreme ease of use
•
•
•
•
•
•
•
•
InCommon, Oauth, OpenID, X.509, …
Credential management
Group definition and management
Transfer management and optimization
Reliability via transfer retries
Web interface, REST API, command line
One-click “Globus Connect” install
5-minute Globus Connect Multi User install
17. Early adoption is encouraging
>12,000 registered users; >150 daily
>27 PB moved; >1B files
10x (or better) performance vs. scp
99.9% availability
Entirely hosted on Amazon
18. Amazon web services used
• Amazon EC2 for hosting Globus services
• Elastic Load Balancing to use multiple
Availability Zones for reliability and uptime
• Amazon S3 to store historical state
• Amazon RDS PostgreSQL for active state
26. The identity challenge in science
• Research communities often need to
– Assign identities to their users
– Manage user profiles
– Organize users into groups for authorization
• Obstacles to high-quality implementations
–
–
–
–
Complexity of associated security protocols
Creation of identity silos
Multiple credentials for users
Reliability, availability, scalability, security
27. Nexus provides four key capabilities
• Identity provisioning
I
I
I
– Create, manage Globus identities
I
I
G
I
V
U
aI b
• Identity hub
– Link with other identities; use
to authenticate to services
• Group hub
– User-managed groups; groups can
be used for authorization
• Profile management
– User-managed attributes;
can use in group admission
Key points:
1) Outsource
identity, group,
profile
management
2) REST API for
flexible integration
3) Intuitive,
customizable
Web interfaces
28. Branded sites
XSEDE
Open Science Grid
University of Chicago
DOE kBase
Indiana University
University of Exeter
Globus Online
NERSC
NIH BIRN
34. Dataset Services
Sharing Service
Transfer Service
Globus Nexus
(Identity, Group, Profile)
Globus Toolkit
Globus Connect
Globus Online
APIs
We are adding capabilities
35. We are adding capabilities
• Ingest and publication
– Imagine a DropBox that not only replicates, but also extracts
metadata, catalogs, converts
• Cataloging
– Virtual views of data based on user-defined and/or automatically
extracted metadata
• Computation
– Associate computational procedures, orchestrate application,
catalog results, record provenance
36. Next Gen Sequencing Analysis for Everyone –
No IT Required
Ravi K Madduri, The University of Chicago and Argonne National Laboratory
November 14, 2013
38. Outline
• Globus Vision
• Challenges in Sequencing Analysis
– Big Data Management
– Analysis at Scale
– Reproducibility
• Proposed Approach Using Globus Genomics
• Example Collaborations
• Q&A
39. Globus Vision
Goal: Accelerate discovery and innovation worldwide
by providing research IT as a service
Leverage software-as-a-service to:
– provide millions of researchers with unprecedented access to
powerful tools for managing Big Data
– reduce research IT costs dramatically via economies of scale
“Civilization advances by extending the number of important
operations which we can perform without thinking of them”
—Alfred North Whitehead , 1911
40. Challenges in Sequencing Analysis
Data Movement and Access Challenges
•
•
•
•
Shell scripts to sequentially execute the tools
Manually modify the scripts for any change
•
Public
Data
Manually move the data to the Compute node
Install all the tools required for the Analysis
Difficult to maintain and transfer the knowledge
•
BWA, Picard, GATK, Filtering Scripts, etc.
•
Error Prone, difficult to keep track, messy..
Storage
Sequencing
Centers
Fastq
Ref Genome
Research Lab
Seq
Center
Local Cluster/
Cloud
Modify
Picard
Install
•
•
•
•
Data is distributed in different locations
Research labs need access to the data for analysis
Be able to share data with other researchers/collaborators
•
Inefficient ways of data movement
Data needs to be available on the local and distributed compute
Resources
•
Local clusters, cloud, grid and transfer the knowledge
Alignment
(Re)Run
GATK
Script
Variant
Calling
How do we analyze this
Sequence Data
Manual Data Analysis
41. Globus Genomics
Globus Genomics
Galaxy Based
Workflow
Management System
•
Public
Data
Sequencin
g Centers
Globus Provides a
•
High-performance
Research Lab
•
Fault-tolerant
Seq Secure
•
Center
Storage
•
•
Galaxy
Data Libraries
•
Local Cluster/
Cloud
•
file transfer Service between
all data-endpoints
Globus Integrated
within Galaxy
Web-based UI
Drag-Drop workflow
creations
Easily modify
Workflows with new
tools
Analytical tools are
automatically run on
the scalable compute
resources when
possible
Globus Genomics on
Amazon EC2
Data Management
Data Analysis
45. Globus Genomics
• Computational profiles for
various analysis tools
• Resources can be provisioned
on-demand with Amazon Web
Services cloud based
infrastructure
• Glusterfs as a shared file
system between head nodes
and compute nodes
• Provisioned I/O on Amazon EBS
46. Coming soon!
• Integration with Globus Catalog
– Better data discovery and metadata management
• Integration with Globus Sharing
– Easy and secure method to share large datasets with collaborators
• Integration with Amazon Glacier for data archiving
• Support for high throughput computational
modalities through Apache Mesos
– MapReduce and MPI clusters
• Dynamic Storage Strategies using Amazon S3 or
LVM-based shared file system
47.
48. Our vision for a 21st century
discovery infrastructure
Provide more capability for
more people at lower cost by
building a “Discovery Cloud”
Delivering “Science as a service”
50. For more information
• More information on Globus Genomics and to
sign up: www.globus.org/genomics
• More information on Globus:
www.globusonline.org
• Follow us on Twitter: @ianfoster, @madduri,
@globusgenomics, @globusonline