This presentation is by Ian Foster, director of the Computation Institute at The University of Chicago. It was given at the Great Plains Network Annual Meeting, on May 29, 2013.
For more information on Globus Online, visit globusonline.org.
"What would a Dropbox for science look like?" asks Foster. "It should be trivial to collect, move, sync, share, analyze, annotate, publish, search, backup, and archive Big Data. But in reality it's often very challenging."
Globus Online, a software as a service for data management, solves these problems. This slideshow explains how Globus Online does that for universities and laboratories around the world.
4. computationinstitute.org
www.globusonline.org
Registry
Staging
Store
Ingest
Store
Analysis
Store
Community
Store
Archive
Mirror
Ingest
Store
Analysis
Store
Community
Store
Archive
Mirror
Registry
Quota
exceeded
!
Expired
credentials
!
Network
failed. Retry.
!
Permission
denied
!
It should be trivial to Collect, Move, Sync, Share, Analyze,
Annotate, Publish, Search, Backup, & Archive BIG DATA
… but in reality it’s often very challenging
9. computationinstitute.org
www.globusonline.org
Data
Source
User
A
selects
file(s)
to
share;
selects
user/
group,
sets
share
permissions
1
Globus
Online
tracks
shared
files;
no
need
to
move
files
to
cloud
storage!
2
User
B
logs
in
to
Globus
Online
and
accesses
shared
file
3
13. computationinstitute.org
www.globusonline.org
We
are
also
adding
capabiliAes
Globus Toolkit
Sharing Service
Transfer Service
Globus Nexus
(Identity, Group, Profile)
GlobusOnlineAPIs
GlobusConnect
14. computationinstitute.org
www.globusonline.org
We
are
also
adding
capabiliAes
Globus Toolkit
Sharing Service
Transfer Service
Dataset Services
Globus Nexus
(Identity, Group, Profile)
GlobusOnlineAPIs
GlobusConnect
15. computationinstitute.org
www.globusonline.org
Expanding Globus Online services
• Ingest and publication
– Imagine a DropBox that not only replicates, but
also extracts metadata, catalogs, converts
• Cataloging
– Virtual views of data based on user-defined
and/or automatically extracted metadata
• Computation
– Associate computational procedures,
orchestrate application, catalog results, record
provenance
16. computationinstitute.org
www.globusonline.org
Builds on catalog as a service
Approach
• Hosted user-defined
catalogs
• Based on tag model
<subject, name, value>
• Optional schema
constraints
• Integrated with other
Globus services
Three REST APIs
/query/
• Retrieve subjects
/tags/
• Create, delete, retrieve
tags
/tagdef/
• Create, delete, retrieve
tag definitions
Builds
on
USC
Tagfiler
project
(C.
Kesselman
et
al.)
20. computationinstitute.org
www.globusonline.org
Starting at $20k per year
• Provider endpoints with sharing
• Multiple GridFTP servers per endpoint
• Branded web sites
• Alternate identity provider
• Usage reporting
• MSS optimizations
• Operations monitoring and management
• Input into and access to product roadmap
Provider Plans offer…
21. computationinstitute.org
www.globusonline.org
Thanks to great colleagues
and collaborators
• Steve Tuecke, Rachana Ananthakrishnan, Kyle
Chard, Raj Kettimuthu, Ravi Madduri, Tanu
Malik, and many others at Argonne & Uchicago
• Carl Kesselman, Karl Czajkowski, Rob Schuler,
and others at USC/ISI
• Birali Runesha and others at UChicago
Research Computing Center
Here are some of the areas where we have active projectsFocus on areas of particular interest to I2/Esnet, namely HEP, climate change, genomics (up and coming)
Many in this room are probably users of Dropbox or similar services for keeping their files synced across multiple machinesWell, the scientific research equivalent is a little different
So how would such a drop box for science be used? Let’s look at a very typical scientific data work flow . . .Data is generated by some instrument (a sequencer at JGI or a light source like APS/ALS)…since these instruments are in high demand, users have to get their data off the instrument to make way for the next userSo the data is typically moved from a staging area to some type of ingest storeEtcetera for analysis, sharing of results with collaborators, annotation with metadata for future search, backup/sync/archival, …
We figured it needs to allow a group of collaborating researchers to do many or all of these things with their data ……and not just the 2GB of powerpoints…or the 100GB of family photos and videos….but the petabytes and exabytes of data that will soon be the norm for many
Started with seemingly simple/mundane task of transferring files …etc.