Talk in the National Science Data Fabric (NSDF) Distinguished Speaker Series
The Globus team has spent more than a decade developing software-as-a-service methods for research data management, available at globus.org. Globus transfer, sharing, search, publication, identity and access management (IAM), automation, and other services enable reliable, secure, and efficient managed access to exabytes of scientific data on tens of thousands of storage systems. For developers, flexible and open platform APIs reduce greatly the cost of developing and operating customized data distribution, sharing, and analysis applications. With 200,000 registered users at more than 2,000 institutions, more than 1.5 exabytes and 100 billion files handled, and 100s of registered applications and services, the services that comprise the Globus platform have become essential infrastructure for many researchers, projects, and institutions. I describe the design of the Globus platform, present illustrative applications, and discuss lessons learned for cyberinfrastructure software architecture, dissemination, and sustainability.
Video is at https://www.youtube.com/watch?v=p8pCHkFFq1E
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
1. Crescat scientia; vita excolatur
A Global Research Data Platform:
How Globus Services Enable Scientific Discovery
Ian Foster
The University of Chicago
Argonne National Laboratory
foster@uchicago.edu, @ianfoster
3. Three lessons learned operating Globus
A carefully designed research data platform can advance scientific
discovery by reducing data friction
Public cloud can be used to facilitate the development and delivery
of the persistent services required to implement such a platform
A hybrid free/subscription-based support model can enable
sustainable operation of such services
4. Three lessons learned operating Globus
A carefully designed research data platform can advance scientific
discovery by reducing data friction
Public cloud can be used to facilitate the development and delivery
of the persistent services required to implement such a platform
A hybrid free/subscription-based support model can enable
sustainable operation of such services
5.
6. Data friction is a frequent, and often fatal, impediment
to efficient communication and collaboration
We aim to eliminate friction in:
access
movement
sharing
discovery
analysis
publication 1919 Motor Transport Corps convoy
Washington, DC., to San Francisco
56 days, average speed of 9 km/h
8. Friction: Many data locations
Need: Easy access to data regardless of location
Solution: Easily deployed Globus Connect agent
Enables: Access data anywhere
39,000 active endpoints
9. Friction: Many access methods
Need: Uniform interface,
consistent UX across
storage systems
Solution: Modular,
high-performance
Data Storage
Interface
Enables:
High-speed, high-functionality data access
POSIX file systems
Parallel file systems
Object stores
Etc.
10. Friction:
Slow data movement
Need: Rapid end-to-end data
movement regardless of file size(s),
network, storage system, etc.
Solution: Extensible architecture with
optimizations for many workloads,
networks, storage systems
Enables: Multi-GB/s transfers
HPC-HPC
HPC-Cloud
Cloud-Cloud
ArXiv:2009.03190
A=ALCF
N=NERSC
O=OLCF
11. 81 PB
transferred on
XSEDE
1,456,199
successful
transfers
12,715
TB+ transfers
397 TB
largest transfer
7,500
endpoints involved
in XSEDE transfers
554
endpoints w/ TB+
XSEDE transfers
1 in 17
transfers overcame
a transient fault
1 in 180
transfers overcame
a checksum error
41,210
people logged into
XSEDE apps w/
Globus Auth
5,995
people transferred
data to/from
XSEDE
952
people completed
TB+ transfers
50
people/month
1st transfer on
XSEDE
Friction: Faults
Need: Robust execution despite transient failures
Solution: Redundant deployment, checksums …
Globus and XSEDE, 2011-2022
12. https://dashboard.globus.org/esgf As of March 10, 2022
4 to 6 GB/s
1.5 GB/s
Challenge: Replicate 7.4 petabytes of climate data to ANL, ORNL
Solution: Use
Globus to copy
data over
ESnet
Note: Dashboard above assumed 8.8 PB
14. Friction: Manual operations
Need: Interactive and programmatic control
Solution: Web interfaces, APIs, command
line interfaces define the Globus platform
Enables: Fire-and-forget transfers, portals
# (1) Create shared endpoint:
# Create directory to be shared
share_path = '/~/' + str(uuid.uuid4()) + '/'
tc.operation_mkdir(host_id, path=share_path)
# Create shared endpoint on directory
shared_ep_data = {
'DATA_TYPE': 'shared_endpoint',
'host_endpoint': host_id,
'host_path': share_path
}
r = tc.create_shared_endpoint(shared_ep_data)
# (2) Copy data into the shared endpoint
tc.endpoint_autoactivate(share_id)
tdata = TransferData(tc, host_id, share_id)
tdata.add_item(source_path, '/', recursive=True)
r = tc.submit_transfer(tdata)
tc.task_wait(r['task_id'], timeout=1000)
# (3) Set access control on shared endpoint
tc.add_endpoint_acl_rule(share_id, rule_data)
# (4) Ultimately, delete the shared endpoint
tc.delete_endpoint(share_id)
15. Data Energy Science
Collaboration (DESC)
Data Challenge 2:
• Simulate 300 sq degree
of sky over 5 years
• 5 TB simulated data
• ~90M core hours
at ALCF and NERSC
Globus-based portal makes data accessible to
DESC participants
15
16. Friction: Complex data pipelines
Solution: Globus Automate for definition, execution, sharing of flows
Example:
Integrate data
from many
institutional
repositories
Auth Search Transfer Compute
19. Friction: Nowhere to put things!
Need: On-demand access to storage for caching, staging, distribution
Solution: Data stores implement Globus APIs for creating, managing,
and accessing virtual stores (“collections”) and metadata catalogs
Examples:
Argonne Petrel (6 PB)
and Eagle (100 PB)
20. Friction:
Repeated
authentication
Need: Automated operations
on resources with different
authentication & authorization
requirements
Solution: Protocols and APIs for
identity linking, authentication,
credential refresh, delegation
Numbers: 300,000 registered
users; 1,500 identity providers
21. Friction:
Repeated
authentication
Need: Automated operations
on resources with different
authentication & authorization
requirements
Solution: Protocols and APIs for
identity linking, authentication,
credential refresh, delegation
Numbers: 300,000 registered
users; 1,500 identity providers
24. Three lessons learned operating Globus
A carefully designed research data platform can advance scientific
discovery by reducing data friction
Public cloud can be used to facilitate the development and delivery
of the persistent services required to implement such a platform
A hybrid free/subscription-based support model can enable
sustainable operation of such services
26. Challenge: Operate a ultra-reliable, scalable, secure
service–with a small team
Elastic Compute Cloud (EC2)
Simple Storage Service (S3)
CloudFront
DynamoDB
Elastic Block Store (EBS)
Kinesis (Streams)
Redshift
RDS (MySQL, Oracle, Postgres)
Step Functions
Lambda
Route 53
Simple Email Service (SES)
Simple Notification Service (SNS)
Simple Queue Service (SQS)
CloudTrail
CloudWatch Logs
Identity & Access Management (IAM)
Key Management Service (KMS)
Elastic Load Balancing
Solution: Leverage a commercial cloud platform that is
designed and operated for that purpose
27. Cloud as platform: Lessons learned
Commercial cloud can be leveraged for science:
• As an elastic source of computing and storage capacity – sure
• As a cheap source of computing and storage capacity –
maybe/not?
As an immensely powerful platform for delivering scalable,
reliable, and democratizing digital services – absolutely!
28. Three lessons learned operating Globus
A carefully designed research data platform can advance scientific
discovery by reducing data friction
Public cloud can be used to facilitate the development and delivery
of the persistent services required to implement such a platform
A hybrid free/subscription-based support model can enable
sustainable operation of such services
29. All software must be sustainable software
● A research tool is sustainable if its users can count on it being
maintained as fit for purpose over time
● Sustainability requires that, in the aggregate, resources equal or
exceed costs:
● Generally desirable that adding each new user:
○ is inexpensive (“low incremental costs”)
○ increases total resources (“positive returns to scale”)
30. Globus achieves sustainability via a hybrid
free/subscription model
• Free tier enables broad use regardless of ability to pay
• Premium services available to subscribers incentivize subscriptions,
with subscription levels set based on institutional research budget as
approximation of benefit received and ability to pay
• Notes:
• As a not-for-profit service operated by the University of Chicago, goal is not
profit maximization but maximum value delivered to science
• Extensive cloud-based automation provides for low incremental costs
• Growing subscriber base provide for positive returns to scale
• Challenges
• Balancing goals of broad adoption vs. sustainability: free rider problem
31. Developing vs. sustaining the Globus platform
Developing:
• Creating the cloud-hosted Globus service
• New platform services: e.g., Flows for automation,
funcX for managed compute
• New features: e.g., cloud storage connectors, client-side
striping, personal health information
Sustaining:
• Maintenance and operations; user engagement for
support, training, security reviews, etc.
• Subscription contracts and agreements (e.g., BAAs)
• Feature enhancements; scaling to support more users,
bigger data; new storage adapters, Automate action
providers
180 subscribers
Universities, federal agencies,
national projects, research
institutes, health care systems,
commercial (4%),
international (20%)
34. Federated Research Data Repository
A national research data platform
• Ingest, curation, preservation
• Discovery, citation, sharing
Uses Globus Services:
• Auth for authentication
• Transfer to repository service
• Search for integrated metadata
catalog (includes metadata from
70 other repositories)
34
36. Rate (color) vs. distance and size for 4M transfers (35B files, 280 PB) thru 2017
1,921 transfers from Argonne Advanced Photon Source server are highlighted
37. Thanks to the team! … and our funders …
… and our subscribers
38. Globus services enabling scientific discovery
Unified Data Access Data Transfer and Sharing Platform-as-a-Service
Reliable Automation Publication & Discovery Remote Execution (future)
Globus is a powerful research data services platform,
with hybrid SaaS architecture and hybrid free/subscription-based sustainability,
and with a footprint that encompasses 1600+ institutions and 300,000+ users
39. Implications for a National Science Data Fabric?
Globus reaches 300,000+ users at 1600+ institutions
These institutions can source and sink data at increasingly high rates, ever
more easily (thanks to networks, Science DMZs, and Globus)
Perhaps their most urgent need is for
programmatically accessible storage
for data caching, staging, and distribution
For more information:
● https://globus.org
● https://docs.globus.org/modern-research-data-portal/
● https://docs.globus.org/globus-automation-services/
Ian Foster
foster@uchicago.edu
@ianfoster
labs.globus.org