SlideShare a Scribd company logo
1 of 39
Crescat scientia; vita excolatur
A Global Research Data Platform:
How Globus Services Enable Scientific Discovery
Ian Foster
The University of Chicago
Argonne National Laboratory
foster@uchicago.edu, @ianfoster
Globus platform
2010-
Operated by UChicago for researchers worldwide
Made possible by the support of 180+ subscribers
Three lessons learned operating Globus
A carefully designed research data platform can advance scientific
discovery by reducing data friction
Public cloud can be used to facilitate the development and delivery
of the persistent services required to implement such a platform
A hybrid free/subscription-based support model can enable
sustainable operation of such services
Three lessons learned operating Globus
A carefully designed research data platform can advance scientific
discovery by reducing data friction
Public cloud can be used to facilitate the development and delivery
of the persistent services required to implement such a platform
A hybrid free/subscription-based support model can enable
sustainable operation of such services
Data friction is a frequent, and often fatal, impediment
to efficient communication and collaboration
We aim to eliminate friction in:
access
movement
sharing
discovery
analysis
publication 1919 Motor Transport Corps convoy
Washington, DC., to San Francisco
56 days, average speed of 9 km/h
The Globus hybrid “SaaS” model
Friction: Many data locations
Need: Easy access to data regardless of location
Solution: Easily deployed Globus Connect agent
Enables: Access data anywhere
39,000 active endpoints
Friction: Many access methods
Need: Uniform interface,
consistent UX across
storage systems
Solution: Modular,
high-performance
Data Storage
Interface
Enables:
High-speed, high-functionality data access
POSIX file systems
Parallel file systems
Object stores
Etc.
Friction:
Slow data movement
Need: Rapid end-to-end data
movement regardless of file size(s),
network, storage system, etc.
Solution: Extensible architecture with
optimizations for many workloads,
networks, storage systems
Enables: Multi-GB/s transfers
HPC-HPC
HPC-Cloud
Cloud-Cloud
ArXiv:2009.03190
A=ALCF
N=NERSC
O=OLCF
81 PB
transferred on
XSEDE
1,456,199
successful
transfers
12,715
TB+ transfers
397 TB
largest transfer
7,500
endpoints involved
in XSEDE transfers
554
endpoints w/ TB+
XSEDE transfers
1 in 17
transfers overcame
a transient fault
1 in 180
transfers overcame
a checksum error
41,210
people logged into
XSEDE apps w/
Globus Auth
5,995
people transferred
data to/from
XSEDE
952
people completed
TB+ transfers
50
people/month
1st transfer on
XSEDE
Friction: Faults
Need: Robust execution despite transient failures
Solution: Redundant deployment, checksums …
Globus and XSEDE, 2011-2022
https://dashboard.globus.org/esgf As of March 10, 2022
4 to 6 GB/s
1.5 GB/s
Challenge: Replicate 7.4 petabytes of climate data to ANL, ORNL
Solution: Use
Globus to copy
data over
ESnet
Note: Dashboard above assumed 8.8 PB
13
Misconfigured
GPFS @ LLNL
ALCF
maintenance
LLNL→
ALCF
LLNL→
OLCF
ALCF→
OLCF
OLCF→
ALCF
17,347,671 directories and 28,907,532 files
February 12 to May 4, 2022
Sustained rate 1.45 GB/s (LLNL rate limit)
Peak 7.5 GB/s (OLCF→ALCF)
Friction: Manual operations
Need: Interactive and programmatic control
Solution: Web interfaces, APIs, command
line interfaces define the Globus platform
Enables: Fire-and-forget transfers, portals
# (1) Create shared endpoint:
# Create directory to be shared
share_path = '/~/' + str(uuid.uuid4()) + '/'
tc.operation_mkdir(host_id, path=share_path)
# Create shared endpoint on directory
shared_ep_data = {
'DATA_TYPE': 'shared_endpoint',
'host_endpoint': host_id,
'host_path': share_path
}
r = tc.create_shared_endpoint(shared_ep_data)
# (2) Copy data into the shared endpoint
tc.endpoint_autoactivate(share_id)
tdata = TransferData(tc, host_id, share_id)
tdata.add_item(source_path, '/', recursive=True)
r = tc.submit_transfer(tdata)
tc.task_wait(r['task_id'], timeout=1000)
# (3) Set access control on shared endpoint
tc.add_endpoint_acl_rule(share_id, rule_data)
# (4) Ultimately, delete the shared endpoint
tc.delete_endpoint(share_id)
Data Energy Science
Collaboration (DESC)
Data Challenge 2:
• Simulate 300 sq degree
of sky over 5 years
• 5 TB simulated data
• ~90M core hours
at ALCF and NERSC
Globus-based portal makes data accessible to
DESC participants
15
Friction: Complex data pipelines
Solution: Globus Automate for definition, execution, sharing of flows
Example:
Integrate data
from many
institutional
repositories
Auth Search Transfer Compute
Globus-provided flows
17
User-built
flows
18
arXiv:2204.05128v1
Friction: Nowhere to put things!
Need: On-demand access to storage for caching, staging, distribution
Solution: Data stores implement Globus APIs for creating, managing,
and accessing virtual stores (“collections”) and metadata catalogs
Examples:
Argonne Petrel (6 PB)
and Eagle (100 PB)
Friction:
Repeated
authentication
Need: Automated operations
on resources with different
authentication & authorization
requirements
Solution: Protocols and APIs for
identity linking, authentication,
credential refresh, delegation
Numbers: 300,000 registered
users; 1,500 identity providers
Friction:
Repeated
authentication
Need: Automated operations
on resources with different
authentication & authorization
requirements
Solution: Protocols and APIs for
identity linking, authentication,
credential refresh, delegation
Numbers: 300,000 registered
users; 1,500 identity providers
As of 4/2022
Three lessons learned operating Globus
A carefully designed research data platform can advance scientific
discovery by reducing data friction
Public cloud can be used to facilitate the development and delivery
of the persistent services required to implement such a platform
A hybrid free/subscription-based support model can enable
sustainable operation of such services
Globus platform
2010-
Operated by UChicago for researchers worldwide
Made possible by the support of 180+ subscribers
Challenge: Operate a ultra-reliable, scalable, secure
service–with a small team
Elastic Compute Cloud (EC2)
Simple Storage Service (S3)
CloudFront
DynamoDB
Elastic Block Store (EBS)
Kinesis (Streams)
Redshift
RDS (MySQL, Oracle, Postgres)
Step Functions
Lambda
Route 53
Simple Email Service (SES)
Simple Notification Service (SNS)
Simple Queue Service (SQS)
CloudTrail
CloudWatch Logs
Identity & Access Management (IAM)
Key Management Service (KMS)
Elastic Load Balancing
Solution: Leverage a commercial cloud platform that is
designed and operated for that purpose
Cloud as platform: Lessons learned
Commercial cloud can be leveraged for science:
• As an elastic source of computing and storage capacity – sure
• As a cheap source of computing and storage capacity –
maybe/not?
As an immensely powerful platform for delivering scalable,
reliable, and democratizing digital services – absolutely!
Three lessons learned operating Globus
A carefully designed research data platform can advance scientific
discovery by reducing data friction
Public cloud can be used to facilitate the development and delivery
of the persistent services required to implement such a platform
A hybrid free/subscription-based support model can enable
sustainable operation of such services
All software must be sustainable software
● A research tool is sustainable if its users can count on it being
maintained as fit for purpose over time
● Sustainability requires that, in the aggregate, resources equal or
exceed costs:
● Generally desirable that adding each new user:
○ is inexpensive (“low incremental costs”)
○ increases total resources (“positive returns to scale”)
Globus achieves sustainability via a hybrid
free/subscription model
• Free tier enables broad use regardless of ability to pay
• Premium services available to subscribers incentivize subscriptions,
with subscription levels set based on institutional research budget as
approximation of benefit received and ability to pay
• Notes:
• As a not-for-profit service operated by the University of Chicago, goal is not
profit maximization but maximum value delivered to science
• Extensive cloud-based automation provides for low incremental costs
• Growing subscriber base provide for positive returns to scale
• Challenges
• Balancing goals of broad adoption vs. sustainability: free rider problem
Developing vs. sustaining the Globus platform
Developing:
• Creating the cloud-hosted Globus service
• New platform services: e.g., Flows for automation,
funcX for managed compute
• New features: e.g., cloud storage connectors, client-side
striping, personal health information
Sustaining:
• Maintenance and operations; user engagement for
support, training, security reviews, etc.
• Subscription contracts and agreements (e.g., BAAs)
• Feature enhancements; scaling to support more users,
bigger data; new storage adapters, Automate action
providers
180 subscribers
Universities, federal agencies,
national projects, research
institutes, health care systems,
commercial (4%),
international (20%)
Example premium service: management console
32
33
National cyberinfrastructure adoption
Federated Research Data Repository
A national research data platform
• Ingest, curation, preservation
• Discovery, citation, sharing
Uses Globus Services:
• Auth for authentication
• Transfer to repository service
• Search for integrated metadata
catalog (includes metadata from
70 other repositories)
34
Geospatial search
Rate (color) vs. distance and size for 4M transfers (35B files, 280 PB) thru 2017
1,921 transfers from Argonne Advanced Photon Source server are highlighted
Thanks to the team! … and our funders …
… and our subscribers
Globus services enabling scientific discovery
Unified Data Access Data Transfer and Sharing Platform-as-a-Service
Reliable Automation Publication & Discovery Remote Execution (future)
Globus is a powerful research data services platform,
with hybrid SaaS architecture and hybrid free/subscription-based sustainability,
and with a footprint that encompasses 1600+ institutions and 300,000+ users
Implications for a National Science Data Fabric?
Globus reaches 300,000+ users at 1600+ institutions
These institutions can source and sink data at increasingly high rates, ever
more easily (thanks to networks, Science DMZs, and Globus)
Perhaps their most urgent need is for
programmatically accessible storage
for data caching, staging, and distribution
For more information:
● https://globus.org
● https://docs.globus.org/modern-research-data-portal/
● https://docs.globus.org/globus-automation-services/
Ian Foster
foster@uchicago.edu
@ianfoster
labs.globus.org

More Related Content

Similar to A Global Research Data Platform: How Globus Services Enable Scientific Discovery

CloudComputingJun28.ppt
CloudComputingJun28.pptCloudComputingJun28.ppt
CloudComputingJun28.pptVipin Singhal
 
CloudComputingJun28.ppt
CloudComputingJun28.pptCloudComputingJun28.ppt
CloudComputingJun28.pptgeminass1
 
Unified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AIUnified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AIAlluxio, Inc.
 
Introduction to Globus for New Users (GlobusWorld Tour - Columbia University)
Introduction to Globus for New Users (GlobusWorld Tour - Columbia University)Introduction to Globus for New Users (GlobusWorld Tour - Columbia University)
Introduction to Globus for New Users (GlobusWorld Tour - Columbia University)Globus
 
UberCloud HPC Experiment Introduction for Beginners
UberCloud HPC Experiment Introduction for BeginnersUberCloud HPC Experiment Introduction for Beginners
UberCloud HPC Experiment Introduction for Beginnershpcexperiment
 
OGF Introductory Overview - FAS* 2014
OGF Introductory Overview -  FAS* 2014OGF Introductory Overview -  FAS* 2014
OGF Introductory Overview - FAS* 2014Alan Sill
 
OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3Robert Grossman
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
 
Globus: A Data Management Platform for Collaborative Research (CHPC 2019 - So...
Globus: A Data Management Platform for Collaborative Research (CHPC 2019 - So...Globus: A Data Management Platform for Collaborative Research (CHPC 2019 - So...
Globus: A Data Management Platform for Collaborative Research (CHPC 2019 - So...Globus
 
The Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, EvolutionThe Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, EvolutionIan Foster
 
Windows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
Windows Azure - Uma Plataforma para o Desenvolvimento de AplicaçõesWindows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
Windows Azure - Uma Plataforma para o Desenvolvimento de AplicaçõesComunidade NetPonto
 
Introduction to Globus for New Users
Introduction to Globus for New UsersIntroduction to Globus for New Users
Introduction to Globus for New UsersGlobus
 
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...IOSR Journals
 
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWSExperiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWSEd Dodds
 
Ogce Workflow Suite
Ogce Workflow SuiteOgce Workflow Suite
Ogce Workflow Suitesmarru
 
OGF Introductory Overview - OGF 44 at EGI Conference 2015
OGF Introductory Overview - OGF 44 at EGI Conference 2015OGF Introductory Overview - OGF 44 at EGI Conference 2015
OGF Introductory Overview - OGF 44 at EGI Conference 2015Alan Sill
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer OverlordsIan Foster
 
Deploying and Managing Artificial Intelligence Services using the Open Data H...
Deploying and Managing Artificial Intelligence Services using the Open Data H...Deploying and Managing Artificial Intelligence Services using the Open Data H...
Deploying and Managing Artificial Intelligence Services using the Open Data H...Orgad Kimchi
 

Similar to A Global Research Data Platform: How Globus Services Enable Scientific Discovery (20)

CloudComputingJun28.ppt
CloudComputingJun28.pptCloudComputingJun28.ppt
CloudComputingJun28.ppt
 
CloudComputingJun28.ppt
CloudComputingJun28.pptCloudComputingJun28.ppt
CloudComputingJun28.ppt
 
CloudComputingJun28.ppt
CloudComputingJun28.pptCloudComputingJun28.ppt
CloudComputingJun28.ppt
 
Unified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AIUnified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AI
 
Introduction to Globus for New Users (GlobusWorld Tour - Columbia University)
Introduction to Globus for New Users (GlobusWorld Tour - Columbia University)Introduction to Globus for New Users (GlobusWorld Tour - Columbia University)
Introduction to Globus for New Users (GlobusWorld Tour - Columbia University)
 
UberCloud HPC Experiment Introduction for Beginners
UberCloud HPC Experiment Introduction for BeginnersUberCloud HPC Experiment Introduction for Beginners
UberCloud HPC Experiment Introduction for Beginners
 
OGF Introductory Overview - FAS* 2014
OGF Introductory Overview -  FAS* 2014OGF Introductory Overview -  FAS* 2014
OGF Introductory Overview - FAS* 2014
 
OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Globus: A Data Management Platform for Collaborative Research (CHPC 2019 - So...
Globus: A Data Management Platform for Collaborative Research (CHPC 2019 - So...Globus: A Data Management Platform for Collaborative Research (CHPC 2019 - So...
Globus: A Data Management Platform for Collaborative Research (CHPC 2019 - So...
 
The Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, EvolutionThe Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, Evolution
 
Windows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
Windows Azure - Uma Plataforma para o Desenvolvimento de AplicaçõesWindows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
Windows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
 
Introduction to Globus for New Users
Introduction to Globus for New UsersIntroduction to Globus for New Users
Introduction to Globus for New Users
 
H017144148
H017144148H017144148
H017144148
 
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
 
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWSExperiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
 
Ogce Workflow Suite
Ogce Workflow SuiteOgce Workflow Suite
Ogce Workflow Suite
 
OGF Introductory Overview - OGF 44 at EGI Conference 2015
OGF Introductory Overview - OGF 44 at EGI Conference 2015OGF Introductory Overview - OGF 44 at EGI Conference 2015
OGF Introductory Overview - OGF 44 at EGI Conference 2015
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
 
Deploying and Managing Artificial Intelligence Services using the Open Data H...
Deploying and Managing Artificial Intelligence Services using the Open Data H...Deploying and Managing Artificial Intelligence Services using the Open Data H...
Deploying and Managing Artificial Intelligence Services using the Open Data H...
 

More from Ian Foster

Global Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptxGlobal Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptxIan Foster
 
Better Information Faster: Programming the Continuum
Better Information Faster: Programming the ContinuumBetter Information Faster: Programming the Continuum
Better Information Faster: Programming the ContinuumIan Foster
 
ESnet6 and Smart Instruments
ESnet6 and Smart InstrumentsESnet6 and Smart Instruments
ESnet6 and Smart InstrumentsIan Foster
 
Linking Scientific Instruments and Computation
Linking Scientific Instruments and ComputationLinking Scientific Instruments and Computation
Linking Scientific Instruments and ComputationIan Foster
 
Foster CRA March 2022.pptx
Foster CRA March 2022.pptxFoster CRA March 2022.pptx
Foster CRA March 2022.pptxIan Foster
 
Big Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental ScienceBig Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental ScienceIan Foster
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryIan Foster
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the ContinuumIan Foster
 
Data Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationData Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationIan Foster
 
Research Automation for Data-Driven Discovery
Research Automation for Data-Driven DiscoveryResearch Automation for Data-Driven Discovery
Research Automation for Data-Driven DiscoveryIan Foster
 
Scaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterScaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterIan Foster
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for ScienceIan Foster
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light SourcesIan Foster
 
Team Argon Summary
Team Argon SummaryTeam Argon Summary
Team Argon SummaryIan Foster
 
Thoughts on interoperability
Thoughts on interoperabilityThoughts on interoperability
Thoughts on interoperabilityIan Foster
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
 
NIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasNIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasIan Foster
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFIan Foster
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
 
Software Infrastructure for a National Research Platform
Software Infrastructure for a National Research PlatformSoftware Infrastructure for a National Research Platform
Software Infrastructure for a National Research PlatformIan Foster
 

More from Ian Foster (20)

Global Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptxGlobal Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptx
 
Better Information Faster: Programming the Continuum
Better Information Faster: Programming the ContinuumBetter Information Faster: Programming the Continuum
Better Information Faster: Programming the Continuum
 
ESnet6 and Smart Instruments
ESnet6 and Smart InstrumentsESnet6 and Smart Instruments
ESnet6 and Smart Instruments
 
Linking Scientific Instruments and Computation
Linking Scientific Instruments and ComputationLinking Scientific Instruments and Computation
Linking Scientific Instruments and Computation
 
Foster CRA March 2022.pptx
Foster CRA March 2022.pptxFoster CRA March 2022.pptx
Foster CRA March 2022.pptx
 
Big Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental ScienceBig Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental Science
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and Chemistry
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the Continuum
 
Data Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationData Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud Automation
 
Research Automation for Data-Driven Discovery
Research Automation for Data-Driven DiscoveryResearch Automation for Data-Driven Discovery
Research Automation for Data-Driven Discovery
 
Scaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterScaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and Jupyter
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light Sources
 
Team Argon Summary
Team Argon SummaryTeam Argon Summary
Team Argon Summary
 
Thoughts on interoperability
Thoughts on interoperabilityThoughts on interoperability
Thoughts on interoperability
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
NIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasNIH Data Commons Architecture Ideas
NIH Data Commons Architecture Ideas
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCF
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
Software Infrastructure for a National Research Platform
Software Infrastructure for a National Research PlatformSoftware Infrastructure for a National Research Platform
Software Infrastructure for a National Research Platform
 

Recently uploaded

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 

Recently uploaded (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 

A Global Research Data Platform: How Globus Services Enable Scientific Discovery

  • 1. Crescat scientia; vita excolatur A Global Research Data Platform: How Globus Services Enable Scientific Discovery Ian Foster The University of Chicago Argonne National Laboratory foster@uchicago.edu, @ianfoster
  • 2. Globus platform 2010- Operated by UChicago for researchers worldwide Made possible by the support of 180+ subscribers
  • 3. Three lessons learned operating Globus A carefully designed research data platform can advance scientific discovery by reducing data friction Public cloud can be used to facilitate the development and delivery of the persistent services required to implement such a platform A hybrid free/subscription-based support model can enable sustainable operation of such services
  • 4. Three lessons learned operating Globus A carefully designed research data platform can advance scientific discovery by reducing data friction Public cloud can be used to facilitate the development and delivery of the persistent services required to implement such a platform A hybrid free/subscription-based support model can enable sustainable operation of such services
  • 5.
  • 6. Data friction is a frequent, and often fatal, impediment to efficient communication and collaboration We aim to eliminate friction in: access movement sharing discovery analysis publication 1919 Motor Transport Corps convoy Washington, DC., to San Francisco 56 days, average speed of 9 km/h
  • 7. The Globus hybrid “SaaS” model
  • 8. Friction: Many data locations Need: Easy access to data regardless of location Solution: Easily deployed Globus Connect agent Enables: Access data anywhere 39,000 active endpoints
  • 9. Friction: Many access methods Need: Uniform interface, consistent UX across storage systems Solution: Modular, high-performance Data Storage Interface Enables: High-speed, high-functionality data access POSIX file systems Parallel file systems Object stores Etc.
  • 10. Friction: Slow data movement Need: Rapid end-to-end data movement regardless of file size(s), network, storage system, etc. Solution: Extensible architecture with optimizations for many workloads, networks, storage systems Enables: Multi-GB/s transfers HPC-HPC HPC-Cloud Cloud-Cloud ArXiv:2009.03190 A=ALCF N=NERSC O=OLCF
  • 11. 81 PB transferred on XSEDE 1,456,199 successful transfers 12,715 TB+ transfers 397 TB largest transfer 7,500 endpoints involved in XSEDE transfers 554 endpoints w/ TB+ XSEDE transfers 1 in 17 transfers overcame a transient fault 1 in 180 transfers overcame a checksum error 41,210 people logged into XSEDE apps w/ Globus Auth 5,995 people transferred data to/from XSEDE 952 people completed TB+ transfers 50 people/month 1st transfer on XSEDE Friction: Faults Need: Robust execution despite transient failures Solution: Redundant deployment, checksums … Globus and XSEDE, 2011-2022
  • 12. https://dashboard.globus.org/esgf As of March 10, 2022 4 to 6 GB/s 1.5 GB/s Challenge: Replicate 7.4 petabytes of climate data to ANL, ORNL Solution: Use Globus to copy data over ESnet Note: Dashboard above assumed 8.8 PB
  • 13. 13 Misconfigured GPFS @ LLNL ALCF maintenance LLNL→ ALCF LLNL→ OLCF ALCF→ OLCF OLCF→ ALCF 17,347,671 directories and 28,907,532 files February 12 to May 4, 2022 Sustained rate 1.45 GB/s (LLNL rate limit) Peak 7.5 GB/s (OLCF→ALCF)
  • 14. Friction: Manual operations Need: Interactive and programmatic control Solution: Web interfaces, APIs, command line interfaces define the Globus platform Enables: Fire-and-forget transfers, portals # (1) Create shared endpoint: # Create directory to be shared share_path = '/~/' + str(uuid.uuid4()) + '/' tc.operation_mkdir(host_id, path=share_path) # Create shared endpoint on directory shared_ep_data = { 'DATA_TYPE': 'shared_endpoint', 'host_endpoint': host_id, 'host_path': share_path } r = tc.create_shared_endpoint(shared_ep_data) # (2) Copy data into the shared endpoint tc.endpoint_autoactivate(share_id) tdata = TransferData(tc, host_id, share_id) tdata.add_item(source_path, '/', recursive=True) r = tc.submit_transfer(tdata) tc.task_wait(r['task_id'], timeout=1000) # (3) Set access control on shared endpoint tc.add_endpoint_acl_rule(share_id, rule_data) # (4) Ultimately, delete the shared endpoint tc.delete_endpoint(share_id)
  • 15. Data Energy Science Collaboration (DESC) Data Challenge 2: • Simulate 300 sq degree of sky over 5 years • 5 TB simulated data • ~90M core hours at ALCF and NERSC Globus-based portal makes data accessible to DESC participants 15
  • 16. Friction: Complex data pipelines Solution: Globus Automate for definition, execution, sharing of flows Example: Integrate data from many institutional repositories Auth Search Transfer Compute
  • 19. Friction: Nowhere to put things! Need: On-demand access to storage for caching, staging, distribution Solution: Data stores implement Globus APIs for creating, managing, and accessing virtual stores (“collections”) and metadata catalogs Examples: Argonne Petrel (6 PB) and Eagle (100 PB)
  • 20. Friction: Repeated authentication Need: Automated operations on resources with different authentication & authorization requirements Solution: Protocols and APIs for identity linking, authentication, credential refresh, delegation Numbers: 300,000 registered users; 1,500 identity providers
  • 21. Friction: Repeated authentication Need: Automated operations on resources with different authentication & authorization requirements Solution: Protocols and APIs for identity linking, authentication, credential refresh, delegation Numbers: 300,000 registered users; 1,500 identity providers
  • 22.
  • 24. Three lessons learned operating Globus A carefully designed research data platform can advance scientific discovery by reducing data friction Public cloud can be used to facilitate the development and delivery of the persistent services required to implement such a platform A hybrid free/subscription-based support model can enable sustainable operation of such services
  • 25. Globus platform 2010- Operated by UChicago for researchers worldwide Made possible by the support of 180+ subscribers
  • 26. Challenge: Operate a ultra-reliable, scalable, secure service–with a small team Elastic Compute Cloud (EC2) Simple Storage Service (S3) CloudFront DynamoDB Elastic Block Store (EBS) Kinesis (Streams) Redshift RDS (MySQL, Oracle, Postgres) Step Functions Lambda Route 53 Simple Email Service (SES) Simple Notification Service (SNS) Simple Queue Service (SQS) CloudTrail CloudWatch Logs Identity & Access Management (IAM) Key Management Service (KMS) Elastic Load Balancing Solution: Leverage a commercial cloud platform that is designed and operated for that purpose
  • 27. Cloud as platform: Lessons learned Commercial cloud can be leveraged for science: • As an elastic source of computing and storage capacity – sure • As a cheap source of computing and storage capacity – maybe/not? As an immensely powerful platform for delivering scalable, reliable, and democratizing digital services – absolutely!
  • 28. Three lessons learned operating Globus A carefully designed research data platform can advance scientific discovery by reducing data friction Public cloud can be used to facilitate the development and delivery of the persistent services required to implement such a platform A hybrid free/subscription-based support model can enable sustainable operation of such services
  • 29. All software must be sustainable software ● A research tool is sustainable if its users can count on it being maintained as fit for purpose over time ● Sustainability requires that, in the aggregate, resources equal or exceed costs: ● Generally desirable that adding each new user: ○ is inexpensive (“low incremental costs”) ○ increases total resources (“positive returns to scale”)
  • 30. Globus achieves sustainability via a hybrid free/subscription model • Free tier enables broad use regardless of ability to pay • Premium services available to subscribers incentivize subscriptions, with subscription levels set based on institutional research budget as approximation of benefit received and ability to pay • Notes: • As a not-for-profit service operated by the University of Chicago, goal is not profit maximization but maximum value delivered to science • Extensive cloud-based automation provides for low incremental costs • Growing subscriber base provide for positive returns to scale • Challenges • Balancing goals of broad adoption vs. sustainability: free rider problem
  • 31. Developing vs. sustaining the Globus platform Developing: • Creating the cloud-hosted Globus service • New platform services: e.g., Flows for automation, funcX for managed compute • New features: e.g., cloud storage connectors, client-side striping, personal health information Sustaining: • Maintenance and operations; user engagement for support, training, security reviews, etc. • Subscription contracts and agreements (e.g., BAAs) • Feature enhancements; scaling to support more users, bigger data; new storage adapters, Automate action providers 180 subscribers Universities, federal agencies, national projects, research institutes, health care systems, commercial (4%), international (20%)
  • 32. Example premium service: management console 32
  • 34. Federated Research Data Repository A national research data platform • Ingest, curation, preservation • Discovery, citation, sharing Uses Globus Services: • Auth for authentication • Transfer to repository service • Search for integrated metadata catalog (includes metadata from 70 other repositories) 34
  • 36. Rate (color) vs. distance and size for 4M transfers (35B files, 280 PB) thru 2017 1,921 transfers from Argonne Advanced Photon Source server are highlighted
  • 37. Thanks to the team! … and our funders … … and our subscribers
  • 38. Globus services enabling scientific discovery Unified Data Access Data Transfer and Sharing Platform-as-a-Service Reliable Automation Publication & Discovery Remote Execution (future) Globus is a powerful research data services platform, with hybrid SaaS architecture and hybrid free/subscription-based sustainability, and with a footprint that encompasses 1600+ institutions and 300,000+ users
  • 39. Implications for a National Science Data Fabric? Globus reaches 300,000+ users at 1600+ institutions These institutions can source and sink data at increasingly high rates, ever more easily (thanks to networks, Science DMZs, and Globus) Perhaps their most urgent need is for programmatically accessible storage for data caching, staging, and distribution For more information: ● https://globus.org ● https://docs.globus.org/modern-research-data-portal/ ● https://docs.globus.org/globus-automation-services/ Ian Foster foster@uchicago.edu @ianfoster labs.globus.org