SlideShare una empresa de Scribd logo
1 de 34
Accelerating data-intensive science by outsourcing the mundaneIan Foster
The data deluge MACHO et al.: 1 TB Palomar: 3 TB 2MASS: 10 TB GALEX: 30 TB Sloan: 40 TB Pan-STARRS: 40,000 TB 100,000 TB Genomic sequencing output x2 every 9 month >300 public centers 1330molec. bio databases Nucleic Acids Research (96 in Jan 2001) 2004: 36 TB 2012: 2,300 TB Climate model intercomparison project (CMIP) of the IPCC
Big science has achieved big successes OSG: 1.4M CPU-hours/day, >90 sites, >3000 users, >260 pubs in 2010 LIGO: 1 PB data in last science run, distributed worldwide Robust production solutions Substantial teams and expense Sustained, multi-year effort Application-specific solutions,   built on common technology ESG: 1.2 PB climate data delivered to 23,000 users; 600+ pubs All build on NSF OCI (& DOE)-supported Globus Toolkit software
But small science is struggling More data, more complex data Ad-hoc solutions Inadequate software, hardware Data plan mandates
Medium-scale science struggles too! Blanco 4m on Cerro Tololo Image credit: Roger Smith/NOAO/AURA/NSF Dark Energy Survey receives 100,000 files each night in Illinois They transmit files to Texas for analysis … then move results back to Illinois Process must be reliable, routine, and efficient The cyberinfrastructure team is not large
The challenge of staying competitive    "Well, in our country," said Alice … "you'd generally get to somewhere else — if you run very fast for a long time, as we've been doing.”   "A slow sort of country!" said the Queen. "Now, here, you see, it takes all the running you can do, to keep in the same place. If you want to get somewhere else, you must run at least twice as fast as that!"
Current approaches are unsustainable Small laboratories PI, postdoc, technician, grad students Estimate 5,000 across US university community Average ill-spent/unmet need of 0.5 FTE/lab? Medium-scale projects Multiple PIs, a few software engineers Estimate 500 across US university community Average ill-spent/unmet need of 3 FTE/project? Total 4000 FTE: at ~$100K/FTE => $400M/yr     Plus computers, storage, opportunity costs, …
And don’t forget administrative costs 42%of the time spent by an average PI on a federally funded research project was reported to be expended on administrative tasks related to that project rather than on research        — Federal Demonstration Partnership faculty burden survey, 2007
You can run a company from a coffee shop
Because businesses outsource their IT Web presence Email (hosted Exchange) Calendar  Telephony (hosted VOIP)  Human resources and payroll  Accounting  Customer relationship mgmt Software as a Service (SaaS)
And often their large-scale computing too Web presence Email (hosted Exchange) Calendar  Telephony (hosted VOIP)  Human resources and payroll  Accounting  Customer relationship mgmt  Data analytics  Content distribution Software as a Service (SaaS) Infrastructure as a Service(IaaS)
Let’s rethink how we provide research IT Accelerate discovery and innovation worldwide by providing research IT as a service Leverage software-as-a-service to provide millions of researchers with unprecedented access to powerful tools;  enable  a massive shortening of cycle times intime-consuming research processes; and reduce research IT costs dramatically via economies of scale
Time-consuming tasks in science Run experiments Collect data Manage data Move data Acquire computers Analyze data Run simulations Compare experiment with simulation Search the literature ,[object Object]
Publish papers
Find, configure, install relevant software
Find, access, analyze relevant data
Order supplies
Write proposals
Write reports
…,[object Object]
Publish papers
Find, configure, install relevant software
Find, access, analyze relevant data
Order supplies
Write proposals
Write reports
…,[object Object]
Data movement can be surprisingly difficult                       Discover endpoints, determine available                       protocols, negotiate firewalls, configure software,                       manage space, determine required credentials,                       configure protocols, detect and respond to failures, determine expected performance, determine actual performance, identify diagnose and correct network misconfigurations, integrate with file systems, … It took 2 weeks and much help from many people to move 10 TB between California and Tennessee. (2007 BES report) B A
Globus Online’sSaaS/Web 2.0 architecture Command line interface lsalcf#dtn:/ scpalcf#dtn:/myfile br />nersc#dtn:/myfile HTTP REST interface POST https://transfer.api.globusonline.org/ v0.10/transfer <transfer-doc> Web interface OpenID OAuth Shibboleth (Operate)  Fire-and-forget data movement Automatic fault recovery High performance No client software install Across multiple security domains (Hosted on)  GridFTP servers FTP servers Other protocols: HTTP, WebDAV, SRM, … Globus Connect on local computers
Example application: UC sequencing facility Mac using Globus Connect Delivery of data to customer iBi File Server Mount drive iBi general-purpose compute cluster Sequencing-specific compute cluster Sequencing instrument
Statistics and user feedback Launched November 2010 >1700 users registered >500 TB user data moved >30 million user files moved >150 endpoints registered Widely used on TeraGrid/XSEDE; other centers & facilities; internationally >20x faster than SCP Faster than hand-tuned  “Last time I needed to fetch 100,000 files from NERSC, a graduate student babysat the process for a month.” “I expected to spend four weeks writing code to manage my data transfers; with Globus Online, I was up and running in five minutes.” “Transferred 28 MB in 20 minutes instead of 61 hours. Makes these global climate simulations manageable.”
Moving 586 Terabytes in two weeks
Monitoring provides deep visibility

Más contenido relacionado

La actualidad más candente

Taming Big Data!
Taming Big Data!Taming Big Data!
Taming Big Data!Ian Foster
 
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationIan Foster
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light SourcesIan Foster
 
2016 05 sanger
2016 05 sanger2016 05 sanger
2016 05 sangerChris Dwan
 
Advanced Research Computing at York
Advanced Research Computing at YorkAdvanced Research Computing at York
Advanced Research Computing at YorkMing Li
 
Toward a Global Research Platform for Big Data Analysis
Toward a Global Research Platform for Big Data AnalysisToward a Global Research Platform for Big Data Analysis
Toward a Global Research Platform for Big Data AnalysisLarry Smarr
 
2017 bio it world
2017 bio it world2017 bio it world
2017 bio it worldChris Dwan
 
Cloud com foster december 2010
Cloud com foster december 2010Cloud com foster december 2010
Cloud com foster december 2010Ian Foster
 
2015 04 bio it world
2015 04 bio it world2015 04 bio it world
2015 04 bio it worldChris Dwan
 
Science cloud foster june 2013
Science cloud foster june 2013Science cloud foster june 2013
Science cloud foster june 2013Kirill Osipov
 
Science as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate DiscoveryScience as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate DiscoveryIan Foster
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Robert Grossman
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchRobert Grossman
 
Accelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy ScienceAccelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy ScienceIan Foster
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Robert Grossman
 
Data Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationData Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationIan Foster
 
The Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceThe Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceRobert Grossman
 
SaaS and the Transformation of Research
SaaS and the Transformation of ResearchSaaS and the Transformation of Research
SaaS and the Transformation of ResearchVas Vasiliadis
 
A Biological Internet?: Eywa
A Biological Internet?: EywaA Biological Internet?: Eywa
A Biological Internet?: EywaEugene Siow
 
Towards a High-Performance National Research Platform Enabling Digital Research
Towards a High-Performance National Research Platform Enabling Digital ResearchTowards a High-Performance National Research Platform Enabling Digital Research
Towards a High-Performance National Research Platform Enabling Digital ResearchLarry Smarr
 

La actualidad más candente (20)

Taming Big Data!
Taming Big Data!Taming Big Data!
Taming Big Data!
 
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light Sources
 
2016 05 sanger
2016 05 sanger2016 05 sanger
2016 05 sanger
 
Advanced Research Computing at York
Advanced Research Computing at YorkAdvanced Research Computing at York
Advanced Research Computing at York
 
Toward a Global Research Platform for Big Data Analysis
Toward a Global Research Platform for Big Data AnalysisToward a Global Research Platform for Big Data Analysis
Toward a Global Research Platform for Big Data Analysis
 
2017 bio it world
2017 bio it world2017 bio it world
2017 bio it world
 
Cloud com foster december 2010
Cloud com foster december 2010Cloud com foster december 2010
Cloud com foster december 2010
 
2015 04 bio it world
2015 04 bio it world2015 04 bio it world
2015 04 bio it world
 
Science cloud foster june 2013
Science cloud foster june 2013Science cloud foster june 2013
Science cloud foster june 2013
 
Science as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate DiscoveryScience as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate Discovery
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
 
Accelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy ScienceAccelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy Science
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data
 
Data Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationData Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud Automation
 
The Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceThe Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of Science
 
SaaS and the Transformation of Research
SaaS and the Transformation of ResearchSaaS and the Transformation of Research
SaaS and the Transformation of Research
 
A Biological Internet?: Eywa
A Biological Internet?: EywaA Biological Internet?: Eywa
A Biological Internet?: Eywa
 
Towards a High-Performance National Research Platform Enabling Digital Research
Towards a High-Performance National Research Platform Enabling Digital ResearchTowards a High-Performance National Research Platform Enabling Digital Research
Towards a High-Performance National Research Platform Enabling Digital Research
 

Similar a Rpi talk foster september 2011

eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores inside-BigData.com
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22marpierc
 
云计算及其应用
云计算及其应用云计算及其应用
云计算及其应用lantianlcdx
 
Clouds, Grids and Data
Clouds, Grids and DataClouds, Grids and Data
Clouds, Grids and DataGuy Coates
 
Integrating scientific laboratories into the cloud
Integrating scientific laboratories into the cloudIntegrating scientific laboratories into the cloud
Integrating scientific laboratories into the cloudData Finder
 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009Ian Foster
 
2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdfLevLafayette1
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009Ian Foster
 
Privacy preserving public auditing for secured cloud storage
Privacy preserving public auditing for secured cloud storagePrivacy preserving public auditing for secured cloud storage
Privacy preserving public auditing for secured cloud storagedbpublications
 
Many Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersMany Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersIan Foster
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
ACES QuakeSim 2011
ACES QuakeSim 2011ACES QuakeSim 2011
ACES QuakeSim 2011marpierc
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaUniversity of Washington
 
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...Raffaele Montella
 

Similar a Rpi talk foster september 2011 (20)

eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores 
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
云计算及其应用
云计算及其应用云计算及其应用
云计算及其应用
 
Clouds, Grids and Data
Clouds, Grids and DataClouds, Grids and Data
Clouds, Grids and Data
 
Integrating scientific laboratories into the cloud
Integrating scientific laboratories into the cloudIntegrating scientific laboratories into the cloud
Integrating scientific laboratories into the cloud
 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009
 
Microsoft Dryad
Microsoft DryadMicrosoft Dryad
Microsoft Dryad
 
2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf
 
SomeSlides
SomeSlidesSomeSlides
SomeSlides
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
 
Privacy preserving public auditing for secured cloud storage
Privacy preserving public auditing for secured cloud storagePrivacy preserving public auditing for secured cloud storage
Privacy preserving public auditing for secured cloud storage
 
Many Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersMany Task Applications for Grids and Supercomputers
Many Task Applications for Grids and Supercomputers
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
ACES QuakeSim 2011
ACES QuakeSim 2011ACES QuakeSim 2011
ACES QuakeSim 2011
 
Grid computing
Grid computingGrid computing
Grid computing
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and Myria
 
1. GRID COMPUTING
1. GRID COMPUTING1. GRID COMPUTING
1. GRID COMPUTING
 
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
 

Más de Ian Foster

Global Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptxGlobal Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptxIan Foster
 
The Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, EvolutionThe Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, EvolutionIan Foster
 
Better Information Faster: Programming the Continuum
Better Information Faster: Programming the ContinuumBetter Information Faster: Programming the Continuum
Better Information Faster: Programming the ContinuumIan Foster
 
ESnet6 and Smart Instruments
ESnet6 and Smart InstrumentsESnet6 and Smart Instruments
ESnet6 and Smart InstrumentsIan Foster
 
Linking Scientific Instruments and Computation
Linking Scientific Instruments and ComputationLinking Scientific Instruments and Computation
Linking Scientific Instruments and ComputationIan Foster
 
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific DiscoveryA Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific DiscoveryIan Foster
 
Foster CRA March 2022.pptx
Foster CRA March 2022.pptxFoster CRA March 2022.pptx
Foster CRA March 2022.pptxIan Foster
 
Big Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental ScienceBig Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental ScienceIan Foster
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryIan Foster
 
Research Automation for Data-Driven Discovery
Research Automation for Data-Driven DiscoveryResearch Automation for Data-Driven Discovery
Research Automation for Data-Driven DiscoveryIan Foster
 
Scaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterScaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterIan Foster
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for ScienceIan Foster
 
Team Argon Summary
Team Argon SummaryTeam Argon Summary
Team Argon SummaryIan Foster
 
Thoughts on interoperability
Thoughts on interoperabilityThoughts on interoperability
Thoughts on interoperabilityIan Foster
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
 
NIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasNIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasIan Foster
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFIan Foster
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
 
Software Infrastructure for a National Research Platform
Software Infrastructure for a National Research PlatformSoftware Infrastructure for a National Research Platform
Software Infrastructure for a National Research PlatformIan Foster
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Ian Foster
 

Más de Ian Foster (20)

Global Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptxGlobal Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptx
 
The Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, EvolutionThe Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, Evolution
 
Better Information Faster: Programming the Continuum
Better Information Faster: Programming the ContinuumBetter Information Faster: Programming the Continuum
Better Information Faster: Programming the Continuum
 
ESnet6 and Smart Instruments
ESnet6 and Smart InstrumentsESnet6 and Smart Instruments
ESnet6 and Smart Instruments
 
Linking Scientific Instruments and Computation
Linking Scientific Instruments and ComputationLinking Scientific Instruments and Computation
Linking Scientific Instruments and Computation
 
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific DiscoveryA Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
 
Foster CRA March 2022.pptx
Foster CRA March 2022.pptxFoster CRA March 2022.pptx
Foster CRA March 2022.pptx
 
Big Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental ScienceBig Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental Science
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and Chemistry
 
Research Automation for Data-Driven Discovery
Research Automation for Data-Driven DiscoveryResearch Automation for Data-Driven Discovery
Research Automation for Data-Driven Discovery
 
Scaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterScaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and Jupyter
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
Team Argon Summary
Team Argon SummaryTeam Argon Summary
Team Argon Summary
 
Thoughts on interoperability
Thoughts on interoperabilityThoughts on interoperability
Thoughts on interoperability
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
NIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasNIH Data Commons Architecture Ideas
NIH Data Commons Architecture Ideas
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCF
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
Software Infrastructure for a National Research Platform
Software Infrastructure for a National Research PlatformSoftware Infrastructure for a National Research Platform
Software Infrastructure for a National Research Platform
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
 

Último

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Último (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

Rpi talk foster september 2011

  • 1. Accelerating data-intensive science by outsourcing the mundaneIan Foster
  • 2.
  • 3. The data deluge MACHO et al.: 1 TB Palomar: 3 TB 2MASS: 10 TB GALEX: 30 TB Sloan: 40 TB Pan-STARRS: 40,000 TB 100,000 TB Genomic sequencing output x2 every 9 month >300 public centers 1330molec. bio databases Nucleic Acids Research (96 in Jan 2001) 2004: 36 TB 2012: 2,300 TB Climate model intercomparison project (CMIP) of the IPCC
  • 4. Big science has achieved big successes OSG: 1.4M CPU-hours/day, >90 sites, >3000 users, >260 pubs in 2010 LIGO: 1 PB data in last science run, distributed worldwide Robust production solutions Substantial teams and expense Sustained, multi-year effort Application-specific solutions, built on common technology ESG: 1.2 PB climate data delivered to 23,000 users; 600+ pubs All build on NSF OCI (& DOE)-supported Globus Toolkit software
  • 5. But small science is struggling More data, more complex data Ad-hoc solutions Inadequate software, hardware Data plan mandates
  • 6. Medium-scale science struggles too! Blanco 4m on Cerro Tololo Image credit: Roger Smith/NOAO/AURA/NSF Dark Energy Survey receives 100,000 files each night in Illinois They transmit files to Texas for analysis … then move results back to Illinois Process must be reliable, routine, and efficient The cyberinfrastructure team is not large
  • 7. The challenge of staying competitive "Well, in our country," said Alice … "you'd generally get to somewhere else — if you run very fast for a long time, as we've been doing.” "A slow sort of country!" said the Queen. "Now, here, you see, it takes all the running you can do, to keep in the same place. If you want to get somewhere else, you must run at least twice as fast as that!"
  • 8. Current approaches are unsustainable Small laboratories PI, postdoc, technician, grad students Estimate 5,000 across US university community Average ill-spent/unmet need of 0.5 FTE/lab? Medium-scale projects Multiple PIs, a few software engineers Estimate 500 across US university community Average ill-spent/unmet need of 3 FTE/project? Total 4000 FTE: at ~$100K/FTE => $400M/yr Plus computers, storage, opportunity costs, …
  • 9. And don’t forget administrative costs 42%of the time spent by an average PI on a federally funded research project was reported to be expended on administrative tasks related to that project rather than on research — Federal Demonstration Partnership faculty burden survey, 2007
  • 10. You can run a company from a coffee shop
  • 11. Because businesses outsource their IT Web presence Email (hosted Exchange) Calendar Telephony (hosted VOIP) Human resources and payroll Accounting Customer relationship mgmt Software as a Service (SaaS)
  • 12. And often their large-scale computing too Web presence Email (hosted Exchange) Calendar Telephony (hosted VOIP) Human resources and payroll Accounting Customer relationship mgmt Data analytics Content distribution Software as a Service (SaaS) Infrastructure as a Service(IaaS)
  • 13. Let’s rethink how we provide research IT Accelerate discovery and innovation worldwide by providing research IT as a service Leverage software-as-a-service to provide millions of researchers with unprecedented access to powerful tools; enable a massive shortening of cycle times intime-consuming research processes; and reduce research IT costs dramatically via economies of scale
  • 14.
  • 16. Find, configure, install relevant software
  • 17. Find, access, analyze relevant data
  • 21.
  • 23. Find, configure, install relevant software
  • 24. Find, access, analyze relevant data
  • 28.
  • 29. Data movement can be surprisingly difficult Discover endpoints, determine available protocols, negotiate firewalls, configure software, manage space, determine required credentials, configure protocols, detect and respond to failures, determine expected performance, determine actual performance, identify diagnose and correct network misconfigurations, integrate with file systems, … It took 2 weeks and much help from many people to move 10 TB between California and Tennessee. (2007 BES report) B A
  • 30. Globus Online’sSaaS/Web 2.0 architecture Command line interface lsalcf#dtn:/ scpalcf#dtn:/myfile br />nersc#dtn:/myfile HTTP REST interface POST https://transfer.api.globusonline.org/ v0.10/transfer <transfer-doc> Web interface OpenID OAuth Shibboleth (Operate) Fire-and-forget data movement Automatic fault recovery High performance No client software install Across multiple security domains (Hosted on) GridFTP servers FTP servers Other protocols: HTTP, WebDAV, SRM, … Globus Connect on local computers
  • 31. Example application: UC sequencing facility Mac using Globus Connect Delivery of data to customer iBi File Server Mount drive iBi general-purpose compute cluster Sequencing-specific compute cluster Sequencing instrument
  • 32. Statistics and user feedback Launched November 2010 >1700 users registered >500 TB user data moved >30 million user files moved >150 endpoints registered Widely used on TeraGrid/XSEDE; other centers & facilities; internationally >20x faster than SCP Faster than hand-tuned “Last time I needed to fetch 100,000 files from NERSC, a graduate student babysat the process for a month.” “I expected to spend four weeks writing code to manage my data transfers; with Globus Online, I was up and running in five minutes.” “Transferred 28 MB in 20 minutes instead of 61 hours. Makes these global climate simulations manageable.”
  • 33. Moving 586 Terabytes in two weeks
  • 35. 20 Terabytes in less than one day Terabyte 20 Gigabyes in more than two days Gigabyte Megabyte Kilobyte
  • 36. Common research data management steps Dark Energy Survey Galaxy genomics LIGO observatory SBGrid structural biology consortium NCAR climate data applications Land use change; economics
  • 37. We have choices of where to compute Campus systems First target for many researchers XSEDE supercomputers 220,000 cores, peer-reviewed awards Optimized for scientific computing Open Science Grid 60,000 cores; high throughput Commercial cloud providers Instant access for small tasks Expensive for big projects Users insist that they need everything connected
  • 38. Towards “research IT as a service”
  • 39. Research data management as a service GO-User Credentials and other profile information GO-Transfer Data movement GO-Team Group membership GO-Collaborate Connect to collaborative tools: Jira, Confluence, … GO-Store Access to campus, cloud, XSEDE storage GO-Catalog On-demand metadata catalogs GO-Compute Access to computers GO-Galaxy Share, create, run workflows Today Prototype Fall
  • 40. SaaS services in action: The XSEDE vision XUAS
  • 41. Data analysis as a service: Early steps Securely and reliably: Assemble code Find computers Deploy code Run program Access data Store data Record workflow Reuse workflow [7, 8] [1, 2] We have built such systems for biological, environmental,and economics researchers VM image App code Workflow Galaxy Condor [3, 4] [5, 6] Data store
  • 42. SaaS economics: A quick tutorial Lower per-user cost (x10?) via aggregation onto common infrastructure $400M/yr $40M/yr? Initial “cost trough” due to fixed costs Per-user revenue permits positive return to scale Further reduce per-user cost over time $ 0 Time X10 reduction in per-user cost: $50K  $5K/yr per lab $300K  $30K/yr per project
  • 43. A national cyberinfrastructure strategy? To providemore capability formore people at less cost … Create infrastructure Robust and universal Economies of scale Positive returns to scale Via the creative use of Aggregation (“cloud”) Federation (“grid”) Small and medium laboratories and projects P L L L L L L L L L P P P P L L L L L L L L L L L L L L L L L L aa S Research data management Collaboration, computation Research administration
  • 44. Acknowledgments Colleagues at UChicago and Argonne Steve Tuecke, Ravi Madduri, Kyle Chard, Tanu Malik, and others listed at www.globusonline.org/about/goteam/ Carl Kesselman and other colleagues at other institutions Participants in the recent ICiS workshop on “Human-Computer Symbiosis: 50 Years On” NSF OCIand MPS; DOE ASCR; and NIH for support
  • 45. For more information www.globusonline.org; @globusonline: Twitter Foster, I. Globus Online: Accelerating and democratizing science through cloud-based services. IEEE Internet Computing(May/June):70-73, 2011. Allen, B., Bresnahan, J., Childers, L., Foster, I., Kandaswamy, G., Kettimuthu, R., Kordas, J., Link, M., Martin, S., Pickett, K. and Tuecke, S. Globus Online: Radical Simplification of Data Movement via SaaS. Communications of the ACM, 2011.
  • 46. Thank you! foster@uchicago.edu www.globusonline.org @globusonline

Notas del editor

  1. New capabilities represent a tremendous opportunity for science.The challenge that I want to speak to is how we leverage these capabilities without computers and computation overwhelming the research community in terms of both human and financial resources.The solution, I will suggest, is to get computation out of the lab—to outsource it to third party providers. I will explain how this task can be achieved.
  2. The need to deal with and benefit from large quantities of data is not a new concept: it has been noted in many policy reports, particularly in the US and UK, over the past several years.
  3. But now the data deluge is now upon us. I use a few examples to highlight developments:-- Genome sequencing machines are doubling in output every nine months. This leaves the rather stately 18 month Moore’s Law doubling of computer performance in the shade.-- Astronomy, which only entered the digital era around 2000, projects 100,000 TB data from LSST by the end of the decade. [2MASS completed 2001; -- Simulation -- And not just volume, but also complexityTrends: Scale, complexity, distributed generation, …--------Source for genomic data: http://www.sciencemag.org/content/331/6018/728.short (“Output from next-generation sequencing (NGS) has grown from 10 Mb per day to 40 Gb per day on a single sequencer, and there are now 10 to 20 major sequencing labs worldwide that have each deployed more than 10 sequencers “)Source for mol bio dbs: http://nar.oxfordjournals.org/content/39/suppl_1/D1.full.pdf+htmlSource for climate change image: http://serc.carleton.edu/details/images/17685.html
  4. Not just small labs—medium science too.E.g., Dark Energy Survey.
  5. For many researchers, projects, and institutions, large data volumes are not an opportunity but a fundamental challenge to their competitiveness as researchers. How can they keep up?
  6. 200 universities * 250 faculty per university = 5,000Summary:-- Big projects can build sophisticated solutions to IT problems-- Small labs and collaborations have problems with both--They need solutions, not toolkits—ideally outsourced solutions
  7. Need date
  8. Of course, people also make effective use of IaaS, but only for more specialized tasks
  9. More specifically, the opportunity is to apply a very modern technology—software as a service, or SaaS—to address a very modern problem, namely the enormous challenges inherent in translating revolutionary 21st century technologies into scientific advances. Midway’s SaaS approach will address these challenges, and both make powerful tools far more widely available, and reduce the cycle time associated with research and discovery.Achieve economies of scaleReduce cost per researcher dramaticallyAchieve positive returns to scaleMost academic solutions do NOT have PRTSMost industrial solutions DO have PRTS
  10. So let’s look at that list again.I and my colleagues started an effort a little while ago aimed at applying SaaS to one of these tasks …
  11. Example: small lab generates data at Texas Advanced Computing Center or the Advanced Photon Source. Needs to move it back to their lab.Or: Needs to move data from experimental facility (e.g., sequencing center or Dark Energy Survey) to computing facility for analysis.
  12. Data movement is conceptually simple, but can be surprisingly difficult
  13. Why? Discover endpoints, determine available protocols, negotiate firewalls, configure software, manage space, determine required credentials, configure protocols, detect and respond to failures, identify diagnose and correct network misconfigurations,…
  14. •Reliable file transfer. –Easy “fire and forget” file transfers –Automatic fault recovery –High performance –Across multiple security domains•No IT required. –No client software installation –New features automatically available –Consolidated support and troubleshooting –Works with existing GridFTP servers –Globus Connect solves “last mile problem”
  15. I’ll talk about integration with the Galaxy workflow system later …
  16. Reduce costs.Improve performance.Enable new science.
  17. What else do we need?
  18. Add university logos?
  19. Slide 33: Is the task of creating reusable workflows part of these 6 steps? Is publication and discovery of workflows/derived data products part of this as well? Is reproducible research part of it as well?
  20. Researchers vote with their dollars
  21. Before-- Lots of little labs-- Big science-- XSEDE After:lots of empowered SMLs, entrepreneurship in science, reproducible/reusable research etc