SlideShare una empresa de Scribd logo
1 de 57
Ian Foster
Argonne National Laboratory and University of Chicago
foster@anl.gov
ianfoster.org
Taming Big Data!
Publish
results
Collect
data
Design
experiment
Test
hypothesis
Hypothesize
explanation
Identify
patterns
Analyze
data
Discovery is an iterative process
Pose
question
Janet Rowley, 1972
Publish
results
Collect
data
Design
experiment
Test
hypothesis
Hypothesize
explanation
Identify
patterns
Analyze
data
Discovery in the big data era:
Resource-intensive, expensive, slow
Pose
question
Three big data challenges
Channel massive flows
Automate management
Build discovery engines
4
Three big data challenges
Channel massive flows
Automate management
Build discovery engines
5
Channel massive data flows
Data must move to be useful. We may optimize,
but we can never entirely eliminate distance.
• Sources: experimental facilities,
sensors, computations
• Sinks: analysis computers,
display systems
• Stores: impedance
matchers & time shifters
• Pipes: IO systems and
networks connect other elements
“We must think of data as a flowing river over time, not a static
snapshot. Make copies, share, and do magic” – S. Madhavan
Stor
e
Transfer is challenging at many levels
Speed and reliability
• GridFTP protocol
• Globus implementation
Scheduling and modeling
• SEAL and STEAL algorithms
• RAMSES project
7
8
Source
data
store
Desti-
nation
data
store
Wide
Area
Network
File transfer is an end-to-end problem
9
Application
OS
FS Stack
HBA/HCA
LAN
Switch
Router
Source
data
transfer
node
TCP
IP
NIC
Application
OS
FS Stack
HBA/HCA
LAN
Switch
Router TCP
IP
NIC
Storage Array
Wide
Area
Network
OST
MDT
Lustre
file
system
Destination
data transfer
node
OSS
OSS
MDS
MDS
+ diverse environments
+ diverse workloads
+ contention
File transfer is an end-to-end problem
GridFTP protocol and implementations:
Fast, reliable, secure 3rd-party data transfer
10
Extend legacy FTP protocol to enhance performance, reliability, security
Globus GridFTP provides a widely-used open source implementation.
Modular, pluggable architecture (different protocols, I/O interfaces).
Many optimizations: e.g., concurrency, parallelism, pipelining.
Data Transfer
Node at Site B
Data Transfer
Node at Site A
ParallelFileSystem
GridFTP
Server
Process
GridFTP
Server
Process
Parallelism = 3
Concurrency = 2
GridFTP
Server
Process
GridFTP
Server
Process
TCP Connection
TCP Connection
TCP Connection
TCP Connection
TCP Connection
TCP Connection
85 Gbps sustained disk-to-disk over 100
Gbps network, Ottawa—New Orleans
11
Raj Kettiumuthu
and team,
Argonne
Nov 2014
Higgs discovery “only possible because
of the extraordinary achievements of …
grid computing”—Rolf Heuer, CERN DG
10s of PB, 100s of institutions, 1000s of
scientists, 100Ks of CPUs, Bs of tasks
12
13
One Advanced
Photon Source
data node:
125 destinations
Same
node
(1 Gbps
link)
16
Transfer scheduling and optimization
• Science data traffic is
extremely bursty
• User experience can be
improved by scheduling to
minimize slowdown
• Traffic can be categorized:
interactive or batch
• Increased concurrency
tends to increase aggregate
throughput, to a point
17
Concurrency over 24 hours. Kettimuthu et
al., 2015
Throughput vs. concurency & parallelism.
Kettimuthu et al., 2014
A load-aware, adaptive algorithm:
(1) Data-driven model of throughput
18
EP2
EP3
EP4
EP1
Collect many <s, d, cs, cd, v, a> data
E.g., <EP1, EP3, 3, 3, 20GB, 29sec>
Estimate throughput(s, d, cs, cd, v)
Adjust with estimate of external load
Define transfer priority:
Schedule transfers if neither source nor destination
is saturated, using model to decide concurrency
If source or destination is saturated, interrupt active
transfer(s) to service waiting requests, if in so doing
can reduce overall average slowdown
19
A load-aware, adaptive algorithm:
(2) Concurrency-constrained scheduling
20
21
Gagan Agarwal1* Prasanna Balaprakash2 Ian Foster2* Raj Kettimuthu2
Sven Leyffer2 Vitali Morozov2 Todd Munson2 Nagi Rao3*
Saday Sadayappan1 Brad Settlemyer3 Brian Tierney4* Don Towsley5*
Venkat Vishwanath2 Yao Zhang2
1 Ohio State University 2 Argonne National Laboratory
3 Oak Ridge National Laboratory 4 ESnet 5 UMass Amherst (* Co-PIs)
Advanced Scientific Computing Research
Program manager: Rich Carlson♦︎
How to create more accurate, useful, and
portable models of distributed systems?
Simple analytical model:
T= α+ β*l
[startup cost + sustained bandwidth]
Experiment + regression
to estimate α, β
23
First-principles modeling
to better capture details
of system & application
components
Data-driven modeling to
learn unknown details of
system & application
components
Model
composition
Model, data
comparison
Differential regression for combining
data from different sources
Example of use: Predict performance on connection length L
not realizable on physical infrastructure
E.g., IB-RDMA or HTCP throughput on 900-mile connection
1) Make multiple measurements of performance on path lengths d:
– Ms(d): OPNET simulation
– ME(d): ANUE-emulated path
– MU(di): Real network (USN)
2) Compute measurement regressions on d: ṀA(.), A∈{S, E, U}
3) Compute differential regressions: ∆ṀA,B(.) = ṀA(.) - ṀB(.), A, B∈{S, E, U}
4) Apply differential regression to obtain estimates, C∈{S, E}
𝓜U(d) = MC(d) - ∆ṀC,U(d)
simulated/emulated measurements point regression estimate
Source LAN
profile
WAN
profile
Destination LAN
profile
Configuration for
host and edge
devices
Configuration
for WAN
devices
Configuration for
host and edge
devices
composition
operations
End-to-end profile composition
Three big data challenges
Channel massive flows
Automate management
Build discovery engines
26
Registry
Staging
Store
Ingest
Store
Analysis
Store
Community
Store
Archive Mirror
Ingest
Store
Analysis
Store
Community
Store
Archive Mirror
Registry
Quota
exceeded
!
Expired
credentials
!
Network
failed. Retry.
!
Permission
denied
!
It should be trivial to Collect, Move, Sync, Share, Analyze,
Annotate, Publish, Search, Backup, & Archive BIG DATA
… but in reality it’s often very challenging
One researcher’s perspective
on data management challenges
28
29
Tripit exemplifies process automation
Me
Book flights
Book hotel
Record flights
Suggest hotel
Record hotel
Get weather
Prepare maps
Share info
Monitor prices
Monitor flight
Other services
How the “business cloud” works
Platform
services
Database, analytics, application, deployment, workflow, queuing
Auto-scaling, Domain Name Service, content distribution
Elastic MapReduce, streaming data analytics
Email, messaging, transcoding. Many more.
Infrastructure
services
Computing, storage, networking
Elastic capacity
Multiple availability zones
Process automation for science
Run experiment
Collect data
Move data
Check data
Annotate data
Share data
Find similar data
Link to literature
Analyze data
Publish data
Automate
and
outsource:
the
Discovery
cloud
Analysis
Staging Ingest
Community
Repository
Archive Mirror
Registry
Next-gen
genome
sequencer
Telescope
In millions of labs worldwide,
researchers struggle with massive
data, advanced software, complex
protocols, burdensome reporting
Globus research data
management services
www.globus.org
Simulation
Reliable, secure, high-performance file
transfer and synchronization
“Fire-and-forget”
transfers
Automatic fault
recovery
Seamless security
integration
Powerful GUI
and APIs
Data
Source
Data
Destination
User initiates
transfer
request
1
Globus
moves and
syncs files
2
Globus
notifies user
3
Data
Source
User A selects
file(s) to share,
selects user or
group, and sets
permissions
1
Globus tracks shared
files; no need to
move files to cloud
storage!
2
User B logs in
to Globus and
accesses
shared file
3
Easily share large
data with any user or
group
No cloud storage
required
Extreme ease of use
• InCommon, Oauth, OpenID, X.509, …
• Credential management
• Group definition and management
• Transfer management and optimization
• Reliability via transfer retries
• Web interface, REST API, command line
• One-click “Globus Connect Personal” install
• 5-minute Globus Connect Server install
37
38
High-speed transfers to/from AWS cloud,
via Globus transfer service
• UChicago  AWS S3 (US region): Sustained 2 Gbps
– 2 GridFTP servers, GPFS file system at UChicago
– Multi-part upload via 16 concurrent HTTP connections
• AWS  AWS (same region): Sustained 5 Gbps
39
go#s3
Globus transfer & sharing; identity & group
management, data discovery & publication
25,000 users, 75 PB and 3B files transferred, 8,000 endpoints
Globus endpoints
Identity, group, profile
management services
…
Sharing service
Transfer service
Globus Toolkit
GlobusConnect
X
Identity, group, profile
management services
Sharing service
Transfer service
Globus Toolkit
GlobusConnect
Publication and discovery
X
43
Identity, group, profile
management services
Sharing service
Transfer service
Globus Toolkit
GlobusAPIs
GlobusConnect
Publication and discovery
X
The Globus Galaxies platform:
Science as a service
Globus
Galaxies
platform
Tool and workflow execution,
publication, discovery, sharing;
identity management; data
management; task scheduling
Infra-
structure
services
EC2, EBS, S3, SNS,
Spot, Route 53,
Cloud Formation
Ematter
materials
scienceFACE-IT
PDACS
Three big data challenges
Channel massive flows
Automate management
Build discovery engines
46
Discovery engines: Integrate simulation,
experiment, and informatics
Informatics
Analysis
Tools
High-throughput
Experiments
Problem
Specification
Modeling and
Simulation
Analysis &
Visualization
Experimental
Design
Analysis &
Visualization
Integrated
Databases
metagenomics.anl.gov
A discovery engine for metagenomics
kbase.us
DOE Systems Biology Knowledge Base (KBase)
Source: Rick Stevens
A discovery engine
for the study of disordered structures
Diffuse scattering images from Ray Osborn et al., Argonne
SampleExperimental
scattering
Material
composition
Simulated
structure
Simulated
scattering
La 60%
Sr 40%
Detect errors
(secs—mins)
Knowledge base
Past experiments;
simulations; literature;
expert knowledge
Select experiments
(mins—hours)
Contribute to knowledge base
Simulations driven by
experiments (mins—days)
Knowledge-driven
decision making
Evolutionary optimization
Immediate assessment of alignment quality in
near-field high-energy diffraction microscopy
5
Blue Gene/Q
Orthros
(All data in NFS)
3: Generate
Parameters
FOP.c
50 tasks
25s/task
¼ CPU hours
Uses Swift/K
Dataset
360 files
4 GB total
1: Median calc
75s (90% I/O)
MedianImage.c
Uses Swift/K
2: Peak Search
15s per file
ImageProcessing.c
Uses Swift/K
Reduced
Dataset
360 files
5 MB total
feedback to experiment
Detector
4: Analysis Pass
FitOrientation.c
60s/task (PC)
1667 CPU hours
60s/task (BG/Q)
1667 CPU hours
Uses Swift/T
GO Transfer
Up to
2.2 M CPU hours
per week!
ssh
Globus Catalog
Scientific Metadata
Workflow ProgressWorkflow
Control
Script
Bash
Manual
This is a
single
workflow
3: Convert bin L
to N
2 min for all files,
convert files to
Network Endian
format
Before
After
Hemant Sharma, Justin Wozniak, Mike Wilde, Jon Almer
Integrate data movement, management, workflow,
and computation to accelerate data-driven
applications
New data, computational capabilities, and
methods create opportunities and challenges
Integrate statistics/machine learning to assess
many models and calibrate them against `all'
relevant data
New computer facilities enable on-demand
computing and high-speed analysis of large
quantities of data
Three big data challenges
Channel massive flows
– New protocols and
management algorithms
Automate management
– The Discovery Cloud
Build discovery engines
– MG-RAST, kBase, Materials
56
U. S. D E PART M ENT OF
ENERGY
57
58

Más contenido relacionado

La actualidad más candente

Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013
Ian Foster
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
Ian Foster
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011
Ian Foster
 
Grid'5000: Running a Large Instrument for Parallel and Distributed Computing ...
Grid'5000: Running a Large Instrument for Parallel and Distributed Computing ...Grid'5000: Running a Large Instrument for Parallel and Distributed Computing ...
Grid'5000: Running a Large Instrument for Parallel and Distributed Computing ...
Frederic Desprez
 
Virtual Science in the Cloud
Virtual Science in the CloudVirtual Science in the Cloud
Virtual Science in the Cloud
thetfoot
 
Advanced Research Computing at York
Advanced Research Computing at YorkAdvanced Research Computing at York
Advanced Research Computing at York
Ming Li
 

La actualidad más candente (20)

Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
 
Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
 
The Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource ProvisioningThe Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource Provisioning
 
DIET_BLAST
DIET_BLASTDIET_BLAST
DIET_BLAST
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011
 
Toward a National Research Platform
Toward a National Research PlatformToward a National Research Platform
Toward a National Research Platform
 
What Are Science Clouds?
What Are Science Clouds?What Are Science Clouds?
What Are Science Clouds?
 
The Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceThe Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of Science
 
Grid'5000: Running a Large Instrument for Parallel and Distributed Computing ...
Grid'5000: Running a Large Instrument for Parallel and Distributed Computing ...Grid'5000: Running a Large Instrument for Parallel and Distributed Computing ...
Grid'5000: Running a Large Instrument for Parallel and Distributed Computing ...
 
The Pacific Research Platform: A Regional-Scale Big Data Analytics Cyberinfra...
The Pacific Research Platform: A Regional-Scale Big Data Analytics Cyberinfra...The Pacific Research Platform: A Regional-Scale Big Data Analytics Cyberinfra...
The Pacific Research Platform: A Regional-Scale Big Data Analytics Cyberinfra...
 
Virtual Science in the Cloud
Virtual Science in the CloudVirtual Science in the Cloud
Virtual Science in the Cloud
 
Advanced Research Computing at York
Advanced Research Computing at YorkAdvanced Research Computing at York
Advanced Research Computing at York
 
Dynamic Data Center concept
Dynamic Data Center concept  Dynamic Data Center concept
Dynamic Data Center concept
 
Data Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationData Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud Automation
 
CHASE-CI: A Distributed Big Data Machine Learning Platform
CHASE-CI: A Distributed Big Data Machine Learning PlatformCHASE-CI: A Distributed Big Data Machine Learning Platform
CHASE-CI: A Distributed Big Data Machine Learning Platform
 
The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...
 
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
 
Toward a Global Research Platform for Big Data Analysis
Toward a Global Research Platform for Big Data AnalysisToward a Global Research Platform for Big Data Analysis
Toward a Global Research Platform for Big Data Analysis
 

Destacado

Taming the Data Deluge: How Advertisers and Publishers Can Win at the Big Dat...
Taming the Data Deluge: How Advertisers and Publishers Can Win at the Big Dat...Taming the Data Deluge: How Advertisers and Publishers Can Win at the Big Dat...
Taming the Data Deluge: How Advertisers and Publishers Can Win at the Big Dat...
Signal Chicago 2012
 

Destacado (20)

Spark: Taming Big Data
Spark: Taming Big DataSpark: Taming Big Data
Spark: Taming Big Data
 
Big Data
Big DataBig Data
Big Data
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data introduction - Café Numérique Bruxelles
Big Data introduction - Café Numérique BruxellesBig Data introduction - Café Numérique Bruxelles
Big Data introduction - Café Numérique Bruxelles
 
Taming the Data Deluge: How Advertisers and Publishers Can Win at the Big Dat...
Taming the Data Deluge: How Advertisers and Publishers Can Win at the Big Dat...Taming the Data Deluge: How Advertisers and Publishers Can Win at the Big Dat...
Taming the Data Deluge: How Advertisers and Publishers Can Win at the Big Dat...
 
Turning Information chaos into reliable data
Turning Information chaos into reliable dataTurning Information chaos into reliable data
Turning Information chaos into reliable data
 
3 top tools for taming big data
3 top tools for taming big data3 top tools for taming big data
3 top tools for taming big data
 
Taming Big Data in the Reverse Logistics Supply Chain
Taming Big Data in the Reverse Logistics Supply ChainTaming Big Data in the Reverse Logistics Supply Chain
Taming Big Data in the Reverse Logistics Supply Chain
 
Taming the Big Data Beast - Together
Taming the Big Data Beast - TogetherTaming the Big Data Beast - Together
Taming the Big Data Beast - Together
 
Big data introduction - Big Data from a Consulting perspective - Sogeti
Big data introduction - Big Data from a Consulting perspective - SogetiBig data introduction - Big Data from a Consulting perspective - Sogeti
Big data introduction - Big Data from a Consulting perspective - Sogeti
 
Taming Big Data with NoSQL
Taming Big Data with NoSQLTaming Big Data with NoSQL
Taming Big Data with NoSQL
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Taming Big Data With Modern Software Architecture
Taming Big Data  With Modern Software ArchitectureTaming Big Data  With Modern Software Architecture
Taming Big Data With Modern Software Architecture
 
Hadoop-2.6.0 Slides
Hadoop-2.6.0 SlidesHadoop-2.6.0 Slides
Hadoop-2.6.0 Slides
 
Big Data: an introduction
Big Data: an introductionBig Data: an introduction
Big Data: an introduction
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big data Introduction by Mohan
Big data Introduction by MohanBig data Introduction by Mohan
Big data Introduction by Mohan
 
Taming Social Data: How Social Data Framing liberates analysis and accelerate...
Taming Social Data: How Social Data Framing liberates analysis and accelerate...Taming Social Data: How Social Data Framing liberates analysis and accelerate...
Taming Social Data: How Social Data Framing liberates analysis and accelerate...
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 

Similar a Taming Big Data!

Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light Sources
Ian Foster
 
Science cloud foster june 2013
Science cloud foster june 2013Science cloud foster june 2013
Science cloud foster june 2013
Kirill Osipov
 
Presentation southernstork 2009-nov-southernworkshop
Presentation southernstork 2009-nov-southernworkshopPresentation southernstork 2009-nov-southernworkshop
Presentation southernstork 2009-nov-southernworkshop
balmanme
 
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...
Network-aware Data Management for High Throughput Flows   Akamai, Cambridge, ...Network-aware Data Management for High Throughput Flows   Akamai, Cambridge, ...
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...
balmanme
 
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Ian Foster
 

Similar a Taming Big Data! (20)

RAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesRAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme Scales
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light Sources
 
Scientific
Scientific Scientific
Scientific
 
Science cloud foster june 2013
Science cloud foster june 2013Science cloud foster june 2013
Science cloud foster june 2013
 
Science as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate DiscoveryScience as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate Discovery
 
Lambda Data Grid
Lambda Data GridLambda Data Grid
Lambda Data Grid
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
 
Grid optical network service architecture for data intensive applications
Grid optical network service architecture for data intensive applicationsGrid optical network service architecture for data intensive applications
Grid optical network service architecture for data intensive applications
 
Geospatial Sensor Networks and Partitioning Data
Geospatial Sensor Networks and Partitioning DataGeospatial Sensor Networks and Partitioning Data
Geospatial Sensor Networks and Partitioning Data
 
Presentation southernstork 2009-nov-southernworkshop
Presentation southernstork 2009-nov-southernworkshopPresentation southernstork 2009-nov-southernworkshop
Presentation southernstork 2009-nov-southernworkshop
 
Big Process for Big Data @ NASA
Big Process for Big Data @ NASABig Process for Big Data @ NASA
Big Process for Big Data @ NASA
 
A Platform for Data Intensive Services Enabled by Next Generation Dynamic Opt...
A Platform for Data Intensive Services Enabled by Next Generation Dynamic Opt...A Platform for Data Intensive Services Enabled by Next Generation Dynamic Opt...
A Platform for Data Intensive Services Enabled by Next Generation Dynamic Opt...
 
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...
Network-aware Data Management for High Throughput Flows   Akamai, Cambridge, ...Network-aware Data Management for High Throughput Flows   Akamai, Cambridge, ...
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...
 
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
 
A Platform for Data Intensive Services Enabled by Next Generation Dynamic Opt...
A Platform for Data Intensive Services Enabled by Next Generation Dynamic Opt...A Platform for Data Intensive Services Enabled by Next Generation Dynamic Opt...
A Platform for Data Intensive Services Enabled by Next Generation Dynamic Opt...
 
re:Invent 2013-foster-madduri
re:Invent 2013-foster-maddurire:Invent 2013-foster-madduri
re:Invent 2013-foster-madduri
 
Semantics in Sensor Networks
Semantics in Sensor NetworksSemantics in Sensor Networks
Semantics in Sensor Networks
 
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
 
Data Mobility Exhibition
Data Mobility ExhibitionData Mobility Exhibition
Data Mobility Exhibition
 
Network Engineering for High Speed Data Sharing
Network Engineering for High Speed Data SharingNetwork Engineering for High Speed Data Sharing
Network Engineering for High Speed Data Sharing
 

Más de Ian Foster

Foster CRA March 2022.pptx
Foster CRA March 2022.pptxFoster CRA March 2022.pptx
Foster CRA March 2022.pptx
Ian Foster
 
Globus Auth: A Research Identity and Access Management Platform
Globus Auth: A Research Identity and Access Management PlatformGlobus Auth: A Research Identity and Access Management Platform
Globus Auth: A Research Identity and Access Management Platform
Ian Foster
 

Más de Ian Foster (20)

Global Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptxGlobal Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptx
 
The Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, EvolutionThe Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, Evolution
 
Better Information Faster: Programming the Continuum
Better Information Faster: Programming the ContinuumBetter Information Faster: Programming the Continuum
Better Information Faster: Programming the Continuum
 
ESnet6 and Smart Instruments
ESnet6 and Smart InstrumentsESnet6 and Smart Instruments
ESnet6 and Smart Instruments
 
Linking Scientific Instruments and Computation
Linking Scientific Instruments and ComputationLinking Scientific Instruments and Computation
Linking Scientific Instruments and Computation
 
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific DiscoveryA Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
 
Foster CRA March 2022.pptx
Foster CRA March 2022.pptxFoster CRA March 2022.pptx
Foster CRA March 2022.pptx
 
Big Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental ScienceBig Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental Science
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and Chemistry
 
Research Automation for Data-Driven Discovery
Research Automation for Data-Driven DiscoveryResearch Automation for Data-Driven Discovery
Research Automation for Data-Driven Discovery
 
Scaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterScaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and Jupyter
 
Team Argon Summary
Team Argon SummaryTeam Argon Summary
Team Argon Summary
 
Thoughts on interoperability
Thoughts on interoperabilityThoughts on interoperability
Thoughts on interoperability
 
NIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasNIH Data Commons Architecture Ideas
NIH Data Commons Architecture Ideas
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCF
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
Software Infrastructure for a National Research Platform
Software Infrastructure for a National Research PlatformSoftware Infrastructure for a National Research Platform
Software Infrastructure for a National Research Platform
 
Globus Auth: A Research Identity and Access Management Platform
Globus Auth: A Research Identity and Access Management PlatformGlobus Auth: A Research Identity and Access Management Platform
Globus Auth: A Research Identity and Access Management Platform
 
Streamlined data sharing and analysis to accelerate cancer research
Streamlined data sharing and analysis to accelerate cancer researchStreamlined data sharing and analysis to accelerate cancer research
Streamlined data sharing and analysis to accelerate cancer research
 
Accelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy ScienceAccelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy Science
 

Último

Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 

Último (20)

Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 

Taming Big Data!

  • 1. Ian Foster Argonne National Laboratory and University of Chicago foster@anl.gov ianfoster.org Taming Big Data!
  • 4. Three big data challenges Channel massive flows Automate management Build discovery engines 4
  • 5. Three big data challenges Channel massive flows Automate management Build discovery engines 5
  • 6. Channel massive data flows Data must move to be useful. We may optimize, but we can never entirely eliminate distance. • Sources: experimental facilities, sensors, computations • Sinks: analysis computers, display systems • Stores: impedance matchers & time shifters • Pipes: IO systems and networks connect other elements “We must think of data as a flowing river over time, not a static snapshot. Make copies, share, and do magic” – S. Madhavan Stor e
  • 7. Transfer is challenging at many levels Speed and reliability • GridFTP protocol • Globus implementation Scheduling and modeling • SEAL and STEAL algorithms • RAMSES project 7
  • 9. 9 Application OS FS Stack HBA/HCA LAN Switch Router Source data transfer node TCP IP NIC Application OS FS Stack HBA/HCA LAN Switch Router TCP IP NIC Storage Array Wide Area Network OST MDT Lustre file system Destination data transfer node OSS OSS MDS MDS + diverse environments + diverse workloads + contention File transfer is an end-to-end problem
  • 10. GridFTP protocol and implementations: Fast, reliable, secure 3rd-party data transfer 10 Extend legacy FTP protocol to enhance performance, reliability, security Globus GridFTP provides a widely-used open source implementation. Modular, pluggable architecture (different protocols, I/O interfaces). Many optimizations: e.g., concurrency, parallelism, pipelining. Data Transfer Node at Site B Data Transfer Node at Site A ParallelFileSystem GridFTP Server Process GridFTP Server Process Parallelism = 3 Concurrency = 2 GridFTP Server Process GridFTP Server Process TCP Connection TCP Connection TCP Connection TCP Connection TCP Connection TCP Connection
  • 11. 85 Gbps sustained disk-to-disk over 100 Gbps network, Ottawa—New Orleans 11 Raj Kettiumuthu and team, Argonne Nov 2014
  • 12. Higgs discovery “only possible because of the extraordinary achievements of … grid computing”—Rolf Heuer, CERN DG 10s of PB, 100s of institutions, 1000s of scientists, 100Ks of CPUs, Bs of tasks 12
  • 13. 13 One Advanced Photon Source data node: 125 destinations
  • 15.
  • 16. 16
  • 17. Transfer scheduling and optimization • Science data traffic is extremely bursty • User experience can be improved by scheduling to minimize slowdown • Traffic can be categorized: interactive or batch • Increased concurrency tends to increase aggregate throughput, to a point 17 Concurrency over 24 hours. Kettimuthu et al., 2015 Throughput vs. concurency & parallelism. Kettimuthu et al., 2014
  • 18. A load-aware, adaptive algorithm: (1) Data-driven model of throughput 18 EP2 EP3 EP4 EP1 Collect many <s, d, cs, cd, v, a> data E.g., <EP1, EP3, 3, 3, 20GB, 29sec> Estimate throughput(s, d, cs, cd, v) Adjust with estimate of external load
  • 19. Define transfer priority: Schedule transfers if neither source nor destination is saturated, using model to decide concurrency If source or destination is saturated, interrupt active transfer(s) to service waiting requests, if in so doing can reduce overall average slowdown 19 A load-aware, adaptive algorithm: (2) Concurrency-constrained scheduling
  • 20. 20
  • 21. 21
  • 22. Gagan Agarwal1* Prasanna Balaprakash2 Ian Foster2* Raj Kettimuthu2 Sven Leyffer2 Vitali Morozov2 Todd Munson2 Nagi Rao3* Saday Sadayappan1 Brad Settlemyer3 Brian Tierney4* Don Towsley5* Venkat Vishwanath2 Yao Zhang2 1 Ohio State University 2 Argonne National Laboratory 3 Oak Ridge National Laboratory 4 ESnet 5 UMass Amherst (* Co-PIs) Advanced Scientific Computing Research Program manager: Rich Carlson♦︎
  • 23. How to create more accurate, useful, and portable models of distributed systems? Simple analytical model: T= α+ β*l [startup cost + sustained bandwidth] Experiment + regression to estimate α, β 23 First-principles modeling to better capture details of system & application components Data-driven modeling to learn unknown details of system & application components Model composition Model, data comparison
  • 24. Differential regression for combining data from different sources Example of use: Predict performance on connection length L not realizable on physical infrastructure E.g., IB-RDMA or HTCP throughput on 900-mile connection 1) Make multiple measurements of performance on path lengths d: – Ms(d): OPNET simulation – ME(d): ANUE-emulated path – MU(di): Real network (USN) 2) Compute measurement regressions on d: ṀA(.), A∈{S, E, U} 3) Compute differential regressions: ∆ṀA,B(.) = ṀA(.) - ṀB(.), A, B∈{S, E, U} 4) Apply differential regression to obtain estimates, C∈{S, E} 𝓜U(d) = MC(d) - ∆ṀC,U(d) simulated/emulated measurements point regression estimate
  • 25. Source LAN profile WAN profile Destination LAN profile Configuration for host and edge devices Configuration for WAN devices Configuration for host and edge devices composition operations End-to-end profile composition
  • 26. Three big data challenges Channel massive flows Automate management Build discovery engines 26
  • 27. Registry Staging Store Ingest Store Analysis Store Community Store Archive Mirror Ingest Store Analysis Store Community Store Archive Mirror Registry Quota exceeded ! Expired credentials ! Network failed. Retry. ! Permission denied ! It should be trivial to Collect, Move, Sync, Share, Analyze, Annotate, Publish, Search, Backup, & Archive BIG DATA … but in reality it’s often very challenging
  • 28. One researcher’s perspective on data management challenges 28
  • 29. 29
  • 30. Tripit exemplifies process automation Me Book flights Book hotel Record flights Suggest hotel Record hotel Get weather Prepare maps Share info Monitor prices Monitor flight Other services
  • 31. How the “business cloud” works Platform services Database, analytics, application, deployment, workflow, queuing Auto-scaling, Domain Name Service, content distribution Elastic MapReduce, streaming data analytics Email, messaging, transcoding. Many more. Infrastructure services Computing, storage, networking Elastic capacity Multiple availability zones
  • 32. Process automation for science Run experiment Collect data Move data Check data Annotate data Share data Find similar data Link to literature Analyze data Publish data Automate and outsource: the Discovery cloud
  • 33. Analysis Staging Ingest Community Repository Archive Mirror Registry Next-gen genome sequencer Telescope In millions of labs worldwide, researchers struggle with massive data, advanced software, complex protocols, burdensome reporting Globus research data management services www.globus.org Simulation
  • 34. Reliable, secure, high-performance file transfer and synchronization “Fire-and-forget” transfers Automatic fault recovery Seamless security integration Powerful GUI and APIs Data Source Data Destination User initiates transfer request 1 Globus moves and syncs files 2 Globus notifies user 3
  • 35. Data Source User A selects file(s) to share, selects user or group, and sets permissions 1 Globus tracks shared files; no need to move files to cloud storage! 2 User B logs in to Globus and accesses shared file 3 Easily share large data with any user or group No cloud storage required
  • 36. Extreme ease of use • InCommon, Oauth, OpenID, X.509, … • Credential management • Group definition and management • Transfer management and optimization • Reliability via transfer retries • Web interface, REST API, command line • One-click “Globus Connect Personal” install • 5-minute Globus Connect Server install
  • 37. 37
  • 38. 38
  • 39. High-speed transfers to/from AWS cloud, via Globus transfer service • UChicago  AWS S3 (US region): Sustained 2 Gbps – 2 GridFTP servers, GPFS file system at UChicago – Multi-part upload via 16 concurrent HTTP connections • AWS  AWS (same region): Sustained 5 Gbps 39 go#s3
  • 40. Globus transfer & sharing; identity & group management, data discovery & publication 25,000 users, 75 PB and 3B files transferred, 8,000 endpoints Globus endpoints
  • 41. Identity, group, profile management services … Sharing service Transfer service Globus Toolkit GlobusConnect X
  • 42. Identity, group, profile management services Sharing service Transfer service Globus Toolkit GlobusConnect Publication and discovery X
  • 43. 43
  • 44. Identity, group, profile management services Sharing service Transfer service Globus Toolkit GlobusAPIs GlobusConnect Publication and discovery X
  • 45. The Globus Galaxies platform: Science as a service Globus Galaxies platform Tool and workflow execution, publication, discovery, sharing; identity management; data management; task scheduling Infra- structure services EC2, EBS, S3, SNS, Spot, Route 53, Cloud Formation Ematter materials scienceFACE-IT PDACS
  • 46. Three big data challenges Channel massive flows Automate management Build discovery engines 46
  • 47. Discovery engines: Integrate simulation, experiment, and informatics Informatics Analysis Tools High-throughput Experiments Problem Specification Modeling and Simulation Analysis & Visualization Experimental Design Analysis & Visualization Integrated Databases
  • 50. DOE Systems Biology Knowledge Base (KBase) Source: Rick Stevens
  • 51.
  • 52. A discovery engine for the study of disordered structures Diffuse scattering images from Ray Osborn et al., Argonne SampleExperimental scattering Material composition Simulated structure Simulated scattering La 60% Sr 40% Detect errors (secs—mins) Knowledge base Past experiments; simulations; literature; expert knowledge Select experiments (mins—hours) Contribute to knowledge base Simulations driven by experiments (mins—days) Knowledge-driven decision making Evolutionary optimization
  • 53. Immediate assessment of alignment quality in near-field high-energy diffraction microscopy 5 Blue Gene/Q Orthros (All data in NFS) 3: Generate Parameters FOP.c 50 tasks 25s/task ¼ CPU hours Uses Swift/K Dataset 360 files 4 GB total 1: Median calc 75s (90% I/O) MedianImage.c Uses Swift/K 2: Peak Search 15s per file ImageProcessing.c Uses Swift/K Reduced Dataset 360 files 5 MB total feedback to experiment Detector 4: Analysis Pass FitOrientation.c 60s/task (PC) 1667 CPU hours 60s/task (BG/Q) 1667 CPU hours Uses Swift/T GO Transfer Up to 2.2 M CPU hours per week! ssh Globus Catalog Scientific Metadata Workflow ProgressWorkflow Control Script Bash Manual This is a single workflow 3: Convert bin L to N 2 min for all files, convert files to Network Endian format Before After Hemant Sharma, Justin Wozniak, Mike Wilde, Jon Almer
  • 54. Integrate data movement, management, workflow, and computation to accelerate data-driven applications New data, computational capabilities, and methods create opportunities and challenges Integrate statistics/machine learning to assess many models and calibrate them against `all' relevant data New computer facilities enable on-demand computing and high-speed analysis of large quantities of data
  • 55. Three big data challenges Channel massive flows – New protocols and management algorithms Automate management – The Discovery Cloud Build discovery engines – MG-RAST, kBase, Materials 56
  • 56. U. S. D E PART M ENT OF ENERGY 57
  • 57. 58