My talk at the Winter School on Big Data in Tarragona, Spain.
Abstract: We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to leverage the “cloud” (whether private or public) to achieve economies of scale and reduce cognitive load. I explore the past, current, and potential future of large-scale outsourcing and automation for science, and suggest opportunities and challenges for today’s researchers.
4. Three big data challenges
Channel massive flows
Automate management
Build discovery engines
4
5. Three big data challenges
Channel massive flows
Automate management
Build discovery engines
5
6. Channel massive data flows
Data must move to be useful. We may optimize,
but we can never entirely eliminate distance.
• Sources: experimental facilities,
sensors, computations
• Sinks: analysis computers,
display systems
• Stores: impedance
matchers & time shifters
• Pipes: IO systems and
networks connect other elements
“We must think of data as a flowing river over time, not a static
snapshot. Make copies, share, and do magic” – S. Madhavan
Stor
e
7. Transfer is challenging at many levels
Speed and reliability
• GridFTP protocol
• Globus implementation
Scheduling and modeling
• SEAL and STEAL algorithms
• RAMSES project
7
10. GridFTP protocol and implementations:
Fast, reliable, secure 3rd-party data transfer
10
Extend legacy FTP protocol to enhance performance, reliability, security
Globus GridFTP provides a widely-used open source implementation.
Modular, pluggable architecture (different protocols, I/O interfaces).
Many optimizations: e.g., concurrency, parallelism, pipelining.
Data Transfer
Node at Site B
Data Transfer
Node at Site A
ParallelFileSystem
GridFTP
Server
Process
GridFTP
Server
Process
Parallelism = 3
Concurrency = 2
GridFTP
Server
Process
GridFTP
Server
Process
TCP Connection
TCP Connection
TCP Connection
TCP Connection
TCP Connection
TCP Connection
11. 85 Gbps sustained disk-to-disk over 100
Gbps network, Ottawa—New Orleans
11
Raj Kettiumuthu
and team,
Argonne
Nov 2014
12. Higgs discovery “only possible because
of the extraordinary achievements of …
grid computing”—Rolf Heuer, CERN DG
10s of PB, 100s of institutions, 1000s of
scientists, 100Ks of CPUs, Bs of tasks
12
17. Transfer scheduling and optimization
• Science data traffic is
extremely bursty
• User experience can be
improved by scheduling to
minimize slowdown
• Traffic can be categorized:
interactive or batch
• Increased concurrency
tends to increase aggregate
throughput, to a point
17
Concurrency over 24 hours. Kettimuthu et
al., 2015
Throughput vs. concurency & parallelism.
Kettimuthu et al., 2014
18. A load-aware, adaptive algorithm:
(1) Data-driven model of throughput
18
EP2
EP3
EP4
EP1
Collect many <s, d, cs, cd, v, a> data
E.g., <EP1, EP3, 3, 3, 20GB, 29sec>
Estimate throughput(s, d, cs, cd, v)
Adjust with estimate of external load
19. Define transfer priority:
Schedule transfers if neither source nor destination
is saturated, using model to decide concurrency
If source or destination is saturated, interrupt active
transfer(s) to service waiting requests, if in so doing
can reduce overall average slowdown
19
A load-aware, adaptive algorithm:
(2) Concurrency-constrained scheduling
22. Gagan Agarwal1* Prasanna Balaprakash2 Ian Foster2* Raj Kettimuthu2
Sven Leyffer2 Vitali Morozov2 Todd Munson2 Nagi Rao3*
Saday Sadayappan1 Brad Settlemyer3 Brian Tierney4* Don Towsley5*
Venkat Vishwanath2 Yao Zhang2
1 Ohio State University 2 Argonne National Laboratory
3 Oak Ridge National Laboratory 4 ESnet 5 UMass Amherst (* Co-PIs)
Advanced Scientific Computing Research
Program manager: Rich Carlson♦︎
23. How to create more accurate, useful, and
portable models of distributed systems?
Simple analytical model:
T= α+ β*l
[startup cost + sustained bandwidth]
Experiment + regression
to estimate α, β
23
First-principles modeling
to better capture details
of system & application
components
Data-driven modeling to
learn unknown details of
system & application
components
Model
composition
Model, data
comparison
24. Differential regression for combining
data from different sources
Example of use: Predict performance on connection length L
not realizable on physical infrastructure
E.g., IB-RDMA or HTCP throughput on 900-mile connection
1) Make multiple measurements of performance on path lengths d:
– Ms(d): OPNET simulation
– ME(d): ANUE-emulated path
– MU(di): Real network (USN)
2) Compute measurement regressions on d: ṀA(.), A∈{S, E, U}
3) Compute differential regressions: ∆ṀA,B(.) = ṀA(.) - ṀB(.), A, B∈{S, E, U}
4) Apply differential regression to obtain estimates, C∈{S, E}
𝓜U(d) = MC(d) - ∆ṀC,U(d)
simulated/emulated measurements point regression estimate
30. Tripit exemplifies process automation
Me
Book flights
Book hotel
Record flights
Suggest hotel
Record hotel
Get weather
Prepare maps
Share info
Monitor prices
Monitor flight
Other services
31. How the “business cloud” works
Platform
services
Database, analytics, application, deployment, workflow, queuing
Auto-scaling, Domain Name Service, content distribution
Elastic MapReduce, streaming data analytics
Email, messaging, transcoding. Many more.
Infrastructure
services
Computing, storage, networking
Elastic capacity
Multiple availability zones
32. Process automation for science
Run experiment
Collect data
Move data
Check data
Annotate data
Share data
Find similar data
Link to literature
Analyze data
Publish data
Automate
and
outsource:
the
Discovery
cloud
34. Reliable, secure, high-performance file
transfer and synchronization
“Fire-and-forget”
transfers
Automatic fault
recovery
Seamless security
integration
Powerful GUI
and APIs
Data
Source
Data
Destination
User initiates
transfer
request
1
Globus
moves and
syncs files
2
Globus
notifies user
3
35. Data
Source
User A selects
file(s) to share,
selects user or
group, and sets
permissions
1
Globus tracks shared
files; no need to
move files to cloud
storage!
2
User B logs in
to Globus and
accesses
shared file
3
Easily share large
data with any user or
group
No cloud storage
required
36. Extreme ease of use
• InCommon, Oauth, OpenID, X.509, …
• Credential management
• Group definition and management
• Transfer management and optimization
• Reliability via transfer retries
• Web interface, REST API, command line
• One-click “Globus Connect Personal” install
• 5-minute Globus Connect Server install
52. A discovery engine
for the study of disordered structures
Diffuse scattering images from Ray Osborn et al., Argonne
SampleExperimental
scattering
Material
composition
Simulated
structure
Simulated
scattering
La 60%
Sr 40%
Detect errors
(secs—mins)
Knowledge base
Past experiments;
simulations; literature;
expert knowledge
Select experiments
(mins—hours)
Contribute to knowledge base
Simulations driven by
experiments (mins—days)
Knowledge-driven
decision making
Evolutionary optimization
53. Immediate assessment of alignment quality in
near-field high-energy diffraction microscopy
5
Blue Gene/Q
Orthros
(All data in NFS)
3: Generate
Parameters
FOP.c
50 tasks
25s/task
¼ CPU hours
Uses Swift/K
Dataset
360 files
4 GB total
1: Median calc
75s (90% I/O)
MedianImage.c
Uses Swift/K
2: Peak Search
15s per file
ImageProcessing.c
Uses Swift/K
Reduced
Dataset
360 files
5 MB total
feedback to experiment
Detector
4: Analysis Pass
FitOrientation.c
60s/task (PC)
1667 CPU hours
60s/task (BG/Q)
1667 CPU hours
Uses Swift/T
GO Transfer
Up to
2.2 M CPU hours
per week!
ssh
Globus Catalog
Scientific Metadata
Workflow ProgressWorkflow
Control
Script
Bash
Manual
This is a
single
workflow
3: Convert bin L
to N
2 min for all files,
convert files to
Network Endian
format
Before
After
Hemant Sharma, Justin Wozniak, Mike Wilde, Jon Almer
54. Integrate data movement, management, workflow,
and computation to accelerate data-driven
applications
New data, computational capabilities, and
methods create opportunities and challenges
Integrate statistics/machine learning to assess
many models and calibrate them against `all'
relevant data
New computer facilities enable on-demand
computing and high-speed analysis of large
quantities of data
55. Three big data challenges
Channel massive flows
– New protocols and
management algorithms
Automate management
– The Discovery Cloud
Build discovery engines
– MG-RAST, kBase, Materials
56