2. Conclusions
In 20123456789 , we’re all “cool with the cloud”
Premature optimization is still terrible
Make it work, make it fast, make it cheap
Experimentation and engineering are very
different practices
Great policy makes great systems
This continues to be an amazing time to be an
infrastructure / data nerd in health care / life
science
3.
4.
5.
6. Geek Cred: My First Petabyte, 2008My first Petabyte: NASA, 2008
7. Geek Cred: My First Petabyte, 2008My first Petabyte: NASA, 2008
10. Genomic Data Production in ContextData Explosion
I did research computing at
Broad from 2014 - 2017
11. Geek Cred: My First Petabyte, 2008My first Exabyte: 2014
Note that this exabyte is
empty. Broad’s data is
nowhere near Exascale
12. Cloud Definitions
Public cloud: AWS, Azure, GCS, plus a bunch of wannabes
Private cloud: Cloud services on gear you own, which may
be hosted at a nice data center somewhere
Fog computing: On premises equipment used for cloud
stuff. It’s fog because that’s a cloud that’s close to earth.
Get it?
Hybrid cloud: Bursting to a public cloud for extra capacity.
Multi cloud: Azure for business, AWS for burst / scalability,
Google for that one weird trick.
Enterprise cloud: IT trying desperately to align with a cloud
strategy by changing the labels on the Powerpoint.
“On premises,” or “legacy,” carrot
cake still has a place, even in homes
with a cake-as-a-service strategy.
Hype-o-meter Impact-o-meter
14. Comparing the Big Three
Uncontested heavyweight
champion in terms of scale
maturity of services and adoption.
Services based on the market.
Default offerings may not be a good
fit for odd-shaped research
computing problems.
Market dominance means little
incentive to provide discounts or
customization.
15. Comparing the Big Three
Focused on value-add platforms.
Enthusiastic partner and sponsor in
areas of interest to $GOOG
Potential conflicts of interest in areas
of interest to $GOOG
Like something out of Greek
mythology, consumes ecosystem
partners whole.
Uncontested heavyweight
champion in terms of scale
maturity of services and adoption.
Services based on the market.
Default offerings may not be a good
fit for odd-shaped research
computing problems.
Market dominance means little
incentive to provide discounts or
customization.
16. Comparing the Big Three
Your CIO already has a regular meeting
with the Microsoft enterprise sales rep.
Microsoft is already a qualified vendor
in your purchasing systems.
Decades of experience with regulatory
compliance and governance
Already provides your identity,
authorization, and (probably) office
productivity.
Strategic purchases in HPC / ML / AI
Uncontested heavyweight
champion in terms of scale
maturity of services and adoption.
Services based on the market.
Default offerings may not be a good
fit for odd-shaped research
computing problems.
Market dominance means little
incentive to provide discounts or
customization.
Focused on value-add platforms.
Enthusiastic partner and sponsor in
areas of interest to $GOOG
Potential conflicts of interest in areas
of interest to $GOOG
Like something out of Greek
mythology, consumes ecosystem
partners whole.
17. Your CIO already has a regular meeting
with the Microsoft enterprise sales rep.
Microsoft is already a qualified vendor
in your purchasing systems.
Decades of experience with regulatory
compliance and governance
Already provides your identity,
authorization, and (probably) office
productivity.
Strategic purchases in HPC / ML / AI
Uncontested heavyweight
champion in terms of scale
maturity of services and adoption.
Services based on the market.
Default offerings may not be a good
fit for odd-shaped research
computing problems.
Market dominance means little
incentive to provide discounts or
customization.
Focused on value-add platforms.
Enthusiastic partner and sponsor in
areas of interest to $GOOG
Potential conflicts of interest in areas
of interest to $GOOG
Like something out of Greek
mythology, consumes ecosystem
partners whole.
Comparing the Big Three
18. Specific Advice on The Big Three
Public cloud is an agility play, not a cost play.
AWS, GCS, and Azure have very similar capabilities and
pricing, even at scale.
Pick one and get good at it.
Don’t be afraid of running experiments.
Avoid 2nd tier cloud providers unless there is an
unambiguous business or capability reason to use
them.
Track spending, even when it’s “free.”
$$ !!
19. The Cloud Is a Big Place
Global IaaS ProvidersDomain Specific PaaS
20. The Cloud Is a Big Place
Global IaaS ProvidersDomain Specific PaaS
Your CIO is not thinking of
HPC or research computing
when articulating their
cloud strategy.
21. The Cloud Is a Big Place
Global IaaS Providers
Analytics Framework
Domain Specific PaaS
Analysis platforms deserve
their own slide deck.
22. RestaurantDeliveryTake and BakeHomemade
Metaphor: Pizza as a Service
Cheese
Tomato Sauce
Pizza Dough
Fire
Oven
Electricity / Gas
Drinks
Table
Cheese
Tomato Sauce
Pizza Dough
Fire
Oven
Electricity / Gas
Drinks
Table
Cheese
Tomato Sauce
Pizza Dough
Fire
Oven
Electricity / Gas
Drinks
Table
Cheese
Tomato Sauce
Pizza Dough
Fire
Oven
Electricity / Gas
Drinks
Table
You Manage Vendor Manages
On-Premises
(legacy!)
Infrastructure as
a Service (IaaS)
Platform as a
Service (PaaS)
Software as a
Service (SaaS)
Credit: Everybody on the Internet.
23. RestaurantDeliveryTake and BakeHomemade
Metaphor: Pizza as a Service
Cheese
Tomato Sauce
Pizza Dough
Fire
Oven
Electricity / Gas
Drinks
Table
Cheese
Tomato Sauce
Pizza Dough
Fire
Oven
Electricity / Gas
Drinks
Table
Cheese
Tomato Sauce
Pizza Dough
Fire
Oven
Electricity / Gas
Drinks
Table
Cheese
Tomato Sauce
Pizza Dough
Fire
Oven
Electricity / Gas
Drinks
Table
You Manage Vendor Manages
On-Premises
(legacy!)
Infrastructure as
a Service (IaaS)
Platform as a
Service (PaaS)
Software as a
Service (SaaS)
Credit: Everybody on the Internet.
24. The Cloud Is a Big Place
Broad Firecloud
Data PlatformGlobal IaaS Providers
Analytics Framework
Domain Specific PaaS
Data platforms are where
it’s at right now.
25. One common thread: “Why Not Do Both?”
UC Health System Data Warehouse
• Shared data warehouse
• AND local instances at hospitals
NIH:
• World class dedicated HPC / networks
• AND negotiated discounts with public
cloud providers
GenePattern Networks:
• Free autoscaling environment on AWS
• AND support workstation / local HPC
26. The Policies you Need
Appropriate usage
Human readable: Expectations of privacy and standards of
behavior.
Data Classification
Governance: Defines the major categories of data (corporate
sensitive, clinical, …) and standards for handling of each.
Written Information Security Policy (WISP)
Technical: Defines how systems will be configured to protect
sensitive data and operations.
Vendor Qualification
Business SOP to establish practices around vendor access and
management. Real world policy impact: Because bicycle
lanes are “traffic lanes,” the argument
about snow plowing is simple, which saves
lives.
27. Practical advice on Cloud Systems
Make it work
– Use dedicated instances (full price) until you’re sure the software works
– Don’t overthink it: Increase RAM and local disk to overcome crashing
– Tear down /rebuild the entire infrastructure from time to time, even in dev.
– All systems (yes, even cloud systems) have limits. Stop whining and learn them.
– Any time you increase throughput by an order of magnitude, your system will break.
Then make it fast
– Profiling tools are your friend, automation is not.
– Benchmark on real data. Imputed and synthetic data just echo your own assumptions back to you.
Then make it cheap
– Now you get to turn on spot instances.
– This is the first time I ever want to hear about Glacier or Infrequently Accessed tiers of data
28. Practice does not make perfect.
Practice makes permanent.
Attributed to Yo Yo Ma
Engineering is different than experimentation
Application Repo
Production
Infrastructure Repo
Build Test
• Development can rely on production
• Production cannot rely on development
• Reference datasets are a prod resource.
• No manual intervention in either test or prod.
29. Many Experiments, Few Projects
INBOX Active INBOX INBOX
Feasibility Development Operations
Active Active
No ability to predict turnaround times.
30. Many Experiments, Few Projects
INBOX Active INBOX INBOX
Feasibility Development Operations
Active Active
“When there is too much to do, there is a strong tendency to engage in local reprioritization, meaning that
each person in the process looks at the pile she is facing, determines which items are the most important, and
then works on those tasks first
local reprioritization creates variability. If a task happens to be prioritized by everyone, it gets done quickly.
But, that means another task has been moved to the bottom of several “to do” lists and it might take weeks or
months to get done.”
No ability to predict turnaround times.
31. FAIR Data (within the enterprise)
Findable
• NoSQL database of metadata and checksums
• It’s plenty for a good long time.
Accessible
• Federated identity management
• Architecture of S3 buckets and production “roles”
Interoperable
• Data standards, ontologies, strong policy framework,
including electronic consents for human subjects data
Reusable
• ”It’s much easier to go FAR than to go FAIR”
Catered
Lunch
Sense of well-being and
contentment arising from
realistic expectations
Data Lake
Open Bar
32. Incredible opportunities
here, and rapidly
developing data silos
The Clinical Data Ecosystem
There is an incredible wealth of
data available to support both
clinical care and research
Unfortunately, it is carved up
and isolated in technical and
social silos.
There are both good and bad
reasons for this segmentation,
and it is holding us back.
Patient Journals
Consumer products
Longitudinal Data from
other providers …
Electronic
Medical Records
Possibility of a self-normal
(N of 1) over time
Diagnostic
Imaging
Natural language processing
has strong potentialClinical Notes
Innovations in the basics of
clinical observation
Hospital Telemetry
Pressure to avoid incidental
findings prevent bias
Primary Lab Data
33. A Personal Story
I use a commercial service that combines
labwork with wearable data
They provide insights and coaching
I have, personally, found this
transformational in how I approach my
health.
34. A Personal Story
I use a commercial service that combines
labwork with wearable data
They provide insights and coaching
I have, personally, found this
transformational in how I approach my
health.
35. A Personal Story
I use a commercial service that combines
labwork with wearable data
They provide insights and coaching
I have, personally, found this
transformational in how I approach my
health.
36. A Personal Story
I use a commercial service that combines
labwork with wearable data
They provide insights and coaching
I have, personally, found this
transformational in how I approach my
health.
37. A Personal Story
I use a commercial service that combines
labwork with wearable data
They provide insights and coaching
I have, personally, found this
transformational in how I approach my
health.
38. Conclusions
In 2012345678 2019 , we’re all “cool with the
cloud”
Premature optimization is still terrible
Make it work, make it fast, make it cheap
Strong distinction between experimentation and
engineering
Great policy makes great platforms
This continues to be an amazing time to be an
infrastructure / data nerd in health care / life
science
39. The future is already here – it’s just
not very well distributed
William Gibson