2024: Domino Containers - The Next Step. News from the Domino Container commu...
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
1. Bionimbus:
Lessons
from
a
Petabyte-‐Scale
Science
Cloud
Service
Provider
(CSP)
Robert
Grossman
Ins?tute
for
Genomics
&
Systems
Biology
Center
for
Research
Informa?cs
Computa?on
Ins?tute
Department
of
Medicine
University
of
Chicago
&
Open
Data
Group
September
11,
2012
2. The
OSDC
&
Bionimbus
Teams
• Open
Science
Data
Cloud
(OSDC)
Team
– MaM
Greenway,
Allison
Heath,
Ray
Powell,
Rafael
Suarez.
– Major
funding
for
the
OSDC
is
provided
by
the
Gordon
and
BeMy
Moore
Founda?on.
• Bionimbus
Team
– Elizabeth
Bartom,
Casey
Brown,
Jason
Grundstad,
David
Hanley,
Nicolas
Negre,
Tom
Stricker,
MaM
SlaMery,
Rebecca
Spokony
&
Kevin
White.
– Bionimbus
is
a
joint
project
between
Laboratory
for
Advanced
Compu?ng
&
White
Lab
at
the
University
of
Chicago
and
uses
in
part
the
OSDC
infrastructure.
3. Let’s
Step
Back
20
Years
• 1992-‐96:
Petabyte
Access
&
Storage
Solu?ons
(PASS)
Project
for
SSC.
• It
developed
&
benchmarked
federated
rela?onal,
OO
DB,
object
stores,
&
column-‐
oriented
data
warehouse
solu?ons
at
the
TB-‐scale.
4. A
picture
of
Cern’s
Large
Hadron
Collider
(LHC).
The
LHC
took
about
a
decade
to
construct,
and
cost
about
$4.75
billion.
Source
of
picture:
Conrad
Melvin,
Crea?ve
Commons
BY-‐SA
2.0,
www.flickr.com/photos/
58220828@N07/5350788732
7. One
Million
Genomes
• Sequencing
a
million
genomes
would
most
likely
fundamentally
change
the
way
we
understand
genomic
varia?on.
• The
genomic
data
for
a
pa?ent
is
about
1
TB
(including
samples
from
both
tumor
and
normal
?ssue).
• One
million
genomes
is
about
1000
PB
or
1
EB
• With
compression,
it
may
be
about
100
PB
• At
$1000/genome,
the
sequencing
would
cost
about
$1B
8. Big
data
driven
discovery
on
1,000,000
genomes
and
1
EB
of
data.
Genomic-‐ Improved
Genomic-‐
driven
understanding
driven
drug
diagnosis
of
genomic
development
science
Precision
diagnosis
and
treatment.
Preven?ve
health
care.
9. ER+
TNBC
With
genomics,
we
can
stra?fy
diseases
and
treat
each
stratum
differently.
Source:
White
Lab,
University
of
Chicago.
10. Clonal
Evolu?on
of
Tumors
Tumors
evolve
temporally
and
spa?ally.
Source:
Mel
Greaves
&
Carlo
C.
Maley,
Clonal
evolu?on
in
cancer,
Nature,
Volume
241,
pages
306-‐312,
2012.
11. Combina?ons
of
Rare
Alleles
Penetrance
High
rare
examples
of
alleles
high-‐penetrance
causing
common
variants
Mendelian
influencing
Intermediate
disease
common
disease
Low-‐frequency
variants
with
intermediate
penetrance
rare
variants
of
most
common
Modest
variants
small
effect
very
hard
to
iden?fy
implicated
in
by
gene?c
means
common
disease
by
GWA
Low
Allele
0.001
0.01
0.1
frequency
Very
rare
Rare
Uncommon
Common
Source:
Mark
McCarthy
12. TCGA
Analysis
of
Lung
Cancer
• 178
cases
of
SQCC
(lung
cancer)
• Matched
tumor
&
normal
• Mean
of
360
exonic
muta?ons,
323
CNV,
&
165
rearrangements
per
tumor
Source:
The
Cancer
Genome
Atlas
Research
Network,
Comprehensive
genomic
characteriza?on
of
squamous
cell
lung
cancers,
Nature,
2012,
doi:10.1038/nature11404.
13. Some
Examples
of
Big
Data
Science
Discipline
Dura3on
Size
#
Devices
HEP
-‐
LHC
10
years
15
PB/year*
One
Astronomy
-‐
LSST
10
years
12
PB/year**
One
Genomics
-‐
NGS
2-‐4
years
0.5
TB/genome
1000’s
*At
full
capacity,
the
Large
Hadron
Collider
(LHC),
the
world's
largest
par?cle
accelerator,
is
expected
to
produce
more
than
15
million
Gigabytes
of
data
each
year.
…
This
ambi?ous
project
connects
and
combines
the
IT
power
of
more
than
140
computer
centres
in
33
countries.
Source:
hMp://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-‐en.html
**As
it
carries
out
its
10-‐year
survey,
LSST
will
produce
over
15
terabytes
of
raw
astronomical
data
each
night
(30
terabytes
processed),
resul?ng
in
a
database
catalog
of
22
petabytes
and
an
image
archive
of
100
petabytes.
Source:
hMp://www.lsst.org/
News/enews/teragrid-‐1004.html
17. Another
way:
opencompute.org
Think
of
data
as
big
if
you
measure
it
in
MW,
as
in
Facebook’s
Pineville
Data
Center
is
30
MW.
18. An
algorithm
and
compu?ng
infrastructure
is
“big-‐
data
scalable”
if
adding
a
rack
(or
container)
of
data
(and
corresponding
processors)
allows
you
to
do
the
same
computa?on
in
the
same
?me
but
over
more
data.
19. Commercial
Cloud
Service
Provider
(CSP)
15
MW
Data
Center
Monitoring,
Accoun?ng
and
network
security
billing
Customer
and
forensics
Facing
Portal
Automa?c
provisioning
and
100,000
servers
infrastructure
1
PB
DRAM
management
100’s
of
PB
of
disk
~1
Tbps
egress
bandwidth
25
operators
for
15
MW
Commercial
Cloud
Data
center
network
20. What
are
some
of
the
important
differences
between
commercial
and
research-‐focused
CSPs?
21. Science
CSP
Commercial
CSP
POV
Democra?ze
access
to
As
long
as
you
pay
the
bill;
data.
Integrate
data
to
as
long
as
the
business
make
discoveries.
Long
model
holds.
term
archive.
Data
&
Data
intensive
Internet
style
scale
out
Storage
Science
Clouds
compu?ng
&
HP
storage
and
object-‐based
storage
Flows
Large
data
flows
in
and
Lots
of
small
web
flows
out
Streams
Streaming
processing
NA
required
Accoun?ng
Essen?al
Essen?al
Lock
in
Moving
environment
Lock
in
is
good
between
CSPs
essen?al
22. Part
3.
The
Open
Cloud
Consor?um’s
Open
Science
Data
Cloud
23. • U.S
based
not-‐for-‐profit
corpora?on.
• Manages
cloud
compu?ng
infrastructure
to
support
scien?fic
research:
Open
Science
Data
Cloud.
• Manages
cloud
compu?ng
testbeds:
Open
Cloud
Testbed.
www.opencloudconsor?um.org
23
24. Cloud
Services
Opera?ons
Centers
(CSOC)
• The
OSDC
operates
Cloud
Services
Opera?ons
Center
(or
CSOC).
• It
is
a
CSOC
focused
on
suppor?ng
Science
Clouds
for
researchers.
• Compare
to
Network
Opera?ons
Center
or
NOC.
• Both
are
an
important
part
of
cyber
infrastructure
for
big
data
science.
25. Different
Styles
of
OSDC
Racks
• Design
1:
Put
cores
over
spindles.
• Higher
cost
but
easy
to
compute
over
all
the
data.
• Design
2:
separate
(some
of
the
)
2012
OSDC
rack
design
(dray)
• 950
TB
/
rack
storage
from
the
• 600
cores
/
rack
compute.
26. Open
Science
Data
Cloud
Accoun?ng
and
Monitoring,
billing
(OSDC)
compliance,
&
security
Customer
Facing
Science
Cloud
SW
&
Services
Portal
(Tukey)
Automa?c
provisioning
and
3
PB
2011
infrastructure
10
PB
2012
management
~100
Gbps
bandwidth
able
to
scale
to
100
PB?
5-‐12
operators
to
operate
1-‐5
MW
Science
Cloud
Data
center
network
OSDC
Data
Stack
based
upon
OpenStack,
Hadoop,
GlusterFS,
UDT,
…
27. OSDC
Philosophy
• We
try
to
automate
as
much
as
possible
(we
automate
the
setup
&
opera?ons
of
a
rack).
• We
try
to
write
as
liMle
soyware
as
possible.
• Each
project
is
a
bit
different,
but
in
general:
• We
assign
(permanent)
IDs
to
data
managed
by
the
OSDC
and
manage
associated
metadata.
• We
assign
and
enforce
permissions
for
users
&
groups
of
users
and
for
files/objects,
collec?ons
of
files/objects,
and
collec?ons
of
collec?ons.
• We
Support
RESTful
interfaces.
• Do
accoun?ng
for
storage
and
core-‐hours.
28. Some
Of
Our
Biggest
Mistakes
• Not
charging
those
who
were
the
largest
users
of
our
services.
This
resulted
in
a
lot
of
bad
behavior.
• Trying
to
support
donated
equipment
without
adequate
staff.
• Being
too
op?mis?c
about
when
big
data
soyware
would
be
ready
for
prime
?me.
• Some
problems
with
big
data
soyware
doesn’t
show
up
at
less
than
the
full
scale
of
the
OSDC,
but
we
have
only
one
OSDC
and
it
is
difficult
to
test
at
this
scale.
29. Essen?al
Services
for
a
Science
CSP
• Support
for
data
intensive
compu?ng
• Support
for
big
data
flows
• Account
management,
authen?ca?on
and
authoriza?on
services
• Health
and
status
monitoring
• Billing
and
accoun?ng
• Ability
to
rapidly
provision
infrastructure
• Security
services,
logging,
event
repor?ng
• Access
to
large
amounts
of
public
data
• High
performance
storage
• Simple
data
export
and
import
services
30. Number
1000’s
Individual
scien?sts
&
small
projects
100’s
Community
based
science
via
Science
as
a
10’s
Service
very
large
projects
Data
Size
Small
Medium
to
Large
Very
Large
Public
Shared
community
Dedicated
infrastructure
infrastructure
infrastructure
31. Part
4.
Bionimbus
Bionimbus
is
a
joint
project
between
Laboratory
For
Advanced
Compu?ng
&
the
White
Lab
at
the
University
of
Chicago.
39. Bionimbus
Community
Genomic
Cloud
researcher
• 1K
genomes
Cloud
for
• PubMed
Public
Data
• etc.
Personal
“dropbox”
+
compute
40. Bionimbus
Private
Genomic
Cloud
researcher
• 1K
genomes
Cloud
for
Cloud
for
TCGA
• PubMed
Public
Data
Controlled
Data
dbGaP
• etc.
Personal
“dropbox”
&
compute
41. Bionimbus
Private
Biomedical
Cloud
researcher
• 1K
genomes
• PubMed
Cloud
for
Cloud
for
TCGA
• etc.
Public
Data
Personal
“dropbox”
Controlled
Data
dbGaP
plus
compute
ScaMer,
gather
Clinical
Cloud
for
queries
Research
Data
PHI
data
Warehouse
42. Step
2.
Send
sample
to
Step
1.
Get
Bionimbus
ID
be
sequenced.
(BID),
assign
project,
private/community,
Internal
BID
Generator
public
cloud,
etc.
External
Sequencers
sequencing
partner
Step
5.
Cloud
based
analysis
using
IGSB
and
3rd
party
tools
and
applica?ons.
Step
3a.
Return
raw
reads.
Step
3b.
Return
variant
calls,
CNV,
annota?on…
Bionimbus
Bionimbus
Private
Cloud
Community
Step
4.
Secure
data
UC
Cloud
rou?ng
to
appropriate
cloud
based
upon
BID.
Bionimbus
Private
dbGaP
Amazon
Cloud
XY
43. (Eucalyptus,
web2py-‐based
Front
End
OpenStack)
U?lity
Cloud
(PostgreSQL)
Services
Database
Analysis
Pipelines
&
Services
Re-‐analysis
Services
Intercloud
Services
(IDs,
etc.)
(UDT,
Data
Data
replica?on)
Inges?on
Services
Cloud
Services
(Hadoop,
Sector/Sphere)
46. Enrich
with
Rela?onal
databases
Summary
level
clinical
data
(10-‐100
TB)
NoSql
&
scien?fic
databases
Varia?on
(VCF)
Files
(1-‐10
PB)
(Genomic
varia?on)
NoSql,
DFS,
Sequence
(BAM)
Files
(100-‐1000
PB)
file
overlays?
(Sequence
data
in
binary
form)
47. Acknowledgements
Major
funding
and
support
for
the
Open
Science
Data
Cloud
(OSDC)
is
provided
by
the
Gordon
and
BeMy
Moore
Founda?on.
This
funding
is
used
to
support
the
OSDC-‐Adler,
Sullivan
and
Root
facili?es.
Addi?onal
funding
for
the
OSDC
has
been
provided
by
the
following
sponsors:
• The
OCC-‐Y
Hadoop
Cluster
(approximately
1000
cores
and
1
PB
of
storage)
was
donated
by
Yahoo!
in
2011.
• Cisco
provides
the
OSDC
access
to
the
Cisco
C-‐Wave,
which
connects
OSDC
data
centers
with
10
Gbps
wide
area
networks.
• NSF
awarded
the
OSDC
a
5-‐year
(2010-‐2016)
PIRE
award
to
train
scien?sts
to
use
the
OSDC
and
to
further
develop
the
underlying
technology.
• OSDC
technology
for
high
performance
data
transport
is
support
in
part
by
NSF
Award
1127316.
• The
StarLight
Facility
in
Chicago
enables
the
OSDC
to
connect
to
over
30
high
performance
research
networks
around
the
world
at
10
Gbps
or
higher,
with
an
increasing
number
of
100
Gbps
connec?ons.
The
OSDC
is
managed
by
the
Open
Cloud
Consor?um,
a
501(c)(3)
not-‐for-‐profit
corpora?on.
If
you
are
interested
in
providing
funding
or
dona?ng
equipment
or
services,
please
contact
us
at
info@opensciencedatacloud.org.
48. For
more
informa?on
• You
can
find
some
more
informa?on
on
my
blog:
rgrossman.com.
• Some
of
my
technical
papers
are
also
available
there.
• My
email
address
is
robert.grossman
at
uchicago
dot
edu
• I
recently
wrote
a
popular
book
about
compu?ng
called:
The
Structure
of
Digital
Compu?ng:
From
Mainframes
to
Big
Data,
which
you
can
buy
from
Amazon.
Center for
Research
Informatics
49. Sources
for
images
• The
image
of
the
hard
disk
is
from
Norlando
Pobre,
Crea?ve
Commons.
• The
image
of
the
Facebook
Pineville
Data
Center
is
from
the
Intel
Free
Press,
www.flickr.com/photos/intelfreepress/6722296855/,
Crea?ve
Commons
BY
2.0.
• The
image
of
the
LHC
is
from
Conrad
Melvin,
Crea?ve
Commons
BY-‐SA
2.0,
www.flickr.com/
photos/58220828@N07/5350788732