Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Big Data, The Community and The Commons (May 12, 2014)
1. Big
Data,
The
Community
and
The
Commons
Robert
Grossman
University
of
Chicago
Open
Cloud
Consor?um
May
12,
2014
TCGA
Third
Annual
Scien?fic
Symposium
2. Outline
1. Big
data
and
the
problems
it
creates
2. Compu?ng
over
big
data
3. Science
clouds
4. Data
analysis
at
scale
5. Genomic
clouds
6. How
might
we
organize?
4. 1. What
is
the
same
and
what
is
different
about
big
biomedical
data
vs
big
science
data
and
vs
big
commercial
data?
2. What
instrument
should
we
use
to
make
discoveries
over
big
biomedical
data?
3. Do
we
need
new
types
of
mathema?cal
and
sta?s?cal
models
for
big
biomedical
data?
4. How
do
we
organize
large
biomedical
datasets
to
maximize
the
discoveries
we
make
and
their
impact
on
health
care?
5. One
Million
Genome
Challenge
• Sequencing
a
million
genomes
would
likely
change
the
way
we
understand
genomic
varia?on.
• The
genomic
data
for
a
pa?ent
is
about
1
TB
(including
samples
from
both
tumor
and
normal
?ssue).
• One
million
genomes
is
about
1000
PB
or
1
EB
• With
compression,
it
may
be
about
100
PB
• At
$1000/genome,
the
sequencing
would
cost
about
$1B
• Think
of
this
as
one
hundred
studies
with
10,000
pa?ents
each
over
three
years.
6. Part
1:
Biomedical
compu?ng
is
being
disrupted
by
big
data
Source:
Michael
S.
Lawrence,
Petar
Stojanov,
Paz
Polak,
et.
al.,
Muta?onal
heterogeneity
in
cancer
and
the
search
for
new
cancer-‐associated
genes,
Nature
449,
pages
214-‐218,
2013.
7. Standard
Model
of
Biomedical
Compu?ng
Public
data
repositories
Private
local
storage
&
compute
Network
download
Local
data
($1K)
Community
soeware
Soeware
&
sweat
and
tears
($100K)
8. We
have
a
problem
…
Image:
A
large-‐scale
sequencing
center
at
the
Broad
Ins?tute
of
MIT
and
Harvard.
Growth
of
data
New
types
of
data
It
takes
over
three
weeks
to
download
the
TCGA
data
at
10
Gbps
Analyzing
the
data
is
more
expensive
than
producing
it
9. The
Smart
Phone
is
Becoming
a
Home
for
Medical
&
Environmental
Sensors
Source:
LifeWatch
V
from
LifeWatch
AG,
www.lifewatchv.com.
10. Source:
Interior
of
one
of
Google’s
Data
Center,
www.google.com/about/datacenters/
11. Computa?onal
Adver?sing
and
Marke?ng
• Computa?onal
adver?sing
finds
the
“best
match”
between
a
given
user
in
a
given
context
and
a
suitable
adver?sement*.
• In
2011,
over
$100
billion
was
spent
in
online
adver?sing
(eMarketer
es?mates)
• Contexts
include:
– Web
search
context
(SERP
adver?sing)
– Web
page
content
context
(content
match
adver?sing
and
banners)
– Social
media
context
– Mobile
context
*Andrei
Broder
and
Vanja
Josifovski,
Introduc?on
to
Computa?onal
Adver?sing.
12. Why
Do
We
Care?
• A
modern
adver?sing
analy?c
planorm:
– Will
build
behavioral
profiles
on
100
million
plus
individuals
– Use
full
sta?s?cal
models
(not
rules)
for
targe?ng
– Re-‐analyze
all
of
the
data
each
night
– Serve
10,000’s
of
ads
per
second
using
sta?s?cal
models
– Respond
<
100
ms
(with
analy?cs
<
10
ms)
– Use
real
?me
geoloca?on
data
– Do
analy?cs
at
machine
speed
– Be
driven
by
an
analyst
with
only
modest
training
• More
than
pay
for
the
data
centers
we
just
saw.
13. New
Model
of
Biomedical
Compu?ng
Public
data
repositories
Private
storage
&
compute
at
medical
research
centers
Community
soeware
Compute
&
storage
Community
resources
14. The
Tragedy
of
the
Commons
Source:
Garrep
Hardin,
The
Tragedy
of
the
Commons,
Science,
Volume
162,
Number
3859,
pages
1243-‐1248,
13
December
1968.
Individuals
when
they
act
independently
following
their
self
interests
can
deplete
a
deplete
a
common
resource,
contrary
to
a
whole
group's
long-‐term
best
interests.
15. 1. Create
community
commons
of
biomedical
data.
2. Use
a
cloud
compu?ng
and
automa?on
to
manage
the
commons
and
to
compute
over
it.
3. Interoperate
the
data
commons.
17. Part
2
Data
Center
Scale
Compu?ng
(aka
“Cloud
Compu?ng”)
Source:
Interior
of
one
of
Google’s
Data
Center,
www.google.com/about/datacenters/
18. What
instrument
do
we
use
to
make
big
data
discoveries
in
cancer
genomics
and
big
data
biology?
How
do
we
build
a
“datascope?”
19. Source:
Luiz
André
Barroso,
Jimmy
Clidaras
and
Urs
Hölzle,
The
Datacenter
as
a
Computer,
Morgan
&
Claypool
Publishers,
Second
Edi?on,
2013,
www.morganclaypool.com/doi/pdf/10.2200/S00516ED2V01Y201306CAC024
20. Self
Service
Scale
Soeware
stack
that
scales
to
a
data
center
Forgot
cloud
compu?ng.
Focus
on
data
centers
&
the
soeware
they
run.
Use
of
automa?on
21. Sevng
up
and
opera?ng
large
scale
efficient,
secure
and
compliant
racks
of
compu?ng
infrastructure
is
out
of
reach
for
most
labs,
but
essen?al
for
the
community.
This
is
not
a
cloud…
22. Commercial
Cloud
Service
Provider
(CSP)
15
MW
Data
Center
100,000
servers
1
PB
DRAM
100’s
of
PB
of
disk
Automa?c
provisioning
and
infrastructure
management
Monitoring,
network
security
and
forensics
Accoun?ng
and
billing
Customer
Facing
Portal
Data
center
network
~1
Tbps
egress
bandwidth
25
operators
for
15
MW
Commercial
Cloud
24. Open
Science
Data
Cloud
(Home
of
Bionimbus)
6
PB
(OpenStack,
GlusterFS,
Hadoop)
Infrastructure
automa?on
&
management
(Yates)
Compliance,
&
security
(OCM)
Accoun?ng
&
billing
Customer
Facing
Portal
(Tukey)
Data
center
network
~10-‐100
Gbps
bandwidth
6
engineers
to
operate
0.5
MW
Science
Cloud
Science
Cloud
SW
&
Services
25. Why
not
just
use
(only)
Amazon
Web
Services
(AWS)?
Community
science
and
biomedical
clouds
• Scale
/
capacity
• Simplicity
of
a
credit
card
• Wide
variety
of
offerings.
vs.
It
is
s?ll
essen?al
to
interoperate
with
CSP
whenever
possible
by
compliance
and
security
policies.
Commercial
Cloud
Service
Providers
(CSP)
• Lower
cost
(at
medium
scale)
• We
can
build
specialized
infrastructure
for
science.
• We
can
build
specialized
infrastructure
for
security
&
compliance.
• The
data
is
too
important
to
trust
exclusively
with
a
commercial
provider.
26. Cost
of
a
medium
private/community
cloud
vs
large
public
cloud.
Source:
Allison
P.
Heath,
Maphew
Greenway,
Raymond
Powell,
Jonathan
Spring,
Rafael
Suarez,
David
Hanley,
Chai
Bandlamudi,
Megan
McNerney,
Kevin
White
and
Robert
L
Grossman,
Bionimbus:
A
Cloud
for
Managing,
Analyzing
and
Sharing
Large
Genomics
Datasets,
Journal
of
the
American
Medical
Informa?cs
Associa?on,
2014.
PB
Cost
(thousands
of
$)
27. Reliability
over
Commodity
Compu?ng
Components
Is
Difficult
as
is
Data
Locality
• Hadoop
enables
reliable
computa?on
over
thousands
of
low
costs,
unreliable
compu?ng
nodes.
• Hadoop
efficiently
computes
over
the
data
instead
of
moving
the
data.
• The
programming
model
of
Hadoop
(MapReduce)
is
in
prac?ce
more
efficient
in
terms
of
soeware
development
than
the
programming
tradi?onally
used
by
high
performance
compu?ng
(message
passing)
(but
usually
does
not
fully
u?lize
the
underlying
hardware)
28. The
Tail
at
Scale,
Jeffrey
Dean,
Luiz
André
Barroso
Communica?ons
of
the
ACM,
Volume
56
Number
2,
Pages
74-‐80
Latency
is
Difficult
29. Call
an
algorithm
and
compu?ng
infrastructure
is
“big-‐data
scalable”
if
adding
a
rack
of
data
(and
corresponding
processors)
does
not
increase
the
?me
required
to
complete
the
computa?on
but
enables
the
computa?on
to
run
on
a
rack
more
of
data.
Most
of
our
community’s
algorithms
today
fail
this
test.
30. 2005
-‐
2015
Bioinforma)cs
tools
&
their
integra)on.
Examples:
Galaxy,
GenomeSpace,
workflow
systems,
portals,
etc.
2010
-‐
2020
Data
center
scale
science.
Examples:
Bionimbus/OSDC,
CG
Hub,
Cancer
Collaboratory,
GenomeBridge,
etc.
2015
-‐
2025
???
32. Discipline
Dura)on
Size
#
Devices
HEP
-‐
LHC
10
years
15
PB/year*
One
Astronomy
-‐
LSST
10
years
12
PB/year**
One
Genomics
-‐
NGS
2-‐4
years
0.4
TB/genome
1000’s
Genomics
as
a
Big
Data
Science
*At
full
capacity,
the
Large
Hadron
Collider
(LHC),
the
world's
largest
par?cle
accelerator,
is
expected
to
produce
more
than
15
million
Gigabytes
of
data
each
year.
…
This
ambi?ous
project
connects
and
combines
the
IT
power
of
more
than
140
computer
centres
in
33
countries.
Source:
hpp://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-‐en.html
**As
it
carries
out
its
10-‐year
survey,
LSST
will
produce
over
15
terabytes
of
raw
astronomical
data
each
night
(30
terabytes
processed),
resul?ng
in
a
database
catalog
of
22
petabytes
and
an
image
archive
of
100
petabytes.
Source:
hpp://www.lsst.org/
News/enews/teragrid-‐1004.html
33. vs
Source:
A
picture
of
Cern’s
Large
Hadron
Collider
(LHC).
The
LHC
took
about
a
decade
to
construct,
and
cost
about
$4.75
billion.
Source
of
picture:
Conrad
Melvin,
Crea?ve
Commons
BY-‐SA
2.0,
www.flickr.com/photos/58220828@N07/5350788732
• Ten
years
and
$10B
• No
business
value
• Big
data
culture
• Liple
compliance
• Five
years
and
$1M
• Business
value
• Culture
of
small
data
• Compliance
Common
compu?ng,
storage
&
transport
infrastructure
34. UDT
Tuning:
Buffers
and
CPUs
Configuration Change
Observed Transfer
Rate
Time to Transfer 1 TB
(minutes)
UDT and Linux
Defaults
1.6 Gbps
85
Setting buffers sizes to
64 MB
3.3 Gbps
41
Improved CPU on
sending side with
processor affinity
3.7 Gbps
36
Improved CPU on
receiving side with
processor affinity
4.6 Gbps
29
Improved CPU on both
sides with processor
affinity on both sides
6.3 Gbps
21
Turn off CPU
frequency scaling and
set to max clock speed
6.7 Gbps
20
35. Science
vs
Commercial
Clouds
Science
CSP
Commercial
CSP
Perspec?ve
Democra?ze
access
to
data.
Integrate
data
to
make
discoveries.
Long
term
archive.
As
long
as
you
pay
the
bill;
as
long
as
we
keep
the
our
business
model.
Data
Data
intensive
compu?ng
Internet
style
scale
out
Flows
Large
data
flows
in
and
out
Lots
of
small
web
flows
Lock
in
Data
and
compute
portability
essen?al
Lock
in
is
good
36. A
Key
Ques?on
• Is
biomedical
compu?ng
at
the
scale
of
a
data
center
important
enough
for
the
research
community
to
do
or
do
we
only
outsource
to
commercial
cloud
service
providers
(certainly
we
will
interoperate
with
commercial
cloud
service
providers)?
38. Part
4:
Analyzing
Data
at
the
Scale
of
a
Data
Center
Source:
Jon
Kleinberg,
Cornell
University,
www.cs.cornell.edu/home/kleinber/networks-‐book/
39. experimental
science
simula?on
science
data
science
(big
data
biology,
medicine)
1609
30x
1670
250x
1976
10x-‐100x
2004
10x-‐100x
40. Is
More
Different?
Do
New
Phenomena
Emerge
at
Scale
in
Biomedical
Data?
41. Complex
models
over
small
data
that
are
highly
manual.
Simpler
models
over
large
data
that
are
highly
automated.
Small
data
Medium
data
GB
TB
PB
W
KW
MW
42. Several
ac?ve
voxels
were
discovered
in
a
cluster
located
within
the
salmon’s
brain
cavity
(Figure
1,
see
above).
The
size
of
this
cluster
was
81
mm3
with
a
cluster-‐level
significance
of
p
=
0.001.
Due
to
the
coarse
resolu?on
of
the
echo-‐planar
image
acquisi?on
and
the
rela?vely
small
size
of
the
salmon
brain
further
discrimina?on
between
brain
regions
could
not
be
completed.
Out
of
a
search
volume
of
8064
voxels
a
total
of
16
voxels
were
significant.
45. Building
Models
over
Big
Data
• We
know
about
the
“unreasonable
effec?veness
of
ensemble
models.”
Building
ensembles
of
models
over
computer
clusters
works
well
…
• …
but,
how
do
machine
learning
algorithms
scale
to
data
center
scale
science?
• Ensembles
of
random
trees
built
from
templates
appear
to
work
beper
than
tradi?onal
ensembles
of
classifiers
• The
challenge
is
oeen
decomposing
large
heterogeneous
datasets
into
homogeneous
components
that
can
be
modeled.
46. New
Ques?ons
• How
would
research
be
impacted
if
we
could
analyze
all
of
the
data
each
evening?
• How
would
health
care
be
impacted
if
we
could
analyze
of
the
data
each
evening?
47. 2005
-‐
2015
Bioinforma)cs
tools
&
their
integra)on.
Examples:
Galaxy,
GenomeSpace,
workflow
systems,
portals,
etc.
2010
-‐
2020
Data
center
scale
science.
Interoperability
and
preserva?on/peering/
portability
of
large
biomedical
datasets.
Examples:
Bionimbus/OSDC,
CG
Hub,
Cancer
Collaboratory,
GenomeBridge,
etc.
2015
-‐
2025
New
modeling
techniques.
The
discovery
of
new
&
emergent
behavior
at
scale.
Examples:
What
are
the
founda?ons?
Is
more
different?
48. Part
5:
Genomic
Clouds
Cancer
Genome
Collaboratory
(Toronto)
Cancer
Genomics
Hub
(Santa
Cruz)
Bionimbus
Protected
Data
Cloud
(Chicago)
(Cambridge)
(Mountain
View)
50. Analyzing
Data
From
The
Cancer
Genome
Atlas
(TCGA)
1. Apply
to
dbGaP
for
access
to
data.
2. Hire
staff,
set
up
and
operate
secure
compliant
compu?ng
environment
to
mange
10
–
100+
TB
of
data.
3. Get
environment
approved
by
your
research
center.
4. Setup
analysis
pipelines.
5. Download
data
from
CG-‐
Hub
(takes
days
to
weeks).
6. Begin
analysis.
Current
Prac)ce
With
Bionimbus
Protected
Data
Cloud
(PDC)
1. Apply
to
dbGaP
for
access
to
data.
2. Use
your
exis?ng
NIH
grant
creden?als
to
login
to
the
PDC,
select
the
data
that
you
want
to
analyze,
and
the
pipelines
that
you
want
to
use.
3. Begin
analysis.
51. 1. What
scale
is
required
for
biomedical
clouds?
2. What
is
the
design
for
biomedical
clouds?
3. What
tools
and
applica?ons
do
users
need
to
make
discoveries
in
large
amounts
of
biomedical
data?
4. How
do
different
biomedical
clouds
interoperate?
53. Cyber
Condo
Model
• Research
ins?tu?ons
today
have
access
to
high
performance
networks
–
10G
&
100G.
• They
couldn’t
afford
access
to
these
networks
from
commercial
providers.
• Over
a
decade
ago,
they
got
together
to
buy
and
light
fiber.
• This
changed
how
we
do
scien?fic
research.
• Cyber
condos
interoperate
with
commercial
ISPs
54. Science
Cloud
Condos
• Build
Science
Cloud
condo.
• To
provide
a
sustainable
home
for
large
commons
of
research
data
(and
an
infrastructure
to
compute
over
it).
• “Tier
1”
Science
Clouds
need
to
establish
peering
rela?onships.
• And
to
interoperate
with
CSPs
• The
Open
Cloud
Consor?um
is
planning
to
develop
a
science
cloud
condo.
55. Some
Biomedical
Data
Commons
Guidelines
for
the
Next
Five
Years
• There
is
a
societal
benefit
when
biomedical
data
is
also
available
in
data
commons
operated
by
the
research
community
(vs
sold
exclusively
as
data
products
by
commercial
en??es
or
only
offered
for
download
by
the
USG).
• Large
data
commons
providers
should
peer.
• Data
commons
providers
should
develop
standards
for
interopera?ng.
• Standards
should
not
be
developed
ahead
of
open
source
reference
implementa?ons.
• We
need
a
period
of
experimenta?on
as
we
develop
the
best
technology
and
prac?ces.
• The
details
are
hard
(consent,
scalable
APIs,
open
vs
controlled
access,
sustainability,
security,
etc.)
56. Source:
David
R.
Blair,
Christopher
S.
Lyple,
Jonathan
M.
Mortensen,
Charles
F.
Bearden,
Anders
Boeck
Jensen,
Hossein
Khiabanian,
Rachel
Melamed,
Raul
Rabadan,
Elmer
V.
Bernstam,
Søren
Brunak,
Lars
Juhl
Jensen,
Dan
Nicolae,
Nigam
H.
Shah,
Robert
L.
Grossman,
Nancy
J.
Cox,
Kevin
P.
White,
Andrey
Rzhetsky,
A
Non-‐Degenerate
Code
of
Deleterious
Variants
in
Mendelian
Loci
Contributes
to
Complex
Disease
Risk,
Cell,
September,
2013
7.
Recap
57. 100,000
pa?ents
100
PB
1,000,000
pa?ents
1,000
PB
10,000
pa?ents
10
PB
1000
pa?ents
• What
are
the
key
common
services
&
APIs?
• How
do
the
biomedical
commons
clouds
interoperate?
• What
is
the
governance
structure?
• What
is
the
sustainability
model?
58. 2005
-‐
2015
Bioinforma)cs
tools
&
their
integra)on.
Examples:
Galaxy,
GenomeSpace,
workflow
systems,
portals,
etc.
2010
-‐
2020
Data
center
scale
science.
Interoperability
and
preserva?on/peering/
portability
of
large
biomedical
datasets.
Examples:
Bionimbus/OSDC,
CG
Hub,
Cancer
Collaboratory,
GenomeBridge,
etc.
2015
-‐
2025
New
modeling
techniques.
The
discovery
of
new
&
emergent
behavior
at
scale.
Examples:
What
are
the
founda?ons?
Is
more
different?
59. Thanks
To
My
Colleagues
&
Collaborators
• Kevin
White
• Nancy
Cox
• Andrey
Rzhetsky
• Lincoln
Stein
• Barbara
Stranger
60. Thanks
to
My
Lab
Allison
Heath
Ray
Powell
Jonathan
Spring
Renuka
Ayra
David
Hanley
Maria
Paperson
Map
Greenway
Rafael
Suarez
Stu?
Agrawal
Sean
Sullivan
Zhenyu
Zhang
61. Thanks
to
the
White
Lab
Jason
Grundstad
Chaitanya
Bandlamidi
Jason
Pip
Megan
McNerney
63. For
more
informa?on
• www.opensciencedatacloud.org
• For
more
informa?on
and
background,
see
Robert
L.
Grossman
and
Kevin
P.
White,
A
Vision
for
a
Biomedical
Cloud,
Journal
of
Internal
Medicine,
Volume
271,
Number
2,
pages
122-‐130,
2012.
• You
can
find
some
more
informa?on
on
my
blog:
rgrossman.com.
• My
email
address
is
robert.grossman
at
uchicago
dot
edu.
Center for
Research
Informatics
64. Major
funding
and
support
for
the
Open
Science
Data
Cloud
(OSDC)
is
provided
by
the
Gordon
and
Bepy
Moore
Founda?on.
This
funding
is
used
to
support
the
OSDC-‐Adler,
Sullivan
and
Root
facili?es.
Addi?onal
funding
for
the
OSDC
has
been
provided
by
the
following
sponsors:
• The
Bionimbus
Protected
Data
Cloud
is
supported
in
by
part
by
NIH/NCI
through
NIH/SAIC
Contract
13XS021
/
HHSN261200800001E.
• The
OCC-‐Y
Hadoop
Cluster
(approximately
1000
cores
and
1
PB
of
storage)
was
donated
by
Yahoo!
in
2011.
• Cisco
provides
the
OSDC
access
to
the
Cisco
C-‐Wave,
which
connects
OSDC
data
centers
with
10
Gbps
wide
area
networks.
• The
OSDC
is
supported
by
a
5-‐year
(2010-‐2016)
PIRE
award
(OISE
–
1129076)
to
train
scien?sts
to
use
the
OSDC
and
to
further
develop
the
underlying
technology.
• OSDC
technology
for
high
performance
data
transport
is
support
in
part
by
NSF
Award
1127316.
• The
StarLight
Facility
in
Chicago
enables
the
OSDC
to
connect
to
over
30
high
performance
research
networks
around
the
world
at
10
Gbps
or
higher.
• Any
opinions,
findings,
and
conclusions
or
recommenda?ons
expressed
in
this
material
are
those
of
the
author(s)
and
do
not
necessarily
reflect
the
views
of
the
Na?onal
Science
Founda?on,
NIH
or
other
funders
of
this
research.
The
OSDC
is
managed
by
the
Open
Cloud
Consor?um,
a
501(c)(3)
not-‐for-‐profit
corpora?on.
If
you
are
interested
in
providing
funding
or
dona?ng
equipment
or
services,
please
contact
us
at
info@opensciencedatacloud.org.