SlideShare a Scribd company logo
1 of 64
Download to read offline
Big	
  Data,	
  The	
  Community	
  and	
  	
  
The	
  Commons	
  
Robert	
  Grossman	
  
University	
  of	
  Chicago	
  
Open	
  Cloud	
  Consor?um	
  
May	
  12,	
  2014	
  
TCGA	
  Third	
  Annual	
  Scien?fic	
  Symposium	
  
Outline	
  
1.  Big	
  data	
  and	
  the	
  problems	
  it	
  creates	
  
2.  Compu?ng	
  over	
  big	
  data	
  	
  
3.  Science	
  clouds	
  	
  
4.  Data	
  analysis	
  at	
  scale	
  
5.  Genomic	
  clouds	
  
6.  How	
  might	
  we	
  organize?	
  
Four	
  ques?ons	
  and	
  	
  
one	
  challenge	
  
1.  What	
  is	
  the	
  same	
  and	
  what	
  is	
  different	
  
about	
  big	
  biomedical	
  data	
  vs	
  big	
  science	
  
data	
  and	
  vs	
  big	
  commercial	
  data?	
  
2.  What	
  instrument	
  should	
  we	
  use	
  to	
  make	
  
discoveries	
  over	
  big	
  biomedical	
  data?	
  
3.  Do	
  we	
  need	
  new	
  types	
  of	
  mathema?cal	
  
and	
  sta?s?cal	
  models	
  for	
  big	
  biomedical	
  
data?	
  
4.  How	
  do	
  we	
  organize	
  large	
  biomedical	
  
datasets	
  to	
  maximize	
  the	
  discoveries	
  we	
  
make	
  and	
  their	
  impact	
  on	
  health	
  care?	
  
One	
  Million	
  Genome	
  Challenge	
  
•  Sequencing	
  a	
  million	
  genomes	
  would	
  likely	
  
change	
  the	
  way	
  we	
  understand	
  genomic	
  
varia?on.	
  
•  The	
  genomic	
  data	
  for	
  a	
  pa?ent	
  is	
  about	
  1	
  TB	
  
(including	
  samples	
  from	
  both	
  tumor	
  and	
  normal	
  
?ssue).	
  
•  One	
  million	
  genomes	
  is	
  about	
  1000	
  PB	
  or	
  1	
  EB	
  
•  With	
  compression,	
  it	
  may	
  be	
  about	
  100	
  PB	
  
•  At	
  $1000/genome,	
  the	
  sequencing	
  would	
  cost	
  
about	
  $1B	
  
•  Think	
  of	
  this	
  as	
  one	
  hundred	
  studies	
  with	
  10,000	
  
pa?ents	
  each	
  over	
  three	
  years.	
  
Part	
  1:	
  
Biomedical	
  compu?ng	
  is	
  being	
  
disrupted	
  by	
  big	
  data	
  
Source:	
  Michael	
  S.	
  Lawrence,	
  Petar	
  Stojanov,	
  Paz	
  Polak,	
  et.	
  al.,	
  Muta?onal	
  heterogeneity	
  in	
  cancer	
  and	
  the	
  search	
  for	
  
new	
  cancer-­‐associated	
  genes,	
  Nature	
  449,	
  pages	
  214-­‐218,	
  2013.	
  
Standard	
  Model	
  of	
  Biomedical	
  Compu?ng	
  
Public	
  data	
  
repositories	
  
Private	
  local	
  
storage	
  &	
  
compute	
  
Network	
  
download	
  
Local	
  data	
  ($1K)	
  
Community	
  
soeware	
  
Soeware	
  &	
  sweat	
  and	
  
tears	
  ($100K)	
  
We	
  have	
  a	
  problem	
  …	
  
Image:	
  A	
  large-­‐scale	
  sequencing	
  center	
  at	
  the	
  Broad	
  Ins?tute	
  of	
  MIT	
  and	
  Harvard.	
  
Growth	
  of	
  data	
   New	
  types	
  of	
  data	
  
It	
  takes	
  over	
  three	
  
weeks	
  to	
  download	
  
the	
  TCGA	
  data	
  at	
  10	
  
Gbps	
  	
  
Analyzing	
  the	
  data	
  is	
  more	
  
expensive	
  than	
  producing	
  it	
  
The	
  Smart	
  Phone	
  is	
  Becoming	
  a	
  	
  
Home	
  for	
  Medical	
  &	
  Environmental	
  Sensors	
  
Source:	
  LifeWatch	
  V	
  from	
  LifeWatch	
  AG,	
  www.lifewatchv.com.	
  
Source:	
  Interior	
  of	
  one	
  of	
  Google’s	
  Data	
  Center,	
  www.google.com/about/datacenters/	
  
Computa?onal	
  Adver?sing	
  and	
  Marke?ng	
  
•  Computa?onal	
  adver?sing	
  finds	
  the	
  “best	
  match”	
  
between	
  a	
  given	
  user	
  in	
  a	
  given	
  context	
  and	
  a	
  
suitable	
  adver?sement*.	
  
•  In	
  2011,	
  over	
  $100	
  billion	
  was	
  spent	
  in	
  online	
  
adver?sing	
  (eMarketer	
  es?mates)	
  
•  Contexts	
  include:	
  
–  Web	
  search	
  context	
  (SERP	
  adver?sing)	
  
–  Web	
  page	
  content	
  context	
  (content	
  match	
  adver?sing	
  
and	
  banners)	
  
–  Social	
  media	
  context	
  
–  Mobile	
  context	
  
	
  
	
  
*Andrei	
  Broder	
  and	
  Vanja	
  Josifovski,	
  Introduc?on	
  to	
  Computa?onal	
  Adver?sing.	
  
Why	
  Do	
  We	
  Care?	
  
•  A	
  modern	
  adver?sing	
  analy?c	
  planorm:	
  
– Will	
  build	
  behavioral	
  profiles	
  on	
  100	
  million	
  plus	
  
individuals	
  	
  
– Use	
  full	
  sta?s?cal	
  models	
  (not	
  rules)	
  for	
  targe?ng	
  
– Re-­‐analyze	
  all	
  of	
  the	
  data	
  each	
  night	
  
– Serve	
  10,000’s	
  of	
  ads	
  per	
  second	
  using	
  sta?s?cal	
  
models	
  
– Respond	
  <	
  100	
  ms	
  (with	
  analy?cs	
  <	
  10	
  ms)	
  
– Use	
  real	
  ?me	
  geoloca?on	
  data	
  
– Do	
  analy?cs	
  at	
  machine	
  speed	
  	
  
– Be	
  driven	
  by	
  an	
  analyst	
  with	
  only	
  modest	
  training	
  
•  More	
  than	
  pay	
  for	
  the	
  data	
  centers	
  we	
  just	
  saw.	
  
New	
  Model	
  of	
  Biomedical	
  Compu?ng	
  
Public	
  data	
  repositories	
  
Private	
  storage	
  &	
  compute	
  at	
  
medical	
  research	
  centers	
  
Community	
  soeware	
  
Compute	
  &	
  	
  
storage	
  
Community	
  resources	
  
The	
  Tragedy	
  of	
  the	
  Commons	
  
Source:	
  Garrep	
  Hardin,	
  The	
  Tragedy	
  of	
  the	
  Commons,	
  Science,	
  Volume	
  162,	
  Number	
  3859,	
  pages	
  
1243-­‐1248,	
  13	
  December	
  1968.	
  
Individuals	
  when	
  they	
  act	
  independently	
  following	
  their	
  
self	
  interests	
  can	
  deplete	
  a	
  deplete	
  a	
  common	
  resource,	
  
contrary	
  to	
  a	
  whole	
  group's	
  long-­‐term	
  best	
  interests.	
  
1. Create	
  community	
  commons	
  of	
  biomedical	
  data.	
  
2. Use	
  a	
  cloud	
  compu?ng	
  and	
  automa?on	
  to	
  
manage	
  the	
  commons	
  and	
  to	
  compute	
  over	
  it.	
  
3. Interoperate	
  the	
  data	
  commons.	
  
2005	
  -­‐	
  2015	
   Bioinforma)cs	
  tools	
  &	
  their	
  integra)on.	
  
Examples:	
  Galaxy,	
  GenomeSpace,	
  
workflow	
  systems,	
  portals,	
  etc.	
  
2010	
  -­‐	
  2020	
   ???	
  
2015	
  -­‐	
  2025	
   ???	
  
Part	
  2	
  
Data	
  Center	
  Scale	
  Compu?ng	
  	
  
(aka	
  “Cloud	
  Compu?ng”)	
  
Source:	
  Interior	
  of	
  one	
  of	
  Google’s	
  Data	
  Center,	
  www.google.com/about/datacenters/	
  
What	
  instrument	
  do	
  we	
  use	
  to	
  make	
  big	
  data	
  
discoveries	
  in	
  cancer	
  genomics	
  and	
  big	
  data	
  
biology?	
  
How	
  do	
  we	
  build	
  a	
  “datascope?”	
  
Source:	
  Luiz	
  André	
  Barroso,	
  Jimmy	
  Clidaras	
  and	
  Urs	
  Hölzle,	
  The	
  Datacenter	
  as	
  a	
  Computer,	
  Morgan	
  &	
  Claypool	
  
Publishers,	
  Second	
  Edi?on,	
  2013,	
  www.morganclaypool.com/doi/pdf/10.2200/S00516ED2V01Y201306CAC024	
  
Self	
  Service	
   Scale	
  
Soeware	
  stack	
  
that	
  scales	
  to	
  a	
  
data	
  center	
  
Forgot	
  cloud	
  
compu?ng.	
  	
  Focus	
  on	
  
data	
  centers	
  &	
  the	
  
soeware	
  they	
  run.	
  
Use	
  of	
  
automa?on	
  
Sevng	
  up	
  and	
  opera?ng	
  large	
  scale	
  efficient,	
  secure	
  
and	
  compliant	
  racks	
  of	
  compu?ng	
  infrastructure	
  is	
  out	
  
of	
  reach	
  for	
  most	
  labs,	
  but	
  essen?al	
  for	
  the	
  community.	
  
This	
  is	
  not	
  a	
  cloud…	
  
Commercial	
  Cloud	
  Service	
  Provider	
  (CSP)	
  	
  
15	
  MW	
  Data	
  Center	
  
100,000	
  servers	
  
1	
  PB	
  DRAM	
  
100’s	
  of	
  PB	
  of	
  disk	
  
Automa?c	
  
provisioning	
  and	
  
infrastructure	
  
management	
  
Monitoring,	
  
network	
  security	
  
and	
  forensics	
  
Accoun?ng	
  and	
  
billing	
   Customer	
  
Facing	
  
Portal	
  
Data	
  center	
  network	
  
~1	
  Tbps	
  egress	
  bandwidth	
  
	
  
25	
  operators	
  for	
  15	
  MW	
  Commercial	
  Cloud	
  
opencompute.org	
  www.openstack.org	
  
hadoop.apache.org	
  
…	
  
Open	
  Science	
  Data	
  Cloud	
  (Home	
  of	
  Bionimbus)	
  
6	
  PB	
  
(OpenStack,	
  
GlusterFS,	
  
Hadoop)	
  
Infrastructure	
  
automa?on	
  &	
  
management	
  
(Yates)	
  
Compliance,	
  &	
  
security	
  (OCM)	
  
Accoun?ng	
  &	
  
billing	
  
Customer	
  Facing	
  
Portal	
  (Tukey)	
  
Data	
  center	
  network	
  
~10-­‐100	
  Gbps	
  bandwidth	
  
	
  
6	
  engineers	
  to	
  operate	
  0.5	
  MW	
  Science	
  Cloud	
  
Science	
  Cloud	
  SW	
  
&	
  Services	
  
Why	
  not	
  just	
  use	
  (only)	
  Amazon	
  Web	
  Services	
  (AWS)?	
  
Community	
  science	
  and	
  
biomedical	
  clouds	
  
•  Scale	
  /	
  capacity	
  
•  Simplicity	
  of	
  a	
  credit	
  card	
  
•  Wide	
  variety	
  of	
  offerings.	
  
	
  
vs.	
  
It	
  is	
  s?ll	
  essen?al	
  to	
  interoperate	
  with	
  CSP	
  whenever	
  possible	
  by	
  
compliance	
  and	
  security	
  policies.	
  
Commercial	
  Cloud	
  Service	
  
Providers	
  (CSP)	
  
•  Lower	
  cost	
  (at	
  medium	
  
scale)	
  
•  We	
  can	
  build	
  specialized	
  
infrastructure	
  for	
  science.	
  
•  We	
  can	
  build	
  specialized	
  
infrastructure	
  for	
  security	
  &	
  
compliance.	
  
•  The	
  data	
  is	
  too	
  important	
  to	
  
trust	
  exclusively	
  with	
  a	
  
commercial	
  provider.	
  
Cost	
  of	
  a	
  medium	
  private/community	
  cloud	
  
vs	
  large	
  public	
  cloud.	
  
Source:	
  Allison	
  P.	
  Heath,	
  Maphew	
  Greenway,	
  Raymond	
  Powell,	
  Jonathan	
  Spring,	
  Rafael	
  Suarez,	
  David	
  Hanley,	
  Chai	
  
Bandlamudi,	
  Megan	
  McNerney,	
  Kevin	
  White	
  and	
  Robert	
  L	
  Grossman,	
  Bionimbus:	
  A	
  Cloud	
  for	
  Managing,	
  Analyzing	
  
and	
  Sharing	
  Large	
  Genomics	
  Datasets,	
  Journal	
  of	
  the	
  American	
  Medical	
  Informa?cs	
  Associa?on,	
  2014.	
  
PB	
  
Cost	
  	
  
(thousands	
  of	
  $)	
  
Reliability	
  over	
  Commodity	
  Compu?ng	
  
Components	
  Is	
  Difficult	
  as	
  is	
  Data	
  Locality	
  
•  Hadoop	
  enables	
  reliable	
  computa?on	
  over	
  
thousands	
  of	
  low	
  costs,	
  unreliable	
  compu?ng	
  nodes.	
  
•  Hadoop	
  efficiently	
  computes	
  over	
  the	
  data	
  instead	
  of	
  
moving	
  the	
  data.	
  
•  The	
  programming	
  model	
  of	
  Hadoop	
  (MapReduce)	
  is	
  
in	
  prac?ce	
  more	
  efficient	
  in	
  terms	
  of	
  soeware	
  
development	
  than	
  the	
  programming	
  tradi?onally	
  
used	
  by	
  high	
  performance	
  compu?ng	
  (message	
  
passing)	
  (but	
  usually	
  does	
  not	
  fully	
  u?lize	
  the	
  
underlying	
  hardware)	
  
The	
  Tail	
  at	
  Scale,	
  Jeffrey	
  Dean,	
  Luiz	
  André	
  Barroso	
  	
  Communica?ons	
  of	
  the	
  ACM,	
  
Volume	
  56	
  Number	
  2,	
  Pages	
  74-­‐80	
  
Latency	
  is	
  Difficult	
  
Call	
  an	
  algorithm	
  and	
  
compu?ng	
  infrastructure	
  
is	
  “big-­‐data	
  scalable”	
  if	
  
adding	
  a	
  rack	
  of	
  data	
  (and	
  
corresponding	
  processors)	
  
does	
  not	
  increase	
  the	
  ?me	
  
required	
  to	
  complete	
  the	
  
computa?on	
  but	
  enables	
  
the	
  computa?on	
  to	
  run	
  on	
  
a	
  rack	
  more	
  of	
  data.	
  	
  Most	
  
of	
  our	
  community’s	
  
algorithms	
  today	
  fail	
  this	
  
test.	
  
2005	
  -­‐	
  2015	
   Bioinforma)cs	
  tools	
  &	
  their	
  integra)on.	
  
Examples:	
  Galaxy,	
  GenomeSpace,	
  
workflow	
  systems,	
  portals,	
  etc.	
  
2010	
  -­‐	
  2020	
   Data	
  center	
  scale	
  science.	
  	
  
Examples:	
  Bionimbus/OSDC,	
  CG	
  Hub,	
  
Cancer	
  Collaboratory,	
  GenomeBridge,	
  etc.	
  
2015	
  -­‐	
  2025	
   ???	
  
Part	
  3	
  
Science	
  Clouds	
  
Source:	
  Lincoln	
  Stein	
  
Discipline	
   Dura)on	
   Size	
   #	
  Devices	
  
HEP	
  -­‐	
  LHC	
   10	
  years	
   15	
  PB/year*	
   One	
  
Astronomy	
  -­‐	
  LSST	
   10	
  years	
   12	
  PB/year**	
   One	
  
Genomics	
  -­‐	
  NGS	
   2-­‐4	
  years	
   0.4	
  TB/genome	
   1000’s	
  
Genomics	
  as	
  a	
  Big	
  Data	
  Science	
  
*At	
  full	
  capacity,	
  the	
  Large	
  Hadron	
  Collider	
  (LHC),	
  the	
  world's	
  largest	
  par?cle	
  accelerator,	
  is	
  expected	
  to	
  produce	
  more	
  than	
  15	
  
million	
  Gigabytes	
  of	
  data	
  each	
  year.	
  	
  …	
  This	
  ambi?ous	
  project	
  connects	
  and	
  combines	
  the	
  IT	
  power	
  of	
  more	
  than	
  140	
  computer	
  
centres	
  in	
  33	
  countries.	
  	
  Source:	
  hpp://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-­‐en.html	
  
	
  
**As	
  it	
  carries	
  out	
  its	
  10-­‐year	
  survey,	
  LSST	
  will	
  produce	
  over	
  15	
  terabytes	
  of	
  raw	
  astronomical	
  data	
  each	
  night	
  (30	
  terabytes	
  
processed),	
  resul?ng	
  in	
  a	
  database	
  catalog	
  of	
  22	
  petabytes	
  and	
  an	
  image	
  archive	
  of	
  100	
  petabytes.	
  	
  Source:	
  hpp://www.lsst.org/
News/enews/teragrid-­‐1004.html	
  
vs	
  
Source:	
  A	
  picture	
  of	
  Cern’s	
  Large	
  Hadron	
  Collider	
  (LHC).	
  	
  The	
  LHC	
  took	
  about	
  a	
  decade	
  to	
  construct,	
  and	
  cost	
  about	
  $4.75	
  billion.	
  	
  	
  
Source	
  of	
  picture:	
  Conrad	
  Melvin,	
  Crea?ve	
  Commons	
  BY-­‐SA	
  2.0,	
  www.flickr.com/photos/58220828@N07/5350788732	
  
•  Ten	
  years	
  and	
  $10B	
  
•  No	
  business	
  value	
  
•  Big	
  data	
  culture	
  
•  Liple	
  compliance	
  
•  Five	
  years	
  and	
  $1M	
  
•  Business	
  value	
  
•  Culture	
  of	
  small	
  data	
  
•  Compliance	
  
Common	
  compu?ng,	
  	
  storage	
  &	
  transport	
  	
  infrastructure	
  
UDT	
  Tuning:	
  Buffers	
  and	
  CPUs	
  	
  
Configuration Change	

Observed Transfer
Rate	

Time to Transfer 1 TB
(minutes)	

UDT and Linux
Defaults	

1.6 Gbps	

 85	

Setting buffers sizes to
64 MB	

3.3 Gbps	

 41	

Improved CPU on
sending side with
processor affinity	

3.7 Gbps	

 36	

Improved CPU on
receiving side with
processor affinity	

4.6 Gbps	

 29	

Improved CPU on both
sides with processor
affinity on both sides	

6.3 Gbps	

 21	

Turn off CPU
frequency scaling and
set to max clock speed	

6.7 Gbps	

 20
Science	
  vs	
  Commercial	
  Clouds	
  
Science	
  CSP	
   Commercial	
  CSP	
  
Perspec?ve	
   Democra?ze	
  access	
  to	
  
data.	
  	
  Integrate	
  data	
  
to	
  make	
  discoveries.	
  	
  
Long	
  term	
  archive.	
  
As	
  long	
  as	
  you	
  pay	
  the	
  
bill;	
  as	
  long	
  as	
  we	
  keep	
  
the	
  our	
  business	
  model.	
  
Data	
   Data	
  intensive	
  
compu?ng	
  
Internet	
  style	
  scale	
  out	
  
Flows	
   Large	
  data	
  flows	
  in	
  
and	
  out	
  
Lots	
  of	
  small	
  web	
  flows	
  
Lock	
  in	
   Data	
  and	
  compute	
  
portability	
  essen?al	
  
Lock	
  in	
  is	
  good	
  
A	
  Key	
  Ques?on	
  
•  Is	
  biomedical	
  compu?ng	
  at	
  the	
  scale	
  of	
  a	
  data	
  
center	
  important	
  enough	
  for	
  the	
  research	
  
community	
  to	
  do	
  or	
  do	
  we	
  only	
  outsource	
  to	
  
commercial	
  cloud	
  service	
  providers	
  (certainly	
  
we	
  will	
  interoperate	
  with	
  commercial	
  cloud	
  
service	
  providers)?	
  
Open	
  Science	
  Data	
  Cloud	
  
Part	
  4:	
  
Analyzing	
  Data	
  at	
  the	
  Scale	
  of	
  a	
  Data	
  Center	
  
Source:	
  Jon	
  Kleinberg,	
  Cornell	
  University,	
  www.cs.cornell.edu/home/kleinber/networks-­‐book/	
  
experimental	
  
science	
  
simula?on	
  
science	
  
data	
  science	
  
(big	
  data	
  biology,	
  
medicine)	
  
1609	
  
30x	
  
1670	
  
250x	
  
1976	
  
10x-­‐100x	
  
2004	
  
10x-­‐100x	
  
Is	
  More	
  Different?	
  	
  Do	
  New	
  Phenomena	
  
Emerge	
  at	
  Scale	
  in	
  Biomedical	
  Data?	
  
Complex	
  models	
  
over	
  small	
  data	
  that	
  
are	
  highly	
  manual.	
  
Simpler	
  models	
  
over	
  large	
  data	
  that	
  
are	
  highly	
  
automated.	
  
Small	
  data	
   Medium	
  data	
  
GB	
   TB	
   PB	
  
W	
   KW	
   MW	
  
Several	
  ac?ve	
  voxels	
  were	
  discovered	
  in	
  a	
  cluster	
  located	
  
within	
  the	
  salmon’s	
  brain	
  cavity	
  (Figure	
  1,	
  see	
  above).	
  The	
  size	
  
of	
  this	
  cluster	
  was	
  81	
  mm3	
  with	
  a	
  cluster-­‐level	
  significance	
  of	
  
p	
  =	
  0.001.	
  Due	
  to	
  the	
  coarse	
  resolu?on	
  of	
  the	
  echo-­‐planar	
  
image	
  acquisi?on	
  and	
  the	
  rela?vely	
  small	
  size	
  of	
  the	
  salmon	
  
brain	
  further	
  discrimina?on	
  between	
  brain	
  regions	
  could	
  not	
  
be	
  completed.	
  Out	
  of	
  a	
  search	
  volume	
  of	
  8064	
  voxels	
  a	
  total	
  
of	
  16	
  voxels	
  were	
  significant.	
  	
  
	
  
Environmental	
  Factors	
  and	
  Cancer	
  
Building	
  Models	
  over	
  Big	
  Data	
  
•  We	
  know	
  about	
  the	
  “unreasonable	
  effec?veness	
  of	
  
ensemble	
  models.”	
  Building	
  ensembles	
  of	
  models	
  
over	
  computer	
  clusters	
  works	
  well	
  …	
  
•  …	
  but,	
  how	
  do	
  machine	
  learning	
  algorithms	
  scale	
  to	
  
data	
  center	
  scale	
  science?	
  
•  Ensembles	
  of	
  random	
  trees	
  built	
  from	
  templates	
  
appear	
  to	
  work	
  beper	
  than	
  tradi?onal	
  ensembles	
  of	
  
classifiers	
  
•  The	
  challenge	
  is	
  oeen	
  decomposing	
  large	
  
heterogeneous	
  datasets	
  into	
  homogeneous	
  
components	
  that	
  can	
  be	
  modeled.	
  
New	
  Ques?ons	
  
•  How	
  would	
  research	
  be	
  impacted	
  if	
  we	
  could	
  
analyze	
  all	
  of	
  the	
  data	
  each	
  evening?	
  
•  How	
  would	
  health	
  care	
  be	
  impacted	
  if	
  we	
  
could	
  analyze	
  of	
  the	
  data	
  each	
  evening?	
  
2005	
  -­‐	
  2015	
   Bioinforma)cs	
  tools	
  &	
  their	
  integra)on.	
  
Examples:	
  Galaxy,	
  GenomeSpace,	
  
workflow	
  systems,	
  portals,	
  etc.	
  
2010	
  -­‐	
  2020	
   Data	
  center	
  scale	
  science.	
  	
  
Interoperability	
  and	
  preserva?on/peering/
portability	
  of	
  large	
  biomedical	
  datasets.	
  	
  
Examples:	
  Bionimbus/OSDC,	
  CG	
  Hub,	
  
Cancer	
  Collaboratory,	
  GenomeBridge,	
  etc.	
  
2015	
  -­‐	
  2025	
   New	
  modeling	
  techniques.	
  The	
  
discovery	
  of	
  new	
  &	
  emergent	
  behavior	
  
at	
  scale.	
  	
  Examples:	
  	
  What	
  are	
  the	
  
founda?ons?	
  	
  Is	
  more	
  different?	
  
Part	
  5:	
  
Genomic	
  Clouds	
  
Cancer	
  Genome	
  
Collaboratory	
  (Toronto)	
  
Cancer	
  Genomics	
  Hub	
  
(Santa	
  Cruz)	
  
Bionimbus	
  Protected	
  
Data	
  Cloud	
  (Chicago)	
  
(Cambridge)	
  
(Mountain	
  View)	
  
bionimbus.opensciencedatacloud.org	
  
Analyzing	
  Data	
  From	
  	
  
The	
  Cancer	
  Genome	
  Atlas	
  (TCGA)	
  
1.  Apply	
  to	
  dbGaP	
  for	
  access	
  
to	
  data.	
  
2.  Hire	
  staff,	
  set	
  up	
  and	
  
operate	
  secure	
  compliant	
  
compu?ng	
  environment	
  to	
  
mange	
  10	
  –	
  100+	
  TB	
  of	
  data.	
  	
  	
  
3.  Get	
  environment	
  approved	
  
by	
  your	
  research	
  center.	
  
4.  Setup	
  analysis	
  pipelines.	
  
5.  Download	
  data	
  from	
  CG-­‐
Hub	
  (takes	
  days	
  to	
  weeks).	
  	
  
6.  Begin	
  analysis.	
  
Current	
  Prac)ce	
  
With	
  Bionimbus	
  Protected	
  Data	
  
Cloud	
  (PDC)	
  
1.  Apply	
  to	
  dbGaP	
  for	
  access	
  
to	
  data.	
  
2.  Use	
  your	
  exis?ng	
  NIH	
  grant	
  
creden?als	
  to	
  login	
  to	
  the	
  
PDC,	
  select	
  the	
  data	
  that	
  
you	
  want	
  to	
  analyze,	
  and	
  
the	
  pipelines	
  that	
  you	
  want	
  
to	
  use.	
  	
  
3.  Begin	
  analysis.	
  
1.  What	
  scale	
  is	
  required	
  for	
  biomedical	
  clouds?	
  
2.  What	
  is	
  the	
  design	
  for	
  biomedical	
  clouds?	
  
3.  What	
  tools	
  and	
  applica?ons	
  do	
  users	
  need	
  to	
  make	
  
discoveries	
  in	
  large	
  amounts	
  of	
  biomedical	
  data?	
  
4.  How	
  do	
  different	
  biomedical	
  clouds	
  interoperate?	
  
6.	
  How	
  Do	
  We	
  Organize?	
  
Cyber	
  Condo	
  Model	
  
•  Research	
  ins?tu?ons	
  today	
  
have	
  access	
  to	
  high	
  
performance	
  networks	
  –	
  10G	
  
&	
  100G.	
  
•  They	
  couldn’t	
  afford	
  access	
  to	
  
these	
  networks	
  from	
  
commercial	
  providers.	
  
•  Over	
  a	
  decade	
  ago,	
  they	
  got	
  
together	
  to	
  buy	
  and	
  light	
  
fiber.	
  	
  	
  	
  
•  This	
  changed	
  how	
  we	
  do	
  
scien?fic	
  research.	
  
•  Cyber	
  condos	
  interoperate	
  
with	
  commercial	
  ISPs	
  
Science	
  Cloud	
  Condos	
  
•  Build	
  Science	
  Cloud	
  condo.	
  
•  To	
  provide	
  a	
  sustainable	
  
home	
  for	
  large	
  commons	
  of	
  
research	
  data	
  (and	
  an	
  
infrastructure	
  to	
  compute	
  
over	
  it).	
  
•  “Tier	
  1”	
  Science	
  Clouds	
  need	
  
to	
  establish	
  peering	
  
rela?onships.	
  
•  And	
  to	
  interoperate	
  with	
  CSPs	
  
•  The	
  Open	
  Cloud	
  Consor?um	
  is	
  
planning	
  to	
  develop	
  a	
  science	
  
cloud	
  condo.	
  
Some	
  Biomedical	
  Data	
  Commons	
  Guidelines	
  for	
  
the	
  Next	
  Five	
  Years	
  
•  There	
  is	
  a	
  societal	
  benefit	
  when	
  biomedical	
  data	
  is	
  also	
  
available	
  in	
  data	
  commons	
  operated	
  by	
  the	
  research	
  
community	
  (vs	
  sold	
  exclusively	
  as	
  data	
  products	
  by	
  
commercial	
  en??es	
  or	
  only	
  offered	
  for	
  download	
  by	
  
the	
  USG).	
  
•  Large	
  data	
  commons	
  providers	
  should	
  peer.	
  
•  Data	
  commons	
  providers	
  should	
  develop	
  standards	
  for	
  
interopera?ng.	
  
•  Standards	
  should	
  not	
  be	
  developed	
  ahead	
  of	
  open	
  
source	
  reference	
  implementa?ons.	
  
•  We	
  need	
  a	
  period	
  of	
  experimenta?on	
  as	
  we	
  develop	
  
the	
  best	
  technology	
  and	
  prac?ces.	
  
•  The	
  details	
  are	
  hard	
  (consent,	
  scalable	
  APIs,	
  open	
  vs	
  
controlled	
  access,	
  sustainability,	
  security,	
  etc.)	
  
Source:	
  David	
  R.	
  Blair,	
  Christopher	
  S.	
  Lyple,	
  Jonathan	
  M.	
  Mortensen,	
  Charles	
  F.	
  Bearden,	
  Anders	
  Boeck	
  
Jensen,	
  Hossein	
  Khiabanian,	
  Rachel	
  Melamed,	
  Raul	
  Rabadan,	
  Elmer	
  V.	
  Bernstam,	
  Søren	
  Brunak,	
  Lars	
  
Juhl	
  Jensen,	
  Dan	
  Nicolae,	
  Nigam	
  H.	
  Shah,	
  Robert	
  L.	
  Grossman,	
  Nancy	
  J.	
  Cox,	
  Kevin	
  P.	
  White,	
  Andrey	
  
Rzhetsky,	
  A	
  Non-­‐Degenerate	
  Code	
  of	
  Deleterious	
  Variants	
  in	
  Mendelian	
  Loci	
  Contributes	
  to	
  Complex	
  
Disease	
  Risk,	
  Cell,	
  September,	
  2013	
  	
  
7.	
  	
  
Recap	
  
100,000	
  pa?ents	
  
100	
  PB	
  
1,000,000	
  pa?ents	
  
1,000	
  PB	
  
10,000	
  pa?ents	
  
10	
  PB	
  
1000	
  pa?ents	
  
•  What	
  are	
  the	
  key	
  
common	
  services	
  &	
  APIs?	
  
•  How	
  do	
  the	
  biomedical	
  
commons	
  clouds	
  
interoperate?	
  
•  What	
  is	
  the	
  governance	
  
structure?	
  
•  What	
  is	
  the	
  sustainability	
  
model?	
  
2005	
  -­‐	
  2015	
   Bioinforma)cs	
  tools	
  &	
  their	
  integra)on.	
  
Examples:	
  Galaxy,	
  GenomeSpace,	
  
workflow	
  systems,	
  portals,	
  etc.	
  
2010	
  -­‐	
  2020	
   Data	
  center	
  scale	
  science.	
  	
  
Interoperability	
  and	
  preserva?on/peering/
portability	
  of	
  large	
  biomedical	
  datasets.	
  	
  
Examples:	
  Bionimbus/OSDC,	
  CG	
  Hub,	
  
Cancer	
  Collaboratory,	
  GenomeBridge,	
  etc.	
  
2015	
  -­‐	
  2025	
   New	
  modeling	
  techniques.	
  The	
  
discovery	
  of	
  new	
  &	
  emergent	
  behavior	
  
at	
  scale.	
  	
  Examples:	
  	
  What	
  are	
  the	
  
founda?ons?	
  	
  Is	
  more	
  different?	
  
Thanks	
  To	
  My	
  Colleagues	
  &	
  Collaborators	
  
•  Kevin	
  White	
  
•  Nancy	
  Cox	
  
•  Andrey	
  Rzhetsky	
  
•  Lincoln	
  Stein	
  
•  Barbara	
  Stranger	
  
Thanks	
  to	
  My	
  Lab	
  
Allison	
  
Heath	
  
Ray	
  
Powell	
  
Jonathan	
  
Spring	
  
Renuka	
  
Ayra	
  
David	
  
Hanley	
  
Maria	
  
Paperson	
  
Map	
  
Greenway	
  
Rafael	
  
Suarez	
  	
  
Stu?	
  
Agrawal	
  	
  
Sean	
  
Sullivan	
  
Zhenyu	
  
Zhang	
  	
  
Thanks	
  to	
  the	
  White	
  Lab	
  
Jason	
  
Grundstad	
  
Chaitanya	
  
Bandlamidi	
  
Jason	
  Pip	
  
Megan	
  
McNerney	
  
Ques?ons?	
  
62	
  
For	
  more	
  informa?on	
  
•  www.opensciencedatacloud.org	
  
•  For	
  more	
  informa?on	
  and	
  background,	
  see	
  Robert	
  L.	
  
Grossman	
  and	
  Kevin	
  P.	
  White,	
  A	
  Vision	
  for	
  a	
  Biomedical	
  
Cloud,	
  Journal	
  of	
  Internal	
  Medicine,	
  Volume	
  271,	
  Number	
  2,	
  
pages	
  122-­‐130,	
  2012.	
  
•  You	
  can	
  find	
  some	
  more	
  informa?on	
  on	
  my	
  blog:	
  
	
   	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  rgrossman.com.	
  
•  My	
  email	
  address	
  is	
  robert.grossman	
  at	
  uchicago	
  dot	
  edu.	
  
	
  
Center for
Research
Informatics
Major	
  funding	
  and	
  support	
  for	
  the	
  Open	
  Science	
  Data	
  Cloud	
  (OSDC)	
  is	
  provided	
  by	
  the	
  Gordon	
  and	
  
Bepy	
  Moore	
  Founda?on.	
  	
  This	
  funding	
  is	
  used	
  to	
  support	
  the	
  OSDC-­‐Adler,	
  Sullivan	
  and	
  Root	
  facili?es.	
  
	
  
Addi?onal	
  funding	
  for	
  the	
  OSDC	
  has	
  been	
  provided	
  by	
  the	
  following	
  sponsors:	
  
	
  
•  The	
  Bionimbus	
  Protected	
  Data	
  Cloud	
  is	
  supported	
  in	
  by	
  part	
  by	
  NIH/NCI	
  through	
  NIH/SAIC	
  Contract	
  
13XS021	
  /	
  HHSN261200800001E.	
  	
  
•  The	
  OCC-­‐Y	
  Hadoop	
  Cluster	
  (approximately	
  1000	
  cores	
  and	
  1	
  PB	
  of	
  storage)	
  was	
  donated	
  by	
  Yahoo!	
  
in	
  2011.	
  
•  Cisco	
  provides	
  the	
  OSDC	
  access	
  to	
  the	
  Cisco	
  C-­‐Wave,	
  which	
  connects	
  OSDC	
  data	
  centers	
  with	
  10	
  
Gbps	
  wide	
  area	
  networks.	
  
•  The	
  OSDC	
  is	
  supported	
  by	
  a	
  5-­‐year	
  (2010-­‐2016)	
  PIRE	
  award	
  (OISE	
  –	
  1129076)	
  to	
  train	
  scien?sts	
  to	
  
use	
  the	
  OSDC	
  and	
  to	
  further	
  develop	
  the	
  underlying	
  technology.	
  
•  OSDC	
  technology	
  for	
  high	
  performance	
  data	
  transport	
  is	
  support	
  in	
  part	
  by	
  	
  NSF	
  Award	
  1127316.	
  
•  The	
  StarLight	
  Facility	
  in	
  Chicago	
  enables	
  the	
  OSDC	
  to	
  connect	
  to	
  over	
  30	
  high	
  performance	
  
research	
  networks	
  around	
  the	
  world	
  at	
  10	
  Gbps	
  or	
  higher.	
  
•  Any	
  opinions,	
  findings,	
  and	
  conclusions	
  or	
  recommenda?ons	
  expressed	
  in	
  this	
  material	
  are	
  those	
  
of	
  the	
  author(s)	
  and	
  do	
  not	
  necessarily	
  reflect	
  the	
  views	
  of	
  the	
  Na?onal	
  Science	
  Founda?on,	
  NIH	
  or	
  
other	
  funders	
  of	
  this	
  research.	
  
	
  
The	
  OSDC	
  is	
  managed	
  by	
  the	
  Open	
  Cloud	
  Consor?um,	
  a	
  501(c)(3)	
  not-­‐for-­‐profit	
  corpora?on.	
  If	
  you	
  are	
  
interested	
  in	
  providing	
  funding	
  or	
  dona?ng	
  equipment	
  or	
  services,	
  please	
  contact	
  us	
  at	
  
info@opensciencedatacloud.org.	
  

More Related Content

What's hot

How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...Robert Grossman
 
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Robert Grossman
 
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataThe Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataRobert Grossman
 
Multipleregression covidmobility and Covid-19 policy recommendation
Multipleregression covidmobility and Covid-19 policy recommendationMultipleregression covidmobility and Covid-19 policy recommendation
Multipleregression covidmobility and Covid-19 policy recommendationKan Yuenyong
 
20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - Amsterdam20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - AmsterdamAllen Day, PhD
 
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / PhoenixAllen Day, PhD
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
 
20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - WageningenAllen Day, PhD
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational ScienceChelle Gentemann
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaUniversity of Washington
 
Opportunities for HPC in pharma R&D - main deck
Opportunities for HPC in pharma R&D - main deckOpportunities for HPC in pharma R&D - main deck
Opportunities for HPC in pharma R&D - main deckPistoia Alliance
 
Büyük Veriyle Büyük Resmi Görmek
Büyük Veriyle Büyük Resmi GörmekBüyük Veriyle Büyük Resmi Görmek
Büyük Veriyle Büyük Resmi Görmekideaport
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer OverlordsIan Foster
 

What's hot (20)

How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
 
Cri big data
Cri big dataCri big data
Cri big data
 
Big Data
Big Data Big Data
Big Data
 
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataThe Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
 
Multipleregression covidmobility and Covid-19 policy recommendation
Multipleregression covidmobility and Covid-19 policy recommendationMultipleregression covidmobility and Covid-19 policy recommendation
Multipleregression covidmobility and Covid-19 policy recommendation
 
Are you ready for BIG DATA?
Are you ready for BIG DATA?Are you ready for BIG DATA?
Are you ready for BIG DATA?
 
Future of hpc
Future of hpcFuture of hpc
Future of hpc
 
20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - Amsterdam20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - Amsterdam
 
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
 
20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational Science
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and Myria
 
Opportunities for HPC in pharma R&D - main deck
Opportunities for HPC in pharma R&D - main deckOpportunities for HPC in pharma R&D - main deck
Opportunities for HPC in pharma R&D - main deck
 
Büyük Veriyle Büyük Resmi Görmek
Büyük Veriyle Büyük Resmi GörmekBüyük Veriyle Büyük Resmi Görmek
Büyük Veriyle Büyük Resmi Görmek
 
Democratizing Data Science in the Cloud
Democratizing Data Science in the CloudDemocratizing Data Science in the Cloud
Democratizing Data Science in the Cloud
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
 
Big Data: an introduction
Big Data: an introductionBig Data: an introduction
Big Data: an introduction
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
 

Viewers also liked

Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Robert Grossman
 
Practical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsPractical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsRobert Grossman
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...Robert Grossman
 
AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016Robert Grossman
 
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...Robert Grossman
 
Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Robert Grossman
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Robert Grossman
 
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Robert Grossman
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Robert Grossman
 
Precision Medicine in Oncology Informatics
Precision Medicine in Oncology InformaticsPrecision Medicine in Oncology Informatics
Precision Medicine in Oncology InformaticsWarren Kibbe
 
NCI Cancer Genomic Data Commons for NCAB September 2016
NCI Cancer Genomic Data Commons for NCAB September 2016NCI Cancer Genomic Data Commons for NCAB September 2016
NCI Cancer Genomic Data Commons for NCAB September 2016Warren Kibbe
 

Viewers also liked (11)

Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
 
Practical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsPractical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large Datasets
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
 
AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016
 
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
 
Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)
 
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)
 
Precision Medicine in Oncology Informatics
Precision Medicine in Oncology InformaticsPrecision Medicine in Oncology Informatics
Precision Medicine in Oncology Informatics
 
NCI Cancer Genomic Data Commons for NCAB September 2016
NCI Cancer Genomic Data Commons for NCAB September 2016NCI Cancer Genomic Data Commons for NCAB September 2016
NCI Cancer Genomic Data Commons for NCAB September 2016
 

Similar to Big Data, The Community and The Commons (May 12, 2014)

Big data and data mining
Big data and data miningBig data and data mining
Big data and data miningPolash Halder
 
The Science of Data Science
The Science of Data Science The Science of Data Science
The Science of Data Science James Hendler
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemWarren Kibbe
 
wireless sensor network
wireless sensor networkwireless sensor network
wireless sensor networkparry prabhu
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...Robert Grossman
 
Introduction to Big Data and its Potential for Dementia Research
Introduction to Big Data and its Potential for Dementia ResearchIntroduction to Big Data and its Potential for Dementia Research
Introduction to Big Data and its Potential for Dementia ResearchDavid De Roure
 
Data Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has ChangedData Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has ChangedPhilip Bourne
 
What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?Robert Grossman
 
Life Technologies' Journey to the Cloud (ENT208) | AWS re:Invent 2013
Life Technologies' Journey to the Cloud (ENT208) | AWS re:Invent 2013Life Technologies' Journey to the Cloud (ENT208) | AWS re:Invent 2013
Life Technologies' Journey to the Cloud (ENT208) | AWS re:Invent 2013Amazon Web Services
 
Role of data in precision oncology
Role of data in precision oncologyRole of data in precision oncology
Role of data in precision oncologyWarren Kibbe
 
The pulse of cloud computing with bioinformatics as an example
The pulse of cloud computing with bioinformatics as an exampleThe pulse of cloud computing with bioinformatics as an example
The pulse of cloud computing with bioinformatics as an exampleEnis Afgan
 
The Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big DataThe Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big DataPhilip Bourne
 
Aaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridAaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridIan Foster
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabatinabati
 
Becoming Datacentric
Becoming DatacentricBecoming Datacentric
Becoming DatacentricTimothy Cook
 
Big data
Big dataBig data
Big dataCisco
 
Big Data in Bioinformatics & the Era of Cloud Computing
Big Data in Bioinformatics & the Era of Cloud ComputingBig Data in Bioinformatics & the Era of Cloud Computing
Big Data in Bioinformatics & the Era of Cloud ComputingIOSR Journals
 
Thesis blending big data and cloud -epilepsy global data research and inform...
Thesis  blending big data and cloud -epilepsy global data research and inform...Thesis  blending big data and cloud -epilepsy global data research and inform...
Thesis blending big data and cloud -epilepsy global data research and inform...Anup Singh
 

Similar to Big Data, The Community and The Commons (May 12, 2014) (20)

Big data and data mining
Big data and data miningBig data and data mining
Big data and data mining
 
The Science of Data Science
The Science of Data Science The Science of Data Science
The Science of Data Science
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health System
 
wireless sensor network
wireless sensor networkwireless sensor network
wireless sensor network
 
Big data mining
Big data miningBig data mining
Big data mining
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
Introduction to Big Data and its Potential for Dementia Research
Introduction to Big Data and its Potential for Dementia ResearchIntroduction to Big Data and its Potential for Dementia Research
Introduction to Big Data and its Potential for Dementia Research
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
Data Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has ChangedData Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has Changed
 
What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?
 
Life Technologies' Journey to the Cloud (ENT208) | AWS re:Invent 2013
Life Technologies' Journey to the Cloud (ENT208) | AWS re:Invent 2013Life Technologies' Journey to the Cloud (ENT208) | AWS re:Invent 2013
Life Technologies' Journey to the Cloud (ENT208) | AWS re:Invent 2013
 
Role of data in precision oncology
Role of data in precision oncologyRole of data in precision oncology
Role of data in precision oncology
 
The pulse of cloud computing with bioinformatics as an example
The pulse of cloud computing with bioinformatics as an exampleThe pulse of cloud computing with bioinformatics as an example
The pulse of cloud computing with bioinformatics as an example
 
The Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big DataThe Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big Data
 
Aaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridAaas Data Intensive Science And Grid
Aaas Data Intensive Science And Grid
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabati
 
Becoming Datacentric
Becoming DatacentricBecoming Datacentric
Becoming Datacentric
 
Big data
Big dataBig data
Big data
 
Big Data in Bioinformatics & the Era of Cloud Computing
Big Data in Bioinformatics & the Era of Cloud ComputingBig Data in Bioinformatics & the Era of Cloud Computing
Big Data in Bioinformatics & the Era of Cloud Computing
 
Thesis blending big data and cloud -epilepsy global data research and inform...
Thesis  blending big data and cloud -epilepsy global data research and inform...Thesis  blending big data and cloud -epilepsy global data research and inform...
Thesis blending big data and cloud -epilepsy global data research and inform...
 

More from Robert Grossman

Some Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanySome Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanyRobert Grossman
 
Some Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data PlatformsSome Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data PlatformsRobert Grossman
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataRobert Grossman
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
 
A Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchA Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchRobert Grossman
 
Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Robert Grossman
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Robert Grossman
 
Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Robert Grossman
 

More from Robert Grossman (8)

Some Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanySome Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your Company
 
Some Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data PlatformsSome Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data Platforms
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
A Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchA Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical Research
 
Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11
 
Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011
 

Recently uploaded

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 

Recently uploaded (20)

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 

Big Data, The Community and The Commons (May 12, 2014)

  • 1. Big  Data,  The  Community  and     The  Commons   Robert  Grossman   University  of  Chicago   Open  Cloud  Consor?um   May  12,  2014   TCGA  Third  Annual  Scien?fic  Symposium  
  • 2. Outline   1.  Big  data  and  the  problems  it  creates   2.  Compu?ng  over  big  data     3.  Science  clouds     4.  Data  analysis  at  scale   5.  Genomic  clouds   6.  How  might  we  organize?  
  • 3. Four  ques?ons  and     one  challenge  
  • 4. 1.  What  is  the  same  and  what  is  different   about  big  biomedical  data  vs  big  science   data  and  vs  big  commercial  data?   2.  What  instrument  should  we  use  to  make   discoveries  over  big  biomedical  data?   3.  Do  we  need  new  types  of  mathema?cal   and  sta?s?cal  models  for  big  biomedical   data?   4.  How  do  we  organize  large  biomedical   datasets  to  maximize  the  discoveries  we   make  and  their  impact  on  health  care?  
  • 5. One  Million  Genome  Challenge   •  Sequencing  a  million  genomes  would  likely   change  the  way  we  understand  genomic   varia?on.   •  The  genomic  data  for  a  pa?ent  is  about  1  TB   (including  samples  from  both  tumor  and  normal   ?ssue).   •  One  million  genomes  is  about  1000  PB  or  1  EB   •  With  compression,  it  may  be  about  100  PB   •  At  $1000/genome,  the  sequencing  would  cost   about  $1B   •  Think  of  this  as  one  hundred  studies  with  10,000   pa?ents  each  over  three  years.  
  • 6. Part  1:   Biomedical  compu?ng  is  being   disrupted  by  big  data   Source:  Michael  S.  Lawrence,  Petar  Stojanov,  Paz  Polak,  et.  al.,  Muta?onal  heterogeneity  in  cancer  and  the  search  for   new  cancer-­‐associated  genes,  Nature  449,  pages  214-­‐218,  2013.  
  • 7. Standard  Model  of  Biomedical  Compu?ng   Public  data   repositories   Private  local   storage  &   compute   Network   download   Local  data  ($1K)   Community   soeware   Soeware  &  sweat  and   tears  ($100K)  
  • 8. We  have  a  problem  …   Image:  A  large-­‐scale  sequencing  center  at  the  Broad  Ins?tute  of  MIT  and  Harvard.   Growth  of  data   New  types  of  data   It  takes  over  three   weeks  to  download   the  TCGA  data  at  10   Gbps     Analyzing  the  data  is  more   expensive  than  producing  it  
  • 9. The  Smart  Phone  is  Becoming  a     Home  for  Medical  &  Environmental  Sensors   Source:  LifeWatch  V  from  LifeWatch  AG,  www.lifewatchv.com.  
  • 10. Source:  Interior  of  one  of  Google’s  Data  Center,  www.google.com/about/datacenters/  
  • 11. Computa?onal  Adver?sing  and  Marke?ng   •  Computa?onal  adver?sing  finds  the  “best  match”   between  a  given  user  in  a  given  context  and  a   suitable  adver?sement*.   •  In  2011,  over  $100  billion  was  spent  in  online   adver?sing  (eMarketer  es?mates)   •  Contexts  include:   –  Web  search  context  (SERP  adver?sing)   –  Web  page  content  context  (content  match  adver?sing   and  banners)   –  Social  media  context   –  Mobile  context       *Andrei  Broder  and  Vanja  Josifovski,  Introduc?on  to  Computa?onal  Adver?sing.  
  • 12. Why  Do  We  Care?   •  A  modern  adver?sing  analy?c  planorm:   – Will  build  behavioral  profiles  on  100  million  plus   individuals     – Use  full  sta?s?cal  models  (not  rules)  for  targe?ng   – Re-­‐analyze  all  of  the  data  each  night   – Serve  10,000’s  of  ads  per  second  using  sta?s?cal   models   – Respond  <  100  ms  (with  analy?cs  <  10  ms)   – Use  real  ?me  geoloca?on  data   – Do  analy?cs  at  machine  speed     – Be  driven  by  an  analyst  with  only  modest  training   •  More  than  pay  for  the  data  centers  we  just  saw.  
  • 13. New  Model  of  Biomedical  Compu?ng   Public  data  repositories   Private  storage  &  compute  at   medical  research  centers   Community  soeware   Compute  &     storage   Community  resources  
  • 14. The  Tragedy  of  the  Commons   Source:  Garrep  Hardin,  The  Tragedy  of  the  Commons,  Science,  Volume  162,  Number  3859,  pages   1243-­‐1248,  13  December  1968.   Individuals  when  they  act  independently  following  their   self  interests  can  deplete  a  deplete  a  common  resource,   contrary  to  a  whole  group's  long-­‐term  best  interests.  
  • 15. 1. Create  community  commons  of  biomedical  data.   2. Use  a  cloud  compu?ng  and  automa?on  to   manage  the  commons  and  to  compute  over  it.   3. Interoperate  the  data  commons.  
  • 16. 2005  -­‐  2015   Bioinforma)cs  tools  &  their  integra)on.   Examples:  Galaxy,  GenomeSpace,   workflow  systems,  portals,  etc.   2010  -­‐  2020   ???   2015  -­‐  2025   ???  
  • 17. Part  2   Data  Center  Scale  Compu?ng     (aka  “Cloud  Compu?ng”)   Source:  Interior  of  one  of  Google’s  Data  Center,  www.google.com/about/datacenters/  
  • 18. What  instrument  do  we  use  to  make  big  data   discoveries  in  cancer  genomics  and  big  data   biology?   How  do  we  build  a  “datascope?”  
  • 19. Source:  Luiz  André  Barroso,  Jimmy  Clidaras  and  Urs  Hölzle,  The  Datacenter  as  a  Computer,  Morgan  &  Claypool   Publishers,  Second  Edi?on,  2013,  www.morganclaypool.com/doi/pdf/10.2200/S00516ED2V01Y201306CAC024  
  • 20. Self  Service   Scale   Soeware  stack   that  scales  to  a   data  center   Forgot  cloud   compu?ng.    Focus  on   data  centers  &  the   soeware  they  run.   Use  of   automa?on  
  • 21. Sevng  up  and  opera?ng  large  scale  efficient,  secure   and  compliant  racks  of  compu?ng  infrastructure  is  out   of  reach  for  most  labs,  but  essen?al  for  the  community.   This  is  not  a  cloud…  
  • 22. Commercial  Cloud  Service  Provider  (CSP)     15  MW  Data  Center   100,000  servers   1  PB  DRAM   100’s  of  PB  of  disk   Automa?c   provisioning  and   infrastructure   management   Monitoring,   network  security   and  forensics   Accoun?ng  and   billing   Customer   Facing   Portal   Data  center  network   ~1  Tbps  egress  bandwidth     25  operators  for  15  MW  Commercial  Cloud  
  • 24. Open  Science  Data  Cloud  (Home  of  Bionimbus)   6  PB   (OpenStack,   GlusterFS,   Hadoop)   Infrastructure   automa?on  &   management   (Yates)   Compliance,  &   security  (OCM)   Accoun?ng  &   billing   Customer  Facing   Portal  (Tukey)   Data  center  network   ~10-­‐100  Gbps  bandwidth     6  engineers  to  operate  0.5  MW  Science  Cloud   Science  Cloud  SW   &  Services  
  • 25. Why  not  just  use  (only)  Amazon  Web  Services  (AWS)?   Community  science  and   biomedical  clouds   •  Scale  /  capacity   •  Simplicity  of  a  credit  card   •  Wide  variety  of  offerings.     vs.   It  is  s?ll  essen?al  to  interoperate  with  CSP  whenever  possible  by   compliance  and  security  policies.   Commercial  Cloud  Service   Providers  (CSP)   •  Lower  cost  (at  medium   scale)   •  We  can  build  specialized   infrastructure  for  science.   •  We  can  build  specialized   infrastructure  for  security  &   compliance.   •  The  data  is  too  important  to   trust  exclusively  with  a   commercial  provider.  
  • 26. Cost  of  a  medium  private/community  cloud   vs  large  public  cloud.   Source:  Allison  P.  Heath,  Maphew  Greenway,  Raymond  Powell,  Jonathan  Spring,  Rafael  Suarez,  David  Hanley,  Chai   Bandlamudi,  Megan  McNerney,  Kevin  White  and  Robert  L  Grossman,  Bionimbus:  A  Cloud  for  Managing,  Analyzing   and  Sharing  Large  Genomics  Datasets,  Journal  of  the  American  Medical  Informa?cs  Associa?on,  2014.   PB   Cost     (thousands  of  $)  
  • 27. Reliability  over  Commodity  Compu?ng   Components  Is  Difficult  as  is  Data  Locality   •  Hadoop  enables  reliable  computa?on  over   thousands  of  low  costs,  unreliable  compu?ng  nodes.   •  Hadoop  efficiently  computes  over  the  data  instead  of   moving  the  data.   •  The  programming  model  of  Hadoop  (MapReduce)  is   in  prac?ce  more  efficient  in  terms  of  soeware   development  than  the  programming  tradi?onally   used  by  high  performance  compu?ng  (message   passing)  (but  usually  does  not  fully  u?lize  the   underlying  hardware)  
  • 28. The  Tail  at  Scale,  Jeffrey  Dean,  Luiz  André  Barroso    Communica?ons  of  the  ACM,   Volume  56  Number  2,  Pages  74-­‐80   Latency  is  Difficult  
  • 29. Call  an  algorithm  and   compu?ng  infrastructure   is  “big-­‐data  scalable”  if   adding  a  rack  of  data  (and   corresponding  processors)   does  not  increase  the  ?me   required  to  complete  the   computa?on  but  enables   the  computa?on  to  run  on   a  rack  more  of  data.    Most   of  our  community’s   algorithms  today  fail  this   test.  
  • 30. 2005  -­‐  2015   Bioinforma)cs  tools  &  their  integra)on.   Examples:  Galaxy,  GenomeSpace,   workflow  systems,  portals,  etc.   2010  -­‐  2020   Data  center  scale  science.     Examples:  Bionimbus/OSDC,  CG  Hub,   Cancer  Collaboratory,  GenomeBridge,  etc.   2015  -­‐  2025   ???  
  • 31. Part  3   Science  Clouds   Source:  Lincoln  Stein  
  • 32. Discipline   Dura)on   Size   #  Devices   HEP  -­‐  LHC   10  years   15  PB/year*   One   Astronomy  -­‐  LSST   10  years   12  PB/year**   One   Genomics  -­‐  NGS   2-­‐4  years   0.4  TB/genome   1000’s   Genomics  as  a  Big  Data  Science   *At  full  capacity,  the  Large  Hadron  Collider  (LHC),  the  world's  largest  par?cle  accelerator,  is  expected  to  produce  more  than  15   million  Gigabytes  of  data  each  year.    …  This  ambi?ous  project  connects  and  combines  the  IT  power  of  more  than  140  computer   centres  in  33  countries.    Source:  hpp://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-­‐en.html     **As  it  carries  out  its  10-­‐year  survey,  LSST  will  produce  over  15  terabytes  of  raw  astronomical  data  each  night  (30  terabytes   processed),  resul?ng  in  a  database  catalog  of  22  petabytes  and  an  image  archive  of  100  petabytes.    Source:  hpp://www.lsst.org/ News/enews/teragrid-­‐1004.html  
  • 33. vs   Source:  A  picture  of  Cern’s  Large  Hadron  Collider  (LHC).    The  LHC  took  about  a  decade  to  construct,  and  cost  about  $4.75  billion.       Source  of  picture:  Conrad  Melvin,  Crea?ve  Commons  BY-­‐SA  2.0,  www.flickr.com/photos/58220828@N07/5350788732   •  Ten  years  and  $10B   •  No  business  value   •  Big  data  culture   •  Liple  compliance   •  Five  years  and  $1M   •  Business  value   •  Culture  of  small  data   •  Compliance   Common  compu?ng,    storage  &  transport    infrastructure  
  • 34. UDT  Tuning:  Buffers  and  CPUs     Configuration Change Observed Transfer Rate Time to Transfer 1 TB (minutes) UDT and Linux Defaults 1.6 Gbps 85 Setting buffers sizes to 64 MB 3.3 Gbps 41 Improved CPU on sending side with processor affinity 3.7 Gbps 36 Improved CPU on receiving side with processor affinity 4.6 Gbps 29 Improved CPU on both sides with processor affinity on both sides 6.3 Gbps 21 Turn off CPU frequency scaling and set to max clock speed 6.7 Gbps 20
  • 35. Science  vs  Commercial  Clouds   Science  CSP   Commercial  CSP   Perspec?ve   Democra?ze  access  to   data.    Integrate  data   to  make  discoveries.     Long  term  archive.   As  long  as  you  pay  the   bill;  as  long  as  we  keep   the  our  business  model.   Data   Data  intensive   compu?ng   Internet  style  scale  out   Flows   Large  data  flows  in   and  out   Lots  of  small  web  flows   Lock  in   Data  and  compute   portability  essen?al   Lock  in  is  good  
  • 36. A  Key  Ques?on   •  Is  biomedical  compu?ng  at  the  scale  of  a  data   center  important  enough  for  the  research   community  to  do  or  do  we  only  outsource  to   commercial  cloud  service  providers  (certainly   we  will  interoperate  with  commercial  cloud   service  providers)?  
  • 37. Open  Science  Data  Cloud  
  • 38. Part  4:   Analyzing  Data  at  the  Scale  of  a  Data  Center   Source:  Jon  Kleinberg,  Cornell  University,  www.cs.cornell.edu/home/kleinber/networks-­‐book/  
  • 39. experimental   science   simula?on   science   data  science   (big  data  biology,   medicine)   1609   30x   1670   250x   1976   10x-­‐100x   2004   10x-­‐100x  
  • 40. Is  More  Different?    Do  New  Phenomena   Emerge  at  Scale  in  Biomedical  Data?  
  • 41. Complex  models   over  small  data  that   are  highly  manual.   Simpler  models   over  large  data  that   are  highly   automated.   Small  data   Medium  data   GB   TB   PB   W   KW   MW  
  • 42. Several  ac?ve  voxels  were  discovered  in  a  cluster  located   within  the  salmon’s  brain  cavity  (Figure  1,  see  above).  The  size   of  this  cluster  was  81  mm3  with  a  cluster-­‐level  significance  of   p  =  0.001.  Due  to  the  coarse  resolu?on  of  the  echo-­‐planar   image  acquisi?on  and  the  rela?vely  small  size  of  the  salmon   brain  further  discrimina?on  between  brain  regions  could  not   be  completed.  Out  of  a  search  volume  of  8064  voxels  a  total   of  16  voxels  were  significant.      
  • 43.
  • 45. Building  Models  over  Big  Data   •  We  know  about  the  “unreasonable  effec?veness  of   ensemble  models.”  Building  ensembles  of  models   over  computer  clusters  works  well  …   •  …  but,  how  do  machine  learning  algorithms  scale  to   data  center  scale  science?   •  Ensembles  of  random  trees  built  from  templates   appear  to  work  beper  than  tradi?onal  ensembles  of   classifiers   •  The  challenge  is  oeen  decomposing  large   heterogeneous  datasets  into  homogeneous   components  that  can  be  modeled.  
  • 46. New  Ques?ons   •  How  would  research  be  impacted  if  we  could   analyze  all  of  the  data  each  evening?   •  How  would  health  care  be  impacted  if  we   could  analyze  of  the  data  each  evening?  
  • 47. 2005  -­‐  2015   Bioinforma)cs  tools  &  their  integra)on.   Examples:  Galaxy,  GenomeSpace,   workflow  systems,  portals,  etc.   2010  -­‐  2020   Data  center  scale  science.     Interoperability  and  preserva?on/peering/ portability  of  large  biomedical  datasets.     Examples:  Bionimbus/OSDC,  CG  Hub,   Cancer  Collaboratory,  GenomeBridge,  etc.   2015  -­‐  2025   New  modeling  techniques.  The   discovery  of  new  &  emergent  behavior   at  scale.    Examples:    What  are  the   founda?ons?    Is  more  different?  
  • 48. Part  5:   Genomic  Clouds   Cancer  Genome   Collaboratory  (Toronto)   Cancer  Genomics  Hub   (Santa  Cruz)   Bionimbus  Protected   Data  Cloud  (Chicago)   (Cambridge)   (Mountain  View)  
  • 50. Analyzing  Data  From     The  Cancer  Genome  Atlas  (TCGA)   1.  Apply  to  dbGaP  for  access   to  data.   2.  Hire  staff,  set  up  and   operate  secure  compliant   compu?ng  environment  to   mange  10  –  100+  TB  of  data.       3.  Get  environment  approved   by  your  research  center.   4.  Setup  analysis  pipelines.   5.  Download  data  from  CG-­‐ Hub  (takes  days  to  weeks).     6.  Begin  analysis.   Current  Prac)ce   With  Bionimbus  Protected  Data   Cloud  (PDC)   1.  Apply  to  dbGaP  for  access   to  data.   2.  Use  your  exis?ng  NIH  grant   creden?als  to  login  to  the   PDC,  select  the  data  that   you  want  to  analyze,  and   the  pipelines  that  you  want   to  use.     3.  Begin  analysis.  
  • 51. 1.  What  scale  is  required  for  biomedical  clouds?   2.  What  is  the  design  for  biomedical  clouds?   3.  What  tools  and  applica?ons  do  users  need  to  make   discoveries  in  large  amounts  of  biomedical  data?   4.  How  do  different  biomedical  clouds  interoperate?  
  • 52. 6.  How  Do  We  Organize?  
  • 53. Cyber  Condo  Model   •  Research  ins?tu?ons  today   have  access  to  high   performance  networks  –  10G   &  100G.   •  They  couldn’t  afford  access  to   these  networks  from   commercial  providers.   •  Over  a  decade  ago,  they  got   together  to  buy  and  light   fiber.         •  This  changed  how  we  do   scien?fic  research.   •  Cyber  condos  interoperate   with  commercial  ISPs  
  • 54. Science  Cloud  Condos   •  Build  Science  Cloud  condo.   •  To  provide  a  sustainable   home  for  large  commons  of   research  data  (and  an   infrastructure  to  compute   over  it).   •  “Tier  1”  Science  Clouds  need   to  establish  peering   rela?onships.   •  And  to  interoperate  with  CSPs   •  The  Open  Cloud  Consor?um  is   planning  to  develop  a  science   cloud  condo.  
  • 55. Some  Biomedical  Data  Commons  Guidelines  for   the  Next  Five  Years   •  There  is  a  societal  benefit  when  biomedical  data  is  also   available  in  data  commons  operated  by  the  research   community  (vs  sold  exclusively  as  data  products  by   commercial  en??es  or  only  offered  for  download  by   the  USG).   •  Large  data  commons  providers  should  peer.   •  Data  commons  providers  should  develop  standards  for   interopera?ng.   •  Standards  should  not  be  developed  ahead  of  open   source  reference  implementa?ons.   •  We  need  a  period  of  experimenta?on  as  we  develop   the  best  technology  and  prac?ces.   •  The  details  are  hard  (consent,  scalable  APIs,  open  vs   controlled  access,  sustainability,  security,  etc.)  
  • 56. Source:  David  R.  Blair,  Christopher  S.  Lyple,  Jonathan  M.  Mortensen,  Charles  F.  Bearden,  Anders  Boeck   Jensen,  Hossein  Khiabanian,  Rachel  Melamed,  Raul  Rabadan,  Elmer  V.  Bernstam,  Søren  Brunak,  Lars   Juhl  Jensen,  Dan  Nicolae,  Nigam  H.  Shah,  Robert  L.  Grossman,  Nancy  J.  Cox,  Kevin  P.  White,  Andrey   Rzhetsky,  A  Non-­‐Degenerate  Code  of  Deleterious  Variants  in  Mendelian  Loci  Contributes  to  Complex   Disease  Risk,  Cell,  September,  2013     7.     Recap  
  • 57. 100,000  pa?ents   100  PB   1,000,000  pa?ents   1,000  PB   10,000  pa?ents   10  PB   1000  pa?ents   •  What  are  the  key   common  services  &  APIs?   •  How  do  the  biomedical   commons  clouds   interoperate?   •  What  is  the  governance   structure?   •  What  is  the  sustainability   model?  
  • 58. 2005  -­‐  2015   Bioinforma)cs  tools  &  their  integra)on.   Examples:  Galaxy,  GenomeSpace,   workflow  systems,  portals,  etc.   2010  -­‐  2020   Data  center  scale  science.     Interoperability  and  preserva?on/peering/ portability  of  large  biomedical  datasets.     Examples:  Bionimbus/OSDC,  CG  Hub,   Cancer  Collaboratory,  GenomeBridge,  etc.   2015  -­‐  2025   New  modeling  techniques.  The   discovery  of  new  &  emergent  behavior   at  scale.    Examples:    What  are  the   founda?ons?    Is  more  different?  
  • 59. Thanks  To  My  Colleagues  &  Collaborators   •  Kevin  White   •  Nancy  Cox   •  Andrey  Rzhetsky   •  Lincoln  Stein   •  Barbara  Stranger  
  • 60. Thanks  to  My  Lab   Allison   Heath   Ray   Powell   Jonathan   Spring   Renuka   Ayra   David   Hanley   Maria   Paperson   Map   Greenway   Rafael   Suarez     Stu?   Agrawal     Sean   Sullivan   Zhenyu   Zhang    
  • 61. Thanks  to  the  White  Lab   Jason   Grundstad   Chaitanya   Bandlamidi   Jason  Pip   Megan   McNerney  
  • 63. For  more  informa?on   •  www.opensciencedatacloud.org   •  For  more  informa?on  and  background,  see  Robert  L.   Grossman  and  Kevin  P.  White,  A  Vision  for  a  Biomedical   Cloud,  Journal  of  Internal  Medicine,  Volume  271,  Number  2,   pages  122-­‐130,  2012.   •  You  can  find  some  more  informa?on  on  my  blog:                                                  rgrossman.com.   •  My  email  address  is  robert.grossman  at  uchicago  dot  edu.     Center for Research Informatics
  • 64. Major  funding  and  support  for  the  Open  Science  Data  Cloud  (OSDC)  is  provided  by  the  Gordon  and   Bepy  Moore  Founda?on.    This  funding  is  used  to  support  the  OSDC-­‐Adler,  Sullivan  and  Root  facili?es.     Addi?onal  funding  for  the  OSDC  has  been  provided  by  the  following  sponsors:     •  The  Bionimbus  Protected  Data  Cloud  is  supported  in  by  part  by  NIH/NCI  through  NIH/SAIC  Contract   13XS021  /  HHSN261200800001E.     •  The  OCC-­‐Y  Hadoop  Cluster  (approximately  1000  cores  and  1  PB  of  storage)  was  donated  by  Yahoo!   in  2011.   •  Cisco  provides  the  OSDC  access  to  the  Cisco  C-­‐Wave,  which  connects  OSDC  data  centers  with  10   Gbps  wide  area  networks.   •  The  OSDC  is  supported  by  a  5-­‐year  (2010-­‐2016)  PIRE  award  (OISE  –  1129076)  to  train  scien?sts  to   use  the  OSDC  and  to  further  develop  the  underlying  technology.   •  OSDC  technology  for  high  performance  data  transport  is  support  in  part  by    NSF  Award  1127316.   •  The  StarLight  Facility  in  Chicago  enables  the  OSDC  to  connect  to  over  30  high  performance   research  networks  around  the  world  at  10  Gbps  or  higher.   •  Any  opinions,  findings,  and  conclusions  or  recommenda?ons  expressed  in  this  material  are  those   of  the  author(s)  and  do  not  necessarily  reflect  the  views  of  the  Na?onal  Science  Founda?on,  NIH  or   other  funders  of  this  research.     The  OSDC  is  managed  by  the  Open  Cloud  Consor?um,  a  501(c)(3)  not-­‐for-­‐profit  corpora?on.  If  you  are   interested  in  providing  funding  or  dona?ng  equipment  or  services,  please  contact  us  at   info@opensciencedatacloud.org.