SlideShare una empresa de Scribd logo
1 de 60
Descargar para leer sin conexión
Big	
  Data	
  &	
  Analy-cs:	
  
Five	
  Trends	
  and	
  Five	
  Research	
  Challenges	
  
Robert	
  Grossman	
  
University	
  of	
  Chicago	
  
&	
  	
  
Open	
  Data	
  Group	
  
	
  
September	
  18,	
  2015	
  
Part	
  1	
  
What	
  is	
  Big	
  Data?	
  
Researchers	
  and	
  policymakers	
  are	
  beginning	
  to	
  realize	
  the	
  poten-al	
  for	
  channeling	
  these	
  
torrents	
  of	
  data	
  into	
  ac-onable	
  informa-on	
  that	
  can	
  be	
  used	
  to	
  iden-fy	
  needs	
  &	
  provide	
  
services	
  for	
  the	
  benefit	
  of	
  low-­‐income	
  popula-ons.	
  	
  Source:	
  Big	
  Data,	
  Big	
  Impact:	
  New	
  
Possibili-es	
  for	
  Interna-onal	
  Development,	
  World	
  Economic	
  Forum,	
  2012.	
  
•  Volume	
  
•  Velocity	
  
•  Variety	
  
•  Veracity	
  
•  Value	
  
•  Megabytes	
  
•  Gigabytes	
  
•  Terabytes	
  	
  
•  Petabytes	
  
•  Etabytes	
  
•  Zetabytes	
  
The	
  Name	
  Changes	
  
1830	
  	
   	
  sta-s-cs	
  	
  
1980	
  	
   	
  computa-onally	
  intensive	
  sta-s-cs	
  
1993	
  	
   	
  data	
  mining	
  &	
  knowledge	
  discovery	
  in	
  databases	
  
1997	
  	
   	
  business	
  analy-cs	
  
2004	
  	
   	
  predic-ve	
  analy-cs	
  
2011	
  	
   	
  big	
  data,	
  data	
  science	
  &	
  data	
  analy-cs	
  
Source:	
  Google	
  Trends,	
  www.google.com/trends	
  
What	
  is	
  Big	
  Data?	
  	
  
(Opera-ons	
  POV)	
  
A	
  marke-ng	
  term	
  introduced	
  by	
  O’Reilly:	
  
	
  
Big	
  data	
  is	
  data	
  that	
  exceeds	
  the	
  processing	
  capacity	
  of	
  
conven-onal	
  database	
  systems.	
  The	
  data	
  is	
  too	
  big,	
  
moves	
  too	
  fast,	
  or	
  doesn’t	
  fit	
  the	
  strictures	
  of	
  your	
  
database	
  architectures.	
  To	
  gain	
  value	
  from	
  this	
  data,	
  
you	
  must	
  choose	
  an	
  alterna-ve	
  way	
  to	
  process	
  it.	
  	
  
	
  
Edd	
  Dumbill,	
  What	
  is	
  Big	
  Data?,	
  strata.oreilly.com,	
  
January	
  11,	
  2012.	
  
	
  
What	
  is	
  Big	
  Data?	
  
(POV:	
  New	
  Types	
  of	
  Data	
  that	
  IT	
  Cannot	
  Manage)	
  
	
  
Period	
   New	
  types	
  of	
  data	
   Term	
  Used	
  
1990’s	
   Clicks	
  on	
  the	
  Internet,	
  
POS	
  transac-ons	
  
Data	
  mining	
  
2000’s	
   Unstructured	
  data,	
  
graph	
  data	
  
Predic-ve	
  
Analy-cs	
  
2010’s	
   Mobile	
  data,	
  IoT	
  data	
   Big	
  Data	
  
What	
  Is	
  Small	
  Data?	
  
•  100	
  million	
  movie	
  ra-ngs	
  
•  480	
  thousand	
  customers	
  
•  17,000	
  movies	
  
•  From	
  1998	
  to	
  2005	
  
•  Less	
  than	
  2	
  GB	
  data.	
  
•  Fits	
  into	
  memory,	
  but	
  very	
  
sophis-cated	
  models	
  
required	
  to	
  win.	
  
What	
  are	
  the	
  origins	
  of	
  big	
  data?	
  
Basic	
  Choice	
  with	
  Hardware:	
  Scale	
  Up	
  or	
  Out	
  
More	
  memory,	
  
more	
  processors,	
  
more	
  disk	
  ($K)	
  
Specialized	
  
hardware	
  	
  
(e.g.	
  connects)
($100K)	
  
Specialized	
  	
  
devices	
  ($M)	
  
One	
  machine	
   Cluster	
  
(racks)	
  
($100K)	
  
Cyber	
  	
  
Pod	
  
$M	
  
Distributed	
  
cyber	
  pods	
  
$10M+	
  
Source:	
  Interior	
  of	
  one	
  of	
  Google’s	
  Data	
  Center,	
  www.google.com/about/datacenters/	
  
Computa-onal	
  adver-sing	
  finds	
  
the	
  “best	
  match”	
  between	
  a	
  given	
  
user	
  in	
  a	
  given	
  context	
  and	
  a	
  
suitable	
  adver-sement	
  ($100+	
  B	
  
market).	
  
	
  
The	
  Google	
  Data	
  Stack	
  
•  The	
  Google	
  File	
  System	
  (2003)	
  
•  MapReduce:	
  Simplified	
  Data	
  Processing…	
  (2004)	
  
•  BigTable:	
  A	
  Distributed	
  Storage	
  System…	
  (2006)	
  
11	
  
Source:	
  Terence	
  Kawaja,	
  hnp://www.slideshare.net/tkawaja	
  
•  The	
  leaders	
  in	
  big	
  data	
  analy-cs	
  
measure	
  data	
  in	
  Megawans.	
  	
  	
  	
  
– As	
  in,	
  Facebook’s	
  leased	
  data	
  
centers	
  are	
  typically	
  between	
  
2.5	
  MW	
  and	
  6.0	
  MW.	
  
– Facebook’s	
  new	
  Pineville	
  data	
  
center	
  is	
  30	
  MW.	
  
	
  
What	
  is	
  Big	
  Data?	
  
(My	
  computer	
  is	
  a	
  data	
  center	
  POV)	
  
Part	
  2	
  
What	
  is	
  Analy-cs?	
  
Source:	
  Aaron	
  Parecki,	
  Everywhere	
  I’ve	
  Been,	
  aaronparecki.com.	
  
What	
  is	
  Analy-cs?	
  
Short	
  Defini8on	
  
•  Using	
  data	
  to	
  make	
  decisions.	
  
Longer	
  Defini8on	
  
•  Using	
  data	
  to	
  take	
  ac-ons	
  and	
  make	
  decisions	
  using	
  
models	
  that	
  are	
  sta-s-cally	
  valid	
  and	
  empirically	
  derived.	
  
	
  
Defini-on	
  of	
  Sta-s-cs	
  from	
  ASA	
  web	
  page:	
  
•  Sta-s-cs	
  is	
  the	
  science	
  of	
  learning	
  from	
  data,	
  and	
  of	
  
measuring,	
  controlling,	
  and	
  communica-ng	
  uncertainty	
  …	
  	
  
15	
  
Source:	
  American	
  Sta-s-cal	
  Associa-on,	
  	
  www.amstat.org/careers/wha-ssta-s-cs.cfm,	
  from:	
  
Davidian,	
  M.	
  and	
  Louis,	
  T.	
  A.,	
  10.1126/science.1218685.	
  
16	
  
1993	
   2004	
  
Data	
  Mining	
  	
  
&	
  KDD	
  
1984	
  
Computa-onally	
  
Intensive	
  Sta-s-cs	
  
Predic-ve	
  
Analy-cs	
  
Big	
  Data	
  &	
  
Data	
  Science	
  
2011	
  
PageRank	
  
Spanner	
  TX	
  
algorithm	
  
Devices/IoT	
  Internet	
  POS	
  Direct	
  marke-ng	
  
ID3	
  &	
  C4.5	
  
1.  Given	
  n	
  planes	
  A1,	
  …,	
  An.	
  	
  	
  Assume	
  each	
  plane	
  Ai	
  has	
  bij	
  bullet	
  holes	
  in	
  
the	
  tail,	
  wing,	
  fuselage	
  and	
  other	
  (j=1,	
  2,	
  3,	
  4,	
  respec-vely).	
  	
  
2.  Compute	
  where	
  to	
  put	
  addi-onal	
  armor	
  to	
  maximize	
  the	
  chance	
  that	
  
planes	
  return.	
  
Part	
  3.	
  
Data	
  Science	
  
A	
  picture	
  of	
  Cern’s	
  Large	
  Hadron	
  Collider	
  (LHC).	
  	
  The	
  LHC	
  took	
  about	
  a	
  decade	
  to	
  construct,	
  and	
  cost	
  about	
  
$4.75	
  billion.	
  	
  	
  Source	
  of	
  picture:	
  Conrad	
  Melvin,	
  Crea-ve	
  Commons	
  BY-­‐SA	
  2.0,	
  www.flickr.com/photos/
58220828@N07/5350788732	
  
Some	
  fields	
  have	
  (one)	
  billion	
  dollar	
  (or	
  more)	
  
instrument	
  that	
  generates	
  big	
  data.	
  
A	
  genomics	
  sequencing	
  facility	
  might	
  have	
  3-­‐5	
  next	
  genera-on	
  sequencing	
  
instruments	
  that	
  cost	
  $250,000	
  or	
  more	
  each.	
  	
  
Some	
  fields	
  have	
  hundreds	
  or	
  thousands	
  of	
  
million	
  dollar	
  instruments	
  that	
  in	
  aggregate	
  
produce	
  big	
  data.	
  
Some	
  fields	
  have	
  millions	
  of	
  hundred	
  dollar	
  
sensors	
  that	
  in	
  aggregate	
  produce	
  big	
  data.	
  
Math	
  &	
  
Sta-s-cs	
  
Computer	
  
Science	
  
Disciplinary	
  
Science	
  
Data	
  
Science	
  
Understanding	
  Salmon	
  
(A	
  Cau-onary	
  Tale)	
  	
  
	
  
Source:	
  Salmo	
  salar,	
  (Atlan-c	
  Salmon),	
  wikipedia.org	
  	
  
Methods	
  
Subject.	
  One	
  mature	
  Atlan-c	
  Salmon	
  (Salmo	
  salar)	
  
par-cipated	
  in	
  the	
  fMRI	
  study.	
  The	
  salmon	
  was	
  
approximately	
  18	
  inches	
  long,	
  weighed	
  3.8	
  lbs,	
  and	
  was	
  not	
  
alive	
  at	
  the	
  -me	
  of	
  scanning.	
  	
  
Task.	
  The	
  task	
  administered	
  to	
  the	
  salmon	
  involved	
  
comple-ng	
  an	
  open-­‐ended	
  mentalizing	
  task.	
  The	
  salmon	
  
was	
  shown	
  a	
  series	
  of	
  photographs	
  depic-ng	
  human	
  
individuals	
  in	
  social	
  situa-ons	
  with	
  a	
  specified	
  emo-onal	
  
valence.	
  The	
  salmon	
  was	
  asked	
  to	
  determine	
  what	
  emo-on	
  
the	
  individual	
  in	
  the	
  photo	
  must	
  have	
  been	
  experiencing.	
  	
  
Design.	
  S-muli	
  were	
  presented	
  in	
  a	
  block	
  design	
  with	
  each	
  
photo	
  presented	
  for	
  10	
  seconds	
  followed	
  by	
  12	
  seconds	
  of	
  
rest.	
  A	
  total	
  of	
  15	
  photos	
  were	
  displayed.	
  Total	
  scan	
  -me	
  
was	
  5.5	
  minutes.	
  	
  
	
  
Several	
  ac-ve	
  voxels	
  were	
  discovered	
  in	
  a	
  cluster	
  located	
  within	
  
the	
  salmon’s	
  brain	
  cavity	
  (Figure	
  1,	
  see	
  above).	
  The	
  size	
  of	
  this	
  
cluster	
  was	
  81	
  mm3	
  with	
  a	
  cluster-­‐level	
  significance	
  of	
  p	
  =	
  0.001.	
  
Due	
  to	
  the	
  coarse	
  resolu-on	
  of	
  the	
  echo-­‐planar	
  image	
  
acquisi-on	
  and	
  the	
  rela-vely	
  small	
  size	
  of	
  the	
  salmon	
  brain	
  
further	
  discrimina-on	
  between	
  brain	
  regions	
  could	
  not	
  be	
  
completed.	
  Out	
  of	
  a	
  search	
  volume	
  of	
  8064	
  voxels	
  a	
  total	
  of	
  16	
  
voxels	
  were	
  significant.	
  	
  
	
  
The	
  bigger	
  the	
  data,	
  the	
  easier	
  it	
  is	
  to	
  do	
  stupid	
  
things	
  with	
  it,	
  such	
  as	
  forgetng	
  to	
  correct	
  for	
  
mul-ple	
  tests.	
  
Part	
  4.	
  
What	
  Instrument	
  Do	
  we	
  Use	
  to	
  	
  
Make	
  Discoveries	
  in	
  Data	
  Science?	
  
How	
  do	
  we	
  build	
  a	
  “datascope?”	
  
experimental	
  
science	
  
simula-on	
  
science	
  
1609	
  
30x	
  
1670	
  
250x	
  
1976	
  
10x-­‐100x	
  
data	
  science	
  
experimental	
  
science	
  
simula-on	
  
science	
  
data	
  science	
  
1609	
  
30x	
  
1670	
  
250x	
  
1976	
  
10x-­‐100x	
  
2004	
  
10x-­‐100x	
  
“Cyberpod”	
  
Could	
  we	
  con-nuously	
  re-­‐analyze	
  the	
  world’s	
  
cancer	
  data?	
  
Complex	
  sta-s-cal	
  
models	
  over	
  small	
  data	
  
that	
  are	
  highly	
  manual	
  
and	
  update	
  infrequently.	
  
Simpler	
  sta-s-cal	
  
models	
  over	
  large	
  data	
  
that	
  are	
  highly	
  
automated	
  and	
  updated	
  
frequently.	
  
memory	
   databases	
  
GB	
   TB	
   PB	
  
W	
   KW	
   MW	
  
datapods	
  
cyber	
  pods	
  
Part	
  5	
  
Five	
  Trends	
  
Source:	
  Google	
  Trends,	
  for	
  term	
  “data	
  commons”,	
  www.google.com/trends.	
  
Trend	
  1	
  
Data	
  Commons	
  
Source:	
  NEXRAD,	
  NOAA,	
  www.noaa.org	
  
The	
  Standard	
  Model	
  of	
  Biomedical	
  
Compu-ng	
  No	
  Longer	
  Works	
  
Public	
  data	
  
repositories	
  
Private	
  local	
  
storage	
  &	
  
compute	
  
Network	
  
download	
  
Local	
  data	
  ($1K)	
  
Community	
  
souware	
  
Souware,	
  sweat	
  and	
  
tears	
  ($100K)	
  
Data	
  Commons	
  
Data	
  commons	
  co-­‐locate	
  data,	
  storage	
  and	
  compu-ng	
  
infrastructure,	
  and	
  commonly	
  used	
  tools	
  for	
  analyzing	
  
and	
  sharing	
  data	
  to	
  create	
  a	
  resource	
  for	
  the	
  research	
  
community.	
  
Source:	
  Interior	
  of	
  one	
  of	
  Google’s	
  data	
  centers,	
  www.google.com/about/datacenters/	
  
Open	
  Science	
  Data	
  Cloud	
  
(Open	
  Cloud	
  Consor-um,	
  
2012)	
  
NCI	
  Data	
  Commons	
  	
  
(UChicago,	
  Nov	
  
2015)	
  
Bionimbus	
  Protected	
  
Data	
  Cloud	
  (UChicago,	
  
2013)	
  
NOAA	
  Data	
  
Commons	
  	
  
(Open	
  
Cloud	
  
Consor-um
Oct	
  2015)	
  
Purple	
  balls	
  are	
  lung	
  adenocarcinoma.	
  	
  Grey	
  are	
  lung	
  
squamous	
  cell	
  carcinoma.	
  	
  Green	
  are	
  misdiagnosed.	
  	
  
Hospitals,	
  medical	
  
research	
  centers	
  
and	
  doctors	
  
Data	
  commons	
  containing	
  	
  
genomic	
  and	
  clinical	
  data.	
  
Pa-ents	
  
Output:	
  con-nuously	
  
updated,	
  data-­‐driven,	
  	
  
analy-cs-­‐informed	
  	
  
discovery,	
  diagnosis	
  
and	
  treatment.	
  
Trend	
  2	
  
Analy-cs	
  of	
  Things,	
  People	
  and	
  Places	
  
Source:	
  Urban	
  sensor	
  on	
  street	
  pole	
  in	
  Chicago	
  (conceptual),	
  arrayouhings.github.io/	
  
People	
  and	
  things	
  genera-ng	
  streaming	
  	
  
data	
  that	
  are	
  relevant	
  for	
  research.	
  
Places	
  that	
  generate	
  data	
  
Source:	
  Jane	
  Macfarlane,	
  Here,	
  a	
  Division	
  of	
  Nokia.	
  
Trend	
  3	
  
Languages	
  for	
  Data,	
  Sta-s-cal	
  Models,	
  Data	
  
Science	
  Workflows	
  &	
  Exploratory	
  Data	
  Analysis	
  
Source:	
  M.	
  Bostock,	
  hnp://bl.ocks.org/mbostock/4063318	
  
Portable	
  Format	
  for	
  Analy-cs	
  (PFA)	
  
Predic-ve	
  Model	
  Markup	
  Language	
  (PMML)	
  
Grammar	
  of	
  Graphics	
  
d3.js	
  
Trend	
  4	
  
More	
  Policies	
  That	
  Make	
  Data	
  Available	
  
and	
  Analy-cs	
  Repeatable	
  
Execu-ve	
  Order	
  13642	
  (May	
  9,	
  2013)	
  
Making	
  Open	
  and	
  Machine	
  Readable	
  the	
  Default	
  for	
  
Government	
  Informa-on	
  (“Open	
  Data	
  Policy”)	
  
OMB	
  Guidance	
  President’s	
  Ex	
  Order	
  
Trend	
  5	
  
Transla-onal	
  Data	
  Science	
  
How	
  do	
  we	
  translate	
  data	
  driven	
  discoveries	
  
into	
  ac-ons	
  that	
  impact	
  society?	
  	
  
Imaging
Informatics
Clinical
Informatics
Bioinformatics
Public Health
Informatics
Basic Research
Applied Research
Practice (dx,
treatment
and prevention)
Molecular &
cellular
processes
Tissues &
organs
Individuals
(patients)
Groups &
populations
Quality & outcomes
Translational Informatics
New	
  algorithms,	
  
new	
  sta-s-cal	
  
models	
  (data	
  
science)	
  
Applica-ons	
  to	
  
genomics,	
  analysis	
  
of	
  EMR,	
  etc.	
  
Souware	
  stacks	
  for	
  data	
  
intensive	
  compu-ng	
  
(data	
  engineering)	
  
Data	
  driven	
  
discoveries	
  
Data	
  driven	
  
diagnosis	
  
Data	
  driven	
  
therapeu-cs	
  
Develop	
  souware	
  stack	
  that	
  scales	
  to	
  a	
  “datapod”,	
  to	
  create	
  
“commons”	
  for	
  data	
  driven	
  discoveries,	
  dx	
  &	
  treatment.	
  	
  (Core	
  
strategy	
  for	
  Center	
  for	
  Data	
  Intensive	
  Science,	
  University	
  of	
  Chicago)	
  
Transla-onal	
  Data	
  Science	
  
Source:	
  Maria	
  T.	
  Panerson	
  and	
  Robert	
  L.	
  Grossman,	
  Detec-ng	
  localized	
  spa-al	
  panerns	
  of	
  disease	
  
incidence	
  using	
  a	
  neighbor-­‐based	
  bootstrapping	
  method	
  on	
  electronic	
  medical	
  records	
  data	
  from	
  99.1	
  
million	
  pa-ents,	
  to	
  appear.	
  
Part	
  5	
  
Five	
  Challenges	
  
Challenge	
  1.	
  Is	
  More	
  Different?	
  	
  	
  
Source:	
  P.	
  W.	
  Anderson,	
  More	
  is	
  Different,	
  Science,	
  Volume	
  177,	
  Number	
  4047,	
  4	
  August	
  1972,	
  pages	
  393-­‐396.	
  
Do	
  New	
  Phenomena	
  Emerge	
  at	
  Scale	
  in	
  Data?	
  
Challenge	
  2.	
  One	
  Million	
  Genomes	
  
•  Sequencing	
  a	
  million	
  genomes	
  would	
  likely	
  change	
  
the	
  way	
  we	
  understand	
  genomic	
  varia-on	
  and	
  
provide	
  a	
  founda-on	
  for	
  precision	
  medicine.	
  
•  The	
  genomic	
  data	
  for	
  a	
  pa-ent	
  is	
  about	
  1	
  TB	
  
(including	
  samples	
  from	
  both	
  tumor	
  and	
  normal	
  
-ssue).	
  
•  One	
  million	
  genomes	
  is	
  about	
  1000	
  PB	
  or	
  1	
  EB	
  
•  With	
  compression,	
  it	
  may	
  be	
  about	
  100	
  PB	
  
•  At	
  $1000/genome,	
  the	
  sequencing	
  would	
  cost	
  about	
  
$1B	
  
•  Think	
  of	
  this	
  as	
  one	
  hundred	
  studies	
  with	
  10,000	
  
pa-ents	
  each	
  over	
  three	
  years.	
  
Challenge	
  3.	
  	
  Datapods	
  
•  Databases	
  have	
  fundamentally	
  changed	
  the	
  way	
  we	
  
manage	
  and	
  analyze	
  scien-fic	
  data.	
  	
  
•  NoSQL	
  databases	
  allow	
  us	
  to	
  scale	
  out	
  to	
  mul-ple	
  
racks	
  of	
  computers,	
  but	
  are	
  hard	
  to	
  to	
  operate.	
  
•  If	
  our	
  scien-fic	
  instrument	
  for	
  data	
  science	
  is	
  a	
  
cyberpod	
  of	
  hardware	
  and	
  a	
  souware	
  stack	
  
suppor-ng	
  data	
  analysis,	
  we	
  need	
  a	
  simple-­‐to-­‐
manage,	
  open	
  source	
  “database”	
  that	
  scales	
  to	
  a	
  
cyberpod.	
  
•  Call	
  this	
  a	
  “datapod.”	
  
•  It	
  could	
  support	
  open	
  source	
  data	
  commons	
  and	
  
allow	
  them	
  to	
  peer.	
  
Challenge	
  4.	
  	
  A	
  Billion	
  Predic-ve	
  Models	
  
•  Develop	
  technology	
  to	
  generate	
  automa-cally	
  1	
  to	
  
10	
  billion	
  heterogeneous	
  segmented	
  models	
  
•  	
  Applica-ons	
  
– George	
  Church’s	
  challenge	
  individual	
  predic-ve	
  
models	
  for	
  each	
  human	
  genome	
  6.5	
  Billion	
  
humans.	
  
– 1	
  Million	
  cancer	
  genomes	
  x	
  1,000	
  models	
  /	
  
genome.	
  
– Urban	
  science	
  –	
  instrumen-ng	
  ci-es.	
  
– Consumer	
  Marke-ng	
  -­‐	
  large	
  adver-sers	
  will	
  see	
  
1-­‐3	
  billion	
  different	
  consumers	
  	
  
Challenge	
  5.	
  	
  HDSI	
  
•  Human	
  Computer	
  Interac-on	
  (HCI)	
  was	
  an	
  important	
  
field	
  before	
  everyone	
  got	
  a	
  computer	
  and	
  became	
  an	
  
expert.	
  
•  Think	
  of	
  Human	
  Data	
  Science	
  Interac-on	
  (HDSI)	
  of	
  
how	
  humans	
  interact	
  with	
  the	
  souware	
  suppor-ng	
  
the	
  analysis	
  of	
  data	
  science	
  at	
  the	
  scale	
  of	
  datapods	
  
with	
  billion	
  models	
  and	
  trillions	
  of	
  hypotheses.	
  
•  How	
  can	
  we	
  improve	
  the	
  interac-on	
  to	
  improve	
  how	
  
we	
  semi-­‐automa-cally	
  integrate	
  data,	
  validate	
  
hypotheses,	
  interac-vely	
  explore	
  data,	
  etc.	
  
Ques-ons?	
  
59	
  
rgrossman.com	
  
@bobgrossman	
  
For	
  More	
  Informa-on	
  
cdis.uchicago.edu	
  
www.opendatagroup.com	
  
rgrossman.com	
  

Más contenido relacionado

La actualidad más candente

Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Robert Grossman
 
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataThe Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataRobert Grossman
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...Robert Grossman
 
Massive Data Analysis- Challenges and Applications
Massive Data Analysis- Challenges and ApplicationsMassive Data Analysis- Challenges and Applications
Massive Data Analysis- Challenges and ApplicationsVijay Raghavan
 
Multipleregression covidmobility and Covid-19 policy recommendation
Multipleregression covidmobility and Covid-19 policy recommendationMultipleregression covidmobility and Covid-19 policy recommendation
Multipleregression covidmobility and Covid-19 policy recommendationKan Yuenyong
 
20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - Amsterdam20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - AmsterdamAllen Day, PhD
 
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / PhoenixAllen Day, PhD
 
Büyük Veriyle Büyük Resmi Görmek
Büyük Veriyle Büyük Resmi GörmekBüyük Veriyle Büyük Resmi Görmek
Büyük Veriyle Büyük Resmi Görmekideaport
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...Geoffrey Fox
 
20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - WageningenAllen Day, PhD
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational ScienceChelle Gentemann
 
Comparing Big Data and Simulation Applications and Implications for Software ...
Comparing Big Data and Simulation Applications and Implications for Software ...Comparing Big Data and Simulation Applications and Implications for Software ...
Comparing Big Data and Simulation Applications and Implications for Software ...Geoffrey Fox
 
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
 Unexpected Challenges in Large Scale Machine Learning by Charles Parker Unexpected Challenges in Large Scale Machine Learning by Charles Parker
Unexpected Challenges in Large Scale Machine Learning by Charles ParkerBigMine
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer OverlordsIan Foster
 
Clouds, Grids and Data
Clouds, Grids and DataClouds, Grids and Data
Clouds, Grids and DataGuy Coates
 

La actualidad más candente (20)

Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
 
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataThe Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
Big Data
Big Data Big Data
Big Data
 
Cri big data
Cri big dataCri big data
Cri big data
 
Massive Data Analysis- Challenges and Applications
Massive Data Analysis- Challenges and ApplicationsMassive Data Analysis- Challenges and Applications
Massive Data Analysis- Challenges and Applications
 
Multipleregression covidmobility and Covid-19 policy recommendation
Multipleregression covidmobility and Covid-19 policy recommendationMultipleregression covidmobility and Covid-19 policy recommendation
Multipleregression covidmobility and Covid-19 policy recommendation
 
Future of hpc
Future of hpcFuture of hpc
Future of hpc
 
20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - Amsterdam20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - Amsterdam
 
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
 
Büyük Veriyle Büyük Resmi Görmek
Büyük Veriyle Büyük Resmi GörmekBüyük Veriyle Büyük Resmi Görmek
Büyük Veriyle Büyük Resmi Görmek
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational Science
 
Comparing Big Data and Simulation Applications and Implications for Software ...
Comparing Big Data and Simulation Applications and Implications for Software ...Comparing Big Data and Simulation Applications and Implications for Software ...
Comparing Big Data and Simulation Applications and Implications for Software ...
 
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
 Unexpected Challenges in Large Scale Machine Learning by Charles Parker Unexpected Challenges in Large Scale Machine Learning by Charles Parker
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
 
Seminar Report Vaibhav
Seminar Report VaibhavSeminar Report Vaibhav
Seminar Report Vaibhav
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
 
Clouds, Grids and Data
Clouds, Grids and DataClouds, Grids and Data
Clouds, Grids and Data
 

Destacado

AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016Robert Grossman
 
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...Robert Grossman
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...Robert Grossman
 
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Robert Grossman
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Robert Grossman
 
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Robert Grossman
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Robert Grossman
 
Practical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsPractical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsRobert Grossman
 
Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Robert Grossman
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotLi Shen
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analysesrjorton
 

Destacado (12)

AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016
 
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
 
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)
 
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)
 
Practical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsPractical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large Datasets
 
Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)
 
NGS - QC & Dataformat
NGS - QC & Dataformat NGS - QC & Dataformat
NGS - QC & Dataformat
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plot
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analyses
 

Similar a Keynote on 2015 Yale Day of Data

Machine Learning, Data Mining, and
Machine Learning, Data Mining, and Machine Learning, Data Mining, and
Machine Learning, Data Mining, and butest
 
Data-Mining-ppt (1).pptx
Data-Mining-ppt (1).pptxData-Mining-ppt (1).pptx
Data-Mining-ppt (1).pptxParvathyparu25
 
Data-Mining-ppt.pptx
Data-Mining-ppt.pptxData-Mining-ppt.pptx
Data-Mining-ppt.pptxayush309565
 
Big data and data mining
Big data and data miningBig data and data mining
Big data and data miningPolash Halder
 
Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data ScienceAndrew Gardner
 
Introduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .pptIntroduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .pptSangrangBargayary3
 
Data science Innovations January 2018
Data science Innovations January 2018Data science Innovations January 2018
Data science Innovations January 2018suresh sood
 
Data Science Innovations : Democratisation of Data and Data Science
Data Science Innovations : Democratisation of Data and Data Science  Data Science Innovations : Democratisation of Data and Data Science
Data Science Innovations : Democratisation of Data and Data Science suresh sood
 
Bigdatacooltools
BigdatacooltoolsBigdatacooltools
Bigdatacooltoolssuresh sood
 
Data Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsData Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsIJMER
 
SWOT of Bigdata Security Using Machine Learning Techniques
SWOT of Bigdata Security Using Machine Learning TechniquesSWOT of Bigdata Security Using Machine Learning Techniques
SWOT of Bigdata Security Using Machine Learning Techniquesijistjournal
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactDr. Sunil Kr. Pandey
 
sybca-bigdata-ppt.pptx
sybca-bigdata-ppt.pptxsybca-bigdata-ppt.pptx
sybca-bigdata-ppt.pptxcalf_ville86
 
Aaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridAaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridIan Foster
 

Similar a Keynote on 2015 Yale Day of Data (20)

Machine Learning, Data Mining, and
Machine Learning, Data Mining, and Machine Learning, Data Mining, and
Machine Learning, Data Mining, and
 
Cs501 dm intro
Cs501 dm introCs501 dm intro
Cs501 dm intro
 
Data-Mining-ppt (1).pptx
Data-Mining-ppt (1).pptxData-Mining-ppt (1).pptx
Data-Mining-ppt (1).pptx
 
Data-Mining-ppt.pptx
Data-Mining-ppt.pptxData-Mining-ppt.pptx
Data-Mining-ppt.pptx
 
data.2.pptx
data.2.pptxdata.2.pptx
data.2.pptx
 
Big data and data mining
Big data and data miningBig data and data mining
Big data and data mining
 
Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data Science
 
Introduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .pptIntroduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .ppt
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
Data science Innovations January 2018
Data science Innovations January 2018Data science Innovations January 2018
Data science Innovations January 2018
 
Data Science Innovations : Democratisation of Data and Data Science
Data Science Innovations : Democratisation of Data and Data Science  Data Science Innovations : Democratisation of Data and Data Science
Data Science Innovations : Democratisation of Data and Data Science
 
Bigdatacooltools
BigdatacooltoolsBigdatacooltools
Bigdatacooltools
 
Big dataorig
Big dataorigBig dataorig
Big dataorig
 
Data Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsData Mining: Future Trends and Applications
Data Mining: Future Trends and Applications
 
SWOT of Bigdata Security Using Machine Learning Techniques
SWOT of Bigdata Security Using Machine Learning TechniquesSWOT of Bigdata Security Using Machine Learning Techniques
SWOT of Bigdata Security Using Machine Learning Techniques
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
sybca-bigdata-ppt.pptx
sybca-bigdata-ppt.pptxsybca-bigdata-ppt.pptx
sybca-bigdata-ppt.pptx
 
Datamining
DataminingDatamining
Datamining
 
Big Data et eGovernment
Big Data et eGovernmentBig Data et eGovernment
Big Data et eGovernment
 
Aaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridAaas Data Intensive Science And Grid
Aaas Data Intensive Science And Grid
 

Más de Robert Grossman

Some Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanySome Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanyRobert Grossman
 
Some Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data PlatformsSome Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data PlatformsRobert Grossman
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataRobert Grossman
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
 
A Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchA Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchRobert Grossman
 
What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?Robert Grossman
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...Robert Grossman
 
Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Robert Grossman
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Robert Grossman
 
Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Robert Grossman
 

Más de Robert Grossman (10)

Some Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanySome Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your Company
 
Some Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data PlatformsSome Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data Platforms
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
A Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchA Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical Research
 
What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11
 
Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011
 

Último

Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 

Último (20)

Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 

Keynote on 2015 Yale Day of Data

  • 1. Big  Data  &  Analy-cs:   Five  Trends  and  Five  Research  Challenges   Robert  Grossman   University  of  Chicago   &     Open  Data  Group     September  18,  2015  
  • 2. Part  1   What  is  Big  Data?   Researchers  and  policymakers  are  beginning  to  realize  the  poten-al  for  channeling  these   torrents  of  data  into  ac-onable  informa-on  that  can  be  used  to  iden-fy  needs  &  provide   services  for  the  benefit  of  low-­‐income  popula-ons.    Source:  Big  Data,  Big  Impact:  New   Possibili-es  for  Interna-onal  Development,  World  Economic  Forum,  2012.  
  • 3. •  Volume   •  Velocity   •  Variety   •  Veracity   •  Value   •  Megabytes   •  Gigabytes   •  Terabytes     •  Petabytes   •  Etabytes   •  Zetabytes  
  • 4. The  Name  Changes   1830      sta-s-cs     1980      computa-onally  intensive  sta-s-cs   1993      data  mining  &  knowledge  discovery  in  databases   1997      business  analy-cs   2004      predic-ve  analy-cs   2011      big  data,  data  science  &  data  analy-cs   Source:  Google  Trends,  www.google.com/trends  
  • 5. What  is  Big  Data?     (Opera-ons  POV)   A  marke-ng  term  introduced  by  O’Reilly:     Big  data  is  data  that  exceeds  the  processing  capacity  of   conven-onal  database  systems.  The  data  is  too  big,   moves  too  fast,  or  doesn’t  fit  the  strictures  of  your   database  architectures.  To  gain  value  from  this  data,   you  must  choose  an  alterna-ve  way  to  process  it.       Edd  Dumbill,  What  is  Big  Data?,  strata.oreilly.com,   January  11,  2012.    
  • 6. What  is  Big  Data?   (POV:  New  Types  of  Data  that  IT  Cannot  Manage)     Period   New  types  of  data   Term  Used   1990’s   Clicks  on  the  Internet,   POS  transac-ons   Data  mining   2000’s   Unstructured  data,   graph  data   Predic-ve   Analy-cs   2010’s   Mobile  data,  IoT  data   Big  Data  
  • 7. What  Is  Small  Data?   •  100  million  movie  ra-ngs   •  480  thousand  customers   •  17,000  movies   •  From  1998  to  2005   •  Less  than  2  GB  data.   •  Fits  into  memory,  but  very   sophis-cated  models   required  to  win.  
  • 8. What  are  the  origins  of  big  data?  
  • 9. Basic  Choice  with  Hardware:  Scale  Up  or  Out   More  memory,   more  processors,   more  disk  ($K)   Specialized   hardware     (e.g.  connects) ($100K)   Specialized     devices  ($M)   One  machine   Cluster   (racks)   ($100K)   Cyber     Pod   $M   Distributed   cyber  pods   $10M+  
  • 10. Source:  Interior  of  one  of  Google’s  Data  Center,  www.google.com/about/datacenters/   Computa-onal  adver-sing  finds   the  “best  match”  between  a  given   user  in  a  given  context  and  a   suitable  adver-sement  ($100+  B   market).    
  • 11. The  Google  Data  Stack   •  The  Google  File  System  (2003)   •  MapReduce:  Simplified  Data  Processing…  (2004)   •  BigTable:  A  Distributed  Storage  System…  (2006)   11  
  • 12. Source:  Terence  Kawaja,  hnp://www.slideshare.net/tkawaja  
  • 13. •  The  leaders  in  big  data  analy-cs   measure  data  in  Megawans.         – As  in,  Facebook’s  leased  data   centers  are  typically  between   2.5  MW  and  6.0  MW.   – Facebook’s  new  Pineville  data   center  is  30  MW.     What  is  Big  Data?   (My  computer  is  a  data  center  POV)  
  • 14. Part  2   What  is  Analy-cs?   Source:  Aaron  Parecki,  Everywhere  I’ve  Been,  aaronparecki.com.  
  • 15. What  is  Analy-cs?   Short  Defini8on   •  Using  data  to  make  decisions.   Longer  Defini8on   •  Using  data  to  take  ac-ons  and  make  decisions  using   models  that  are  sta-s-cally  valid  and  empirically  derived.     Defini-on  of  Sta-s-cs  from  ASA  web  page:   •  Sta-s-cs  is  the  science  of  learning  from  data,  and  of   measuring,  controlling,  and  communica-ng  uncertainty  …     15   Source:  American  Sta-s-cal  Associa-on,    www.amstat.org/careers/wha-ssta-s-cs.cfm,  from:   Davidian,  M.  and  Louis,  T.  A.,  10.1126/science.1218685.  
  • 16. 16   1993   2004   Data  Mining     &  KDD   1984   Computa-onally   Intensive  Sta-s-cs   Predic-ve   Analy-cs   Big  Data  &   Data  Science   2011   PageRank   Spanner  TX   algorithm   Devices/IoT  Internet  POS  Direct  marke-ng   ID3  &  C4.5  
  • 17.
  • 18. 1.  Given  n  planes  A1,  …,  An.      Assume  each  plane  Ai  has  bij  bullet  holes  in   the  tail,  wing,  fuselage  and  other  (j=1,  2,  3,  4,  respec-vely).     2.  Compute  where  to  put  addi-onal  armor  to  maximize  the  chance  that   planes  return.  
  • 19. Part  3.   Data  Science  
  • 20. A  picture  of  Cern’s  Large  Hadron  Collider  (LHC).    The  LHC  took  about  a  decade  to  construct,  and  cost  about   $4.75  billion.      Source  of  picture:  Conrad  Melvin,  Crea-ve  Commons  BY-­‐SA  2.0,  www.flickr.com/photos/ 58220828@N07/5350788732   Some  fields  have  (one)  billion  dollar  (or  more)   instrument  that  generates  big  data.  
  • 21. A  genomics  sequencing  facility  might  have  3-­‐5  next  genera-on  sequencing   instruments  that  cost  $250,000  or  more  each.     Some  fields  have  hundreds  or  thousands  of   million  dollar  instruments  that  in  aggregate   produce  big  data.  
  • 22. Some  fields  have  millions  of  hundred  dollar   sensors  that  in  aggregate  produce  big  data.  
  • 23. Math  &   Sta-s-cs   Computer   Science   Disciplinary   Science   Data   Science  
  • 24. Understanding  Salmon   (A  Cau-onary  Tale)       Source:  Salmo  salar,  (Atlan-c  Salmon),  wikipedia.org    
  • 25. Methods   Subject.  One  mature  Atlan-c  Salmon  (Salmo  salar)   par-cipated  in  the  fMRI  study.  The  salmon  was   approximately  18  inches  long,  weighed  3.8  lbs,  and  was  not   alive  at  the  -me  of  scanning.     Task.  The  task  administered  to  the  salmon  involved   comple-ng  an  open-­‐ended  mentalizing  task.  The  salmon   was  shown  a  series  of  photographs  depic-ng  human   individuals  in  social  situa-ons  with  a  specified  emo-onal   valence.  The  salmon  was  asked  to  determine  what  emo-on   the  individual  in  the  photo  must  have  been  experiencing.     Design.  S-muli  were  presented  in  a  block  design  with  each   photo  presented  for  10  seconds  followed  by  12  seconds  of   rest.  A  total  of  15  photos  were  displayed.  Total  scan  -me   was  5.5  minutes.      
  • 26. Several  ac-ve  voxels  were  discovered  in  a  cluster  located  within   the  salmon’s  brain  cavity  (Figure  1,  see  above).  The  size  of  this   cluster  was  81  mm3  with  a  cluster-­‐level  significance  of  p  =  0.001.   Due  to  the  coarse  resolu-on  of  the  echo-­‐planar  image   acquisi-on  and  the  rela-vely  small  size  of  the  salmon  brain   further  discrimina-on  between  brain  regions  could  not  be   completed.  Out  of  a  search  volume  of  8064  voxels  a  total  of  16   voxels  were  significant.      
  • 27. The  bigger  the  data,  the  easier  it  is  to  do  stupid   things  with  it,  such  as  forgetng  to  correct  for   mul-ple  tests.  
  • 28. Part  4.   What  Instrument  Do  we  Use  to     Make  Discoveries  in  Data  Science?   How  do  we  build  a  “datascope?”  
  • 29. experimental   science   simula-on   science   1609   30x   1670   250x   1976   10x-­‐100x   data  science  
  • 30. experimental   science   simula-on   science   data  science   1609   30x   1670   250x   1976   10x-­‐100x   2004   10x-­‐100x   “Cyberpod”  
  • 31. Could  we  con-nuously  re-­‐analyze  the  world’s   cancer  data?  
  • 32. Complex  sta-s-cal   models  over  small  data   that  are  highly  manual   and  update  infrequently.   Simpler  sta-s-cal   models  over  large  data   that  are  highly   automated  and  updated   frequently.   memory   databases   GB   TB   PB   W   KW   MW   datapods   cyber  pods  
  • 33. Part  5   Five  Trends   Source:  Google  Trends,  for  term  “data  commons”,  www.google.com/trends.  
  • 34. Trend  1   Data  Commons   Source:  NEXRAD,  NOAA,  www.noaa.org  
  • 35. The  Standard  Model  of  Biomedical   Compu-ng  No  Longer  Works   Public  data   repositories   Private  local   storage  &   compute   Network   download   Local  data  ($1K)   Community   souware   Souware,  sweat  and   tears  ($100K)  
  • 36. Data  Commons   Data  commons  co-­‐locate  data,  storage  and  compu-ng   infrastructure,  and  commonly  used  tools  for  analyzing   and  sharing  data  to  create  a  resource  for  the  research   community.   Source:  Interior  of  one  of  Google’s  data  centers,  www.google.com/about/datacenters/  
  • 37. Open  Science  Data  Cloud   (Open  Cloud  Consor-um,   2012)   NCI  Data  Commons     (UChicago,  Nov   2015)   Bionimbus  Protected   Data  Cloud  (UChicago,   2013)   NOAA  Data   Commons     (Open   Cloud   Consor-um Oct  2015)  
  • 38.
  • 39.
  • 40. Purple  balls  are  lung  adenocarcinoma.    Grey  are  lung   squamous  cell  carcinoma.    Green  are  misdiagnosed.    
  • 41. Hospitals,  medical   research  centers   and  doctors   Data  commons  containing     genomic  and  clinical  data.   Pa-ents   Output:  con-nuously   updated,  data-­‐driven,     analy-cs-­‐informed     discovery,  diagnosis   and  treatment.  
  • 42. Trend  2   Analy-cs  of  Things,  People  and  Places   Source:  Urban  sensor  on  street  pole  in  Chicago  (conceptual),  arrayouhings.github.io/  
  • 43. People  and  things  genera-ng  streaming     data  that  are  relevant  for  research.  
  • 44. Places  that  generate  data   Source:  Jane  Macfarlane,  Here,  a  Division  of  Nokia.  
  • 45. Trend  3   Languages  for  Data,  Sta-s-cal  Models,  Data   Science  Workflows  &  Exploratory  Data  Analysis   Source:  M.  Bostock,  hnp://bl.ocks.org/mbostock/4063318  
  • 46. Portable  Format  for  Analy-cs  (PFA)   Predic-ve  Model  Markup  Language  (PMML)   Grammar  of  Graphics   d3.js  
  • 47. Trend  4   More  Policies  That  Make  Data  Available   and  Analy-cs  Repeatable  
  • 48. Execu-ve  Order  13642  (May  9,  2013)   Making  Open  and  Machine  Readable  the  Default  for   Government  Informa-on  (“Open  Data  Policy”)   OMB  Guidance  President’s  Ex  Order  
  • 49. Trend  5   Transla-onal  Data  Science   How  do  we  translate  data  driven  discoveries   into  ac-ons  that  impact  society?    
  • 50. Imaging Informatics Clinical Informatics Bioinformatics Public Health Informatics Basic Research Applied Research Practice (dx, treatment and prevention) Molecular & cellular processes Tissues & organs Individuals (patients) Groups & populations Quality & outcomes Translational Informatics
  • 51. New  algorithms,   new  sta-s-cal   models  (data   science)   Applica-ons  to   genomics,  analysis   of  EMR,  etc.   Souware  stacks  for  data   intensive  compu-ng   (data  engineering)   Data  driven   discoveries   Data  driven   diagnosis   Data  driven   therapeu-cs   Develop  souware  stack  that  scales  to  a  “datapod”,  to  create   “commons”  for  data  driven  discoveries,  dx  &  treatment.    (Core   strategy  for  Center  for  Data  Intensive  Science,  University  of  Chicago)   Transla-onal  Data  Science  
  • 52. Source:  Maria  T.  Panerson  and  Robert  L.  Grossman,  Detec-ng  localized  spa-al  panerns  of  disease   incidence  using  a  neighbor-­‐based  bootstrapping  method  on  electronic  medical  records  data  from  99.1   million  pa-ents,  to  appear.  
  • 53. Part  5   Five  Challenges  
  • 54. Challenge  1.  Is  More  Different?       Source:  P.  W.  Anderson,  More  is  Different,  Science,  Volume  177,  Number  4047,  4  August  1972,  pages  393-­‐396.   Do  New  Phenomena  Emerge  at  Scale  in  Data?  
  • 55. Challenge  2.  One  Million  Genomes   •  Sequencing  a  million  genomes  would  likely  change   the  way  we  understand  genomic  varia-on  and   provide  a  founda-on  for  precision  medicine.   •  The  genomic  data  for  a  pa-ent  is  about  1  TB   (including  samples  from  both  tumor  and  normal   -ssue).   •  One  million  genomes  is  about  1000  PB  or  1  EB   •  With  compression,  it  may  be  about  100  PB   •  At  $1000/genome,  the  sequencing  would  cost  about   $1B   •  Think  of  this  as  one  hundred  studies  with  10,000   pa-ents  each  over  three  years.  
  • 56. Challenge  3.    Datapods   •  Databases  have  fundamentally  changed  the  way  we   manage  and  analyze  scien-fic  data.     •  NoSQL  databases  allow  us  to  scale  out  to  mul-ple   racks  of  computers,  but  are  hard  to  to  operate.   •  If  our  scien-fic  instrument  for  data  science  is  a   cyberpod  of  hardware  and  a  souware  stack   suppor-ng  data  analysis,  we  need  a  simple-­‐to-­‐ manage,  open  source  “database”  that  scales  to  a   cyberpod.   •  Call  this  a  “datapod.”   •  It  could  support  open  source  data  commons  and   allow  them  to  peer.  
  • 57. Challenge  4.    A  Billion  Predic-ve  Models   •  Develop  technology  to  generate  automa-cally  1  to   10  billion  heterogeneous  segmented  models   •   Applica-ons   – George  Church’s  challenge  individual  predic-ve   models  for  each  human  genome  6.5  Billion   humans.   – 1  Million  cancer  genomes  x  1,000  models  /   genome.   – Urban  science  –  instrumen-ng  ci-es.   – Consumer  Marke-ng  -­‐  large  adver-sers  will  see   1-­‐3  billion  different  consumers    
  • 58. Challenge  5.    HDSI   •  Human  Computer  Interac-on  (HCI)  was  an  important   field  before  everyone  got  a  computer  and  became  an   expert.   •  Think  of  Human  Data  Science  Interac-on  (HDSI)  of   how  humans  interact  with  the  souware  suppor-ng   the  analysis  of  data  science  at  the  scale  of  datapods   with  billion  models  and  trillions  of  hypotheses.   •  How  can  we  improve  the  interac-on  to  improve  how   we  semi-­‐automa-cally  integrate  data,  validate   hypotheses,  interac-vely  explore  data,  etc.  
  • 59. Ques-ons?   59   rgrossman.com   @bobgrossman  
  • 60. For  More  Informa-on   cdis.uchicago.edu   www.opendatagroup.com   rgrossman.com