SlideShare una empresa de Scribd logo
1 de 60
Descargar para leer sin conexión
The	
  Emerging	
  Discipline	
  of	
  
Data	
  Science	
  
Principles	
  and	
  Techniques	
  
For	
  
Data-­‐Intensive	
  Analysis	
  
	
  
What	
  is	
  Big	
  Data	
  Analy9cs?	
  
Is	
  this	
  a	
  new	
  paradigm?	
  
What	
  is	
  the	
  role	
  of	
  data?	
  
What	
  could	
  possibly	
  go	
  wrong?	
  
What	
  is	
  Data	
  Science?	
  
Big	
  Data	
  is	
  Hot!	
  
Big	
  Data	
  Is	
  Important	
  
Hot	
  
•  Market	
  
–  Results,	
  products,	
  jobs	
  
•  Poten9al	
  
–  4th	
  Paradigm	
  
–  Accelerates	
  discovery	
  [urgent]	
  
–  BeLer:	
  cost,	
  speed,	
  specificity	
  
–  Change	
  80%	
  of	
  processes	
  [Gartner]	
  
•  Government	
  Policy	
  (45+)	
  
–  White	
  House;	
  most	
  US	
  Govt	
  agencies	
  
•  Adop9on:	
  Most	
  Human	
  Endeavors	
  
–  All	
  academic	
  disciplines	
  
–  Computa9onal	
  X	
  	
  
Cool	
  
•  Low	
  effec9ve	
  adop9on	
  [EMC]	
  
–  	
  60%	
  opera9onal	
  
–  20%	
  significant	
  change	
  
–  <	
  1%	
  effec9ve	
  
•  Results	
  not	
  opera9onal	
  
•  In	
  its	
  infancy	
  þ	
  lacking	
  
–  Understanding	
  
–  Concepts,	
  tools,	
  techniques	
  
(methods)	
  
•  21st	
  Century	
  Sta9s9cs	
  	
  
–  Theory:	
  principles,	
  guidelines	
  
Healthcare	
  Poten9al:	
  BeLer	
  Health;	
  Faster,	
  Cheaper	
  Remedies	
  
What	
  could	
  go	
  Wrong?	
  
When	
  are	
  Correla9ons	
  Spurious?	
  
Or	
  Just	
  Wrong?	
  E.g.	
  Google	
  Flu	
  Trends	
  
Allegedly	
  Real-­‐9me,	
  Reliable	
  Predic9ons	
  
High	
  100	
  out	
  of	
  108	
  weeks	
  
Future	
  of	
  Life:	
  Ins9tute	
  to	
  
“mi;gate	
  existen;al	
  risks	
  facing	
  humanity”	
  
US	
  Legal	
  Community	
  Pursuing	
  
Algorithmic	
  Accountability	
  
Do	
  We	
  Know	
  /	
  Can	
  We	
  Prove?	
  
•  DIA	
  Result:	
  correct,	
  complete,	
  efficient?	
  
•  What	
  machines	
  /	
  algorithms	
  /	
  Machine	
  
Learning	
  /	
  Black	
  Boxes	
  /	
  DIA	
  do?	
  
•  Emergent	
  Data-­‐Driven	
  Society	
  with	
  High	
  
– Reward:	
  Cancer	
  cures,	
  drug	
  discovery,	
  personalized	
  
medicine,	
  …	
  
– Risk:	
  errors	
  in	
  any	
  of	
  the	
  above	
  	
  
The	
  search	
  for	
  
truth	
  
evidence-­‐based	
  causality	
  
evidence-­‐based	
  correla9ons	
  
Model	
  /	
  
Theory	
  
Data	
  
Hypotheses	
  
Analysis	
  
Long	
  Illustrious	
  Histories	
  
Data	
  Analysis	
  
•  Mathema9cs	
  
•  Babylon	
  (17th-­‐12th	
  C	
  BCE)	
  
•  India	
  (12th	
  C	
  BCE)	
  
•  Mathema9cal	
  analysis	
  (17th	
  C,	
  
Scien9fic	
  Revolu9on)	
  
•  Sta9s9cs	
  (5th	
  C	
  BCE,	
  18th	
  C)	
  
	
  
~4,000	
  years	
  
Scien1fic	
  Method	
  
•  Empiricism	
  
–  Aristotle	
  (384-­‐322	
  BCE)	
  
–  Ptolemy	
  (1st	
  C)	
  
–  Bacons	
  (13th,	
  16th	
  C)	
  
~2,000	
  years	
  
•  Scien9fic	
  Discovery	
  Paradigms	
  
1.  Theory	
  
2.  Experimenta9on	
  
3.  Simula9on	
  
4.  eScience	
  /	
  Big	
  Data	
  
~	
  1,000	
  years	
  
Fourth	
  Paradigm	
  
Modern	
  Compu1ng	
  
•  Hardware:	
  40s-­‐50s	
  
•  FORTRAN:	
  50s	
  	
  
•  Spreadsheets:	
  70s	
  
•  Databases:	
  70s-­‐80s	
  
•  World	
  Wide	
  Web:	
  90s	
  
~	
  60	
  years	
  
Data-­‐Intensive	
  Analysis	
  of	
  Everything	
  
•  eScience	
  (~2000)	
  
•  Big	
  Data	
  (~2007)	
  
–  Par9cle	
  physics,	
  drug	
  discovery,	
  …	
  
~	
  15	
  years	
  
Paradigms	
  
–  Long	
  developments	
  
–  Significant	
  shiss	
  
•  Conceptual	
  
•  Theore9cal	
  
•  Procedural	
  
Biopsy
Sequence
Compare
TargetTest
Treat
Monitor
Precision Oncology
In silicoIn vivo In vitro
Normal skin cell
Sequencing Machines
Chromosomes
Normal cell
Cancer cell
Treated cell
Scans
Patient
Biomarkers
Original cancer cell
Source: Marty Tenebaum, Cancer Commons
Accelerating Scientific Discovery
Experiment Model
Correlations/
Hypotheses
Probabilistic
Results
What:
Correlation
Why:
Causation
Accelerating Scientific Discovery
Experiment Model
Correlations/
Hypotheses
Probabilistic
Results
What:
Correlation
Why:
Causation
WatsonBaylor
Scientists
Profound	
  Changes:	
  Paradigm	
  Shis	
  [Kuhn]	
  
•  New	
  reasoning	
  /	
  problem	
  solving	
  model	
  
–  Data	
   	
   	
   	
   	
   	
  è Data-­‐Intensive	
  (Big	
  Data	
  –	
  4	
  Vs)	
  
–  Why	
   	
   	
   	
   	
   	
  è What	
  
–  Strategic	
  (theory-­‐based)	
   	
  è Tac9cal	
  (evidence-­‐based)	
  
–  Theory-­‐driven	
  (top-­‐down)	
  è Data-­‐driven	
  (boLom-­‐up)	
  
–  Hypothesis	
  tes9ng	
   	
   	
  è Hypothesis	
  genera9on	
  
•  Enabling	
  Paradigm	
  Shiss	
  in	
  most	
  disciplines	
  
–  Science	
   	
   	
   	
   	
  è	
  	
  	
  	
  eScience	
  
–  Accelera9ng	
  (scien9fic	
  /	
  engineering)	
  discovery	
  
–  Most	
  domains	
  
•  Personalized	
  medicine 	
   	
  •	
  Urban	
  Planning	
  
•  Drug	
  interac9ons 	
   	
   	
  •	
  Social	
  and	
  Economic	
  Planning	
  
•  Beyond	
  Data-­‐Driven:	
  Symbiosis	
  
–  What	
  +	
  Why	
  
–  Human	
  intelligence	
  +	
  machine	
  intelligence	
  
THE	
  BIG	
  PICTURE:	
  MY	
  PERSPECTIVE	
  
Big	
  Data	
  and	
  Data-­‐Intensive	
  Analysis	
  
DIA	
  Pipelines	
  /	
  Ecosystem	
  
•  Q:	
  What	
  Big	
  Data	
  technologies	
  do	
  you	
  see	
  becoming	
  
very	
  popular	
  within	
  the	
  next	
  five	
  years?	
  	
  
•  A:	
  I	
  don’t	
  like	
  to	
  say	
  that	
  there’s	
  a	
  specific	
  technology,	
  …	
  there	
  
are	
  pipelines	
  that	
  you	
  would	
  build	
  that	
  have	
  pieces	
  to	
  them.	
  
How	
  do	
  you	
  process	
  the	
  data,	
  how	
  do	
  you	
  represent	
  it,	
  how	
  
do	
  you	
  store	
  it,	
  what	
  inferen9al	
  problem	
  are	
  you	
  trying	
  to	
  
solve.	
  There’s	
  a	
  whole	
  toolbox	
  or	
  ecosystem	
  that	
  you	
  have	
  
to	
  understand	
  if	
  you	
  are	
  going	
  to	
  be	
  working	
  in	
  the	
  field.	
  
Michael	
  Jordan,	
  Pehong	
  Chen	
  Dis;nguished	
  Professor	
  at	
  the	
  University	
  of	
  California,	
  Berkeley	
  	
  
Data-­‐Intensive	
  Analysis	
  
.	
  
.	
  
.	
  
.	
  
.	
  
.	
  
Analy9cal	
  Models	
   Analy9cal	
  Methods	
  
Analy9cal	
  
Results	
  
Data	
  Science	
  
Data-­‐Intensive	
  Analysis	
  
.	
  
.	
  
.	
  
Data-­‐Intensive	
  Analysis	
  
.	
  
.	
  
.	
  
.	
  
.	
  
.	
  
.	
  
.	
  
.	
  
Analy9cal	
  Models	
   Analy9cal	
  Methods	
   Analy9cal	
  
Results	
  
Data	
  Science	
  
Data-­‐Intensive	
  Analysis	
  
Data-­‐Intensive	
  Analysis	
  
.	
  
.	
  
.	
  
.	
  
.	
  
.	
  
.	
  
.	
  
.	
  
Analy9cal	
  Models	
   Analy9cal	
  Methods	
   Analy9cal	
  
Results	
  
Data	
  Science	
  
Data	
  Management	
  for	
  Data-­‐Intensive	
  Analysis	
  
.	
  
.	
  
.	
  
.	
  
.	
  
.	
  
External	
  
Internal	
  
.	
  
.	
  
.	
  
Data	
  Sources	
  
Global	
  Data	
  
Catalogue	
  &	
  
Grid	
  Access	
  
Shared	
  
Data	
  Repository	
  
Rela9onships	
  En99es	
  
Shared	
  Repository	
  Catalogue	
  
Raw	
  Data	
  Acquisi9on	
  &	
  
Cura9on	
  
Analy9cal	
  Data	
  Acquisi9on	
   Data-­‐Intensive	
  Analysis	
  
Data	
  Science	
  
DATA-­‐INTENSIVE	
  ANALYSIS	
  (DIA)	
  
DIA	
  PROCESS	
  (WORKFLOW	
  /	
  PIPELINE)	
  
DIA	
  USE	
  CASE	
  RANGE	
  
Research	
  Method:	
  Examine	
  Complex,	
  Large-­‐Scale	
  Use	
  Cases	
  that	
  push	
  limits	
  
Data	
  Analysis	
  èData-­‐Intensive	
  Analysis	
  
•  Common	
  defini9on–	
  far	
  too	
  simplis;c	
  :	
  extract	
  
knowledge	
  from	
  data	
  
•  DIA:	
  the	
  ac;vity	
  of	
  using	
  data	
  to	
  inves;gate	
  
phenomena,	
  to	
  acquire	
  new	
  knowledge,	
  and	
  to	
  
correct	
  and	
  integrate	
  previous	
  knowledge	
  
•  DIA	
  Process/Workflow/Pipeline:	
  a	
  sequence	
  of	
  
opera;ons	
  that	
  cons;tute	
  an	
  end-­‐to-­‐end	
  DIA	
  
from	
  source	
  data	
  to	
  a	
  quan;fied,	
  qualified	
  result	
  
My	
  Focus	
  is	
  Not	
  common	
  DIA	
  Use	
  Cases	
  
…	
  Nor	
  High	
  Impact	
  Organiza9onal	
  DIA	
  
Data(-­‐Intensive)	
  Analysis	
  Range	
  *	
  
•  Small	
  Data	
  ≠	
  (volume,	
  velocity,	
  variety)	
  98%	
  
–  Conven9onal	
  data	
  analysis:	
  1	
  K	
  years	
  -­‐	
  sta9s9cs,	
  spreadsheets,	
  databases,	
  …	
  
•  Big	
  Data	
  =	
  (volume,	
  velocity,	
  variety)	
  2%	
  
–  Simple	
  DIA:	
  “most	
  data	
  science	
  is	
  simple”	
  Jeff	
  Leek	
  96%	
  
•  Simple	
  models	
  &	
  methods,	
  single	
  user,	
  short	
  dura9on:	
  65+	
  self-­‐service	
  tools,	
  
ML,	
  widest-­‐usage	
  
•  Rela9ve	
  simplicity:	
  sales,	
  marke9ng,	
  &	
  social	
  trends,	
  defects,	
  …	
  	
  
–  Complex	
  DIA	
  4%	
  
•  Domains:	
  par9cle	
  physics,	
  economics,	
  stock	
  market,	
  genomics,	
  drug	
  
discovery,	
  weather,	
  boiling	
  water,	
  psychology,	
  …	
  
•  Models	
  &	
  Methods:	
  large,	
  collabora9ve	
  community,	
  long	
  dura9on,	
  very	
  large	
  
scale	
  
*	
  Many	
  more	
  factors	
  
Focus	
  
Why?	
  
A:	
  This	
  is	
  where	
  things	
  obviously	
  break	
  …	
  
Example	
  Scien9fic	
  Workflow	
  (Arvados)	
  
TOP-­‐QUARK,	
  LARGE	
  HADRON	
  
COLLIDER,	
  CERN,	
  SWITZERLAND	
  	
  
Complex	
  DIA	
  Use	
  Case	
  #1	
  
eScience,	
  Big	
  Science,	
  Networked	
  Science,	
  Community	
  Compu9ng,	
  
Science	
  Gateway	
  
	
  
Higg’s	
  Boson:	
  40	
  Year	
  Search	
  
LHC	
  Data	
  from	
  proton–proton	
  collisions	
  at	
  centre-­‐of-­‐mass	
  energies	
  of	
  
7	
  TeV	
  (2011)	
  and	
  8	
  TeV	
  (2012)	
  
	
  
How	
  do	
  you	
  Prove	
  Higgs	
  Boson	
  Exists?	
  
•  Standard	
  model	
  of	
  physics	
  predicts	
  (30	
  years)	
  
Higgs	
  Boson	
  characteris9cs	
  
– Mass	
  ~125	
  GeV	
  
– Decays	
  to	
  γγ,	
  WW	
  and	
  ZZ	
  boson	
  pairs	
  
– Couplings	
  to	
  W	
  and	
  Z	
  bosons	
  
– Spin	
  parity	
  
– Couples	
  to	
  up-­‐type	
  top-­‐quark	
  
– Couples	
  to	
  down-­‐type	
  fermions?	
  
– Decays	
  to	
  boLom	
  quarks	
  and	
  τ	
  leptons	
  	
  
5	
  Sigma=0.00001%	
  possible	
  error	
  (2012)	
  
10	
  Sigma	
  (2014)	
  	
  
.	
  
.	
  
.	
  
.	
  
.	
  
.	
  
ROOT	
  in	
  
Oracle	
  
Root	
  
Model	
   	
  …	
  
Boson	
  
Model	
  
ROOT	
  
Source	
  Data	
  Sets	
  
Top	
  
Quark	
  
Model	
  
ROOT	
  
.	
  
.	
  
.	
  
.	
  
.	
  
.	
  
.	
  
.	
  
.	
  
Original	
  Big	
  Data	
  Applica9on,	
  e.g.,	
  ATLAS	
  high-­‐energy	
  physics	
  (CERN)	
  
Root	
  
Model	
  
1,000s	
  
Tier	
  1:	
  long-­‐term	
  cura9on;	
  RAW	
  
reprocessing	
  to	
  derived	
  formats	
  	
  
Tier	
  0:	
  Calibra9on	
  and	
  Alignment	
  
Express	
  Stream	
  Analysis	
  	
  
Hardware	
  filtering	
  
Raw	
  Data	
  Acquisi9on	
  &	
  Cura9on	
   Analy9cal	
  Data	
  Acquisi9on	
   Data-­‐Intensive	
  Analysis	
  
Author/more	
  info	
  	
   37	
  
Worldwide	
  LHC	
  Computing	
  Grid	
  
April	
  2012	
  
Tier	
  0	
  (CERN)	
  
•  Data	
  recording	
  
•  Ini9al	
  data	
  reconstruc9on	
  
•  Data	
  distribu9on	
  
Tier	
  1	
  (13	
  centers)	
  
•  Permanent	
  storage	
  
•  Re-­‐processing	
  
•  Analysis	
  
	
  
	
  
Tier	
  2	
  (~160	
  centers)	
  
•  Simula9on	
  
•  End-­‐user	
  analysis	
  
.	
  
.	
  
.	
  
.	
  
.	
  
.	
  
ROOT	
  
(Oracle)	
  
Root	
  
Model	
   	
  …	
  
Source	
  Data	
  Sets	
  
Original	
  Big	
  Data	
  Applica9on,	
  e.g.,	
  ATLAS	
  high-­‐energy	
  physics	
  (CERN);	
  Oracle	
  +	
  DQ2	
  +	
  ROOT	
  
ATLAS	
  Distributed	
  Data	
  Management	
  System	
  (DQ2)	
  (Pig,	
  Hive,	
  Hadoop)	
  2007+	
  
On	
  the	
  Worldwide	
  LHC	
  Compu9ng	
  Grid	
  (WLCG)	
  
Root	
  
Model	
  
Boson	
  
Model	
  
ROOT	
  
Top	
  
Quark	
  
Model	
  
ROOT	
  
.	
  
.	
  
.	
  
.	
  
.	
  
.	
  
.	
  
.	
  
.	
  
1,000s	
  
Tier	
  1:	
  long-­‐term	
  cura9on;	
  RAW	
  
reprocessing	
  to	
  derived	
  formats	
  	
  
Tier	
  0:	
  Calibra9on	
  and	
  Alignment	
  
Express	
  Stream	
  Analysis	
  	
  
Hardware	
  filtering	
  
Meta-­‐Data	
  
/	
  Access	
  
DQ2	
  
(HDFS)	
  
Raw	
  Data	
  Acquisi9on	
  &	
  Cura9on	
   Analy9cal	
  Data	
  Acquisi9on	
  
Data-­‐Intensive	
  
Analysis	
  
LESSONS	
  LEARNED	
  
Based	
  on	
  ~30	
  Large-­‐Scale	
  DIA	
  Use	
  Cases	
  
DIA	
  Lessons	
  Learned	
  (What)	
  
•  A	
  Sosware	
  Ar9fact:	
  a	
  workflow	
  /	
  pipeline	
  
–  Data-­‐Intensive	
  Analysis	
  Workflow	
  
•  Data	
  Management	
  (80%)	
  
–  (Raw)	
  Data	
  Acquisi9on	
  and	
  Cura9on	
  
–  Analy9cal	
  Data	
  Acquisi9on	
  
•  Data-­‐Intensive	
  Analysis	
  (20%)	
  
–  Objec9ve:	
  switch	
  80:20	
  to	
  20:80	
   	
   	
  è Let	
  scien;sts	
  do	
  science	
  
–  Explore	
  (DIA)	
  vs	
  Build	
  (sosware	
  engineering)	
  
–  Dura9on:	
  years	
  
•  Emerging	
  Paradigm	
  
–  New	
  programming	
  paradigm	
  
–  Experiments	
  over	
  data	
  
–  Convergence	
  
•  Scien9fic	
  /	
  engineering	
  discovery	
  
•  ~10	
  programming	
  paradigms:	
  database,	
  IR,	
  BI,	
  DM,	
  …	
  	
  
DIA	
  Lessons	
  Learned	
  (How)	
  
•  Result	
  Types	
  
–  Provable	
  <-­‐>	
  Probabilis9c	
  <-­‐>	
  Specula9ve	
  
•  Nature	
  
–  Analy9cal	
  
•  Empirical:	
  complete	
  meta-­‐data	
  
•  Abstract:	
  incomplete	
  meta-­‐data	
  
–  Phases:	
  Explora9on,	
  Analysis,	
  Interpreta9on	
  	
  
–  Exploratory,	
  Itera9ve,	
  and	
  Incremental	
  
•  Users	
  
–  Individual	
  
–  Workgroup	
  
–  Organiza9on	
  /	
  Enterprise	
  
–  Community	
  	
  
DIA	
  Lessons	
  Learned	
  (People)	
  
•  Machine	
  +	
  Human	
  Intelligence	
  
–  Symbiosis	
  –	
  op9mized	
  
–  Domain	
  knowledge	
  cri9cal	
  
•  Mul9-­‐disciplinary,	
  Collabora9ve,	
  Itera9ve	
  
•  Community	
  Compu9ng:	
  DIA	
  Ecosystems	
  –	
  sharing	
  
–  Massive	
  resources	
  
–  Knowledge	
  
–  Costs	
  
–  Many	
  (~60):	
  eScience,	
  Science	
  Gateways,	
  Networked	
  Science,	
  …	
  
•  High-­‐energy	
  physics	
  (CERN:	
  ROOT)	
  
•  Astrophysics	
  (Gaia)	
  
•  Scien9fic	
  Workflow	
  Systems:	
  ~30	
  
•  Macroeconomics	
  
•  Global	
  Alliance	
  for	
  Genomics	
  and	
  Health	
  
•  Enterprise	
  Ecosystems,	
  e.g.,	
  Informa9on	
  Services:	
  Thomson	
  Reuters,	
  Bloomberg,	
  …	
  
•  Open-­‐Science-­‐Grid	
  
•  The	
  Cancer	
  Genome	
  Atlas	
  
•  The	
  Cancer	
  Genomics	
  Hub	
  
The	
  value	
  and	
  role	
  of	
  
truth	
  
evidence-­‐based	
  causality	
  
evidence-­‐based	
  correla9ons	
  
What	
  data	
  is	
  adequate	
  evidence	
  for	
  Q?	
  
DIA	
  Lessons	
  Learned	
  (Essence)	
  
DOW	
  JONES,	
  BLOOMBERG,	
  
THOMSON	
  REUTERS,	
  PEARSON,	
  …	
  
Complex	
  DIA	
  Use	
  Case	
  #2	
  
Informa9on	
  Services	
  
	
  
Informa9on	
  Services	
  Business	
  
Collect,	
  curate,	
  enrich,	
  augment	
  (IP)	
  &	
  disseminate	
  informa9on	
  
	
  
–  Financial	
  &	
  Risk	
  
•  Investors	
  	
  
•  News	
  &	
  press	
  releases	
  
•  Brokerage	
  research	
  
•  Instruments:	
  stocks,	
  
bonds,	
  loans,	
  …	
  
–  Legal	
  
•  Dockets	
  
•  Case	
  Law	
  
•  Public	
  records	
  
•  Law	
  firms	
  
•  Global	
  businesses	
  
–  Intellectual	
  Property	
  &	
  
Science	
  
•  Scien9fic	
  ar9cles	
  
•  Patents	
  
•  Trademarks	
  
•  Domain	
  names	
  
•  Clinical	
  trials	
  
–  Tax	
  &	
  Accoun9ng	
  	
  
•  Corporate	
  
•  Government	
  
•  Solu9ons	
  
Consequences	
  of	
  Errors	
  
DS1	
  
Knowledge	
  
Graph	
  
(linked	
  En99es)	
  
Knowledge	
  
Graph	
  Data	
  
Internal	
  
&	
  External	
  
Data	
  Sources	
  
Enterprise-­‐Scale	
  Big	
  Data	
  Architecture	
  (Informa9on	
  Services)	
  
Domain	
  specific	
  Data	
  Marts	
  
Domain	
  specific	
  Knowledge	
  Graphs	
  
data	
  
En9ty1	
  
data	
  
En9ty2	
  
Org	
  
data	
  
Organiza1on	
  
data	
  
En9ty3	
  
DSn	
  
Open	
  
Data	
  
.	
  
.	
  
.	
  
.	
  
.	
  
.	
  
.	
  
.	
  
.	
  
Directory	
  
Pre-­‐Big	
  Data	
  
Open	
  
Registry	
  
Big	
  Data	
  
Catalogue	
  
Big	
  Data	
   Domain1	
  
.	
  
.	
  
.	
  
Domaink	
  
.	
  
.	
  
.	
  
.	
  
.	
  
.	
  
KG1	
  
.	
  
.	
  
.	
  
Raw	
  Data	
  Acquisi9on	
  &	
  Cura9on	
  
Analy9cal	
  Data	
  
Acquisi9on	
  
Data-­‐Intensive	
  
Analysis	
  
Opinion	
  
IP	
  
✚	
  
DIA	
  Lessons	
  Learned	
  
•  Modelling	
  
–  Analy9cal	
  Models	
  and	
  Methods	
  
•  Selec9on	
  /	
  crea9on,	
  fi„ng	
  /	
  tuning	
  
•  Result	
  verifica9on	
  
•  Model	
  /	
  method	
  management	
  
–  Data	
  Models	
  
•  En99es	
  dominate	
  “Data	
  Lakes”	
  
•  Named	
  En99es	
  +	
  En9ty	
  (Graph)	
  Models	
  
•  Ontologies	
  (genomics),	
  Ensembles,	
  …	
  
•  Emerging	
  DIA	
  Ecosystems	
  Technology	
  
–  Languages	
  (~30)	
  
–  Analy9cs	
  Suites	
  /	
  Pla…orms	
  (~60)	
  
–  Big	
  Data	
  Management	
  (~30)	
  
WHAT	
  COULD	
  POSSIBLY	
  GO	
  WRONG?	
  
Veracity	
  
Do	
  We	
  Know	
  /	
  Can	
  We	
  Prove	
  
•  DIA	
  Result:	
  correct,	
  complete,	
  efficient?	
  
•  What	
  machines	
  /	
  algorithms	
  /	
  Machine	
  
Learning	
  /	
  Black	
  Boxes	
  /	
  DIA	
  do?	
  
•  High	
  Risk	
  /	
  High	
  Reward	
  Data-­‐Driven	
  Society	
  
– Risk:	
  drugs	
  or	
  medical	
  advice	
  that	
  cause	
  harm	
  
– Reward:	
  faster,	
  cheaper,	
  more	
  effec9ve	
  cancer	
  
cures,	
  drug	
  discovery,	
  personalized	
  medicine,	
  …	
  
Professional	
  Cau9ons	
  
•  Experienced	
  prac99oners	
  
•  Medicine	
  
– Few	
  data-­‐driven	
  results	
  opera9onalized	
  
– Mount	
  Sinai:	
  no	
  black	
  box	
  solu9ons	
  
•  Authorita9ve	
  organiza9ons	
  
–  NIH,	
  HHS,	
  EOCD,	
  Na9onal	
  Sta9s9cal	
  Organiza9ons,	
  …	
  
•  Legal:	
  Algorithmic	
  Accountability:	
  John	
  ZiLrain,	
  
Harvard	
  Law	
  School	
  
Q:	
  What	
  could	
  possibly	
  go	
  wrong?	
  
A:	
  Every	
  step	
  
•  Data	
  Sets	
  
–  All	
  measurements	
  approximate:	
  availability,	
  quality,	
  requirements,	
  
sparse/dense,	
  …	
  ;	
  How	
  much	
  can	
  we	
  tolerate?	
  What	
  is	
  the	
  impact	
  on	
  
the	
  result?	
  
•  Models:	
  “All	
  models	
  wrong	
  …”	
  George	
  Box	
  1974	
  
•  Methods	
  
–  Select	
  (1,000s),	
  tune,	
  verify	
  
–  Different	
  methods	
  èradically	
  different	
  results	
  
•  Results:	
  Probabilis9c,	
  error	
  bounds,	
  verifica9on,	
  …	
  
	
  
Data	
  Analysis	
  is	
  20%	
  of	
  the	
  story	
  
Pre	
  Big	
  Data	
  Challenges	
  	
  
•  Science:	
  Experimental	
  design:	
  hypotheses,	
  
null	
  hypotheses,	
  dependent	
  and	
  independent	
  
variables,	
  controls,	
  blocking,	
  randomiza9on,	
  
repeatability,	
  accuracy	
  
•  Analysis:	
  models,	
  methods	
  
•  Resources:	
  cost,	
  9me,	
  precision	
  
+Big	
  Data	
  Challenges	
  …	
  
•  Pre-­‐Big	
  Data	
  Challenges	
  @	
  scale:	
  volume,	
  velocity,	
  and	
  variety	
  
•  Complexity	
  
–  Data:	
  sources,	
  meta-­‐data,	
  3Vs	
  
–  Models	
  (reflec9ng	
  the	
  domain)	
  
–  Methods	
  (mul9variate	
  paLerns)	
  beyond	
  human	
  cogni9on	
  
–  Results:	
  Massive	
  numbers	
  of	
  correla9ons	
  
•  Unreliability	
  (sta9s9cs	
  @	
  scale)	
  
–  Reliability	
  decreases	
  as	
  the	
  number	
  of	
  variables	
  increases	
  (mul9variate	
  analysis)	
  
–  <<	
  10	
  variables	
  (science	
  &drug	
  discovery)	
  è	
  1,000s	
  to	
  millions	
  (Machine	
  Learning)	
  	
  
•  “In	
  science	
  and	
  medical	
  research,	
  we’ve	
  always	
  known	
  that”	
  
•  Misunderstood:	
  Self-­‐service,	
  Automated	
  Data	
  Science-­‐in-­‐a-­‐Box	
  
–  80%	
  unfamiliar	
  with	
  sta9s9cs,	
  error	
  bars,	
  causa9on/correla9on,	
  probabilis9c	
  
reasoning,	
  automated	
  data	
  cura9on	
  and	
  analysis	
  
–  Widespread	
  use	
  of	
  DIA:	
  self-­‐service,	
  “democra9za9on	
  of	
  analy9cs”	
  
–  Like	
  monkeys	
  playing	
  with	
  loaded	
  guns	
  	
  
Somewhere	
  over	
  there	
  …	
  
DIA	
  Verifica9on	
  
Principles	
  &	
  Techniques	
  
•  Conven9onal	
  disciplines	
  	
  
•  Man-­‐machine	
  symbiosis	
  
•  DIA	
  Result	
  èEmpirical	
  evidence	
  
–  Flashlight	
  analogy:	
  DIA	
  reduces	
  hypothesis	
  space	
  
•  Cross-­‐valida9on	
  
–  Validate	
  predic9ve	
  model:	
  avoid	
  overfi„ng,	
  will	
  model	
  work	
  on	
  
unseen	
  data	
  sets?	
  
–  Data	
  par99ons:	
  Training	
  Set,	
  Test	
  /Valida9on	
  Set,	
  ground	
  truth	
  	
  	
  
–  K-­‐fold	
  cross	
  valida9on	
  
•  Research	
  Direc9on	
  
–  New	
  measures	
  of	
  significance,	
  the	
  next	
  genera9on	
  P	
  value	
  
–  21st	
  Century	
  sta9s9cs	
  
SCIENTIFIC	
  METHOD	
  	
  è EMPIRICISM	
  
DATA	
  SCIENCE	
  	
   	
   	
  è DATA-­‐INTENSIVE	
  ANALYSIS	
  
	
  
I	
  Proposal	
  
Data	
  Science	
  is	
  …	
  
A	
  body	
  of	
  principles	
  and	
  techniques	
  for	
  applying	
  data-­‐
intensive	
  analysis	
  for	
  inves;ga;ng	
  phenomena,	
  
acquiring	
  new	
  knowledge,	
  and	
  correc;ng	
  and	
  
integra;ng	
  previous	
  knowledge	
  with	
  measures	
  of	
  
correctness,	
  completeness,	
  and	
  efficiency.	
  
DIA:	
  an	
  experiment	
  over	
  data	
  
Conclusions	
  
Big	
  Data	
  	
  &	
  Data-­‐Intensive	
  Analysis	
  
•  Value	
  of	
  evidence	
  (from	
  data)	
  
•  Emerging	
  reasoning	
  and	
  problem	
  solving	
  paradigm	
  
–  High	
  risk	
  /	
  high	
  reward	
  
–  Substan9al	
  results	
  already	
  
–  In	
  its	
  infancy,	
  not	
  yet	
  understood,	
  decades	
  to	
  go	
  
–  Overhyped	
  (short	
  term)	
  but	
  may	
  change	
  our	
  world	
  (long	
  term)	
  
•  è	
  Need	
  for	
  Data	
  Science	
  =	
  principles	
  &	
  guidelines	
  
–  “We’re	
  now	
  at	
  the	
  “what	
  are	
  the	
  principles?”	
  point	
  in	
  9me”	
  M.	
  Jordan	
  
–  Decades	
  of	
  research	
  and	
  prac9ce	
  
Thank You!

Más contenido relacionado

La actualidad más candente

Data science e machine learning
Data science e machine learningData science e machine learning
Data science e machine learningGiuseppe Manco
 
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Ilkay Altintas, Ph.D.
 
Making an impact with data science
Making an impact  with data scienceMaking an impact  with data science
Making an impact with data scienceJordan Engbers
 
8 minute intro to data science
8 minute intro to data science 8 minute intro to data science
8 minute intro to data science Mahesh Kumar CV
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
 
Presentation of science 2.0 at European Astronomical Society
Presentation of science 2.0 at European Astronomical SocietyPresentation of science 2.0 at European Astronomical Society
Presentation of science 2.0 at European Astronomical Societyosimod
 
The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataPaul Groth
 
Data curation issues for repositories
Data curation issues for repositoriesData curation issues for repositories
Data curation issues for repositoriesChris Rusbridge
 
Science20brussels osimo april2013
Science20brussels osimo april2013Science20brussels osimo april2013
Science20brussels osimo april2013osimod
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceANOOP V S
 
Guy avoiding-dat apocalypse
Guy avoiding-dat apocalypseGuy avoiding-dat apocalypse
Guy avoiding-dat apocalypseENUG
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsCombining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsPaul Groth
 
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceII-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceDr. Haxel Consult
 
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Greg Landrum
 
Is the current measure of excellence perverting Science? A Data deluge is com...
Is the current measure of excellence perverting Science? A Data deluge is com...Is the current measure of excellence perverting Science? A Data deluge is com...
Is the current measure of excellence perverting Science? A Data deluge is com...Lourdes Verdes-Montenegro
 

La actualidad más candente (20)

Data science e machine learning
Data science e machine learningData science e machine learning
Data science e machine learning
 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
 
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
 
Making an impact with data science
Making an impact  with data scienceMaking an impact  with data science
Making an impact with data science
 
Introduction to Data Science by Datalent Team @Data Science Clinic #9
Introduction to Data Science by Datalent Team @Data Science Clinic #9Introduction to Data Science by Datalent Team @Data Science Clinic #9
Introduction to Data Science by Datalent Team @Data Science Clinic #9
 
8 minute intro to data science
8 minute intro to data science 8 minute intro to data science
8 minute intro to data science
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
Presentation of science 2.0 at European Astronomical Society
Presentation of science 2.0 at European Astronomical SocietyPresentation of science 2.0 at European Astronomical Society
Presentation of science 2.0 at European Astronomical Society
 
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11
 
The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture Data
 
Data curation issues for repositories
Data curation issues for repositoriesData curation issues for repositories
Data curation issues for repositories
 
Science20brussels osimo april2013
Science20brussels osimo april2013Science20brussels osimo april2013
Science20brussels osimo april2013
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Guy avoiding-dat apocalypse
Guy avoiding-dat apocalypseGuy avoiding-dat apocalypse
Guy avoiding-dat apocalypse
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsCombining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
 
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceII-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in Nice
 
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...
 
Fair by design
Fair by designFair by design
Fair by design
 
Is the current measure of excellence perverting Science? A Data deluge is com...
Is the current measure of excellence perverting Science? A Data deluge is com...Is the current measure of excellence perverting Science? A Data deluge is com...
Is the current measure of excellence perverting Science? A Data deluge is com...
 
Data science 101
Data science 101Data science 101
Data science 101
 

Similar a Dia sds2015 web version

Real-time applications of Data Science.pptx
Real-time applications  of Data Science.pptxReal-time applications  of Data Science.pptx
Real-time applications of Data Science.pptxshalini s
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxssuser1a4f0f
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxwahiba ben abdessalem
 
Data Science-1 (1).ppt
Data Science-1 (1).pptData Science-1 (1).ppt
Data Science-1 (1).pptSanjayAcharaya
 
Introduction Data Science.pptx
Introduction Data Science.pptxIntroduction Data Science.pptx
Introduction Data Science.pptxAkhirulAminulloh2
 
Data Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has ChangedData Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has ChangedPhilip Bourne
 
Human Genome and Big Data Challenges
Human Genome and Big Data ChallengesHuman Genome and Big Data Challenges
Human Genome and Big Data ChallengesPhilip Bourne
 
Introduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .pptIntroduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .pptSangrangBargayary3
 
Fundamentals of Data science Introduction Unit 1
Fundamentals of Data science Introduction Unit 1Fundamentals of Data science Introduction Unit 1
Fundamentals of Data science Introduction Unit 1sasi
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introductionbutest
 
Big data, new epistemologies and paradigm shifts
Big data, new epistemologies and paradigm shiftsBig data, new epistemologies and paradigm shifts
Big data, new epistemologies and paradigm shiftsrobkitchin
 
Nicholas Jewell MedicReS World Congress 2014
Nicholas Jewell MedicReS World Congress 2014Nicholas Jewell MedicReS World Congress 2014
Nicholas Jewell MedicReS World Congress 2014MedicReS
 
Data Science Meets Biomedicine, Does Anything Change
Data Science Meets Biomedicine, Does Anything ChangeData Science Meets Biomedicine, Does Anything Change
Data Science Meets Biomedicine, Does Anything ChangePhilip Bourne
 

Similar a Dia sds2015 web version (20)

Real-time applications of Data Science.pptx
Real-time applications  of Data Science.pptxReal-time applications  of Data Science.pptx
Real-time applications of Data Science.pptx
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptx
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptx
 
DBMS
DBMSDBMS
DBMS
 
Data Science-1 (1).ppt
Data Science-1 (1).pptData Science-1 (1).ppt
Data Science-1 (1).ppt
 
Introduction Data Science.pptx
Introduction Data Science.pptxIntroduction Data Science.pptx
Introduction Data Science.pptx
 
Data Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has ChangedData Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has Changed
 
Human Genome and Big Data Challenges
Human Genome and Big Data ChallengesHuman Genome and Big Data Challenges
Human Genome and Big Data Challenges
 
Research-Data-Management-and-your-PhD
Research-Data-Management-and-your-PhDResearch-Data-Management-and-your-PhD
Research-Data-Management-and-your-PhD
 
Introduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .pptIntroduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .ppt
 
Claudia Bauzer Medeiros - Open Science meets Data Science: Some challenges to...
Claudia Bauzer Medeiros - Open Science meets Data Science: Some challenges to...Claudia Bauzer Medeiros - Open Science meets Data Science: Some challenges to...
Claudia Bauzer Medeiros - Open Science meets Data Science: Some challenges to...
 
Fundamentals of Data science Introduction Unit 1
Fundamentals of Data science Introduction Unit 1Fundamentals of Data science Introduction Unit 1
Fundamentals of Data science Introduction Unit 1
 
Öppen data och forskningens genomslag
Öppen data och forskningens genomslagÖppen data och forskningens genomslag
Öppen data och forskningens genomslag
 
Research Data Management and your PhD
Research Data Management and your PhDResearch Data Management and your PhD
Research Data Management and your PhD
 
Unit 1
Unit 1Unit 1
Unit 1
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introduction
 
Big data, new epistemologies and paradigm shifts
Big data, new epistemologies and paradigm shiftsBig data, new epistemologies and paradigm shifts
Big data, new epistemologies and paradigm shifts
 
Nicholas Jewell MedicReS World Congress 2014
Nicholas Jewell MedicReS World Congress 2014Nicholas Jewell MedicReS World Congress 2014
Nicholas Jewell MedicReS World Congress 2014
 
Data Science Meets Biomedicine, Does Anything Change
Data Science Meets Biomedicine, Does Anything ChangeData Science Meets Biomedicine, Does Anything Change
Data Science Meets Biomedicine, Does Anything Change
 
Cs501 dm intro
Cs501 dm introCs501 dm intro
Cs501 dm intro
 

Último

Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...gajnagarg
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...gajnagarg
 

Último (20)

Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
 

Dia sds2015 web version

  • 1. The  Emerging  Discipline  of   Data  Science   Principles  and  Techniques   For   Data-­‐Intensive  Analysis    
  • 2. What  is  Big  Data  Analy9cs?   Is  this  a  new  paradigm?   What  is  the  role  of  data?   What  could  possibly  go  wrong?   What  is  Data  Science?  
  • 3. Big  Data  is  Hot!  
  • 4. Big  Data  Is  Important   Hot   •  Market   –  Results,  products,  jobs   •  Poten9al   –  4th  Paradigm   –  Accelerates  discovery  [urgent]   –  BeLer:  cost,  speed,  specificity   –  Change  80%  of  processes  [Gartner]   •  Government  Policy  (45+)   –  White  House;  most  US  Govt  agencies   •  Adop9on:  Most  Human  Endeavors   –  All  academic  disciplines   –  Computa9onal  X     Cool   •  Low  effec9ve  adop9on  [EMC]   –   60%  opera9onal   –  20%  significant  change   –  <  1%  effec9ve   •  Results  not  opera9onal   •  In  its  infancy  þ  lacking   –  Understanding   –  Concepts,  tools,  techniques   (methods)   •  21st  Century  Sta9s9cs     –  Theory:  principles,  guidelines  
  • 5. Healthcare  Poten9al:  BeLer  Health;  Faster,  Cheaper  Remedies  
  • 6. What  could  go  Wrong?   When  are  Correla9ons  Spurious?  
  • 7. Or  Just  Wrong?  E.g.  Google  Flu  Trends   Allegedly  Real-­‐9me,  Reliable  Predic9ons   High  100  out  of  108  weeks  
  • 8. Future  of  Life:  Ins9tute  to   “mi;gate  existen;al  risks  facing  humanity”  
  • 9. US  Legal  Community  Pursuing   Algorithmic  Accountability  
  • 10. Do  We  Know  /  Can  We  Prove?   •  DIA  Result:  correct,  complete,  efficient?   •  What  machines  /  algorithms  /  Machine   Learning  /  Black  Boxes  /  DIA  do?   •  Emergent  Data-­‐Driven  Society  with  High   – Reward:  Cancer  cures,  drug  discovery,  personalized   medicine,  …   – Risk:  errors  in  any  of  the  above    
  • 11. The  search  for   truth   evidence-­‐based  causality   evidence-­‐based  correla9ons  
  • 12. Model  /   Theory   Data   Hypotheses   Analysis  
  • 13. Long  Illustrious  Histories   Data  Analysis   •  Mathema9cs   •  Babylon  (17th-­‐12th  C  BCE)   •  India  (12th  C  BCE)   •  Mathema9cal  analysis  (17th  C,   Scien9fic  Revolu9on)   •  Sta9s9cs  (5th  C  BCE,  18th  C)     ~4,000  years   Scien1fic  Method   •  Empiricism   –  Aristotle  (384-­‐322  BCE)   –  Ptolemy  (1st  C)   –  Bacons  (13th,  16th  C)   ~2,000  years   •  Scien9fic  Discovery  Paradigms   1.  Theory   2.  Experimenta9on   3.  Simula9on   4.  eScience  /  Big  Data   ~  1,000  years  
  • 14. Fourth  Paradigm   Modern  Compu1ng   •  Hardware:  40s-­‐50s   •  FORTRAN:  50s     •  Spreadsheets:  70s   •  Databases:  70s-­‐80s   •  World  Wide  Web:  90s   ~  60  years   Data-­‐Intensive  Analysis  of  Everything   •  eScience  (~2000)   •  Big  Data  (~2007)   –  Par9cle  physics,  drug  discovery,  …   ~  15  years   Paradigms   –  Long  developments   –  Significant  shiss   •  Conceptual   •  Theore9cal   •  Procedural  
  • 15. Biopsy Sequence Compare TargetTest Treat Monitor Precision Oncology In silicoIn vivo In vitro Normal skin cell Sequencing Machines Chromosomes Normal cell Cancer cell Treated cell Scans Patient Biomarkers Original cancer cell Source: Marty Tenebaum, Cancer Commons
  • 16. Accelerating Scientific Discovery Experiment Model Correlations/ Hypotheses Probabilistic Results What: Correlation Why: Causation
  • 17. Accelerating Scientific Discovery Experiment Model Correlations/ Hypotheses Probabilistic Results What: Correlation Why: Causation WatsonBaylor Scientists
  • 18. Profound  Changes:  Paradigm  Shis  [Kuhn]   •  New  reasoning  /  problem  solving  model   –  Data            è Data-­‐Intensive  (Big  Data  –  4  Vs)   –  Why            è What   –  Strategic  (theory-­‐based)    è Tac9cal  (evidence-­‐based)   –  Theory-­‐driven  (top-­‐down)  è Data-­‐driven  (boLom-­‐up)   –  Hypothesis  tes9ng      è Hypothesis  genera9on   •  Enabling  Paradigm  Shiss  in  most  disciplines   –  Science          è        eScience   –  Accelera9ng  (scien9fic  /  engineering)  discovery   –  Most  domains   •  Personalized  medicine    •  Urban  Planning   •  Drug  interac9ons      •  Social  and  Economic  Planning   •  Beyond  Data-­‐Driven:  Symbiosis   –  What  +  Why   –  Human  intelligence  +  machine  intelligence  
  • 19. THE  BIG  PICTURE:  MY  PERSPECTIVE   Big  Data  and  Data-­‐Intensive  Analysis  
  • 20. DIA  Pipelines  /  Ecosystem   •  Q:  What  Big  Data  technologies  do  you  see  becoming   very  popular  within  the  next  five  years?     •  A:  I  don’t  like  to  say  that  there’s  a  specific  technology,  …  there   are  pipelines  that  you  would  build  that  have  pieces  to  them.   How  do  you  process  the  data,  how  do  you  represent  it,  how   do  you  store  it,  what  inferen9al  problem  are  you  trying  to   solve.  There’s  a  whole  toolbox  or  ecosystem  that  you  have   to  understand  if  you  are  going  to  be  working  in  the  field.   Michael  Jordan,  Pehong  Chen  Dis;nguished  Professor  at  the  University  of  California,  Berkeley    
  • 21. Data-­‐Intensive  Analysis   .   .   .   .   .   .   Analy9cal  Models   Analy9cal  Methods   Analy9cal   Results   Data  Science   Data-­‐Intensive  Analysis   .   .   .  
  • 22. Data-­‐Intensive  Analysis   .   .   .   .   .   .   .   .   .   Analy9cal  Models   Analy9cal  Methods   Analy9cal   Results   Data  Science   Data-­‐Intensive  Analysis  
  • 23. Data-­‐Intensive  Analysis   .   .   .   .   .   .   .   .   .   Analy9cal  Models   Analy9cal  Methods   Analy9cal   Results   Data  Science   Data  Management  for  Data-­‐Intensive  Analysis   .   .   .   .   .   .   External   Internal   .   .   .   Data  Sources   Global  Data   Catalogue  &   Grid  Access   Shared   Data  Repository   Rela9onships  En99es   Shared  Repository  Catalogue   Raw  Data  Acquisi9on  &   Cura9on   Analy9cal  Data  Acquisi9on   Data-­‐Intensive  Analysis   Data  Science  
  • 24. DATA-­‐INTENSIVE  ANALYSIS  (DIA)   DIA  PROCESS  (WORKFLOW  /  PIPELINE)   DIA  USE  CASE  RANGE   Research  Method:  Examine  Complex,  Large-­‐Scale  Use  Cases  that  push  limits  
  • 25. Data  Analysis  èData-­‐Intensive  Analysis   •  Common  defini9on–  far  too  simplis;c  :  extract   knowledge  from  data   •  DIA:  the  ac;vity  of  using  data  to  inves;gate   phenomena,  to  acquire  new  knowledge,  and  to   correct  and  integrate  previous  knowledge   •  DIA  Process/Workflow/Pipeline:  a  sequence  of   opera;ons  that  cons;tute  an  end-­‐to-­‐end  DIA   from  source  data  to  a  quan;fied,  qualified  result  
  • 26. My  Focus  is  Not  common  DIA  Use  Cases  
  • 27. …  Nor  High  Impact  Organiza9onal  DIA  
  • 28. Data(-­‐Intensive)  Analysis  Range  *   •  Small  Data  ≠  (volume,  velocity,  variety)  98%   –  Conven9onal  data  analysis:  1  K  years  -­‐  sta9s9cs,  spreadsheets,  databases,  …   •  Big  Data  =  (volume,  velocity,  variety)  2%   –  Simple  DIA:  “most  data  science  is  simple”  Jeff  Leek  96%   •  Simple  models  &  methods,  single  user,  short  dura9on:  65+  self-­‐service  tools,   ML,  widest-­‐usage   •  Rela9ve  simplicity:  sales,  marke9ng,  &  social  trends,  defects,  …     –  Complex  DIA  4%   •  Domains:  par9cle  physics,  economics,  stock  market,  genomics,  drug   discovery,  weather,  boiling  water,  psychology,  …   •  Models  &  Methods:  large,  collabora9ve  community,  long  dura9on,  very  large   scale   *  Many  more  factors   Focus   Why?   A:  This  is  where  things  obviously  break  …  
  • 30. TOP-­‐QUARK,  LARGE  HADRON   COLLIDER,  CERN,  SWITZERLAND     Complex  DIA  Use  Case  #1   eScience,  Big  Science,  Networked  Science,  Community  Compu9ng,   Science  Gateway    
  • 31.
  • 32. Higg’s  Boson:  40  Year  Search   LHC  Data  from  proton–proton  collisions  at  centre-­‐of-­‐mass  energies  of   7  TeV  (2011)  and  8  TeV  (2012)    
  • 33. How  do  you  Prove  Higgs  Boson  Exists?   •  Standard  model  of  physics  predicts  (30  years)   Higgs  Boson  characteris9cs   – Mass  ~125  GeV   – Decays  to  γγ,  WW  and  ZZ  boson  pairs   – Couplings  to  W  and  Z  bosons   – Spin  parity   – Couples  to  up-­‐type  top-­‐quark   – Couples  to  down-­‐type  fermions?   – Decays  to  boLom  quarks  and  τ  leptons    
  • 34. 5  Sigma=0.00001%  possible  error  (2012)   10  Sigma  (2014)    
  • 35.
  • 36. .   .   .   .   .   .   ROOT  in   Oracle   Root   Model    …   Boson   Model   ROOT   Source  Data  Sets   Top   Quark   Model   ROOT   .   .   .   .   .   .   .   .   .   Original  Big  Data  Applica9on,  e.g.,  ATLAS  high-­‐energy  physics  (CERN)   Root   Model   1,000s   Tier  1:  long-­‐term  cura9on;  RAW   reprocessing  to  derived  formats     Tier  0:  Calibra9on  and  Alignment   Express  Stream  Analysis     Hardware  filtering   Raw  Data  Acquisi9on  &  Cura9on   Analy9cal  Data  Acquisi9on   Data-­‐Intensive  Analysis  
  • 37. Author/more  info     37   Worldwide  LHC  Computing  Grid   April  2012   Tier  0  (CERN)   •  Data  recording   •  Ini9al  data  reconstruc9on   •  Data  distribu9on   Tier  1  (13  centers)   •  Permanent  storage   •  Re-­‐processing   •  Analysis       Tier  2  (~160  centers)   •  Simula9on   •  End-­‐user  analysis  
  • 38. .   .   .   .   .   .   ROOT   (Oracle)   Root   Model    …   Source  Data  Sets   Original  Big  Data  Applica9on,  e.g.,  ATLAS  high-­‐energy  physics  (CERN);  Oracle  +  DQ2  +  ROOT   ATLAS  Distributed  Data  Management  System  (DQ2)  (Pig,  Hive,  Hadoop)  2007+   On  the  Worldwide  LHC  Compu9ng  Grid  (WLCG)   Root   Model   Boson   Model   ROOT   Top   Quark   Model   ROOT   .   .   .   .   .   .   .   .   .   1,000s   Tier  1:  long-­‐term  cura9on;  RAW   reprocessing  to  derived  formats     Tier  0:  Calibra9on  and  Alignment   Express  Stream  Analysis     Hardware  filtering   Meta-­‐Data   /  Access   DQ2   (HDFS)   Raw  Data  Acquisi9on  &  Cura9on   Analy9cal  Data  Acquisi9on   Data-­‐Intensive   Analysis  
  • 39. LESSONS  LEARNED   Based  on  ~30  Large-­‐Scale  DIA  Use  Cases  
  • 40. DIA  Lessons  Learned  (What)   •  A  Sosware  Ar9fact:  a  workflow  /  pipeline   –  Data-­‐Intensive  Analysis  Workflow   •  Data  Management  (80%)   –  (Raw)  Data  Acquisi9on  and  Cura9on   –  Analy9cal  Data  Acquisi9on   •  Data-­‐Intensive  Analysis  (20%)   –  Objec9ve:  switch  80:20  to  20:80      è Let  scien;sts  do  science   –  Explore  (DIA)  vs  Build  (sosware  engineering)   –  Dura9on:  years   •  Emerging  Paradigm   –  New  programming  paradigm   –  Experiments  over  data   –  Convergence   •  Scien9fic  /  engineering  discovery   •  ~10  programming  paradigms:  database,  IR,  BI,  DM,  …    
  • 41. DIA  Lessons  Learned  (How)   •  Result  Types   –  Provable  <-­‐>  Probabilis9c  <-­‐>  Specula9ve   •  Nature   –  Analy9cal   •  Empirical:  complete  meta-­‐data   •  Abstract:  incomplete  meta-­‐data   –  Phases:  Explora9on,  Analysis,  Interpreta9on     –  Exploratory,  Itera9ve,  and  Incremental   •  Users   –  Individual   –  Workgroup   –  Organiza9on  /  Enterprise   –  Community    
  • 42. DIA  Lessons  Learned  (People)   •  Machine  +  Human  Intelligence   –  Symbiosis  –  op9mized   –  Domain  knowledge  cri9cal   •  Mul9-­‐disciplinary,  Collabora9ve,  Itera9ve   •  Community  Compu9ng:  DIA  Ecosystems  –  sharing   –  Massive  resources   –  Knowledge   –  Costs   –  Many  (~60):  eScience,  Science  Gateways,  Networked  Science,  …   •  High-­‐energy  physics  (CERN:  ROOT)   •  Astrophysics  (Gaia)   •  Scien9fic  Workflow  Systems:  ~30   •  Macroeconomics   •  Global  Alliance  for  Genomics  and  Health   •  Enterprise  Ecosystems,  e.g.,  Informa9on  Services:  Thomson  Reuters,  Bloomberg,  …   •  Open-­‐Science-­‐Grid   •  The  Cancer  Genome  Atlas   •  The  Cancer  Genomics  Hub  
  • 43. The  value  and  role  of   truth   evidence-­‐based  causality   evidence-­‐based  correla9ons   What  data  is  adequate  evidence  for  Q?   DIA  Lessons  Learned  (Essence)  
  • 44. DOW  JONES,  BLOOMBERG,   THOMSON  REUTERS,  PEARSON,  …   Complex  DIA  Use  Case  #2   Informa9on  Services    
  • 45. Informa9on  Services  Business   Collect,  curate,  enrich,  augment  (IP)  &  disseminate  informa9on     –  Financial  &  Risk   •  Investors     •  News  &  press  releases   •  Brokerage  research   •  Instruments:  stocks,   bonds,  loans,  …   –  Legal   •  Dockets   •  Case  Law   •  Public  records   •  Law  firms   •  Global  businesses   –  Intellectual  Property  &   Science   •  Scien9fic  ar9cles   •  Patents   •  Trademarks   •  Domain  names   •  Clinical  trials   –  Tax  &  Accoun9ng     •  Corporate   •  Government   •  Solu9ons  
  • 47. DS1   Knowledge   Graph   (linked  En99es)   Knowledge   Graph  Data   Internal   &  External   Data  Sources   Enterprise-­‐Scale  Big  Data  Architecture  (Informa9on  Services)   Domain  specific  Data  Marts   Domain  specific  Knowledge  Graphs   data   En9ty1   data   En9ty2   Org   data   Organiza1on   data   En9ty3   DSn   Open   Data   .   .   .   .   .   .   .   .   .   Directory   Pre-­‐Big  Data   Open   Registry   Big  Data   Catalogue   Big  Data   Domain1   .   .   .   Domaink   .   .   .   .   .   .   KG1   .   .   .   Raw  Data  Acquisi9on  &  Cura9on   Analy9cal  Data   Acquisi9on   Data-­‐Intensive   Analysis   Opinion   IP   ✚  
  • 48. DIA  Lessons  Learned   •  Modelling   –  Analy9cal  Models  and  Methods   •  Selec9on  /  crea9on,  fi„ng  /  tuning   •  Result  verifica9on   •  Model  /  method  management   –  Data  Models   •  En99es  dominate  “Data  Lakes”   •  Named  En99es  +  En9ty  (Graph)  Models   •  Ontologies  (genomics),  Ensembles,  …   •  Emerging  DIA  Ecosystems  Technology   –  Languages  (~30)   –  Analy9cs  Suites  /  Pla…orms  (~60)   –  Big  Data  Management  (~30)  
  • 49. WHAT  COULD  POSSIBLY  GO  WRONG?   Veracity  
  • 50. Do  We  Know  /  Can  We  Prove   •  DIA  Result:  correct,  complete,  efficient?   •  What  machines  /  algorithms  /  Machine   Learning  /  Black  Boxes  /  DIA  do?   •  High  Risk  /  High  Reward  Data-­‐Driven  Society   – Risk:  drugs  or  medical  advice  that  cause  harm   – Reward:  faster,  cheaper,  more  effec9ve  cancer   cures,  drug  discovery,  personalized  medicine,  …  
  • 51. Professional  Cau9ons   •  Experienced  prac99oners   •  Medicine   – Few  data-­‐driven  results  opera9onalized   – Mount  Sinai:  no  black  box  solu9ons   •  Authorita9ve  organiza9ons   –  NIH,  HHS,  EOCD,  Na9onal  Sta9s9cal  Organiza9ons,  …   •  Legal:  Algorithmic  Accountability:  John  ZiLrain,   Harvard  Law  School  
  • 52. Q:  What  could  possibly  go  wrong?   A:  Every  step   •  Data  Sets   –  All  measurements  approximate:  availability,  quality,  requirements,   sparse/dense,  …  ;  How  much  can  we  tolerate?  What  is  the  impact  on   the  result?   •  Models:  “All  models  wrong  …”  George  Box  1974   •  Methods   –  Select  (1,000s),  tune,  verify   –  Different  methods  èradically  different  results   •  Results:  Probabilis9c,  error  bounds,  verifica9on,  …     Data  Analysis  is  20%  of  the  story  
  • 53. Pre  Big  Data  Challenges     •  Science:  Experimental  design:  hypotheses,   null  hypotheses,  dependent  and  independent   variables,  controls,  blocking,  randomiza9on,   repeatability,  accuracy   •  Analysis:  models,  methods   •  Resources:  cost,  9me,  precision  
  • 54. +Big  Data  Challenges  …   •  Pre-­‐Big  Data  Challenges  @  scale:  volume,  velocity,  and  variety   •  Complexity   –  Data:  sources,  meta-­‐data,  3Vs   –  Models  (reflec9ng  the  domain)   –  Methods  (mul9variate  paLerns)  beyond  human  cogni9on   –  Results:  Massive  numbers  of  correla9ons   •  Unreliability  (sta9s9cs  @  scale)   –  Reliability  decreases  as  the  number  of  variables  increases  (mul9variate  analysis)   –  <<  10  variables  (science  &drug  discovery)  è  1,000s  to  millions  (Machine  Learning)     •  “In  science  and  medical  research,  we’ve  always  known  that”   •  Misunderstood:  Self-­‐service,  Automated  Data  Science-­‐in-­‐a-­‐Box   –  80%  unfamiliar  with  sta9s9cs,  error  bars,  causa9on/correla9on,  probabilis9c   reasoning,  automated  data  cura9on  and  analysis   –  Widespread  use  of  DIA:  self-­‐service,  “democra9za9on  of  analy9cs”   –  Like  monkeys  playing  with  loaded  guns    
  • 56. DIA  Verifica9on   Principles  &  Techniques   •  Conven9onal  disciplines     •  Man-­‐machine  symbiosis   •  DIA  Result  èEmpirical  evidence   –  Flashlight  analogy:  DIA  reduces  hypothesis  space   •  Cross-­‐valida9on   –  Validate  predic9ve  model:  avoid  overfi„ng,  will  model  work  on   unseen  data  sets?   –  Data  par99ons:  Training  Set,  Test  /Valida9on  Set,  ground  truth       –  K-­‐fold  cross  valida9on   •  Research  Direc9on   –  New  measures  of  significance,  the  next  genera9on  P  value   –  21st  Century  sta9s9cs  
  • 57. SCIENTIFIC  METHOD    è EMPIRICISM   DATA  SCIENCE        è DATA-­‐INTENSIVE  ANALYSIS     I  Proposal  
  • 58. Data  Science  is  …   A  body  of  principles  and  techniques  for  applying  data-­‐ intensive  analysis  for  inves;ga;ng  phenomena,   acquiring  new  knowledge,  and  correc;ng  and   integra;ng  previous  knowledge  with  measures  of   correctness,  completeness,  and  efficiency.   DIA:  an  experiment  over  data  
  • 59. Conclusions   Big  Data    &  Data-­‐Intensive  Analysis   •  Value  of  evidence  (from  data)   •  Emerging  reasoning  and  problem  solving  paradigm   –  High  risk  /  high  reward   –  Substan9al  results  already   –  In  its  infancy,  not  yet  understood,  decades  to  go   –  Overhyped  (short  term)  but  may  change  our  world  (long  term)   •  è  Need  for  Data  Science  =  principles  &  guidelines   –  “We’re  now  at  the  “what  are  the  principles?”  point  in  9me”  M.  Jordan   –  Decades  of  research  and  prac9ce