Keynote on 2015 Yale Day of Data

Big
Data
&
Analy-cs:

Five
Trends
and
Five
Research
Challenges

Robert
Grossman

University
of
Chicago

&

Open
Data
Group

September
18,
2015

Part
1

What
is
Big
Data?

Researchers
and
policymakers
are
beginning
to
realize
the
poten-al
for
channeling
these

torrents
of
data
into
ac-onable
informa-on
that
can
be
used
to
iden-fy
needs
&
provide

services
for
the
beneﬁt
of
low-‐income
popula-ons.

Source:
Big
Data,
Big
Impact:
New

Possibili-es
for
Interna-onal
Development,
World
Economic
Forum,
2012.

•  Volume

•  Velocity

•  Variety

•  Veracity

•  Value

•  Megabytes

•  Gigabytes

•  Terabytes

•  Petabytes

•  Etabytes

•  Zetabytes

The
Name
Changes

1830

sta-s-cs

1980

computa-onally
intensive
sta-s-cs

1993

data
mining
&
knowledge
discovery
in
databases

1997

business
analy-cs

2004

predic-ve
analy-cs

2011

big
data,
data
science
&
data
analy-cs

Source:
Google
Trends,
www.google.com/trends

What
is
Big
Data?

(Opera-ons
POV)

A
marke-ng
term
introduced
by
O’Reilly:

Big
data
is
data
that
exceeds
the
processing
capacity
of

conven-onal
database
systems.
The
data
is
too
big,

moves
too
fast,
or
doesn’t
ﬁt
the
strictures
of
your

database
architectures.
To
gain
value
from
this
data,

you
must
choose
an
alterna-ve
way
to
process
it.

Edd
Dumbill,
What
is
Big
Data?,
strata.oreilly.com,

January
11,
2012.

What
is
Big
Data?

(POV:
New
Types
of
Data
that
IT
Cannot
Manage)

Period
New
types
of
data
Term
Used

1990’s
Clicks
on
the
Internet,

POS
transac-ons

Data
mining

2000’s
Unstructured
data,

graph
data

Predic-ve

Analy-cs

2010’s
Mobile
data,
IoT
data
Big
Data

What
Is
Small
Data?

•  100
million
movie
ra-ngs

•  480
thousand
customers

•  17,000
movies

•  From
1998
to
2005

•  Less
than
2
GB
data.

•  Fits
into
memory,
but
very

sophis-cated
models

required
to
win.

What
are
the
origins
of
big
data?

Basic
Choice
with
Hardware:
Scale
Up
or
Out

More
memory,

more
processors,

more
disk
($K)

Specialized

hardware

(e.g.
connects)
($100K)

Specialized

devices
($M)

One
machine
Cluster

(racks)

($100K)

Cyber

Pod

$M

Distributed

cyber
pods

$10M+

Source:
Interior
of
one
of
Google’s
Data
Center,
www.google.com/about/datacenters/

Computa-onal
adver-sing
ﬁnds

the
“best
match”
between
a
given

user
in
a
given
context
and
a

suitable
adver-sement
($100+
B

market).

The
Google
Data
Stack

•  The
Google
File
System
(2003)

•  MapReduce:
Simpliﬁed
Data
Processing…
(2004)

•  BigTable:
A
Distributed
Storage
System…
(2006)

11

Source:
Terence
Kawaja,
hnp://www.slideshare.net/tkawaja

•  The
leaders
in
big
data
analy-cs

measure
data
in
Megawans.

– As
in,
Facebook’s
leased
data

centers
are
typically
between

2.5
MW
and
6.0
MW.

– Facebook’s
new
Pineville
data

center
is
30
MW.

What
is
Big
Data?

(My
computer
is
a
data
center
POV)

Part
2

What
is
Analy-cs?

Source:
Aaron
Parecki,
Everywhere
I’ve
Been,
aaronparecki.com.

What
is
Analy-cs?

Short
Defini8on

•  Using
data
to
make
decisions.

Longer
Defini8on

•  Using
data
to
take
ac-ons
and
make
decisions
using

models
that
are
sta-s-cally
valid
and
empirically
derived.

Defini-on
of
Sta-s-cs
from
ASA
web
page:

•  Sta-s-cs
is
the
science
of
learning
from
data,
and
of

measuring,
controlling,
and
communica-ng
uncertainty
…

15

Source:
American
Sta-s-cal
Associa-on,

www.amstat.org/careers/wha-ssta-s-cs.cfm,
from:

Davidian,
M.
and
Louis,
T.
A.,
10.1126/science.1218685.

16

1993
2004

Data
Mining

&
KDD

1984

Computa-onally

Intensive
Sta-s-cs

Predic-ve

Analy-cs

Big
Data
&

Data
Science

2011

PageRank

Spanner
TX

algorithm

Devices/IoT
Internet
POS
Direct
marke-ng

ID3
&
C4.5

1.  Given
n
planes
A1,
…,
An.

Assume
each
plane
Ai
has
bij
bullet
holes
in

the
tail,
wing,
fuselage
and
other
(j=1,
2,
3,
4,
respec-vely).

2.  Compute
where
to
put
addi-onal
armor
to
maximize
the
chance
that

planes
return.

Part
3.

Data
Science

A
picture
of
Cern’s
Large
Hadron
Collider
(LHC).

The
LHC
took
about
a
decade
to
construct,
and
cost
about

$4.75
billion.

Source
of
picture:
Conrad
Melvin,
Crea-ve
Commons
BY-‐SA
2.0,
www.ﬂickr.com/photos/
58220828@N07/5350788732

Some
ﬁelds
have
(one)
billion
dollar
(or
more)

instrument
that
generates
big
data.

A
genomics
sequencing
facility
might
have
3-‐5
next
genera-on
sequencing

instruments
that
cost
$250,000
or
more
each.

Some
ﬁelds
have
hundreds
or
thousands
of

million
dollar
instruments
that
in
aggregate

produce
big
data.

Some
ﬁelds
have
millions
of
hundred
dollar

sensors
that
in
aggregate
produce
big
data.

Math
&

Sta-s-cs

Computer

Science

Disciplinary

Science

Data

Science

Understanding
Salmon

(A
Cau-onary
Tale)

Source:
Salmo
salar,
(Atlan-c
Salmon),
wikipedia.org

Methods

Subject.
One
mature
Atlan-c
Salmon
(Salmo
salar)

par-cipated
in
the
fMRI
study.
The
salmon
was

approximately
18
inches
long,
weighed
3.8
lbs,
and
was
not

alive
at
the
-me
of
scanning.

Task.
The
task
administered
to
the
salmon
involved

comple-ng
an
open-‐ended
mentalizing
task.
The
salmon

was
shown
a
series
of
photographs
depic-ng
human

individuals
in
social
situa-ons
with
a
speciﬁed
emo-onal

valence.
The
salmon
was
asked
to
determine
what
emo-on

the
individual
in
the
photo
must
have
been
experiencing.

Design.
S-muli
were
presented
in
a
block
design
with
each

photo
presented
for
10
seconds
followed
by
12
seconds
of

rest.
A
total
of
15
photos
were
displayed.
Total
scan
-me

was
5.5
minutes.

Several
ac-ve
voxels
were
discovered
in
a
cluster
located
within

the
salmon’s
brain
cavity
(Figure
1,
see
above).
The
size
of
this

cluster
was
81
mm3
with
a
cluster-‐level
signiﬁcance
of
p
=
0.001.

Due
to
the
coarse
resolu-on
of
the
echo-‐planar
image

acquisi-on
and
the
rela-vely
small
size
of
the
salmon
brain

further
discrimina-on
between
brain
regions
could
not
be

completed.
Out
of
a
search
volume
of
8064
voxels
a
total
of
16

voxels
were
signiﬁcant.

The
bigger
the
data,
the
easier
it
is
to
do
stupid

things
with
it,
such
as
forgetng
to
correct
for

mul-ple
tests.

Part
4.

What
Instrument
Do
we
Use
to

Make
Discoveries
in
Data
Science?

How
do
we
build
a
“datascope?”

experimental

science

simula-on

science

1609

30x

1670

250x

1976

10x-‐100x

data
science

experimental

science

simula-on

science

data
science

1609

30x

1670

250x

1976

10x-‐100x

2004

10x-‐100x

“Cyberpod”

Could
we
con-nuously
re-‐analyze
the
world’s

cancer
data?

Complex
sta-s-cal

models
over
small
data

that
are
highly
manual

and
update
infrequently.

Simpler
sta-s-cal

models
over
large
data

that
are
highly

automated
and
updated

frequently.

memory
databases

GB
TB
PB

W
KW
MW

datapods

cyber
pods

Part
5

Five
Trends

Source:
Google
Trends,
for
term
“data
commons”,
www.google.com/trends.

Trend
1

Data
Commons

Source:
NEXRAD,
NOAA,
www.noaa.org

The
Standard
Model
of
Biomedical

Compu-ng
No
Longer
Works

Public
data

repositories

Private
local

storage
&

compute

Network

download

Local
data
($1K)

Community

souware

Souware,
sweat
and

tears
($100K)

Data
Commons

Data
commons
co-‐locate
data,
storage
and
compu-ng

infrastructure,
and
commonly
used
tools
for
analyzing

and
sharing
data
to
create
a
resource
for
the
research

community.

Source:
Interior
of
one
of
Google’s
data
centers,
www.google.com/about/datacenters/

Open
Science
Data
Cloud

(Open
Cloud
Consor-um,

2012)

NCI
Data
Commons

(UChicago,
Nov

2015)

Bionimbus
Protected

Data
Cloud
(UChicago,

2013)

NOAA
Data

Commons

(Open

Cloud

Consor-um
Oct
2015)

Purple
balls
are
lung
adenocarcinoma.

Grey
are
lung

squamous
cell
carcinoma.

Green
are
misdiagnosed.

Hospitals,
medical

research
centers

and
doctors

Data
commons
containing

genomic
and
clinical
data.

Pa-ents

Output:
con-nuously

updated,
data-‐driven,

analy-cs-‐informed

discovery,
diagnosis

and
treatment.

Trend
2

Analy-cs
of
Things,
People
and
Places

Source:
Urban
sensor
on
street
pole
in
Chicago
(conceptual),
arrayouhings.github.io/

People
and
things
genera-ng
streaming

data
that
are
relevant
for
research.

Places
that
generate
data

Source:
Jane
Macfarlane,
Here,
a
Division
of
Nokia.

Trend
3

Languages
for
Data,
Sta-s-cal
Models,
Data

Science
Workﬂows
&
Exploratory
Data
Analysis

Source:
M.
Bostock,
hnp://bl.ocks.org/mbostock/4063318

Portable
Format
for
Analy-cs
(PFA)

Predic-ve
Model
Markup
Language
(PMML)

Grammar
of
Graphics

d3.js

Trend
4

More
Policies
That
Make
Data
Available

and
Analy-cs
Repeatable

Execu-ve
Order
13642
(May
9,
2013)

Making
Open
and
Machine
Readable
the
Default
for

Government
Informa-on
(“Open
Data
Policy”)

OMB
Guidance
President’s
Ex
Order

Trend
5

Transla-onal
Data
Science

How
do
we
translate
data
driven
discoveries

into
ac-ons
that
impact
society?

Imaging
Informatics
Clinical
Informatics
Bioinformatics
Public Health
Informatics
Basic Research
Applied Research
Practice (dx,
treatment
and prevention)
Molecular &
cellular
processes
Tissues &
organs
Individuals
(patients)
Groups &
populations
Quality & outcomes
Translational Informatics

New
algorithms,

new
sta-s-cal

models
(data

science)

Applica-ons
to

genomics,
analysis

of
EMR,
etc.

Souware
stacks
for
data

intensive
compu-ng

(data
engineering)

Data
driven

discoveries

Data
driven

diagnosis

Data
driven

therapeu-cs

Develop
souware
stack
that
scales
to
a
“datapod”,
to
create

“commons”
for
data
driven
discoveries,
dx
&
treatment.

(Core

strategy
for
Center
for
Data
Intensive
Science,
University
of
Chicago)

Transla-onal
Data
Science

Source:
Maria
T.
Panerson
and
Robert
L.
Grossman,
Detec-ng
localized
spa-al
panerns
of
disease

incidence
using
a
neighbor-‐based
bootstrapping
method
on
electronic
medical
records
data
from
99.1

million
pa-ents,
to
appear.

Part
5

Five
Challenges

Challenge
1.
Is
More
Diﬀerent?

Source:
P.
W.
Anderson,
More
is
Diﬀerent,
Science,
Volume
177,
Number
4047,
4
August
1972,
pages
393-‐396.

Do
New
Phenomena
Emerge
at
Scale
in
Data?

Challenge
2.
One
Million
Genomes

•  Sequencing
a
million
genomes
would
likely
change

the
way
we
understand
genomic
varia-on
and

provide
a
founda-on
for
precision
medicine.

•  The
genomic
data
for
a
pa-ent
is
about
1
TB

(including
samples
from
both
tumor
and
normal

-ssue).

•  One
million
genomes
is
about
1000
PB
or
1
EB

•  With
compression,
it
may
be
about
100
PB

•  At
$1000/genome,
the
sequencing
would
cost
about

$1B

•  Think
of
this
as
one
hundred
studies
with
10,000

pa-ents
each
over
three
years.

Challenge
3.

Datapods

•  Databases
have
fundamentally
changed
the
way
we

manage
and
analyze
scien-ﬁc
data.

•  NoSQL
databases
allow
us
to
scale
out
to
mul-ple

racks
of
computers,
but
are
hard
to
to
operate.

•  If
our
scien-ﬁc
instrument
for
data
science
is
a

cyberpod
of
hardware
and
a
souware
stack

suppor-ng
data
analysis,
we
need
a
simple-‐to-‐
manage,
open
source
“database”
that
scales
to
a

cyberpod.

•  Call
this
a
“datapod.”

•  It
could
support
open
source
data
commons
and

allow
them
to
peer.

Challenge
4.

A
Billion
Predic-ve
Models

•  Develop
technology
to
generate
automa-cally
1
to

10
billion
heterogeneous
segmented
models

• 
Applica-ons

– George
Church’s
challenge
individual
predic-ve

models
for
each
human
genome
6.5
Billion

humans.

– 1
Million
cancer
genomes
x
1,000
models
/

genome.

– Urban
science
–
instrumen-ng
ci-es.

– Consumer
Marke-ng
-‐
large
adver-sers
will
see

1-‐3
billion
diﬀerent
consumers

Challenge
5.

HDSI

•  Human
Computer
Interac-on
(HCI)
was
an
important

ﬁeld
before
everyone
got
a
computer
and
became
an

expert.

•  Think
of
Human
Data
Science
Interac-on
(HDSI)
of

how
humans
interact
with
the
souware
suppor-ng

the
analysis
of
data
science
at
the
scale
of
datapods

with
billion
models
and
trillions
of
hypotheses.

•  How
can
we
improve
the
interac-on
to
improve
how

we
semi-‐automa-cally
integrate
data,
validate

hypotheses,
interac-vely
explore
data,
etc.

Ques-ons?

59

rgrossman.com

@bobgrossman

For
More
Informa-on

cdis.uchicago.edu

www.opendatagroup.com

rgrossman.com

Keynote on 2015 Yale Day of Data

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (12)

Similar a Keynote on 2015 Yale Day of Data

Similar a Keynote on 2015 Yale Day of Data (20)

Más de Robert Grossman

Más de Robert Grossman (10)

Último

Último (20)

Keynote on 2015 Yale Day of Data