Student Profile Sample report on improving academic performance by uniting gr...
Keynote on 2015 Yale Day of Data
1. Big
Data
&
Analy-cs:
Five
Trends
and
Five
Research
Challenges
Robert
Grossman
University
of
Chicago
&
Open
Data
Group
September
18,
2015
2. Part
1
What
is
Big
Data?
Researchers
and
policymakers
are
beginning
to
realize
the
poten-al
for
channeling
these
torrents
of
data
into
ac-onable
informa-on
that
can
be
used
to
iden-fy
needs
&
provide
services
for
the
benefit
of
low-‐income
popula-ons.
Source:
Big
Data,
Big
Impact:
New
Possibili-es
for
Interna-onal
Development,
World
Economic
Forum,
2012.
4. The
Name
Changes
1830
sta-s-cs
1980
computa-onally
intensive
sta-s-cs
1993
data
mining
&
knowledge
discovery
in
databases
1997
business
analy-cs
2004
predic-ve
analy-cs
2011
big
data,
data
science
&
data
analy-cs
Source:
Google
Trends,
www.google.com/trends
5. What
is
Big
Data?
(Opera-ons
POV)
A
marke-ng
term
introduced
by
O’Reilly:
Big
data
is
data
that
exceeds
the
processing
capacity
of
conven-onal
database
systems.
The
data
is
too
big,
moves
too
fast,
or
doesn’t
fit
the
strictures
of
your
database
architectures.
To
gain
value
from
this
data,
you
must
choose
an
alterna-ve
way
to
process
it.
Edd
Dumbill,
What
is
Big
Data?,
strata.oreilly.com,
January
11,
2012.
6. What
is
Big
Data?
(POV:
New
Types
of
Data
that
IT
Cannot
Manage)
Period
New
types
of
data
Term
Used
1990’s
Clicks
on
the
Internet,
POS
transac-ons
Data
mining
2000’s
Unstructured
data,
graph
data
Predic-ve
Analy-cs
2010’s
Mobile
data,
IoT
data
Big
Data
7. What
Is
Small
Data?
• 100
million
movie
ra-ngs
• 480
thousand
customers
• 17,000
movies
• From
1998
to
2005
• Less
than
2
GB
data.
• Fits
into
memory,
but
very
sophis-cated
models
required
to
win.
9. Basic
Choice
with
Hardware:
Scale
Up
or
Out
More
memory,
more
processors,
more
disk
($K)
Specialized
hardware
(e.g.
connects)
($100K)
Specialized
devices
($M)
One
machine
Cluster
(racks)
($100K)
Cyber
Pod
$M
Distributed
cyber
pods
$10M+
10. Source:
Interior
of
one
of
Google’s
Data
Center,
www.google.com/about/datacenters/
Computa-onal
adver-sing
finds
the
“best
match”
between
a
given
user
in
a
given
context
and
a
suitable
adver-sement
($100+
B
market).
11. The
Google
Data
Stack
• The
Google
File
System
(2003)
• MapReduce:
Simplified
Data
Processing…
(2004)
• BigTable:
A
Distributed
Storage
System…
(2006)
11
13. • The
leaders
in
big
data
analy-cs
measure
data
in
Megawans.
– As
in,
Facebook’s
leased
data
centers
are
typically
between
2.5
MW
and
6.0
MW.
– Facebook’s
new
Pineville
data
center
is
30
MW.
What
is
Big
Data?
(My
computer
is
a
data
center
POV)
14. Part
2
What
is
Analy-cs?
Source:
Aaron
Parecki,
Everywhere
I’ve
Been,
aaronparecki.com.
15. What
is
Analy-cs?
Short
Defini8on
• Using
data
to
make
decisions.
Longer
Defini8on
• Using
data
to
take
ac-ons
and
make
decisions
using
models
that
are
sta-s-cally
valid
and
empirically
derived.
Defini-on
of
Sta-s-cs
from
ASA
web
page:
• Sta-s-cs
is
the
science
of
learning
from
data,
and
of
measuring,
controlling,
and
communica-ng
uncertainty
…
15
Source:
American
Sta-s-cal
Associa-on,
www.amstat.org/careers/wha-ssta-s-cs.cfm,
from:
Davidian,
M.
and
Louis,
T.
A.,
10.1126/science.1218685.
16. 16
1993
2004
Data
Mining
&
KDD
1984
Computa-onally
Intensive
Sta-s-cs
Predic-ve
Analy-cs
Big
Data
&
Data
Science
2011
PageRank
Spanner
TX
algorithm
Devices/IoT
Internet
POS
Direct
marke-ng
ID3
&
C4.5
17.
18. 1. Given
n
planes
A1,
…,
An.
Assume
each
plane
Ai
has
bij
bullet
holes
in
the
tail,
wing,
fuselage
and
other
(j=1,
2,
3,
4,
respec-vely).
2. Compute
where
to
put
addi-onal
armor
to
maximize
the
chance
that
planes
return.
20. A
picture
of
Cern’s
Large
Hadron
Collider
(LHC).
The
LHC
took
about
a
decade
to
construct,
and
cost
about
$4.75
billion.
Source
of
picture:
Conrad
Melvin,
Crea-ve
Commons
BY-‐SA
2.0,
www.flickr.com/photos/
58220828@N07/5350788732
Some
fields
have
(one)
billion
dollar
(or
more)
instrument
that
generates
big
data.
21. A
genomics
sequencing
facility
might
have
3-‐5
next
genera-on
sequencing
instruments
that
cost
$250,000
or
more
each.
Some
fields
have
hundreds
or
thousands
of
million
dollar
instruments
that
in
aggregate
produce
big
data.
22. Some
fields
have
millions
of
hundred
dollar
sensors
that
in
aggregate
produce
big
data.
23. Math
&
Sta-s-cs
Computer
Science
Disciplinary
Science
Data
Science
25. Methods
Subject.
One
mature
Atlan-c
Salmon
(Salmo
salar)
par-cipated
in
the
fMRI
study.
The
salmon
was
approximately
18
inches
long,
weighed
3.8
lbs,
and
was
not
alive
at
the
-me
of
scanning.
Task.
The
task
administered
to
the
salmon
involved
comple-ng
an
open-‐ended
mentalizing
task.
The
salmon
was
shown
a
series
of
photographs
depic-ng
human
individuals
in
social
situa-ons
with
a
specified
emo-onal
valence.
The
salmon
was
asked
to
determine
what
emo-on
the
individual
in
the
photo
must
have
been
experiencing.
Design.
S-muli
were
presented
in
a
block
design
with
each
photo
presented
for
10
seconds
followed
by
12
seconds
of
rest.
A
total
of
15
photos
were
displayed.
Total
scan
-me
was
5.5
minutes.
26. Several
ac-ve
voxels
were
discovered
in
a
cluster
located
within
the
salmon’s
brain
cavity
(Figure
1,
see
above).
The
size
of
this
cluster
was
81
mm3
with
a
cluster-‐level
significance
of
p
=
0.001.
Due
to
the
coarse
resolu-on
of
the
echo-‐planar
image
acquisi-on
and
the
rela-vely
small
size
of
the
salmon
brain
further
discrimina-on
between
brain
regions
could
not
be
completed.
Out
of
a
search
volume
of
8064
voxels
a
total
of
16
voxels
were
significant.
27. The
bigger
the
data,
the
easier
it
is
to
do
stupid
things
with
it,
such
as
forgetng
to
correct
for
mul-ple
tests.
28. Part
4.
What
Instrument
Do
we
Use
to
Make
Discoveries
in
Data
Science?
How
do
we
build
a
“datascope?”
32. Complex
sta-s-cal
models
over
small
data
that
are
highly
manual
and
update
infrequently.
Simpler
sta-s-cal
models
over
large
data
that
are
highly
automated
and
updated
frequently.
memory
databases
GB
TB
PB
W
KW
MW
datapods
cyber
pods
33. Part
5
Five
Trends
Source:
Google
Trends,
for
term
“data
commons”,
www.google.com/trends.
34. Trend
1
Data
Commons
Source:
NEXRAD,
NOAA,
www.noaa.org
35. The
Standard
Model
of
Biomedical
Compu-ng
No
Longer
Works
Public
data
repositories
Private
local
storage
&
compute
Network
download
Local
data
($1K)
Community
souware
Souware,
sweat
and
tears
($100K)
36. Data
Commons
Data
commons
co-‐locate
data,
storage
and
compu-ng
infrastructure,
and
commonly
used
tools
for
analyzing
and
sharing
data
to
create
a
resource
for
the
research
community.
Source:
Interior
of
one
of
Google’s
data
centers,
www.google.com/about/datacenters/
37. Open
Science
Data
Cloud
(Open
Cloud
Consor-um,
2012)
NCI
Data
Commons
(UChicago,
Nov
2015)
Bionimbus
Protected
Data
Cloud
(UChicago,
2013)
NOAA
Data
Commons
(Open
Cloud
Consor-um
Oct
2015)
38.
39.
40. Purple
balls
are
lung
adenocarcinoma.
Grey
are
lung
squamous
cell
carcinoma.
Green
are
misdiagnosed.
41. Hospitals,
medical
research
centers
and
doctors
Data
commons
containing
genomic
and
clinical
data.
Pa-ents
Output:
con-nuously
updated,
data-‐driven,
analy-cs-‐informed
discovery,
diagnosis
and
treatment.
42. Trend
2
Analy-cs
of
Things,
People
and
Places
Source:
Urban
sensor
on
street
pole
in
Chicago
(conceptual),
arrayouhings.github.io/
43. People
and
things
genera-ng
streaming
data
that
are
relevant
for
research.
45. Trend
3
Languages
for
Data,
Sta-s-cal
Models,
Data
Science
Workflows
&
Exploratory
Data
Analysis
Source:
M.
Bostock,
hnp://bl.ocks.org/mbostock/4063318
46. Portable
Format
for
Analy-cs
(PFA)
Predic-ve
Model
Markup
Language
(PMML)
Grammar
of
Graphics
d3.js
47. Trend
4
More
Policies
That
Make
Data
Available
and
Analy-cs
Repeatable
48. Execu-ve
Order
13642
(May
9,
2013)
Making
Open
and
Machine
Readable
the
Default
for
Government
Informa-on
(“Open
Data
Policy”)
OMB
Guidance
President’s
Ex
Order
49. Trend
5
Transla-onal
Data
Science
How
do
we
translate
data
driven
discoveries
into
ac-ons
that
impact
society?
51. New
algorithms,
new
sta-s-cal
models
(data
science)
Applica-ons
to
genomics,
analysis
of
EMR,
etc.
Souware
stacks
for
data
intensive
compu-ng
(data
engineering)
Data
driven
discoveries
Data
driven
diagnosis
Data
driven
therapeu-cs
Develop
souware
stack
that
scales
to
a
“datapod”,
to
create
“commons”
for
data
driven
discoveries,
dx
&
treatment.
(Core
strategy
for
Center
for
Data
Intensive
Science,
University
of
Chicago)
Transla-onal
Data
Science
52. Source:
Maria
T.
Panerson
and
Robert
L.
Grossman,
Detec-ng
localized
spa-al
panerns
of
disease
incidence
using
a
neighbor-‐based
bootstrapping
method
on
electronic
medical
records
data
from
99.1
million
pa-ents,
to
appear.
54. Challenge
1.
Is
More
Different?
Source:
P.
W.
Anderson,
More
is
Different,
Science,
Volume
177,
Number
4047,
4
August
1972,
pages
393-‐396.
Do
New
Phenomena
Emerge
at
Scale
in
Data?
55. Challenge
2.
One
Million
Genomes
• Sequencing
a
million
genomes
would
likely
change
the
way
we
understand
genomic
varia-on
and
provide
a
founda-on
for
precision
medicine.
• The
genomic
data
for
a
pa-ent
is
about
1
TB
(including
samples
from
both
tumor
and
normal
-ssue).
• One
million
genomes
is
about
1000
PB
or
1
EB
• With
compression,
it
may
be
about
100
PB
• At
$1000/genome,
the
sequencing
would
cost
about
$1B
• Think
of
this
as
one
hundred
studies
with
10,000
pa-ents
each
over
three
years.
56. Challenge
3.
Datapods
• Databases
have
fundamentally
changed
the
way
we
manage
and
analyze
scien-fic
data.
• NoSQL
databases
allow
us
to
scale
out
to
mul-ple
racks
of
computers,
but
are
hard
to
to
operate.
• If
our
scien-fic
instrument
for
data
science
is
a
cyberpod
of
hardware
and
a
souware
stack
suppor-ng
data
analysis,
we
need
a
simple-‐to-‐
manage,
open
source
“database”
that
scales
to
a
cyberpod.
• Call
this
a
“datapod.”
• It
could
support
open
source
data
commons
and
allow
them
to
peer.
57. Challenge
4.
A
Billion
Predic-ve
Models
• Develop
technology
to
generate
automa-cally
1
to
10
billion
heterogeneous
segmented
models
•
Applica-ons
– George
Church’s
challenge
individual
predic-ve
models
for
each
human
genome
6.5
Billion
humans.
– 1
Million
cancer
genomes
x
1,000
models
/
genome.
– Urban
science
–
instrumen-ng
ci-es.
– Consumer
Marke-ng
-‐
large
adver-sers
will
see
1-‐3
billion
different
consumers
58. Challenge
5.
HDSI
• Human
Computer
Interac-on
(HCI)
was
an
important
field
before
everyone
got
a
computer
and
became
an
expert.
• Think
of
Human
Data
Science
Interac-on
(HDSI)
of
how
humans
interact
with
the
souware
suppor-ng
the
analysis
of
data
science
at
the
scale
of
datapods
with
billion
models
and
trillions
of
hypotheses.
• How
can
we
improve
the
interac-on
to
improve
how
we
semi-‐automa-cally
integrate
data,
validate
hypotheses,
interac-vely
explore
data,
etc.