Joshua Bloom Data Science at Berkeley

Data$Science$at$Berkeley
Joshua'Bloom'
UC'Berkeley,'Astronomy
@pro%sb
PyData,'May'4.'2014

The$First$Rule$of$Data$Science...

diagram'from'Drew'Conway
Data$Science$as$a$Discipline

diagram'from'Drew'Conway
Data$Science$as$a$Discipline
• What'are'the'Core$Principles?
• Is'it'an'academic$pursuit'to'be'taught'or'a'
skillset'to'be'trained?
• When'should'they'be'taught?'At'what'level'of'
depth/breath?
• Who'should'teach'them'and'who'should'know'
data'science?
• Where'should'investments'be'made?'Does'Data'
Science'need'an'intellectual'home'within'insHtuHons?'

“I'love'working'with'astronomers,'since'their'
data'is'worthless.”
K'Jim'Gray,'Microso'

Bayesian FrequenHst
Theory/Hypothesis
Driven
Data
Driven
non-parametric
parametric
Data$Inference$Space
Hardware---laptops-→-clusters/supercomputers
So6ware---Python/Scipy,-R,-...
Carbonware---(astro)-grad-students,-postdocs

mij = µi + M0j
+ ↵j log10 (Pi/P0)
+ E(B V )i ⇥ [RV ⇥ a (1/ i) + b (1/ i)]
+ ✏ij
Bayesian$Distance$Ladder
Some-Variable-Stars-show-a-PeriodDBrightness-CorrelaFon
i'indexes'over'individual'stars
j'indexes'over'wavebands
a'and'b'are'ﬁxed'constants'at'each'color
Data--134'RR'Lyrae'(ultraviolet'to'infrared)
Fit--307-dimensional-model-parameter-inference
K'determinisHc'MCMC'model'with'PyMC
K'~6'days'for'a'single'run'(one'core)
K'parallelism'for'convergence'tests
Klein+12;'Klein,JSB+14,
“Leavitt Law”
HenrieKa-Swan-LeaviK

Figure 6. Multi-band period–luminosity relations. RRab stars are in blue, RRc stars in red. Blazhko-a↵ected stars are denoted with
• Sub-1% distance uncertainty
• Precision 3D dust in the Milky Way
Bayesian$Distance$Ladder
Milky Way
projection
distance
dust
Brightness
Period
Klein+12;'Klein,JSB+14,

Morphology of LMC RR Lyrae stars 3
the 62 individual science CCDs2
were pro-
andard reduction algorithms (bias subtrac-
g, etc.) using the computational resources at
nergy Research Scientific Computing Center
y on individual frames was calibrated with
ence sources from the Two Micron All Sky
S; Skrutskie et al. 2006) using the astrom-
re package (Lang et al. 2010). Photometric
performed using same-night observations of
Sloan Digital Sky Survey Stripe 82 Standard
Ivezić et al. 2007). Applying standard cali-
dology (e.g., Ofek et al. 2012), we find that
a robust scatter in our absolute photometric
. 0.02 mag on clear nights (& 50% of the ob-
om the Science Verification run). While this
be improved with more advanced modeling
signatures (e.g., Tucker et al. 2014), we find
on is su cient for our scientific objectives.
m observing program produced on average
Figure 3. V -, I-, and z-band period–magnitude relations (solid
lines) derived for the LMC RR Lyrae population, superimposed
on scatter plots of the RR Lyrae posteriors (M computed using
µPost). The dashed lines denote the 1 prediction intervals for a
new RR Lyrae star with known period.
used for all of the µi. This standard deviation was selected to
be a fractional distance error of 10 per cent (⇡ 5 kpc), which
is much larger than the depth of the LMC and significantly
larger than (> 2 times) the median posterior µi .
To fit the model given by equation 6 ten identical
MCMC traces were run, each generating 3.5 million iter-
ations. The first 0.5 million were discarded as burn-in and
the remaining 3 million were thinned by 300 to result in ten
traces of 10,000 iterations each. The Gelman-Rubin conver-
now with
15,040 stars...
scipy.sparse,'hdf5
Klein..JSB+14
Dust Map basemap
4-✕'improvement'in'distance'error
3D density projection
mayavi2

Astronomical$Data$Deluge
Serious$Challenge$to$Tradi;onal$Approaches$&$Toolkits
Large$Synop;c$Survey$Telescope$(LSST)$A$2020$
' Light'curves'for'800M'sources'every'3'days
'''''106'supernovae/yr,'105'eclipsing'binaries
'''''3.2'gigapixel'camera,'20'TB/night
LOFAR$&$SKA
''''150'Gps'(27'Tﬂops)'→'20'Pps'(~100'Pﬂops)
Gaia$space$astrometry$mission$A$2014
''''1'billion'stars'observed'∼70'Hmes'over'5'years
'''''''Will'observe'20K'supernovae
Many'other'astronomical'surveys'are'already'producing'data:
SDSS,'iPTF,'CRTS,'PanKSTARRS,'Hipparcos,'OGLE,'ASAS,'Kepler,'LINEAR,'DES'etc.,

strategy
scheduling
observing
reducFon
finding
discovery
classificaFon
followup
inference
Towards$a$Fully$Automated$ScienAfic$Stack$for$Transients
}
current
state)of)the)art
stack
automated
not-(yet)-automated
published-work
NSF/CDI
NSF/BIGDATA

Our$ML$framework$found$
the$Nearest$Supernova$in$3$
Decades$..‣ Built'&'Deployed'robust,'realKHme'
machine'learning'framework,'
discovering'>10,000'events'in'>'10'
TB'of'imaging'
''''''''→'50+'journal'arHcles
‣ Built'ProbabilisHc'Event'
classiﬁcaHon'catalogs'with'
innovaHve'acHve'learning'
hhp://Hmedomain.org hhps://www.nsf.gov/news/news_summ.jsp?cntn_id=122537

What is the toolbox
of the modern
(data-driven) scientist?
domain
training
statistics
advanced
computing
database
GUI
parallel
visualization
Bayesian
machine learning
Physics
laboratory techniques
MCMC
MapReduce

And...How do we teach this with what
little time the students have?
What is the toolbox
of the modern
(data-driven) scientist?

Data-Centric Coursework, Bootcamps, Seminars, & Lecture
Series
BDAS: Berkeley Data Analytics
Stack
[Spark, Shark, ...]
parallel
programming
bootcamp
...and entire degree programs

2010: 85 campers 2012a: 135 campers
Python Bootcamps at Berkeley

a modern superglue computing language for
(data) science
‣ high-level scripting language
‣ open source, huge & growing community in academia &
industry
‣ Just in time compilation but also fast numerical
computation
‣ Extensive interfaces to 3rd party frameworks

a modern superglue computing language for
(data) science
‣ high-level scripting language
‣ open source, huge & growing community in academia &
industry
‣ Just in time compilation but also fast numerical
computation
‣ Extensive interfaces to 3rd party frameworks
A reasonable lingua franca for scientists...

2012b: 210 campers
Python Bootcamps at Berkeley
2013a: 253 campers

‣ 3 days of live/archive streamed lectures
‣ all open material in GitHub
‣ widely disseminated (e.g., @ NASA)
http://pythonbootcamp.info

Part of the
Designated
Emphasis in
Computational
Science &
Engineering at
Berkeley
visualization
machine learning
database interaction
user interface & web frameworks
timeseries & numerical computing
interfacing to other languages
Bayesian inference & MCMC
hardware control
parallelism

“Are we alone in the universe? What makes up the missing mass of
the universe? ... And maybe the biggest question of all: How in the
wide world can you add $3 billion in market capitalization simply
by adding .com to the end of a name?”
President William Jefferson Clinton
Science and Technology Policy Address
21 January 2000
“Add Data Science or Big Data to your course name to increase
enrollment by tenfold.”
Joshua Bloom
Just Now

Python for Data Science @ Berkeley [Sept 2013]

64%
36%
female male
8%
4%
8%
12%
4%
12%
8%
16%
16%
12%Psychology
Astronomy
Neuroscience
Biostatistics
Physics
Chemical Engineering
ISchool
Earth and Planetary Sciences
Industrial Engineering
Mechanical Engineering
“Parallel Image Reconstruction
from Radio Interferometry Data”
“Graph Theory Analysis of Growing
Graphs”
http://mb3152.github.io/Graph-Growth/
“Realtime Prediction of Activity Behavior
from Smartphone”
“Bus Arrival Time
Prediction in Spain”

Time domain preprocessing
- Start with raw photometry!
- Gaussian process detrending!
- Calibration!
- Petigura & Marcy 2012!
!
Transit search
- Matched filter!
- Similar to BLS algorithm (Kovcas+ 2002)!
- Leverages Fast-Folding Algorithm
O(N^2) → O(N log N) (Staelin+ 1968)!
!
Data validation
- Significant peaks in periodogram, but
inconsistent with exoplanet transit
TERRA – optimized for small planets
Detrended/calibrated photometry
TERRA
RawFlux(ppt)CalibratedFlux
Erik Petigura
Berkeley Astro
Grad Student
Prevalence of Earth-size planets orbiting Sun-like stars
Erik A. Petiguraa,b,1
, Andrew W. Howardb
, and Geoffrey W. Marcya
a
Astronomy Department, University of California, Berkeley, CA 94720; and b
Institute for Astronomy, University of Hawaii at Manoa, Honolulu, HI 96822
Contributed by Geoffrey W. Marcy, October 22, 2013 (sent for review October 18, 2013)
Determining whether Earth-like planets are common or rare looms
as a touchstone in the question of life in the universe. We searched
for Earth-size planets that cross in front of their host stars by
examining the brightness measurements of 42,000 stars from
National Aeronautics and Space Administration’s Kepler mission.
We found 603 planets, including 10 that are Earth size (1 − 2 R⊕)
and receive comparable levels of stellar energy to that of Earth
(0:25 − 4 F⊕). We account for Kepler’s imperfect detectability of
such planets by injecting synthetic planet–caused dimmings into
the Kepler brightness measurements and recording the fraction
detected. We find that 11 ± 4% of Sun-like stars harbor an Earth-
size planet receiving between one and four times the stellar inten-
sity as Earth. We also find that the occurrence of Earth-size planets is
constant with increasing orbital period (P), within equal intervals of
logP up to ∼200 d. Extrapolating, one finds 5:7+1:7
−2:2 % of Sun-like stars
harbor an Earth-size planet with orbital periods of 200–400 d.
extrasolar planets | astrobiology
The National Aeronautics and Space Administration’s (NASA’s)
Kepler mission was launched in 2009 to search for planets
that transit (cross in front of) their host stars (1–4). The resulting
dimming of the host stars is detectable by measuring their bright-
ness, and Kepler monitored the brightness of 150,000 stars every
30 min for 4 y. To date, this exoplanet survey has detected more
than 3,000 planet candidates (4).
We searched for transiting planets in Kepler brightness mea-
surements using our custom-built TERRA software package
described in previous works (6, 9) and in SI Appendix. In brief,
TERRA conditions Kepler photometry in the time domain, re-
moving outliers, long timescale variability (>10 d), and systematic
errors common to a large number of stars. TERRA then searches
for transit signals by evaluating the signal-to-noise ratio (SNR) of
prospective transits over a finely spaced 3D grid of orbital period,
P, time of transit, t0, and transit duration, ΔT. This grid-based
search extends over the orbital period range of 0.5–400 d.
TERRA produced a list of “threshold crossing events” (TCEs)
that meet the key criterion of a photometric dimming SNR ratio
SNR > 12. Unfortunately, an unwieldy 16,227 TCEs met this cri-
terion, many of which are inconsistent with the periodic dimming
profile from a true transiting planet. Further vetting was performed
by automatically assessing which light curves were consistent with
theoretical models of transiting planets (10). We also visually
inspected each TCE light curve, retaining only those exhibiting a
consistent, periodic, box-shaped dimming, and rejecting those
caused by single epoch outliers, correlated noise, and other data
anomalies. The vetting process was applied homogeneously to all
TCEs and is described in further detail in SI Appendix.
To assess our vetting accuracy, we evaluated the 235 Kepler
objects of interest (KOIs) among Best42k stars having P > 50 d,
which had been found by the Kepler Project and identified as planet
candidates in the official Exoplanet Archive (exoplanetarchive.
RONOMY
Bootcamp/Seminar Alum
Python
DOE/NERSC computation
PNAS [2014]

Dangers+of+a+Cursory+Training/Teaching+
Curriculum

ﬁrst-this... ...then-this.
Undergraduate$&$Graduate$Training$Mission
Thinking$Data'Literacy$before$
Thinking$Big'Data'Proﬁciency

34 Fitting a straight line to data
0 50 100 150 200 250 300
x
0
100
200
300
400
500
600
700
y
y = 1.33 x + 164
0 50 100 150 200 250 300
x
0
100
200
300
400
500
600
700
y
0.0
1.0
1.0
1.0
0.0
0.0
0.0
0.00.0
0.0
0.0
0.0
0.00.0
0.0
0.0
0.0
0.0
0.0
0.0
Data analysis recipes:
Fitting a model to data⇤
David W. Hogg
Center for Cosmology and Particle Physics, Department of Physics, New York University
Max-Planck-Institut für Astronomie, Heidelberg
Jo Bovy
Center for Cosmology and Particle Physics, Department of Physics, New York University
Dustin Lang
Department of Computer Science, University of Toronto
Princeton University Observatory
Abstract
We go through the many considerations involved in fitting a model
to data, using as an example the fit of a straight line to a set of points
in a two-dimensional plane. Standard weighted least-squares fitting
is only appropriate when there is a dimension along which the data
points have negligible uncertainties, and another along which all the
uncertainties can be described by Gaussians of known variance; these
D
Fi
Da
Cen
Ma
Jo
Cen
Du
Dep
Prin
tha
dit
line
and
⇤
arXiv:1008.4686v1 [astro-ph.IM] 27 Aug 2010
Sta8s8cal+Inference

Versioning+&+Reproducibility
“Recently, the scientiﬁc community was shaken by reports that a troubling proportion of peer-reviewed
preclinical studies are not reproducible.” McNutt, 2014
http://www.sciencemag.org/content/343/6168/229.summary
K'Git'has'emerged'as'the'de'facto'versioning'tool
K'Berkeley'Common'Environment'(BCE)'Sonware'Stack
K'“Reproducible'and'CollaboraHve'StaHsHcal'Data'Science”'(StaHsHcs'157:'P.'Stark)

The'IPython'Notebook'–'
SupporHng'Research'at'Berkeley
•Designing'nuclear'reactor'cores
•SimulaHng'electron'ﬂow'in'plasmas
•CompuHng'supernovae'spectra
•Analyzing'brain'acHvity
•Modeling'neural'networks
•CalculaHng'quantum'dynamics'and'spectroscopy
•Visualizing'MRI'results

Berkeley Data Analytics Stack
A Comprehensive Big Data Reference Architecture
AMP!
Alpha or
Soon!
AMP!
Released!
BSD/Apache!
3rd Party
Open Source!
Apache Mesos! YARN Resource Manager! Resource!
HDFS / Hadoop Storage!
Tachyon!
Storage!
!
Apache Spark!
Spark
Streaming! ML-lib! Processing!
and Data !
Management!
Applications: Trafﬁc, Carat, Genomics, 3rd Party!
Tools: Visualization, Data Cleaning, …!
Shark
(SQL)!
BlinkDB! GraphX! MLBase!
Analytics!
Frameworks!
Spark-R!
Berkeley$Data$Analy;cs$Stack
A,Comprehensive,Big,Data,Reference,
Architecture
Methodology'innovaHon:'InvenHng'What’s'Next'in'Big'Data'AnalyHcs
Testing the Vision/Early Adopters/Momentum 2010+

(Recent)'Data'Science'Industry'Spinoﬀs'at'Berkeley
http://berkeleystartupcluster.com/

Data'Science'growing'organically'everywhere
Feb'15,'2013
AMP'Lab
Ion$Stoica,$CS
Michael$Franklin,$CS
Adam$Arkin,$Bioengineering
Emmanuel$Saez,$Economics
Reconstruc;ng$the$movies
in$your$mind
Bin$Yu,$Sta;s;cs
Jack$Gallant,$Neuroscience
Earthquake
Strong Shaking
in
11seconds
Richard$Allen$
Earth&$Plan.$Science
Geospa;al$Lab
Fernando$Perez,$
Brain$Imaging$Center
iPython$tools$and$community Charles$Marshall
Rosie$Gillespie
Integra;ve$Biology
Digi;zed$Museum

Created by Natalia Bilenko.
Data source: PubMed Central. http://sciencereview.berkeley.edu/bsr_design/Issue26/datascience/cover/

Established-CS/Stats/Math-in,Service
of-novelty-in-domain-science
vs.
Novelty-in-domain-science-driving-&-informing-novelty-in-
CS/Stats/Math
“novelty2$problem”
an'extra'Burden'for'Forefront'ScienHsts
hhps://medium.com/techKtalk/dd88857f662

Berkeley Institute for
Data Science
http://bitly.com/bundles/fperezorg/1
“Bold new partnership launches to harness potential of data scientists and big data”
Founded'in'December'2013'as'a'result'of'a'year+'long'naHonal'selecHon'process
$37.8M'over'5'years,'along'with'University'of'Washington'&'NYU
‣ An'accelerator'for'dataKdriven'discovery
‣ An'agent$of$change'in'the'modern'university'as'Data'Science'takes'hold
‣ An'incubator'for'the'next'generaHon'of'Data'Science'technology'and'pracHce

Leadership$from$across$the$spectrum
Joshua$Bloom,'Professor,'Astronomy;'Director,'Center'
for'Time'Domain'InformaHcs
''''
Henry$Brady,'Dean,'Goldman'School'of'Public'Policy'
'''''
Cathryn$Carson,'Associate'Dean,'Social'Sciences;'AcHng'
Director'of'Social'Sciences'Data'Laboratory'"DKLab”
''''''
David$Culler,'Chair,'EECS
'''''''
Michael$Franklin,'Professor;'EECS,'CoKDirector,'AMP'
Lab
'''''''''
Erik$Mitchell,'Associate'University'Librarian
'''''''''
Faculty'Lead/PI:'Saul$PerlmuXer,'Physics,'Berkeley'Center'for'Cosmological'Physics
Fernando$Perez,'Researcher,'Henry'H.'Wheeler'Jr.'Brain'
Imaging'Center
Jasjeet$Sekhon,'Professor,'PoliHcal'Science'and'
StaHsHcs;'Center'for'Causal'Inference'and'Program'
EvaluaHon
''''''
Jamie$Sethian,'Professor,'MathemaHcs
''''''
Kimmen$Sjölander,'Professor,'Bioengineering,'Plant'and'
Microbial'Biology
''''''
Philip$Stark,'Chair,'StaHsHcs
'''''
Ion$Stoica,'Professor,'EECS;'CoKDirector,'AMP'Lab

BIDS goals
‣ Support$meaningful$and$sustained$interac;ons$and$collabora;ons$
between'Methodology'ﬁelds'&'Science'domains'to'recognize'what'it'takes'to'
move'these'ﬁelds'forward
‣ Establish$new$Data$Science$career$paths$that$are$longAterm$and$sustainable
• A'generaHon'of'mulHKdisciplinary'scienHsts'in'dataKintensive'science
• A'generaHon'of'data'scienHsts'focused'on'tool'development
‣ Build$an$ecosystem$of$analy;cal$tools,$teaching,$&$research$prac;ces
• Sustainable,'reusable,'extensible,'easy'to'learn'and'to'translate'across'
research'domains
• Enables'scienHsts'to'spend'more'Hme'focusing'on'their'science
37

A'place'to'bring'it'all'together'at'
the'Center
Vibrant'nexus'in'the'heart'of'campus
Doe-Library
Enhancing'strengths'of:
•Simons-InsFtute-for-the-
Theory-of-CompuFng
•-AMP-Lab
•-CITRIS
•etc.

Doe'Memorial'Library
@'the'center'of'UC'Berkeley

Berkeley Institute for Data Science Opening

Berkeley/UW/NYU$Working$Groups$as$Bridges$
Applied'Math
/

Towards+an+Inclusive+Ecosystem
Expanding+Par8cipa8on+Among+Underrepresented+Groups
11%
56%
33%
female male
decline'to'state
2013'Python'bootcamp
K'2013'AMP'Camp:'''<'5%'women
K'Today'@'PyData:'''''1'women'out'of'18'speakers
K'2013'Python'Seminar:''36%'women

Chris-Mentzel
Moore-FoundaFon
@NYU,-on-Monday
Josh-Greenberg
Sloan-FoundaFon
Yann-LeCun
NYU/Facebook

Summary
Data+science+at+Berkeley+is+thriving+and+is+geJng+an+
intellectual+home
---D-incubaFng-novel-science-&-methodologies
---D-teaching-&-training
---D-innovate-environments,-interacFons,-&-networks
A+data+scien8st+is+a+unicorn,+but...
Looking-for-founda8onal+industry+partners-to-
parFcipate-and-help-us-grow

@pro{sb
Thank+you.
PyData,'May'4.'2014

Joshua Bloom Data Science at Berkeley

Recomendados

Recomendados

Más contenido relacionado

Similar a Joshua Bloom Data Science at Berkeley

Similar a Joshua Bloom Data Science at Berkeley (20)

Más de PyData

Más de PyData (20)

Último

Último (20)

Joshua Bloom Data Science at Berkeley