1. FROM PHYSICS TO DATA SCIENCE
Martina Pugliese
17 December 2015
Scotland Data Science & Technology
2. An outline of what we will discuss
THE PARTS ABOUT
ME,
MY JOB,
MY BACKGROUND
WHAT (I LEARNED) IT
MEANS TO
DO DATA SCIENCE
WHAT IS DATA SCIENCE
AND ITS (AMBIGUOUS)
RELATIONSHIP
TO RESEARCH
4. THE BORING BACKGROUND
➤ I did a Bachelor’s degree in Physics
I thought I wanted to do particle physics
➤ Then I did a Master’s degree in Physics (Statistical Mechanics)
I’ve studied the evolution of Influenza virus
0 2 4 6 8 10 12 14 16 18 20 10−3
10−2
10−1
1
10
0
1
2
3
4
5
6
7
S
E
0 2 4 6 8 10 12 14 16 18 20 10−3
10−2
10−1
1
10
0
1
2
3
4
5
6
7
S
E
βM
pM0.55
S
0
1
2
3
4
5
6
7
βM
pM0.55
S
0
1
2
3
4
5
6
7
Numerical model (using a genetic
algorithm) simulating how
the pathogen creates new variants
5. THE BORING BACKGROUND
➤ Then I did a PhD in Physics
I’ve explored how Natural Language evolves in time
0
0.2
0.4
0.6
0.8
1
10−5
10−4
10−3
10−2
10−1
I
fsum
burn
0
0.2
0.4
0.6
0.8
1
10−5
10−4
10−3
10−2
10−1
I
fsum
dwell
0
0.2
0.4
0.6
0.8
1
10−5
10−4
10−3
10−2
10−1
I
fsum
hide
0
0.2
0.4
0.6
0.8
1
10−5
10−4
10−3
10−2
10−1
I
fsum
sing
verbs changing
inflection in time
hide became irregular
sing stayed irregular
burn stayed regular dwell oscillates
Data Mining
&
Simulations
6. THE BORING BACKGROUND
➤ I wanted a job in the industry, as a Data Scientist, so …
I’ve done a bootcamp in London, S2DS, working on a
commercial DS problem [1]
Physics gave me:
the ability to model reality
(mathematically)
a brain trained to deal with data
ideas about lots of more things to study
the scientific method to carry out
experiments
8. “The key word in “Data Science” is not
Data, it is Science.
-Jeff Leek
9. DATA SCIENCE: A BABY COME OF AGE?
NGram Viewer data
There’s lots of talk these days on several buzzwords containing “data”
But the science of extracting information out of raw data
is much older than some think
10. A WEE BIT OF HISTORY
➤ The ‘60s: Data Analysis bashfully starts branching out of Statistics as an
empirical science [1]
➤ The ‘70s: Establishing the idea of converting data into knowledge
➤ The ‘80s: G. Piatetsky-Shapiro founds the KDD (Knowledge Discovery in
Databases) conferences
➤ The ‘90s: companies have lots of data on customers! The term Data Science is
first used in a conference name [2]
➤ the 2000s: Academic endeavours to define the field [3]
Statistical models (the “irrelevant theory”) vs. Algorithms
➤ the 2010s: the BOOM!
The “sexiest job of the 21st century” [4]
Big Data is the new innovation [5]
Growth in “analytics” and Data Science educational programs [6]
Data Science in Business should be called “Decision Science” [7]
11. But today, this is what’s happening:
[Intel, What happens in an Internet Minute? 2012]
12. So there came the need to have (many more) specialised people, in the industry, to
understand this dirty, variegated, large data and leverage it to provide solutions
The data we agree to give to services we use
(social networks, apps …) is used to sell us
tailored experiences There is a saying in Italian which goes
(translated) as:
“I know you as my pockets”
It should now become something like “I know you as your phone”
Where to get all these people from?
DS academic programs
Research
on the rise
???
14. The ugly fact: research has no room for all PhD graduates
Growth of PhD graduates in S&E fields in time
vs. growth of research positions [8]
The academic bottleneck is in the after the PhD
PhDs do not have real “transferable” skills (The Economist, [8])
15. Is this a reason alone to transfer a PhD to the industry?
NO
A PhD is an academic qualification
It is meant to train people for research
And for the new challenges ahead,
we need lots of scientists
to study new solutions
climate change
ageing of population
sustainable energy sources
the human brain
data science algorithms
…
Does it mean access to PhD programs should change?
MAYBE
16. Can we suggest Academia and industry should cooperate more?
CERTAINLY
Google cooperates (and hires from) Academia a lot
They’re shaping the
innovation landscape
Considering them as separate worlds does not help
They’re contributing to
“traditional”
academic research
(Quantum Annealing, [9])
They’re pushing the current
borders of AI
(deep learning, anyone?)
17. THE (OBVIOUS) DISADVANTAGES OF A PHD GRADUATE
➤ The
“overqualified and
unexperienced” curse
➤ Research trains you to sustain and cope with failure
➤ You know how to quickly learn new stuff alone
➤ You have a long history of communicating your findings
THE (NOT-SO-OBVIOUS) ADVANTAGES OF A PHD GRADUATE
I’d argue this is the best
skill to have today
➤ The “age” and “expectations” problems www.phdcomics.com
19. I believe the main and most important skill
one needs in this role is that of being able to
learn quickly and having the passion for doing so
20. BUT PRACTICALLY SPEAKING…
➤ Mathematics & Statistics foundations
This is the brain training you need to understand it all. I won’t list all the needed stuff because it
wouldn't make sense, but in short…:
Linear Algebra (matrices operations)
Probability Theory, the concepts
Graph Theory, the concepts
Be proficient with Calculus and Mathematical Methods
Statistical Tests and Techniques
…
➤ Machine Learning
You need to be able to understand an algorithm on pen and paper, otherwise it’s just pushing a button
on a ML library. With practice you learn which to choose for what and how to assess its performance.
As for libraries, it depends, but scikit-learn is great and very well documented, including the Maths
behind algorithms so it’s a great resource.
21. BUT PRACTICALLY SPEAKING…
➤ Programming
It’s essential code quickly and product reusable, robust scripts.
I have a thing for Python.
I also use R sometimes for stats analyses.
Shell commands proficiency helps a lot to save time
Numerical simulations: something like C++ is very useful
Basics of web development and of the software development process
➤ Data visualisation tools
Visualisations help you and others around you understand information
I use Python libraries for simple things, but the beauty of D3 is unbeatable
➤ Big Data Technologies
This is the bit about which there’s lots of talk these days. Analytical skills also means you
learn the Technologies (Hadoop/Spark/Mahout…) with practice.
23. Mallzee is the fashion app for everyone
You swipe product right (like)
or left (dislike)
You can create your own style feeds
You can search for specific products
and favourite brands
You can buy products
We have millions of “swipes” plus user data
24. WHAT I DO IN MY JOB
Follow the DS mantra:
Exploratory
Analyses
Model
Data pre-processing
Product Insights
Model
Validation
takes long time…[8]
produce
visualisations
produce
software
25. THE ROLE CONSISTS OF SEVERAL THINGS
Understand user behaviour
in all parts of the app
Predict what drives
retention/usage
Analyse numerical data on swipes
to see what’s hot this season
Improve product with
tailored-to-you features
Computer Vision to see what
images features perform best
for what sorts and whom
Measure all indicators
across the business
Recommendations
26. THE REFERENCES
➤ [1] Something I wrote for S2DS
➤ [1] Tukey, The Future of Data Analysis
➤ [2] Data Science, Classification and related methods, Kobe, Japan, 1996
➤ [3] Leo Breiman, Statistical Modeling, the Two Cultures
➤ [4] HBR, Data Scientist: The Sexiest Job of the 21st Century
➤ [5] McKinsey, Big Data, the next frontier for innovation
➤ [6] KDNuggets, the boom in analytics education
➤ [7] TechCrunch, Why Decision Science matters
➤ [8] Nature Biotechnology, The missing piece to changing the university culture
➤ [8] The Economist, the disposable academic
➤ [9] What is the computational value of finite range tunnelling?
➤ [8] NY Times, the "Janitor work" is key hurdle for insight
➤ [8] M. Loudikes, What is Data Science?
➤ [9] The Edison European Project