APM Welcome, APM North West Network Conference, Synergies Across Sectors
What is Data Science
1. What
is
Data
Science
Looking
for
an
objective,
complete,
inclusive,
accurate
and
succinct
definition
of
this
emerging
field
Ioannis
Kourouklides
www.kourouklides.com
2. Contents
• Introduction
• History
• Related
terms
• Definitions
by
various
individuals
• Domain
expertise
• Data
Science
in
the
job
market
• How
Data
Scientists
are
self-‐defined
• Summary
• Conclusion
• References
&
Bibliography
3. Introduction
• In
a
Forbes
article,
Gil
Press
(2013)
admits
himself,
among
others,
that
Data
Science
(DS)
is
a
buzzword without
a
clear
definition
• A
quick
search
in
online
and
print
resources
verifies
this
lack
of
description
• Several
people
and
companies
expressed
their
own
opinion
on
the
matter
• Nonetheless,
most
definitions
overlap
with
each
other
• Data
Science
is
not
concerned
with
everything
that
has
to
do
with
data
• A
brief
look
at
the
recent
history
can
give
more
insight
• The
proper
(concrete)
definition
of
this
science
would
have
to
come
from
the
industry rather
than
academia and
might
keep
evolving
through
time
4. History
• The
term
“Data
Science”
has
been
around
for
more
than
30
years
• It
did
not
always
have
the
same
meaning,
but
it
picked
up
since
then
• Gil
Press
(2013)
authored
an
article
about
the
evolution
of
the
term
• 1966:
Peter
Naur
used
the
term
“Science
of
Data”
interchangeably
with
“Datalogy”
as
a
synonym
of
Computer
Science
in
his
courses
(Naur,
1968)
• 1974:
Naur
published
the
book
‘Concise
Survey
of
Computer
Methods’
which
is
a
survey
of
modern
data
processing
methods
• 1989:
Gregory
Piatetsky-‐Shapiro
organized
and
chaired
the
first
Knowledge
Discovery
in
Databases
workshop.
In
1995,
it
became
the
annual
ACM
Conference
on
Knowledge
Discovery
and
Data
Mining
(KDD).
5. History
• 1996:
International
Federation
of
Classification
Societies
(IFCS)
used
the
term
“Data
Science”
for
the
first
time
in
the
title
of
the
conference
(“Data
science,
classification,
and
related
methods”)
• 1997: C.F.
Jeff
Wu
gave
his
inaugural
lecture
entitled
‘Statistics
=
Data
Science?’
(“Identity
of
statistics
in
science
examined,”
1997)
• 2001:
William
S.
Cleveland
published
‘Data
Science:
An
Action
Plan
for
Expanding
the
Technical
Areas
of
the
Field
of
Statistics’
• 2002:
Launch
of
‘Data
Science
Journal’
by
CODATA
of
ICSU
• 2003:
Launch
of
‘Journal
of
Data
Science’
by
Columbia
University
• 2005:
National
Science
Board
defined
what
a
Data
Scientist
is
• 2007:
Nathan
Yau
wrote
about
the
“Rise
of
the
Data
Scientist”
6. Related
terms
• But
let’s
look
at
some
related
(possibly
overlapping)
terms:
• Machine
Learning
• Data
Mining
• Predictive
Analytics
• Statistics
• Big
Data
• Data
Analysis
• Business
Intelligence
• Data
Engineering
• Business
Analytics
• Knowledge
Discovery
in
Databases
• For
a
comparison
of
these
terms
with
Data
Science:
http://goo.gl/uW15El
7. Definition
by
M.
Loukides
• Loukides
(2010)
wrote
an
article
about
‘What
is
data
science?’
• “Data
science
requires
skills
ranging
from
traditional
computer science to
mathematics to
art.”
• “Data
scientists
combine
entrepreneurship with
patience,
the
willingness
to
build
data
products
incrementally,
the
ability
to
explore,
and
the
ability
to
iterate
over
a
solution.
They
are
inherently
interdisciplinary.
They
can
tackle
all
aspects
of
a
problem,
from
initial
data
collection
and
data
conditioning
to
drawing
conclusions.”
• This
is
not
a
very
precise
definition,
but
it
is
insightful
enough
• He
also
highlighted
the
industry’s
perspective
and
the
escalated
job
trends
8. Definition
by
D.
Conway
• Conway
(2010)
gave
a
less
vague
definition:
“…one
needs
to
learn
a
lot
as
they
aspire
to
become
a
fully
competent
data
scientist.
Unfortunately,
simply
enumerating
texts
and
tutorials
does
not
untangle
the
knots.
Therefore,
in
an
effort
to
simplify
the
discussion,
and
add
my
own
thoughts
to
what
is
already
a
crowded
market
of
ideas,
I
present
the Data
Science
Venn
Diagram…
hacking
skills,
math
and
stats
knowledge,
and
substantive
expertise.”
9. Definition
by
P.
Warden
• An
other
description
of
DS
(Warden,
2011)
appears
to
be
the
following:
• “There
is
no
widely
accepted
boundary
for
what’s
inside
and
outside
of
data
science’s
scope.
Is
it
just
a
faddish
rebranding
of
statistics?
I
don’t
think
so,
but
I
also
don’t
have
a
full
definition.
I
believe
that
the
recent
abundance
of
data
has
sparked
something
new
in
the
world,
and
when
I
look
around
I
see
people
with
shared
characteristics
who
don’t
fit
into
traditional
categories.
These
people
tend
to
work
beyond
the
narrow
specialties
that
dominate
the
corporate
and
institutional
world,
handling
everything
from
finding the
data,
processing it
at
scale,
visualizing it
and
writing
it
up
as
a
story.
They
also
seem
to
start
by
looking
at
what
the
data
can
tell
them,
and
then
picking
interesting
threads
to
follow,
rather
than
the
traditional
scientist’s
approach
of
choosing
the
problem
first
and
then
finding
data
to
shed
light
on
it.”
12. Definition
by
F.
Lo
• When
searching
online
the
phrase
‘define
data
science’,
an
excellent
article
(Lo,
2013)
appears
as
the
suggested/endorsed
answer
by
Google
• “Data
science
is
multidisciplinary;
the
skill
set
of
a
data
scientist
lies
at
the
intersection
of
3
main
competencies.”
• “Also,
a
big
misconception
is
that
data
science
is
all
about
statistics.
While
statistics
are
important,
it
is
not
the
only
type
of
mathematics
that
should
be
well-‐understood
by
a
data
scientist.”
• “A
defining
personality
trait
of
data
scientists
is
they
are
deep
thinkers
with intense
intellectual
curiosity.”
13. Definition
by
M.
Mut
• Mut
(2013)
went
a
step
further
and
classified
Data
Scientists
into
3
distinct
specialties
with
very
little
overlap:
• “Advanced
Analysis:
Math,
Stats,
Pattern
Recognition/Learning,
Uncertainty,
Visualization,
Data
Mining” – let’s
call
them
Data
Researchers
• “Computer
Systems
-‐ Advanced
Computing,
High
Performance
Computing,
Visualization,
Data
Mining” – let’s
call
them
Data
Hackers
• “Databases -‐ Data
Engineering,
Data
Warehousing”
– let’s
call
them
Data Developers
• He
claimed
that
DS
is
defined
to
include
all
these
specialties
and
thus
makes
life
confusing
for
employers
and
applicants
• He
proposed
a
solution
would
be
to
educate
HR
and
employers
that
they
need
to
break
DS
into
specialties
14. Definition
by
V.
Granville
• However,
Granville
(2014)
and
others
disagreed
with
Mut.
They
maintained
that combining
these
different
areas
is
not
impossible
and
they
forecasted
that
in
the
future
there
will
be
more
skills
overlap
within
individuals
• In
his
book
‘Developing
Analytic
Talent:
Becoming
a
Data
Scientist’
he
seems
to
provide
the
most
convincing
and
conforming
definition:
• “Data
Science
is
the
intersection
of
computer
science,
business
engineering,
statistics,
data
mining,
machine
learning,
operations
research,
Six
Sigma,
automation
and
domain
expertise.”
• “…
people
interested
in
a
data
science
career
don’t
need
to
learn
[…]
everything
from
these
domains.”
15. Domain
expertise
• Domain
expertise
and
business
acumen
are
totally
essential
for
DS
• This
depends
on
the
kind
of
data
and
their
source,
such
as:
• Bioinformatics
&
Genomics
• Information
Security
• Computer
Vision
&
Image
Processing
• Finance
&
Econometrics
• Insurance
• Marketing
• Medicine,
Health
&
Biomedical
applications
• Particle
Physics
• Social
Networks
• Telecoms
&
Utilities
• Web
&
Text
Mining
16. Data
Science
in
the
job
market
• Data
Scientist
roles
can
be
referred
to
by
various
names
according
to
the
seniority
level,
the
specific
skillset
and
area
of
expertise
• Frequently
required
skills
are:
• Hadoop/MapReduce/MongoDB/Hive
(not
always
necessary,
sometimes
as
a
plus)
• SQL
(though
less
popular
than
NoSQL)
• Perl/Java/PHP/.NET/Ruby/C++
• Machine
Learning
techniques
• Python/R/MATLAB/Octave/SPSS/SAS/Stata/Mathematica
• Advanced
level
degree:
MSc
or
PhD
• Work
experience
(typically
more
than
1-‐3
years)
• Communications
skills
18. How
Data
Scientists
are
self-‐defined
• Harris
et
al.
(2013)
identified
four clusters
(latent
factors)
of
Data
Scientists
in
their
book,
using
Non-‐negative
Matrix
Factorization:
• The
three
specializations
overlap
with
the
ones
mentioned
by
Mut (2013)
• The
forth
one
refers
mostly
to
CDOs
(Chief
Data
Officers),
self-‐identified
as:
Leaders,
Businesspersons,
or
Entrepreneurs
Data Researcher Researcher Scientist Statistician
Data Hacker Hacker Artist Jack
of
All
Trades
Data
Developer Developer Engineer -‐
19. How
Data
Scientists
are
self-‐defined
• The three specializations have started to emerge as three job positions:
• Nothing stops a person who studied Science from becoming a Data
Developer or Data Hacker and nothing stops a person who studied
Engineering from becominga Data Researcher
• Thus, it is the author’s belief that the terms ‘Scientist’ and ‘Engineer’
should not have been used, as they are misleading
Data Researcher Data
Scientist
Data Hacker Machine Learning
Engineer
Data
Developer Data Engineer
20. Summary
• In
brief,
one
can
split
down
the
skills
defining
DS
into
three
groups:
Note:
Each
column
above
is
not
related
to
the
adjacent
ones
Soft
skills
Communication
Business
knowledge
Domain
expertise
Knowledge &
Research
skills
Machine
Learning
– Data
Mining
Statistics
&
other
Maths
Relational
Databases
High
Performance
Computing
Data
Visualization
Coding
skills
Perl/Java/C#/PHP/Ruby/C++
Python/R/MATLAB/Octave
SPSS/SAS/Stata/Mathematica
Hadoop/MongoDB/Hive
SQL/JSON/XML/HTML/CSS
21. Conclusion
• To
sum
up,
DS
is
an
interdisciplinary science,
but
without
a
clear
definition
• It
can
be
defined
as
a
set
of
skills
from
Computer
Science,
Statistics,
…
• It
definitely
requires
some
Research qualities,
but
also
Domain Expertise
• Machine
Learning
is
at
the
epicentre
of
this
newly
coined
term
• Different
Data
Scientists
used
to
focus
or
specialize
in
one
area
of
expertise
• It
is
the
author’s
belief
that
future
Data
Professionals
will
be
required
to
have
three
distinct specializations
similar
to
Quantitative
Professionals,
i.e.
Quant
Researchers,
Quant
Traders and
Quant
Developers
corresponding
to
Data
Scientists,
Machine
Learning
Engineers
and
Data
Engineers respectively
• More
resources
can
be
found
at
the
next
slides
22. References
&
Bibliography
1. Gil
Press
http://www.forbes.com/sites/gilpress/2013/05/28/a-‐very-‐short-‐
history-‐of-‐data-‐science/
2. . Naur,
P.,
“'Datalogy',
the
science
of
data
and
data
processes.” IFIP
Congress
2,
1968,
pp.
1383-‐1387.
3. "Identity
of
statistics
in
science
examined".
The
University
Records,
9
November
1997,
The
University
of
Michigan.
http://ur.umich.edu/9899/Nov09_98/4.htmRetrieved
8
August
2014.
4. Cleveland,
W.
S.
(2001).
"Data
Science:
An
Action
Plan
for
Expanding
the
Technical
Areas
of
the
Field
of
Statistics". International
Statistical
Review
/
Revue
Internationale
de
Statistique 69 (1).
5. .
http://radar.oreilly.com/2010/06/what-‐is-‐data-‐science.html