SlideShare una empresa de Scribd logo
1 de 67
Descargar para leer sin conexión
Why Data Science is a Science
Dr. Christoforos Anagnostopoulos
Founder and Chief Data Scientist, Mentat Innovations
Lecturer in Statistics (on leave), Imperial College London
Mentat Innovations
Credentials
BA Mathematics at Cambridge University
MSc Machine Learning at Edinburgh University
MSc Logic and Computer Science at Athens University
PhD in Machine Learning for Data Streams at Imperial
Postdoc Fellow at Statistical Laboratory, Cambridge Uni.
Lecturer in Statistics at Imperial College
Founder and Chief Scientist of Mentat Innovations
Credentials
PhD in Machine Learning for Data Streams at Imperial
Postdoc Fellow at Statistical Laboratory, Cambridge Uni.
Lecturer in Statistics at Imperial College
Founder and Chief Scientist of Mentat Innovations
Numerous consulting projects in real-time data analysis:
• social media analysis, sensor network telemetry, online
RTB advertising, cybersecurity and fraud, retail banking
• engaged with data journalism on several occasions
(The Independent, The Guardian, BBC, …)
Mentat Innovations is pioneering real-time anomaly
detection on network, application and telemetry data
This talk
This talk has been given around the world
Much of the thinking in this talk comes from colleagues
that I have had the privilege to work with over the years:
Prof. David Hand, OBE (Chairman of Advisory Board of Mentat)
Renowned statistician, twice president of Royal Statistical Society
Authority on pattern recognition and data mining for retail finance
This talk
This talk has been given around the world
Much of the thinking in this talk comes from colleagues
that I have had the privilege to work with over the years:
Professor Niall Adams, Imperial College London
Machine Learning expert
Data Mining in CyberSecurity pioneer
This talk
This talk has been given around the world
Much of the thinking in this talk comes from colleagues
that I have had the privilege to work with over the years:
Professor David Leslie, Lancaster University
World-wide expert in machine learning within game theory
This talk
This talk has been given around the world
Much of the thinking in this talk comes from colleagues
that I have had the privilege to work with over the years:
George Cotsikis (CEO and co-Founder of Mentat)
Enterpreneur, 17 years experience in quantitative finance
Data Science: the origins
Data Science: the origins
Courtesy of Cathy O’Neil and Rachel Schutt
Data Science: the origins
Data Mining
Pattern Recognition
Statistical Modelling
Business
Intelligence
Many rediscoveries of data
analysis in the last 20 years
Neural Nets
Knowledge
Discovery
Data Science: the origins
Data Mining
Pattern Recognition
Statistical Modelling
Analytics
Business
Intelligence
Predictive
Analytics
Many rediscoveries of data
analysis in the last 20 years
Big Data
Search and
Information Retrieval
Neural Nets
Knowledge
Discovery
Data Science: the origins
Data Mining
Pattern Recognition
Machine Learning
Statistical Modelling
Analytics
Business
Intelligence
Predictive
Analytics
Many rediscoveries of data
analysis in the last 20 years
Big Data
Search and
Information Retrieval
Natural Language
Preocessing
Neural Nets
Deep
Learning
Knowledge
Discovery
Data Science: the origins
Data Mining
Pattern Recognition
Machine Learning
Statistical Modelling
Analytics
Business
Intelligence
Predictive
Analytics
Many rediscoveries of data
analysis in the last 20 years
Big Data
Search and
Information Retrieval
Natural Language
Preocessing
Neural Nets Deep
Learning
Learning
from
Data
Knowledge
Discovery
Data Science: the origins
Many rediscoveries of data
analysis in the last 20 years
1970s: Peter Naur introduces “data science” as a
synonym to “computer science”
Data Science: the origins
Many rediscoveries of data
analysis in the last 20 years
1970s: Peter Naur introduces “data science” as a
synonym to “computer science”
1997: Jeff Wu claims “statisticians” are “data scientists”.
Data Science: the origins
Many rediscoveries of data
analysis in the last 20 years
1970s: Peter Naur introduces “data science” as a
synonym to “computer science”
1997: Jeff Wu claims “statisticians” are “data scientists”.
2001: William Cleveland introduces data science as an
independent discipline, extending statistics.
Data Science: the origins
Many rediscoveries of data
analysis in the last 20 years
1970s: Peter Naur introduces “data science” as a
synonym to “computer science”
1997: Jeff Wu claims “statisticians” are “data scientists”.
2001: William Cleveland introduces data science as an
independent discipline, extending statistics.
2008: DJ Patil (LinkedIn) and Jeff
Hammerbacher (Facebook) describe their job
role as that of “Data Scientist”
Data Science: the origins
Term became trending since 2008
38 years
What about Big Data?
Volume SQL
HDFS
What about Big Data?
Volume SQL
HDFS
Velocity
complex events processing
apache storm
apache spark streaming
What about Big Data?
Volume SQL
HDFS
Velocity
complex events processing
apache storm
apache spark streaming
Variety
structured semi-structured unstructured
social graphs, system logs,
tweets/blogs, CCTV
many variables, sampling variability
(e.g., spatiotemporal)
What about Big Data?
Volume
Velocity
Variety
Veracity
Value
Nobody wants data.
Everybody wants data-driven
reliable actionable insights.
Big Data in Science
CERN
1 Petabyte per day
10 GB per second
Astrostatistics
Biomedical
Climatology
Big Data in Science
Models guided by theory
Well formulated questions
Big Data in the Commercial World
Little to no theory
“Needle in the haystack”
Big Data in the Commercial World
Example: car loan provider
Online advertising
Saw an ad
Clicked
Browsed
Converted
Cookie Info
Big Data in the Commercial World
Example: car loan provider
Online advertising
Credit scoring data
Application data submitted
Credit bureau queried
Credit scoring computed
Interest raid tailored
Loan offered
Big Data in the Commercial World
Example: car loan provider
Online advertising
Credit scoring data
Behavioural data
Timely payments for 3 months
Delayed 4th payment
Delayed 5th payment
Big Data in the Commercial World
Example: car loan provider
Online advertising
Credit scoring data
Behavioural data
External data
Social media data
Public info about employer
Demographic data
Macroeconomic data
Big Data in the Commercial World
Example: car loan provider
Online advertising
Credit scoring data
Behavioural data
External data
Collections
Sent letter, no reply
Telephoned, non-cooperative
In-person visit
Big Data in the Commercial World
Example: car loan provider
Online advertising
Credit scoring data
Behavioural data
External data
Collections
Data silos
No substantive theory
Often question is unclear (“fishing”)
Data quality low
Not necessarily that Big
Variety of data
Statistical Methodology
Exploratory Data Analysis
Formulate question, get data
Exploratory Data Analysis
Model and Variable Selection
Model Fitting
Model Diagnostics
Statistical Methodology
Formulate question, get data
Exploratory Data Analysis
Model and Variable Selection
Model Fitting
Model Diagnostics
Inference Prediction
Statistical Methodology
Formulate question, get data
Exploratory Data Analysis
Model and Variable Selection
Model Fitting
Model Diagnostics
Inference Prediction
Statistical Methodology
Formulate question, get data
histograms
density plots
xy-plots
summary stats
Exploratory Data Analysis
Model and Variable Selection
Model Fitting
Model Diagnostics
Inference Prediction
Statistical Methodology
Formulate question, get data
histograms
density plots
xy-plots
summary stats
variable selection,
dimensionality reduction,
model averaging
(ensembles),
Cross-Validation,
bootstrapping, QQ plots,
outlier detection,…
Exploratory Data Analysis
Model and Variable Selection
Model Fitting
Model Diagnostics
Inference Prediction
Statistical Methodology
Formulate question, get data
histograms
density plots
xy-plots
summary stats
variable selection,
dimensionality reduction,
model averaging
(ensembles),
Cross-Validation,
bootstrapping, QQ plots,
outlier detection,…
classification
regression
forecasting
X,Y,Z have an
effect on W
Exploratory Data Analysis
Model and Variable Selection
Model Fitting
Model Diagnostics
Inference Prediction
Statistical Methodology
Formulate question, get data
histograms
density plots
xy-plots
summary stats
variable selection,
dimensionality reduction,
model averaging
(ensembles),
Cross-Validation,
bootstrapping, QQ plots,
outlier detection,…
classification
regression
forecasting
X,Y,Z have an
effect on W Anomaly /
Change Detection
Statistical Methodology
Bayesian vs Classical
Classical: data are noisy, parameters are fixed but unknown.
We use probability distributions to model the noise.
Bayesian: we use probability distributions to model our
uncertainty about both the data and the parameters
Statistical Methodology
Bayesian vs Classical
Classical: data are noisy, parameters are fixed but unknown.
We use probability distributions to model the noise.
Bayesian: we use probability distributions to model our
uncertainty about both the data and the parameters
In practice:
Bayesians “average” over their uncertainty a lot. This means
they use a lot of numerical integration (recently: Monte Carlo).
Everything has a probability distribution. Some are subjective.
Frequentists usually report “their best guess”. They use a lot of
classical optimisation (gradient descent etc.) - faster. In cases
where the variation is simple/physical, less subjective.
Statistical Methodology
Data Mining and Pattern Recognition
• Focus on pattern extraction rather than inference
• Often no question formulated in advance
Machine Learning
• Focus on prediction (out-of-sample error)
• Largely more automatic, black-box techniques are OK
• Huge success stories in stylised worlds
• Onus on the user to fit their problem into one of only a few
“templates” (classification, regression) - carries big risks.
Deep Learning and Cognitive AI
• Aims to replicate human cognition, low to mid-level faculties
such as vision, hearing, natural language understanding.
• Can share methods with statistics/probabilistic modelling,
but is mostly fundamentally different in its approach.
Statistical Methodology
ANALYTICS LEARNINGvs
Statistical Methodology
ANALYTICS LEARNINGvs
retrospective summaries generalisation
Statistical Methodology
ANALYTICS LEARNINGvs
retrospective summaries generalisation
a matter of resources to
compute the exact answer
(storage, distributed queries,
parallel computation, …)
mathematics
probability theory
numerical optimisation
logic and algorithms
no “exact” answer
Statistical Methodology
Takeaways:
• Black boxes aren’t enough
• More Data != More Information
• Big Data needs Big Models
• Quantity vs Quality vs Homogeneity


Black boxes aren’t enough
Peter Norvig:
Statement largely driven by “quantum step” in machine translation
offered by black-box (neural net) techniques, compared to explicit
grammar models and classical natural language processing tools
Black-box AI is experiencing a second coming. However, it does
rely on (nearly commoditised) natural language preprocessing
tools for keyword extraction, named entity recognition etc.




Almost never true. Even if generalisation is not needed, there are
always sources of error (measurement, nonresponse), as well as
latent factors (e.g., the effect of X on Y, correlation, causality).
More Data != More Information
20 years worth of credit scoring data, but …
• Only one snapshot of each applicant’s behaviour
• Unknown levels of demographic variability
• Unknown levels of temporal variability
With more data (usually) comes more heterogeneity:
one could say that Big Data = Many Small Datasets
Databases went from flat to relational to noSQL, but
most commodity models are pre-relational!
Models are not as re-usable as people think (for
example, a decision tree might be a good predictor
but a poor customer segmentation tool)
More Data != More Information
The signal sometimes simply isn’t there
Substantive theory (and common
sense) are still needed.
External (unobserved) factors,
inherent inpredictability
Biased sampling (observational vs
prospective - e.g., A/B testing). The
lost art of survey sampling (elections?)
Big Data needs Big Models
With enough data, everything is significant
This assumes the model is right and the data i.i.d.
• Bigger data typically means more sources of variation
• Model complexity should grow with the data (Kolmogorov)
−5 0 5 10 15
−2000200400
Small Data
Attribute
Response
●
●
●
●
●
●
●
●
●
●
Truth
Complex model
Simple model
−5 0 5 10 15
−2000200400
Bigger Data
Attribute
Response
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●●
●
●
●
●
●
Truth
Complex model
Simple model
Big Data needs Big Models
Big Data needs Big Models
Personally a big fan of Bayesian non-parametrics.
Zoubin Ghahramani thinks it’s
“the rise of the automated statistician”
Big Data needs Big Models
Fat Data vs Tall Data
Sometimes bigger means more features for the same
examples: curse of dimensionality. Modern techniques for
sparse learning (p >> n) are a great aid (e.g., Lasso)
ID Age Income Tweet Tweet Tweet ...
1
2
3
4
...
ID Age Income
1
2
3
4
5
6
7
8
...
Big Data needs Big Models
Fat Data vs Tall Data
Consider recommender systems. As data grows:
• more items, more users
• each user ranks a fixed number of items: sparser matrices
Big Data needs Big Models
Temporal homogeneity: the hidden bottleneck
At one extreme, one could ignore all past data as irrelevant
At the other one could assume the future is like the past
Solutions in the middle include dynamic modelling (very
complicated and computationally expensive), and exponential
filters of various specifications (my field of expertise)
−4 −2 0 2 4
0.00.20.40.60.81.0
X
Density
Prior
Posterior
Posterior with power prior
Posterior with flat prior
Big Data needs Big Models
Temporal homogeneity: the hidden bottleneck
Sometimes there is nothing to do
●
●●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
−4 −2 0 2 4
−4−2024
X1
X2
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
Class 1
Class 2
●
● ●
●
●
●
●
●
●
●
●
●
●●●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
−4 −2 0 2 4
−4−2024
X1
X2
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
Class 1
Class 2
Big Data needs Big Models
Temporal homogeneity: the hidden bottleneck
What looks like drift for one model might not be for another,
especially when the population, not the concept, is drifting
●
●
●
●
●
●
●
●● ● ●●
●
●●
●
● ●●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●●
● ●●●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
−3 −2 −1 0 1 2 3
−10−50510
X
y
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
old data
new data
Big Data needs Big Models
Robustness
Important to have built-in guarantees. Robustness and model
diagnostics is the unsung hero in classical statistics.
Complicating the assumption set sometimes leads to overly
complex models. Robustness is often the expedient solution.
Do not torture the data
The Wall Street Journal:
“Big Data Unveils Some Weird Correlations”
• orange used cars are more reliable
• taller people are better at repaying loans
−4 −2 0 2 4
0.00.20.40.60.81.0
X
Density
Prior
Posterior
Posterior with power prior
Posterior with flat prior
• http://www.tylervigen.com 

Streaming data
Exact answers are sometimes possible (e.g., running mean)
But sometimes they are not (e.g., top-K, median)
Streaming approximate algorithms are fast, and can be very
accurate, but they can be complicated (e.g., hyperloglog).
Keep constant memory footprint.
Keep up (do not queue)
Streaming data
However, in Machine Learning, there is no “exact” answers.
Will batch always outperform streaming (more resources)?
• Temporal heterogeneity (drift)
• Simulated annealing
• Overfitting (prequential learning)
www.ment.at/blog.html
Keep constant memory footprint.
Keep up (do not queue)
Streaming data
However, in Machine Learning, there is no “exact” answers.
Will batch always outperform streaming (more resources)?
• Temporal heterogeneity (drift)
• Simulated annealing
• Overfitting (prequential learning)
www.ment.at/blog.html
Keep constant memory footprint.
Keep up (do not queue)
Infrastructure
I haven’t discussed infrastructure as much. It’s critical.
If you are late, sometimes you might as well give up.
Parallelisation (e.g., GPUs), distribution (e.g., HDFS),
streaming (e.g., Spark Streaming), λ-architectures …
Algorithms often need to be designed from scratch.
Great progress in this direction. Keep working on it!
datastream.io
datastream.io
additional deployment options
How to manage data scientists
Treat negative results like you treat positive results
Encourage lab reports: data analysis is a process.
Do not overfit. Do not fish for p-values. Do not torture the data.
Specify hypotheses in advance whenever possible. Then test.
Black box solutions are great for prediction. Only.
Do not silo data scientists. Incorporate expert knowledge
whenever possible. Explicit prior beliefs are not a bias risk.
Conclusions
• Knowledge is power. Knowledge relies on data. 

• The process of extracting knowledge from data has
become more efficient and more powerful than ever –
but it’s still far from automatic (we are working on it ...) 

• Big Data needs Big Models 

• More Data != More Information 

• A Data Scientist is a team, not an individual 

Afterthought
What about strong Artificial Intelligence?
Machines are outperforming
humans in an increasingly broad
array of cognitive tasks.
Last time this happened we had the
Industrial Revolution.
Data Science is at the cusp of this wave. This is an
exciting time, but it also carries a lot of responsibility.
Afterthought
If machines replace us, there will only be one profession left
AI programmers and Data Scientists

Más contenido relacionado

La actualidad más candente

Data+Science : A First Course
Data+Science : A First CourseData+Science : A First Course
Data+Science : A First CourseArnab Majumdar
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactDr. Sunil Kr. Pandey
 
Challenges in Analytics for BIG Data
Challenges in Analytics for BIG DataChallenges in Analytics for BIG Data
Challenges in Analytics for BIG DataPrasant Misra
 
Data science
Data scienceData science
Data science9diov
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesRukshan Batuwita
 
hariri2019.pdf
hariri2019.pdfhariri2019.pdf
hariri2019.pdfAkuhuruf
 
Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataJongwook Woo
 
Data and Knowledge as Commodities
Data and Knowledge as CommoditiesData and Knowledge as Commodities
Data and Knowledge as CommoditiesMathieu d'Aquin
 
8 minute intro to data science
8 minute intro to data science 8 minute intro to data science
8 minute intro to data science Mahesh Kumar CV
 
From Rocket Science to Data Science
From Rocket Science to Data ScienceFrom Rocket Science to Data Science
From Rocket Science to Data ScienceSanghamitra Deb
 
Scalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIScalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIJongwook Woo
 
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...Edureka!
 
Data science e machine learning
Data science e machine learningData science e machine learning
Data science e machine learningGiuseppe Manco
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceEdureka!
 
Pistoia Alliance Webinar Demystifying AI: Centre of Excellence for AI Webina...
Pistoia Alliance Webinar Demystifying AI: Centre of Excellence for AI  Webina...Pistoia Alliance Webinar Demystifying AI: Centre of Excellence for AI  Webina...
Pistoia Alliance Webinar Demystifying AI: Centre of Excellence for AI Webina...Pistoia Alliance
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceFerdin Joe John Joseph PhD
 
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-ShapiroKeynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-ShapiroData ScienceTech Institute
 

La actualidad más candente (20)

Data+Science : A First Course
Data+Science : A First CourseData+Science : A First Course
Data+Science : A First Course
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
Challenges in Analytics for BIG Data
Challenges in Analytics for BIG DataChallenges in Analytics for BIG Data
Challenges in Analytics for BIG Data
 
Data science
Data scienceData science
Data science
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our Lives
 
hariri2019.pdf
hariri2019.pdfhariri2019.pdf
hariri2019.pdf
 
Data Science: Past, Present, and Future
Data Science: Past, Present, and FutureData Science: Past, Present, and Future
Data Science: Past, Present, and Future
 
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11
 
Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big Data
 
Data and Knowledge as Commodities
Data and Knowledge as CommoditiesData and Knowledge as Commodities
Data and Knowledge as Commodities
 
8 minute intro to data science
8 minute intro to data science 8 minute intro to data science
8 minute intro to data science
 
From Rocket Science to Data Science
From Rocket Science to Data ScienceFrom Rocket Science to Data Science
From Rocket Science to Data Science
 
Scalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIScalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AI
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
 
Data science e machine learning
Data science e machine learningData science e machine learning
Data science e machine learning
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Pistoia Alliance Webinar Demystifying AI: Centre of Excellence for AI Webina...
Pistoia Alliance Webinar Demystifying AI: Centre of Excellence for AI  Webina...Pistoia Alliance Webinar Demystifying AI: Centre of Excellence for AI  Webina...
Pistoia Alliance Webinar Demystifying AI: Centre of Excellence for AI Webina...
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
 
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-ShapiroKeynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
 

Similar a Why Data Science is a Science

Data Science versus Artificial Intelligence: a useful distinction
Data Science versus Artificial Intelligence: a useful distinctionData Science versus Artificial Intelligence: a useful distinction
Data Science versus Artificial Intelligence: a useful distinctionChristoforos Anagnostopoulos
 
Making an impact with data science
Making an impact  with data scienceMaking an impact  with data science
Making an impact with data scienceJordan Engbers
 
Introduction to Data Science 1115.pptx
Introduction to Data Science 1115.pptxIntroduction to Data Science 1115.pptx
Introduction to Data Science 1115.pptxmark828
 
Introduction to Data Science 1117.pptx
Introduction to Data Science 1117.pptxIntroduction to Data Science 1117.pptx
Introduction to Data Science 1117.pptxmark828
 
Introduction to Data Science 1116.pptx
Introduction to Data Science 1116.pptxIntroduction to Data Science 1116.pptx
Introduction to Data Science 1116.pptxmark828
 
SuanIct-Bigdata desktop-final
SuanIct-Bigdata desktop-finalSuanIct-Bigdata desktop-final
SuanIct-Bigdata desktop-finalstelligence
 
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdfA New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdfArmyTrilidiaDevegaSK
 
Data Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has ChangedData Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has ChangedPhilip Bourne
 
Introduction to Data Science 1114.pptx
Introduction to Data Science 1114.pptxIntroduction to Data Science 1114.pptx
Introduction to Data Science 1114.pptxmark828
 
Introduction to Data Science 1113.pptx
Introduction to Data Science 1113.pptxIntroduction to Data Science 1113.pptx
Introduction to Data Science 1113.pptxmark828
 
MBA-TU-Thailand:BigData for business startup.
MBA-TU-Thailand:BigData for business startup.MBA-TU-Thailand:BigData for business startup.
MBA-TU-Thailand:BigData for business startup.stelligence
 
Presentación Ciro Cattuto, ISI Foundation en VI Summit País Digital 2018
Presentación Ciro Cattuto, ISI Foundation en VI Summit País Digital 2018Presentación Ciro Cattuto, ISI Foundation en VI Summit País Digital 2018
Presentación Ciro Cattuto, ISI Foundation en VI Summit País Digital 2018PAÍS DIGITAL
 
MAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAI
MAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAIMAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAI
MAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAIBig Data Week
 
Introduction to Data Science 1118.pptx
Introduction to Data Science 1118.pptxIntroduction to Data Science 1118.pptx
Introduction to Data Science 1118.pptxmark828
 
Machine Learning, Data Mining, and
Machine Learning, Data Mining, and Machine Learning, Data Mining, and
Machine Learning, Data Mining, and butest
 
Introduction to Data Science 5-13.pptx
Introduction to Data Science 5-13.pptxIntroduction to Data Science 5-13.pptx
Introduction to Data Science 5-13.pptxdatapro2
 
Introduction to Data Science 5-13.pptx
Introduction to Data Science 5-13.pptxIntroduction to Data Science 5-13.pptx
Introduction to Data Science 5-13.pptxSanmati Jain
 
Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.Natalino Busa
 

Similar a Why Data Science is a Science (20)

Data Science versus Artificial Intelligence: a useful distinction
Data Science versus Artificial Intelligence: a useful distinctionData Science versus Artificial Intelligence: a useful distinction
Data Science versus Artificial Intelligence: a useful distinction
 
Making an impact with data science
Making an impact  with data scienceMaking an impact  with data science
Making an impact with data science
 
Introduction to Data Science 1115.pptx
Introduction to Data Science 1115.pptxIntroduction to Data Science 1115.pptx
Introduction to Data Science 1115.pptx
 
Introduction to Data Science 1117.pptx
Introduction to Data Science 1117.pptxIntroduction to Data Science 1117.pptx
Introduction to Data Science 1117.pptx
 
Introduction to Data Science 1116.pptx
Introduction to Data Science 1116.pptxIntroduction to Data Science 1116.pptx
Introduction to Data Science 1116.pptx
 
SuanIct-Bigdata desktop-final
SuanIct-Bigdata desktop-finalSuanIct-Bigdata desktop-final
SuanIct-Bigdata desktop-final
 
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdfA New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
 
Data mining
Data miningData mining
Data mining
 
Data Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has ChangedData Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has Changed
 
Introduction to Data Science 1114.pptx
Introduction to Data Science 1114.pptxIntroduction to Data Science 1114.pptx
Introduction to Data Science 1114.pptx
 
Introduction to Data Science 1113.pptx
Introduction to Data Science 1113.pptxIntroduction to Data Science 1113.pptx
Introduction to Data Science 1113.pptx
 
MBA-TU-Thailand:BigData for business startup.
MBA-TU-Thailand:BigData for business startup.MBA-TU-Thailand:BigData for business startup.
MBA-TU-Thailand:BigData for business startup.
 
Presentación Ciro Cattuto, ISI Foundation en VI Summit País Digital 2018
Presentación Ciro Cattuto, ISI Foundation en VI Summit País Digital 2018Presentación Ciro Cattuto, ISI Foundation en VI Summit País Digital 2018
Presentación Ciro Cattuto, ISI Foundation en VI Summit País Digital 2018
 
MAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAI
MAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAIMAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAI
MAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAI
 
Introduction to Data Science 1118.pptx
Introduction to Data Science 1118.pptxIntroduction to Data Science 1118.pptx
Introduction to Data Science 1118.pptx
 
Data Mining With Big Data
Data Mining With Big DataData Mining With Big Data
Data Mining With Big Data
 
Machine Learning, Data Mining, and
Machine Learning, Data Mining, and Machine Learning, Data Mining, and
Machine Learning, Data Mining, and
 
Introduction to Data Science 5-13.pptx
Introduction to Data Science 5-13.pptxIntroduction to Data Science 5-13.pptx
Introduction to Data Science 5-13.pptx
 
Introduction to Data Science 5-13.pptx
Introduction to Data Science 5-13.pptxIntroduction to Data Science 5-13.pptx
Introduction to Data Science 5-13.pptx
 
Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.
 

Último

Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxjana861314
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 

Último (20)

Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 

Why Data Science is a Science

  • 1. Why Data Science is a Science Dr. Christoforos Anagnostopoulos Founder and Chief Data Scientist, Mentat Innovations Lecturer in Statistics (on leave), Imperial College London Mentat Innovations
  • 2. Credentials BA Mathematics at Cambridge University MSc Machine Learning at Edinburgh University MSc Logic and Computer Science at Athens University PhD in Machine Learning for Data Streams at Imperial Postdoc Fellow at Statistical Laboratory, Cambridge Uni. Lecturer in Statistics at Imperial College Founder and Chief Scientist of Mentat Innovations
  • 3. Credentials PhD in Machine Learning for Data Streams at Imperial Postdoc Fellow at Statistical Laboratory, Cambridge Uni. Lecturer in Statistics at Imperial College Founder and Chief Scientist of Mentat Innovations Numerous consulting projects in real-time data analysis: • social media analysis, sensor network telemetry, online RTB advertising, cybersecurity and fraud, retail banking • engaged with data journalism on several occasions (The Independent, The Guardian, BBC, …) Mentat Innovations is pioneering real-time anomaly detection on network, application and telemetry data
  • 4. This talk This talk has been given around the world Much of the thinking in this talk comes from colleagues that I have had the privilege to work with over the years: Prof. David Hand, OBE (Chairman of Advisory Board of Mentat) Renowned statistician, twice president of Royal Statistical Society Authority on pattern recognition and data mining for retail finance
  • 5. This talk This talk has been given around the world Much of the thinking in this talk comes from colleagues that I have had the privilege to work with over the years: Professor Niall Adams, Imperial College London Machine Learning expert Data Mining in CyberSecurity pioneer
  • 6. This talk This talk has been given around the world Much of the thinking in this talk comes from colleagues that I have had the privilege to work with over the years: Professor David Leslie, Lancaster University World-wide expert in machine learning within game theory
  • 7. This talk This talk has been given around the world Much of the thinking in this talk comes from colleagues that I have had the privilege to work with over the years: George Cotsikis (CEO and co-Founder of Mentat) Enterpreneur, 17 years experience in quantitative finance
  • 9. Data Science: the origins Courtesy of Cathy O’Neil and Rachel Schutt
  • 10. Data Science: the origins Data Mining Pattern Recognition Statistical Modelling Business Intelligence Many rediscoveries of data analysis in the last 20 years Neural Nets Knowledge Discovery
  • 11. Data Science: the origins Data Mining Pattern Recognition Statistical Modelling Analytics Business Intelligence Predictive Analytics Many rediscoveries of data analysis in the last 20 years Big Data Search and Information Retrieval Neural Nets Knowledge Discovery
  • 12. Data Science: the origins Data Mining Pattern Recognition Machine Learning Statistical Modelling Analytics Business Intelligence Predictive Analytics Many rediscoveries of data analysis in the last 20 years Big Data Search and Information Retrieval Natural Language Preocessing Neural Nets Deep Learning Knowledge Discovery
  • 13. Data Science: the origins Data Mining Pattern Recognition Machine Learning Statistical Modelling Analytics Business Intelligence Predictive Analytics Many rediscoveries of data analysis in the last 20 years Big Data Search and Information Retrieval Natural Language Preocessing Neural Nets Deep Learning Learning from Data Knowledge Discovery
  • 14. Data Science: the origins Many rediscoveries of data analysis in the last 20 years 1970s: Peter Naur introduces “data science” as a synonym to “computer science”
  • 15. Data Science: the origins Many rediscoveries of data analysis in the last 20 years 1970s: Peter Naur introduces “data science” as a synonym to “computer science” 1997: Jeff Wu claims “statisticians” are “data scientists”.
  • 16. Data Science: the origins Many rediscoveries of data analysis in the last 20 years 1970s: Peter Naur introduces “data science” as a synonym to “computer science” 1997: Jeff Wu claims “statisticians” are “data scientists”. 2001: William Cleveland introduces data science as an independent discipline, extending statistics.
  • 17. Data Science: the origins Many rediscoveries of data analysis in the last 20 years 1970s: Peter Naur introduces “data science” as a synonym to “computer science” 1997: Jeff Wu claims “statisticians” are “data scientists”. 2001: William Cleveland introduces data science as an independent discipline, extending statistics. 2008: DJ Patil (LinkedIn) and Jeff Hammerbacher (Facebook) describe their job role as that of “Data Scientist”
  • 18. Data Science: the origins Term became trending since 2008 38 years
  • 19. What about Big Data? Volume SQL HDFS
  • 20. What about Big Data? Volume SQL HDFS Velocity complex events processing apache storm apache spark streaming
  • 21. What about Big Data? Volume SQL HDFS Velocity complex events processing apache storm apache spark streaming Variety structured semi-structured unstructured social graphs, system logs, tweets/blogs, CCTV many variables, sampling variability (e.g., spatiotemporal)
  • 22. What about Big Data? Volume Velocity Variety Veracity Value Nobody wants data. Everybody wants data-driven reliable actionable insights.
  • 23. Big Data in Science CERN 1 Petabyte per day 10 GB per second Astrostatistics Biomedical Climatology
  • 24. Big Data in Science Models guided by theory Well formulated questions Big Data in the Commercial World Little to no theory “Needle in the haystack”
  • 25. Big Data in the Commercial World Example: car loan provider Online advertising Saw an ad Clicked Browsed Converted Cookie Info
  • 26. Big Data in the Commercial World Example: car loan provider Online advertising Credit scoring data Application data submitted Credit bureau queried Credit scoring computed Interest raid tailored Loan offered
  • 27. Big Data in the Commercial World Example: car loan provider Online advertising Credit scoring data Behavioural data Timely payments for 3 months Delayed 4th payment Delayed 5th payment
  • 28. Big Data in the Commercial World Example: car loan provider Online advertising Credit scoring data Behavioural data External data Social media data Public info about employer Demographic data Macroeconomic data
  • 29. Big Data in the Commercial World Example: car loan provider Online advertising Credit scoring data Behavioural data External data Collections Sent letter, no reply Telephoned, non-cooperative In-person visit
  • 30. Big Data in the Commercial World Example: car loan provider Online advertising Credit scoring data Behavioural data External data Collections Data silos No substantive theory Often question is unclear (“fishing”) Data quality low Not necessarily that Big Variety of data
  • 31. Statistical Methodology Exploratory Data Analysis Formulate question, get data
  • 32. Exploratory Data Analysis Model and Variable Selection Model Fitting Model Diagnostics Statistical Methodology Formulate question, get data
  • 33. Exploratory Data Analysis Model and Variable Selection Model Fitting Model Diagnostics Inference Prediction Statistical Methodology Formulate question, get data
  • 34. Exploratory Data Analysis Model and Variable Selection Model Fitting Model Diagnostics Inference Prediction Statistical Methodology Formulate question, get data histograms density plots xy-plots summary stats
  • 35. Exploratory Data Analysis Model and Variable Selection Model Fitting Model Diagnostics Inference Prediction Statistical Methodology Formulate question, get data histograms density plots xy-plots summary stats variable selection, dimensionality reduction, model averaging (ensembles), Cross-Validation, bootstrapping, QQ plots, outlier detection,…
  • 36. Exploratory Data Analysis Model and Variable Selection Model Fitting Model Diagnostics Inference Prediction Statistical Methodology Formulate question, get data histograms density plots xy-plots summary stats variable selection, dimensionality reduction, model averaging (ensembles), Cross-Validation, bootstrapping, QQ plots, outlier detection,… classification regression forecasting X,Y,Z have an effect on W
  • 37. Exploratory Data Analysis Model and Variable Selection Model Fitting Model Diagnostics Inference Prediction Statistical Methodology Formulate question, get data histograms density plots xy-plots summary stats variable selection, dimensionality reduction, model averaging (ensembles), Cross-Validation, bootstrapping, QQ plots, outlier detection,… classification regression forecasting X,Y,Z have an effect on W Anomaly / Change Detection
  • 38. Statistical Methodology Bayesian vs Classical Classical: data are noisy, parameters are fixed but unknown. We use probability distributions to model the noise. Bayesian: we use probability distributions to model our uncertainty about both the data and the parameters
  • 39. Statistical Methodology Bayesian vs Classical Classical: data are noisy, parameters are fixed but unknown. We use probability distributions to model the noise. Bayesian: we use probability distributions to model our uncertainty about both the data and the parameters In practice: Bayesians “average” over their uncertainty a lot. This means they use a lot of numerical integration (recently: Monte Carlo). Everything has a probability distribution. Some are subjective. Frequentists usually report “their best guess”. They use a lot of classical optimisation (gradient descent etc.) - faster. In cases where the variation is simple/physical, less subjective.
  • 40. Statistical Methodology Data Mining and Pattern Recognition • Focus on pattern extraction rather than inference • Often no question formulated in advance Machine Learning • Focus on prediction (out-of-sample error) • Largely more automatic, black-box techniques are OK • Huge success stories in stylised worlds • Onus on the user to fit their problem into one of only a few “templates” (classification, regression) - carries big risks. Deep Learning and Cognitive AI • Aims to replicate human cognition, low to mid-level faculties such as vision, hearing, natural language understanding. • Can share methods with statistics/probabilistic modelling, but is mostly fundamentally different in its approach.
  • 43. Statistical Methodology ANALYTICS LEARNINGvs retrospective summaries generalisation a matter of resources to compute the exact answer (storage, distributed queries, parallel computation, …) mathematics probability theory numerical optimisation logic and algorithms no “exact” answer
  • 44. Statistical Methodology Takeaways: • Black boxes aren’t enough • More Data != More Information • Big Data needs Big Models • Quantity vs Quality vs Homogeneity 

  • 45. Black boxes aren’t enough Peter Norvig: Statement largely driven by “quantum step” in machine translation offered by black-box (neural net) techniques, compared to explicit grammar models and classical natural language processing tools Black-box AI is experiencing a second coming. However, it does rely on (nearly commoditised) natural language preprocessing tools for keyword extraction, named entity recognition etc. 
 
 Almost never true. Even if generalisation is not needed, there are always sources of error (measurement, nonresponse), as well as latent factors (e.g., the effect of X on Y, correlation, causality).
  • 46. More Data != More Information 20 years worth of credit scoring data, but … • Only one snapshot of each applicant’s behaviour • Unknown levels of demographic variability • Unknown levels of temporal variability With more data (usually) comes more heterogeneity: one could say that Big Data = Many Small Datasets Databases went from flat to relational to noSQL, but most commodity models are pre-relational! Models are not as re-usable as people think (for example, a decision tree might be a good predictor but a poor customer segmentation tool)
  • 47. More Data != More Information The signal sometimes simply isn’t there Substantive theory (and common sense) are still needed. External (unobserved) factors, inherent inpredictability Biased sampling (observational vs prospective - e.g., A/B testing). The lost art of survey sampling (elections?)
  • 48. Big Data needs Big Models With enough data, everything is significant This assumes the model is right and the data i.i.d. • Bigger data typically means more sources of variation • Model complexity should grow with the data (Kolmogorov) −5 0 5 10 15 −2000200400 Small Data Attribute Response ● ● ● ● ● ● ● ● ● ● Truth Complex model Simple model −5 0 5 10 15 −2000200400 Bigger Data Attribute Response ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●●● ● ● ● ● ● Truth Complex model Simple model
  • 49. Big Data needs Big Models
  • 50. Big Data needs Big Models Personally a big fan of Bayesian non-parametrics. Zoubin Ghahramani thinks it’s “the rise of the automated statistician”
  • 51. Big Data needs Big Models Fat Data vs Tall Data Sometimes bigger means more features for the same examples: curse of dimensionality. Modern techniques for sparse learning (p >> n) are a great aid (e.g., Lasso) ID Age Income Tweet Tweet Tweet ... 1 2 3 4 ... ID Age Income 1 2 3 4 5 6 7 8 ...
  • 52. Big Data needs Big Models Fat Data vs Tall Data Consider recommender systems. As data grows: • more items, more users • each user ranks a fixed number of items: sparser matrices
  • 53. Big Data needs Big Models Temporal homogeneity: the hidden bottleneck At one extreme, one could ignore all past data as irrelevant At the other one could assume the future is like the past Solutions in the middle include dynamic modelling (very complicated and computationally expensive), and exponential filters of various specifications (my field of expertise) −4 −2 0 2 4 0.00.20.40.60.81.0 X Density Prior Posterior Posterior with power prior Posterior with flat prior
  • 54. Big Data needs Big Models Temporal homogeneity: the hidden bottleneck Sometimes there is nothing to do ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● −4 −2 0 2 4 −4−2024 X1 X2 ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Class 1 Class 2 ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● −4 −2 0 2 4 −4−2024 X1 X2 ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●●● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● Class 1 Class 2
  • 55. Big Data needs Big Models Temporal homogeneity: the hidden bottleneck What looks like drift for one model might not be for another, especially when the population, not the concept, is drifting ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● −3 −2 −1 0 1 2 3 −10−50510 X y ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● old data new data
  • 56. Big Data needs Big Models Robustness Important to have built-in guarantees. Robustness and model diagnostics is the unsung hero in classical statistics. Complicating the assumption set sometimes leads to overly complex models. Robustness is often the expedient solution.
  • 57. Do not torture the data The Wall Street Journal: “Big Data Unveils Some Weird Correlations” • orange used cars are more reliable • taller people are better at repaying loans −4 −2 0 2 4 0.00.20.40.60.81.0 X Density Prior Posterior Posterior with power prior Posterior with flat prior • http://www.tylervigen.com 

  • 58. Streaming data Exact answers are sometimes possible (e.g., running mean) But sometimes they are not (e.g., top-K, median) Streaming approximate algorithms are fast, and can be very accurate, but they can be complicated (e.g., hyperloglog). Keep constant memory footprint. Keep up (do not queue)
  • 59. Streaming data However, in Machine Learning, there is no “exact” answers. Will batch always outperform streaming (more resources)? • Temporal heterogeneity (drift) • Simulated annealing • Overfitting (prequential learning) www.ment.at/blog.html Keep constant memory footprint. Keep up (do not queue)
  • 60. Streaming data However, in Machine Learning, there is no “exact” answers. Will batch always outperform streaming (more resources)? • Temporal heterogeneity (drift) • Simulated annealing • Overfitting (prequential learning) www.ment.at/blog.html Keep constant memory footprint. Keep up (do not queue)
  • 61. Infrastructure I haven’t discussed infrastructure as much. It’s critical. If you are late, sometimes you might as well give up. Parallelisation (e.g., GPUs), distribution (e.g., HDFS), streaming (e.g., Spark Streaming), λ-architectures … Algorithms often need to be designed from scratch. Great progress in this direction. Keep working on it!
  • 64. How to manage data scientists Treat negative results like you treat positive results Encourage lab reports: data analysis is a process. Do not overfit. Do not fish for p-values. Do not torture the data. Specify hypotheses in advance whenever possible. Then test. Black box solutions are great for prediction. Only. Do not silo data scientists. Incorporate expert knowledge whenever possible. Explicit prior beliefs are not a bias risk.
  • 65. Conclusions • Knowledge is power. Knowledge relies on data. 
 • The process of extracting knowledge from data has become more efficient and more powerful than ever – but it’s still far from automatic (we are working on it ...) 
 • Big Data needs Big Models 
 • More Data != More Information 
 • A Data Scientist is a team, not an individual 

  • 66. Afterthought What about strong Artificial Intelligence? Machines are outperforming humans in an increasingly broad array of cognitive tasks. Last time this happened we had the Industrial Revolution. Data Science is at the cusp of this wave. This is an exciting time, but it also carries a lot of responsibility.
  • 67. Afterthought If machines replace us, there will only be one profession left AI programmers and Data Scientists