Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro

1
https://www.datasciencetech.institute/

Data Science:
Past, Present, and Future
Gregory Piatetsky-Shapiro
KDnuggets
2© KDnuggets 2016
La Science des données:
passé, présent et futur

Predicting Behavior –
Key to Survival
© KDnuggets 2016 3
Better prediction – better intelligence

“Predictions”: Astrology
© KDnuggets 2016 4
My May 26 Horoscope:
So what if things aren't
completely wonderful in your
life right now? Just keep your
hopes high, and your fingers
crossed. … Being with the
people who make you feel good
about yourself will help keep
your thoughts bright, so get
together with your closest
friend as soon as you can..
www.astrology.com/horoscope/daily/aries.html

“Predictions” : Turkish Coffee Grinds
© KDnuggets 2016 5
If a big chunk of the coffee
grounds falls down on the saucer
then it is taken as the first positive
sign of your reading. “Trouble and
worries are leaving you”.

Pundits “Predictions”
• Nate Silver FiveThirtyEight.com prediction for
Trump winning Republican nomination:
• Aug 2015: 2%
• Sep 2015: 5%
• Nov 2015: 6%
• Jan 2016: 12%
• May 2016: 99%
© KDnuggets 2016 6

Desire to Predict – Deep Human Trait
© KDnuggets 2016 7
• People are hard-wired to see patterns
• People want predictions
• Human intuition does not work on large scale
data, for understanding probability
• Good story is essential to a convincing
prediction (whether true or false)
Lessons

Data Science
Data-Driven, Scientific
approach to prediction
and data analysis
8

Outline
• Intro, Data Science History and Terms
• 10 Real-World Data Science Lessons
• Data Science Now: Polls & Trends
• Data Science Roles
• Data Science Job Trends
• Data Science Future
© KDnuggets 2016 9

What do we call it?
• Statistics
• Data Mining
• Knowledge Discovery in Data
(KDD)
• Predictive Analytics
• Data Analytics
• Data Science
• …?
© KDnuggets 2016 10
Core Idea:
Finding
Useful
Patterns
in Data

Pre-history (1800-2008): Statistics
From Google Ngram viewer – English language books
Search case insensitive.
Other languages need to be considered for full picture
statistics is the biggest term in 20th century,
Analytics is used increasingly thru 20th century
data mining appears in late 1990s

French Books, 1800-2008
Statistiques vs Mathematiques

“Data Mining” Surges in 1996
Advances in Knowledge Discovery and
Data Mining, AAAI/MIT Press, 1996, Eds:
U. Fayyad, G. Piatetsky-Shapiro, P. Smyth,
and R. Uthurusamy
Analytics
Data Mining
KDD-95, 1st Conference on Knowledge
Discovery and Data Mining, Montreal
Google N-grams search case insensitive, smoothing 1

Earliest use of “data mining”: 1962
(c) KDnuggets 2016 15
Source: Google Books
After eliminating many “following data. Mining cost is ” examples
which refer to Mining of minerals,
and books from “1958” that have a CD attached (errors in book year)
The earliest “data mining” reference I found is

Very Recent History
Using Google Trends

Google Trends, 2005-2016:
After 2006, Analytics > Data Mining
17(c) KDnuggets 2016
Global – all regions

>50% of “Analytics” searches are for
“Google Analytics”
Google Analytics introduced,
Dec 2005

Google Trends, 2005-2016
(c) KDnuggets 2016
data
science
analytics - Google
big data
data mining
2010 2012 2014

(c) KDnuggets 2016
2012: Analytics down, Big Data up
2015
2005

(c) KDnuggets 2016
2013: Data Science grows
20132005

Google Trends:
Machine Learning, Data Science,
Deep Learning
2009 2011 2013 2015

Google Trends: Machine Learning
Machine Learning ~ “Machine Learning”

Google Trends: Data Science
[Data Science] != “Data Science”
Lesson: Examine assumptions carefully
2009 2011 2013 2015

Regional Interest in
“Data Science” in 2015
Google Trends
Note: search for “Data Science” is
different from [Data Science]

KDnuggets Audience by Region, Q1
2016

Data Science History
• < 1900 - Statistics
• 1960s Data Mining = bad activity, data “dredging”
• 1990 - “Data Mining” is good, surges in 1996
• 2003 - “Data Mining” peaks (bad in press, invasion of
privacy?), slowly declines, but still popular
• 2006 - Google Analytics
• 2007 - Business/Data/Predictive Analytics
• 2012 - Big Data
• 2014 - Data Science
• 2015 - Deep Learning
• 2018 - ??
27© KDnuggets 2016

10 Real-World Lessons
from the Art & Practice
of Data Science &
Data Mining
28© KDnuggets 2016

Lesson 1: It is a Iterative, Circular Process
Waterfall
model
does NOT
work
for
Data
Science

CRISP-DM: Iterative, Circular Process
See www.kdnuggets.com/2016/03/data-science-process-rediscovered.html
Data Mining Process – CRISP-DM, 1998
CRISP-DM, 1998
1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Modeling
5. Evaluation
6. Deployment

Academic Data Science
Process
See www.kdnuggets.com/2016/03/data-science-process-rediscovered.html
Harvard, 2013

Machine Learning Workflow, MS Azure
See
www.kdnuggets.com/2016/04/developers-need-know-about-machine-learning.html
blogs.msdn.microsoft.com/continuous_learning/2014/11/15/end-to-end-predictive-model-in-
azureml-using-linear-regression/

Lesson 2: Data Engineering
Takes The Bulk of Time
• Building Machine Learning/Predicting Models
is the key (and most fun) part, but only a small
part of the whole process
• 60-80% (?) spent on Data
Preparation/Engineering

Competitions are different
March Machine Learning Mania 2016,
Winner's Interview: 1st Place, Miguel Alomar
https://twitter.com/kdnuggets/status/730417186167263232
http://blog.kaggle.com/2016/05/10/march-machine-learning-
mania-2016-winners-interview-1st-place-miguel-alomar/
How #MachineLearning @Kaggle
winner spent time:
35% read forums,
25% build models,
25% evaluate results
15% data preparation,

Lesson 3: Question Assumptions
Problem:
Deciles not uniform
Decile 1 is too rare,
Decile 0 – too frequent?
Why ?
* Not actual data
Measurement

Mass Spectrometry
Mass spectrometry (MS) is an
analytical technique that ionizes
chemical species and sorts the
ions based on their mass to
charge ratio.
Can produce a large number
(~ 20,000) of
m/z values for a sample
Goal: find biomarkers for
disease, test, condition

Question Assumptions
Instead of Measurement Deciles
Examine actual ranges,
including 0
Nothing between 1 and 14
Value 0 is too frequent
Why ?
* Not actual data
Measurement

Question Assumptions
Instead of Measurement Deciles
Examine actual ranges,
including 0
Nothing between 1 and 14
Value 0 is too frequent
Why ?
* Not actual data
Measurement
Someone added a rule to round
raw measurement values
below 15 to zero

The best data scientists have one
thing in common –
unbelievable curiosity
DJ Patil, US First Chief Data Scientist
http://www.sciencefriday.com/articles/10-questions-for-the-
nations-first-chief-data-scientist
April 2016
39

Lesson 4: Focus on the Right Metric -
Actionable
• Consumer: Churn may depend on age, region,
usage, and rate plan. Rate plan easiest to
change.
• Uplift Modeling in Marketing and Politics:
focus on persuadables

Right Metric: Uplift Modeling
Don’t model if consumer will buy –
Model if consumer will buy in response
to an offer

Right Metric: Uplift Modeling
From Eric Siegel presentation at PAW, 2011
In Obama 2012 Campaign
www.thefiscaltimes.com/Articles/2013/01/21/The-Real-Story-Behind-Obamas-Election-Victory

Lesson 5: Be a Fox, not a Hedgehog
Read Isaiah Berlin 1953 essay, The Hedgehog and the Fox
A fox knows many things, but
a hedgehog - one important thing.

Lesson 5: Modeling
No Free Lunch Theorem – no method is universally the best (Wolpert)
In Kaggle competitions, there are 2 ways to win (Anthony Goldbloom, 2016):
• Handcrafted feature engineering
• Or Deep Learning Neural Networks
www.kdnuggets.com/2016/01/anthony-goldbloom-secret-winning-kaggle-competitions.html
• XGBoost – winning method in many recent Kaggle competitions
• Ensemble methods
For Structured Data (Sebastian Rashka )
• SVM (Support Vector Machines) for smaller data
• Random Forests – more data, more automated
www.kdnuggets.com/2016/04/deep-learning-vs-svm-random-forest.html
Unstructured:
• Deep Learning

Lesson 6: Avoid Overfitting
http://www.kdnuggets.com/2014/06/cardinal-sin-data-mining-data-science.html
Many examples at http://tylervigen.com/spurious-correlations

Avoid Overfitting
“Irreproducible” results - BIG problem is social
sciences, medicine:
John P. A. Ioannidis famous paper Why Most Published
Research Findings Are False (PLoS Medicine, 2005).
Due to
• Small samples
• Testing too many hypotheses
• Confirmation bias (explicit or implicit)
• Poor training

How to Avoid Overfitting
• If it is too good to be true, it probably is
• Find the simplest possible hypothesis
• Adjusting the False Discovery Rate
• Randomization Testing
• Nested cross-validation (train, test, holdout)
• Regularization (adding a penalty for
complexity)
www.kdnuggets.com/2014/06/cardinal-sin-data-mining-data-science.html

Lesson 7: Tell a story
• Combine facts into a story
• Combine visual and text presentation
• Explanation gives credibility
• Dynamic / Interactive
• Examples: Kefir, Google Analytics, Quill

KEFIR (KEy FInding Reporter), 1994
• Overview report
www.kdnuggets.com/data_mining_course/kefir/overview.htm
• Inpatient admissions
www.kdnuggets.com/data_mining_course/kefir/s2.htm

Quill report for KDnuggets
• Sessions Stay Flat, But Way Higher Than 12-Month Weekly Average
• Sessions remained flat compared to the prior week. The 121,040
sessions, however, were above your 85,105-session weekly average
for the year. Your site's total pageviews stayed flat last week at
206,124, while pages per session grew less than a percent to 1.7.
That's equal to your weekly average for the year.
• Among all your pages, Analytics, Data Mining, and Data Science had
both the highest bounce rate (43%) and the most pageviews (8,734)
last week.

La Diseuse de bonne aventure,
Caravaggio, 1595 (Louvre)
Beware of
Fortune
tellers!

Lesson 8: Limits to Predicting Human
Behavior?
• Inherent randomness, complexity in human
behavior
• Individual predictions have limited accuracy
(but can still be better than random and very
useful for consumer analytics)
• Aggregate predictions (eg who will win the
election) more accurate, because individual
randomness cancels out

Example: Netflix Prize, 2006
• Example: Netflix Prize: the most advanced
algorithms were only a few percentages better
than basic algorithms
See Gregory Piatetsky, “Big Data: Hype & Reality”, Harvard Business
Review 2012, https://hbr.org/2012/10/big-data-hype-and-reality/

Direct Marketing Lift:
Random and Model-sorted Lists
0
10
20
30
40
50
60
70
80
90
100
5
15
25
35
45
55
65
75
85
95
Random
Model
5% of random list have 5% of hits
5% of model-score ranked list have 21% of hits.
Lift(5%) = 21%/5% = 4.2
Pct list
CPH:CumulativePctHits

Most lift curves are surprising similar-
limit to human predictability?
Study of lift curves in banking,
telecom
Best lift curves are similar
Special point T=Target
percentage
Lift(T) ~ sqrt (1/T)
G. Piatetsky-Shapiro, B. Masand,
Estimating Campaign Benefits and
Modeling Lift, in Proceedings of
KDD-99 Conference, ACM Press,
1999.
0
2
4
6
8
10
12
14
0 5 10 15 20 25
100*T%
Lift
Actual lift(T) Est. lift(T)

More recent data is more predictive!
• Real-time behavior data more predictive than
historical, demographic data
• Ad retargeting

Lesson 9: Deployment & Maintenance
• Netflix Prize winning algorithm not deployed
• Technical debt of Machine Learning
– (Google research.google.com/pubs/pub43146.html )
… the additional accuracy gains that we
measured did not seem to justify the
engineering effort needed to bring them
into a production environment. Also, our
focus on improving Netflix personalization
had shifted to the next level by then.
http://techblog.netflix.com/2012/04/netflix
-recommendations-beyond-5-stars.html

Modeling in Real World vs Kaggle
• ROI of extra accuracy vs cost of maintenance
• Is model explainable ? (legal, acceptance reasons)
• Does model discriminate on basis of race,
gender,…?
• Netflix Prize algorithm which won $1M - not
implemented
• In real-world, simpler is usually better

Deployment Test and Monitor
• Monitor assumptions
– Do fields have the same value distributions
• Detect when model is no longer valid, needs
rebuilding
• Automatic model re-build

Lesson 10: Don’t just predict, optimize
• Prediction is usually just one part of making a
decision
• Consider cost, frequency, latency, human
behavior, etc
• Goal: Optimization
• From Data Science to Decision Science

Privacy in the age of Big Data
• Privacy laws much stricter in Europe
• Individual Privacy vs Benefits for all (eg
aggregated health-care data)
• Image and Face recognition (eg Facebook)
• Very hard to keep privacy with so many digital
breadcrumbs
• Privacy vs Security (eg FBI vs Apple)
• Politicians are behind technology curve –
researchers should help society, politicians make
an informed decision

When It Is Ethical To Analyze
A Particular Dataset?
62© KDnuggets 2016

Data Ethics Golden Rule
Don’t do with someone else data
what you don’t want done
with your data

Data Science Now
What, Where, How
KDnuggets Polls Findings
www.KDnuggets.com/polls/

65© KDnuggets 2016
www.kdnuggets.com/2016/01/poll-analytics-data-mining-data-science-applied-2015.html
Where did you apply Analytics,
Data Mining, Data Science ?
Avg. Number of Industries 2.7
Most Popular:
- CRM
- Finance
- Banking
- Health Care
- Science
- e-commerce
Highest growth in:
Games, 121%
Entertainment / Music 74%
Social Good/Non-profit, 68%
Finance, 42%
Education, 30%

Data Types
Analyzed/Mined
66© KDnuggets 2016
www.kdnuggets.com/polls/2014/data-types-sources-analyzed.html
Most popular:
- Table data
- Time series
- Text
- itemsets/transactions
Most growing:
- music/audio
- JSON

Largest Dataset Analyzed?
www.kdnuggets.com/2015/08/largest-dataset-analyzed-more-gigabytes-petabytes.html

Python swallowed an Elephant?
Antoine de Saint-Exupery

Big Data Miners –
elite group .
www.kdnuggets.com/2015/08/largest-dataset-analyzed-more-gigabytes-petabytes.html
Median in 11-100 GB
range, slight increase.

Largest Dataset Analyzed by Region
Big Data Miners:
TeraBytes and
Petabytes
10-25%

4 Main Languages of Data Science
www.kdnuggets.com/2014/08/four-main-languages-analytics-data-mining-data-science.html

4 Main Languages of Data Science, 2

R vs Python
http://www.kdnuggets.com/2015/07/poll-primary-analytics-language-r-python.html
Surprising Stability:
88% of R users stayed with R
and 91% stayed with Python.
% of primary R , Python users up,
while % Other or None down.

Data Science Roles

Data Science Roles
• Data Analyst
• (Big) Data Engineer
• Data Scientist
• Machine Learning Researcher
• Data Science Manager/Director
• Chief Data Officer
• Company Founder

Data Science Venn Diagram, 2010
Drew Conway, 2010

LinkedIn Data Skills
LinkedIn has 334,000 Titles with “Data”
• Data Analyst 60,273
• Data Scientist 12,680
• Database Analyst 4,357
• Business Data Analyst 1,709
• Senior Data Scientist 1,691
• Sr. Data Analyst 1,131
Thanks to Lutz Finger, Director of Analytics at LinkedIn for
this custom study

LinkedIn: 4 Groups of Skills
Skills of people with “Data” in the title grouped into dedicated clusters - using similarity of members with similar skills.
Database Management and Software
• Access Database BTEQ Cubes Data Center Data Modeling Database Admin Database Administration Database
Design Databases DB2 Embedded SQL FastExport FastLoad MDX Memcached Microsoft SQL Server MLOAD
MongoDB Multiload MySQL NoSQL OA Framework Oracle Oracle Developer Suite Oracle Discoverer Oracle
Enterprise Manager Oracle PL/SQL Development Oracle RAC Oracle SQL Developer Performance Tuning
PhpMyAdmin PL/SQL PostgreSQL RDBMS Redis Relational Databases Replication RMAN SQL SQL Server
Management Studio SQL*Plus SQL400 SQLite Stored Procedures Sybase T-SQL Teradata Toad TPT TPUMP
Machine Learning
• Computational Linguistics Data Visualization Information Retrieval Machine Learning Natural Language Processing
Research Design Sentiment Analysis Structural Bioinformatics Text Mining
Mathematics
• Algebra Applied Mathematics Calculus Differential Equations Fortran Geometry Image Analysis LabVIEW Linear
Algebra Maple Mathematica Mathematical Modeling Mathematics Matlab Monte Carlo Simulation Numerical
Analysis Numerical Simulation Operations Research Partial Differential Equations Pre-Calculus Scientific Computing
Simulations Trigonometry
Statistical Analysis and Data Mining
• A/B Testing Analytics ANOVA Business Analytics Cluster Analysis Data Analysis Data Mining Decision Trees Design
of Experiments Economic Modeling Experimental Design Factor Analysis Google Analytics JMP Linear Regression
Logistic Regression Marketing Analytics Minitab Pattern Recognition Predictive Analytics Predictive Modeling
Primary Research Questionnaire Design Questionnaires R Sampling SAS SAS Programming SDTM Secondary
Research SPSS Statistical Consulting Statistical Data Analysis Statistical Modeling Statistical Programming Statistics
Survey Research Survival Analysis Time Series Analysis Web Analytics

LinkedIn Skills
N. Skills
relating to
Data
Number of LinkedIn
Members
1 9,708,214
2 3,870,376
3 2,065,318
4 1,097,849
5 576,310
6 305,266
7 169,351
8 98,284
9 60,419
10 37,689

Database,
Coding
Skills
Domain/Business
Expertise
Data Analyst/BI Analyst
© KDnuggets 2016
85
Data Analyst
Glassdoor, Apr 2016
US Avg Salary:
$60-70,000
Positions: 13,000

Database,
Coding
Skills
Data Engineer
© KDnuggets 2016
86
Domain/Business
Expertise
Data Engineer
Glassdoor, Apr 2016
US Salary: $95,500
Jobs: 40,296
Ingénieur … Data
France: 5K Jobs

“Unicorn” Data Scientist
© KDnuggets 2016
88
Database,
Coding
Skills
Domain/Business
Expertise
Glassdoor, Apr 2016
US Salary: $113,400
Jobs: 2572
France: €43,500
Jobs: 180
“Unicorn”
Data Scientist

Data Science Manager/Director
© KDnuggets 2016
89
Database,
Coding
Skills
Domain/
Business
Expertise
People
Management
Skills
Data Science
Leader

Company Founder
© KDnuggets 2016
90
Database,
Coding
Skills
Domain/
Business
Expertise
People
Management
Skills + Vision
Founder

Data Career Progression
BI/Data Analyst Data Engineer
Data Scientist
Machine Learning
Researcher
Data Science
Manager/Director
Company Founder/CEO
Chief Data Officer
Chief
Scientist

DATA SCIENCE
JOB TRENDS

Shortage of Data Scientists?
• McKinsey (2011): shortage by 2018 in US
– 140-190,000 people with deep analytical skills
– 1.5 M managers/analysts with the know-how to
use the analysis of big data to make effective
decisions.
Source:
www.mckinsey.com/mgi/publications/big_data/

Data Scientist –
Sexiest Job of the 21st Century?
• Thomas H. Davenport and D.J. Patil, (Harvard
Business Review, 2012)

“Data Scientist” - leading job trend
“Data Scientist” Job has grown 1,700% from 2012 to 2016
Top 5 Tech Job Trends in 2016:
Data Scientist, Devops, Puppet, PaaS, Hadoop
?
Indeed.com/jobtrends

Attention to Detail:
[Data Scientist] != “Data Scientist”
Indeed.com/jobtrends
Data Scientist
“Data Scientist” = “data scientist”

“Data Scientist” vs Statistician
Indeed.com job trends
“Data Scientist”
Statistician

Data Scientist jobs on KDnuggets
0%
5%
10%
15%
20%
25%
30%
35%
40%
2010 2011 2012 2013 2014 2015
% Data Scientist jobs on KDnuggets
Including Senior, Junior, Principal, Chief DS, …

LinkedIn 25 Hot Skills
2015
2014

Big Data
• Next Industrial Revolution
• Data Science is the Engine of Big Data

Doing Old Things Better
Application areas
– Direct marketing/Customer modeling
– Recommendations
– Fraud detection
– Security/Intelligence
– Healthcare
– …
• Competition will level companies

Big Data Enables New Things !
• Google – first big success of big data
• Social networks (Facebook, Twitter, LinkedIn,
…) success depends on network size, i.e. big
data
• Big Data in Health-care
– image analysis, diagnosis,
– Personalized medicine
• Recommendations - Netflix streaming

New services, products, platforms
• Image recognition – FB uses to decide what to
show users
• Face recognition - security
• Location-based services – Tinder
• Big Data to Power AI and Machine Learning
– Imagine Google DeepMind, IBM Watson, Siri in
2020 ?

© 2016 KDnuggets
108
Gartner Hype Cycle
Big Data
www.kdnuggets.com/2015/08/gartner-2015-hype-cycle-big-data-is-out-machine-learning-is-in.html
Citizen
Data
Science
Machine
Learning

“Citizen” Data Science
This is Bob, our new Citizen Data Scientist.
He previously worked as a citizen dentist
and a citizen pilot.

Golden Age of Data Science,
Machine Learning
• Amazing New Tools
• Very Complex Algorithms are very easy to use
• scikit-learn, iPython notebooks, etc
• One-Click deployment of TensorFlow on AWS
with GPU

Data Science Automated ?
Expert Human Ability
Current
Computer
Ability

Data Science Automated ?
Expert Human Ability

Data Science Automated By 2025?
KDnuggets Poll in 2015:
51% of voters expect Data Science Automation to happen in 10 years or less -
www.kdnuggets.com/2015/05/data-scientists-automated-2025.html

Data Science Automation
I remember when only a Deep Learning
supercomputer could beat
me in a Data Science competition

KDnuggets: Software: Automated Data Science:
• AutoDiscovery from ButlerScientifics
• Automatic Business Modeler from Algolytics
• Automatic Statistician project
• DataRobot
• DMWay
• ForecastThis DSX
• FeatureLab
• Loom Systems,
• machineJS: Automated machine learning
• Quill from Narrative Science
• SAP Predictive Analytics
• Savvy from Yseop.
• Skytree Machine Learning Software
• Tree-based Pipeline Optimization Tool (TPOT)

• New tools make Data Scientists more
productive
• Make data results more widely available
• Automate lower-level Data Science tasks

“Soft” Data Science Skills
Harder to Automate
• Curiosity
• Intuition
• Business Knowledge
• Selecting a good metric
• Posing the right question
• Presentation Skills
Data Science – still a great profession

Questions?
KDnuggets: Analytics, Big Data, Data Science
• Subscribe to KDnuggets News email at
www.KDnuggets.com/subscribe.html
• Email to editor1@kdnuggets.com
• Twitter: @kdnuggets
• facebook.com/kdnuggets
• LinkedIn group: KDnuggets
119© KDnuggets 2016

Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro

Similar to Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro (20)

Recently uploaded

Recently uploaded (20)

Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro

Editor's Notes