More Related Content Similar to Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro (20) Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro2. Data Science:
Past, Present, and Future
Gregory Piatetsky-Shapiro
KDnuggets
2© KDnuggets 2016
La Science des données:
passé, présent et futur
4. “Predictions”: Astrology
© KDnuggets 2016 4
My May 26 Horoscope:
So what if things aren't
completely wonderful in your
life right now? Just keep your
hopes high, and your fingers
crossed. … Being with the
people who make you feel good
about yourself will help keep
your thoughts bright, so get
together with your closest
friend as soon as you can..
www.astrology.com/horoscope/daily/aries.html
5. “Predictions” : Turkish Coffee Grinds
© KDnuggets 2016 5
If a big chunk of the coffee
grounds falls down on the saucer
then it is taken as the first positive
sign of your reading. “Trouble and
worries are leaving you”.
6. Pundits “Predictions”
• Nate Silver FiveThirtyEight.com prediction for
Trump winning Republican nomination:
• Aug 2015: 2%
• Sep 2015: 5%
• Nov 2015: 6%
• Jan 2016: 12%
• May 2016: 99%
© KDnuggets 2016 6
7. Desire to Predict – Deep Human Trait
© KDnuggets 2016 7
• People are hard-wired to see patterns
• People want predictions
• Human intuition does not work on large scale
data, for understanding probability
• Good story is essential to a convincing
prediction (whether true or false)
Lessons
9. Outline
• Intro, Data Science History and Terms
• 10 Real-World Data Science Lessons
• Data Science Now: Polls & Trends
• Data Science Roles
• Data Science Job Trends
• Data Science Future
© KDnuggets 2016 9
10. What do we call it?
• Statistics
• Data Mining
• Knowledge Discovery in Data
(KDD)
• Predictive Analytics
• Data Analytics
• Data Science
• …?
© KDnuggets 2016 10
Core Idea:
Finding
Useful
Patterns
in Data
11. Pre-history (1800-2008): Statistics
© KDnuggets 2016 11
From Google Ngram viewer – English language books
Search case insensitive.
Other languages need to be considered for full picture
statistics is the biggest term in 20th century,
Analytics is used increasingly thru 20th century
data mining appears in late 1990s
13. “Data Mining” Surges in 1996
© KDnuggets 2016 13
Advances in Knowledge Discovery and
Data Mining, AAAI/MIT Press, 1996, Eds:
U. Fayyad, G. Piatetsky-Shapiro, P. Smyth,
and R. Uthurusamy
Analytics
Data Mining
KDD-95, 1st Conference on Knowledge
Discovery and Data Mining, Montreal
Google N-grams search case insensitive, smoothing 1
14. Earliest use of “data mining”: 1962
(c) KDnuggets 2016 15
Source: Google Books
After eliminating many “following data. Mining cost is ” examples
which refer to Mining of minerals,
and books from “1958” that have a CD attached (errors in book year)
The earliest “data mining” reference I found is
17. >50% of “Analytics” searches are for
“Google Analytics”
18(c) KDnuggets 2016
Google Analytics introduced,
Dec 2005
23. Google Trends: Data Science
© KDnuggets 2016 24
[Data Science] != “Data Science”
Lesson: Examine assumptions carefully
2009 2011 2013 2015
24. Regional Interest in
“Data Science” in 2015
25(c) KDnuggets 2016
Google Trends
Note: search for “Data Science” is
different from [Data Science]
26. Data Science History
• < 1900 - Statistics
• 1960s Data Mining = bad activity, data “dredging”
• 1990 - “Data Mining” is good, surges in 1996
• 2003 - “Data Mining” peaks (bad in press, invasion of
privacy?), slowly declines, but still popular
• 2006 - Google Analytics
• 2007 - Business/Data/Predictive Analytics
• 2012 - Big Data
• 2014 - Data Science
• 2015 - Deep Learning
• 2018 - ??
27© KDnuggets 2016
28. Lesson 1: It is a Iterative, Circular Process
© KDnuggets 2016 29
Waterfall
model
does NOT
work
for
Data
Science
29. CRISP-DM: Iterative, Circular Process
© KDnuggets 2016 30
See www.kdnuggets.com/2016/03/data-science-process-rediscovered.html
Data Mining Process – CRISP-DM, 1998
CRISP-DM, 1998
1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Modeling
5. Evaluation
6. Deployment
31. Machine Learning Workflow, MS Azure
© KDnuggets 2016 32
See
www.kdnuggets.com/2016/04/developers-need-know-about-machine-learning.html
blogs.msdn.microsoft.com/continuous_learning/2014/11/15/end-to-end-predictive-model-in-
azureml-using-linear-regression/
32. Lesson 2: Data Engineering
Takes The Bulk of Time
• Building Machine Learning/Predicting Models
is the key (and most fun) part, but only a small
part of the whole process
• 60-80% (?) spent on Data
Preparation/Engineering
© KDnuggets 2016 33
33. Competitions are different
© KDnuggets 2016 34
March Machine Learning Mania 2016,
Winner's Interview: 1st Place, Miguel Alomar
https://twitter.com/kdnuggets/status/730417186167263232
http://blog.kaggle.com/2016/05/10/march-machine-learning-
mania-2016-winners-interview-1st-place-miguel-alomar/
How #MachineLearning @Kaggle
winner spent time:
35% read forums,
25% build models,
25% evaluate results
15% data preparation,
34. Lesson 3: Question Assumptions
© KDnuggets 2016 35
Problem:
Deciles not uniform
Decile 1 is too rare,
Decile 0 – too frequent?
Why ?
* Not actual data
Measurement
35. Mass Spectrometry
© KDnuggets 2016 36
Mass spectrometry (MS) is an
analytical technique that ionizes
chemical species and sorts the
ions based on their mass to
charge ratio.
Can produce a large number
(~ 20,000) of
m/z values for a sample
Goal: find biomarkers for
disease, test, condition
36. Question Assumptions
© KDnuggets 2016 37
Instead of Measurement Deciles
Examine actual ranges,
including 0
Nothing between 1 and 14
Value 0 is too frequent
Why ?
* Not actual data
Measurement
37. Question Assumptions
© KDnuggets 2016 38
Instead of Measurement Deciles
Examine actual ranges,
including 0
Nothing between 1 and 14
Value 0 is too frequent
Why ?
* Not actual data
Measurement
Someone added a rule to round
raw measurement values
below 15 to zero
38. The best data scientists have one
thing in common –
unbelievable curiosity
DJ Patil, US First Chief Data Scientist
http://www.sciencefriday.com/articles/10-questions-for-the-
nations-first-chief-data-scientist
April 2016
39
39. Lesson 4: Focus on the Right Metric -
Actionable
• Consumer: Churn may depend on age, region,
usage, and rate plan. Rate plan easiest to
change.
• Uplift Modeling in Marketing and Politics:
focus on persuadables
© KDnuggets 2016 40
40. Right Metric: Uplift Modeling
© KDnuggets 2016 41
Don’t model if consumer will buy –
Model if consumer will buy in response
to an offer
41. Right Metric: Uplift Modeling
© KDnuggets 2016 42
From Eric Siegel presentation at PAW, 2011
In Obama 2012 Campaign
www.thefiscaltimes.com/Articles/2013/01/21/The-Real-Story-Behind-Obamas-Election-Victory
42. Lesson 5: Be a Fox, not a Hedgehog
© KDnuggets 2016 43
Read Isaiah Berlin 1953 essay, The Hedgehog and the Fox
A fox knows many things, but
a hedgehog - one important thing.
43. Lesson 5: Modeling
No Free Lunch Theorem – no method is universally the best (Wolpert)
In Kaggle competitions, there are 2 ways to win (Anthony Goldbloom, 2016):
• Handcrafted feature engineering
• Or Deep Learning Neural Networks
www.kdnuggets.com/2016/01/anthony-goldbloom-secret-winning-kaggle-competitions.html
• XGBoost – winning method in many recent Kaggle competitions
• Ensemble methods
For Structured Data (Sebastian Rashka )
• SVM (Support Vector Machines) for smaller data
• Random Forests – more data, more automated
www.kdnuggets.com/2016/04/deep-learning-vs-svm-random-forest.html
Unstructured:
• Deep Learning
© KDnuggets 2016 44
44. Lesson 6: Avoid Overfitting
© KDnuggets 2016 45
http://www.kdnuggets.com/2014/06/cardinal-sin-data-mining-data-science.html
Many examples at http://tylervigen.com/spurious-correlations
45. Avoid Overfitting
© KDnuggets 2016 46
“Irreproducible” results - BIG problem is social
sciences, medicine:
John P. A. Ioannidis famous paper Why Most Published
Research Findings Are False (PLoS Medicine, 2005).
Due to
• Small samples
• Testing too many hypotheses
• Confirmation bias (explicit or implicit)
• Poor training
46. How to Avoid Overfitting
• If it is too good to be true, it probably is
• Find the simplest possible hypothesis
• Adjusting the False Discovery Rate
• Randomization Testing
• Nested cross-validation (train, test, holdout)
• Regularization (adding a penalty for
complexity)
© KDnuggets 2016 47
www.kdnuggets.com/2014/06/cardinal-sin-data-mining-data-science.html
47. Lesson 7: Tell a story
• Combine facts into a story
• Combine visual and text presentation
• Explanation gives credibility
• Dynamic / Interactive
• Examples: Kefir, Google Analytics, Quill
© KDnuggets 2016 48
48. KEFIR (KEy FInding Reporter), 1994
• Overview report
www.kdnuggets.com/data_mining_course/kefir/overview.htm
• Inpatient admissions
www.kdnuggets.com/data_mining_course/kefir/s2.htm
© KDnuggets 2016 49
49. Quill report for KDnuggets
• Sessions Stay Flat, But Way Higher Than 12-Month Weekly Average
• Sessions remained flat compared to the prior week. The 121,040
sessions, however, were above your 85,105-session weekly average
for the year. Your site's total pageviews stayed flat last week at
206,124, while pages per session grew less than a percent to 1.7.
That's equal to your weekly average for the year.
• Among all your pages, Analytics, Data Mining, and Data Science had
both the highest bounce rate (43%) and the most pageviews (8,734)
last week.
© KDnuggets 2016 50
50. La Diseuse de bonne aventure,
Caravaggio, 1595 (Louvre)
© KDnuggets 2016 51
Beware of
Fortune
tellers!
51. Lesson 8: Limits to Predicting Human
Behavior?
• Inherent randomness, complexity in human
behavior
• Individual predictions have limited accuracy
(but can still be better than random and very
useful for consumer analytics)
• Aggregate predictions (eg who will win the
election) more accurate, because individual
randomness cancels out
(c) KDnuggets 2016 52
52. Example: Netflix Prize, 2006
• Example: Netflix Prize: the most advanced
algorithms were only a few percentages better
than basic algorithms
© KDnuggets 2016 53
See Gregory Piatetsky, “Big Data: Hype & Reality”, Harvard Business
Review 2012, https://hbr.org/2012/10/big-data-hype-and-reality/
53. Direct Marketing Lift:
Random and Model-sorted Lists
0
10
20
30
40
50
60
70
80
90
100
5
15
25
35
45
55
65
75
85
95
Random
Model
5% of random list have 5% of hits
5% of model-score ranked list have 21% of hits.
Lift(5%) = 21%/5% = 4.2
Pct list
CPH:CumulativePctHits
54. Most lift curves are surprising similar-
limit to human predictability?
Study of lift curves in banking,
telecom
Best lift curves are similar
Special point T=Target
percentage
Lift(T) ~ sqrt (1/T)
G. Piatetsky-Shapiro, B. Masand,
Estimating Campaign Benefits and
Modeling Lift, in Proceedings of
KDD-99 Conference, ACM Press,
1999.
(c) KDnuggets 2016 55
0
2
4
6
8
10
12
14
0 5 10 15 20 25
100*T%
Lift
Actual lift(T) Est. lift(T)
55. More recent data is more predictive!
• Real-time behavior data more predictive than
historical, demographic data
• Ad retargeting
© KDnuggets 2016 56
56. Lesson 9: Deployment & Maintenance
• Netflix Prize winning algorithm not deployed
• Technical debt of Machine Learning
– (Google research.google.com/pubs/pub43146.html )
© KDnuggets 2016 57
… the additional accuracy gains that we
measured did not seem to justify the
engineering effort needed to bring them
into a production environment. Also, our
focus on improving Netflix personalization
had shifted to the next level by then.
http://techblog.netflix.com/2012/04/netflix
-recommendations-beyond-5-stars.html
57. Modeling in Real World vs Kaggle
• ROI of extra accuracy vs cost of maintenance
• Is model explainable ? (legal, acceptance reasons)
• Does model discriminate on basis of race,
gender,…?
• Netflix Prize algorithm which won $1M - not
implemented
• In real-world, simpler is usually better
© KDnuggets 2016 58
58. Deployment Test and Monitor
• Monitor assumptions
– Do fields have the same value distributions
• Detect when model is no longer valid, needs
rebuilding
• Automatic model re-build
© KDnuggets 2016 59
59. Lesson 10: Don’t just predict, optimize
• Prediction is usually just one part of making a
decision
• Consider cost, frequency, latency, human
behavior, etc
• Goal: Optimization
• From Data Science to Decision Science
© KDnuggets 2016 60
60. Privacy in the age of Big Data
• Privacy laws much stricter in Europe
• Individual Privacy vs Benefits for all (eg
aggregated health-care data)
• Image and Face recognition (eg Facebook)
• Very hard to keep privacy with so many digital
breadcrumbs
• Privacy vs Security (eg FBI vs Apple)
• Politicians are behind technology curve –
researchers should help society, politicians make
an informed decision
© KDnuggets 2016 61
61. When It Is Ethical To Analyze
A Particular Dataset?
62© KDnuggets 2016
62. Data Ethics Golden Rule
Don’t do with someone else data
what you don’t want done
with your data
© KDnuggets 2016 63
65. Data Types
Analyzed/Mined
66© KDnuggets 2016
www.kdnuggets.com/polls/2014/data-types-sources-analyzed.html
Most popular:
- Table data
- Time series
- Text
- itemsets/transactions
Most growing:
- music/audio
- JSON
68. Largest Dataset Analyzed?
© KDnuggets 2016 69
Big Data Miners –
elite group .
www.kdnuggets.com/2015/08/largest-dataset-analyzed-more-gigabytes-petabytes.html
Median in 11-100 GB
range, slight increase.
70. 4 Main Languages of Data Science
© KDnuggets 2016 71
www.kdnuggets.com/2014/08/four-main-languages-analytics-data-mining-data-science.html
72. R vs Python
© KDnuggets 2016 74
http://www.kdnuggets.com/2015/07/poll-primary-analytics-language-r-python.html
Surprising Stability:
88% of R users stayed with R
and 91% stayed with Python.
% of primary R , Python users up,
while % Other or None down.
74. Data Science Roles
• Data Analyst
• (Big) Data Engineer
• Data Scientist
• Machine Learning Researcher
• Data Science Manager/Director
• Chief Data Officer
• Company Founder
© KDnuggets 2016 78
76. LinkedIn Data Skills
LinkedIn has 334,000 Titles with “Data”
• Data Analyst 60,273
• Data Scientist 12,680
• Database Analyst 4,357
• Business Data Analyst 1,709
• Senior Data Scientist 1,691
• Sr. Data Analyst 1,131
Thanks to Lutz Finger, Director of Analytics at LinkedIn for
this custom study
© KDnuggets 2016 80
77. LinkedIn: 4 Groups of Skills
Skills of people with “Data” in the title grouped into dedicated clusters - using similarity of members with similar skills.
Database Management and Software
• Access Database BTEQ Cubes Data Center Data Modeling Database Admin Database Administration Database
Design Databases DB2 Embedded SQL FastExport FastLoad MDX Memcached Microsoft SQL Server MLOAD
MongoDB Multiload MySQL NoSQL OA Framework Oracle Oracle Developer Suite Oracle Discoverer Oracle
Enterprise Manager Oracle PL/SQL Development Oracle RAC Oracle SQL Developer Performance Tuning
PhpMyAdmin PL/SQL PostgreSQL RDBMS Redis Relational Databases Replication RMAN SQL SQL Server
Management Studio SQL*Plus SQL400 SQLite Stored Procedures Sybase T-SQL Teradata Toad TPT TPUMP
Machine Learning
• Computational Linguistics Data Visualization Information Retrieval Machine Learning Natural Language Processing
Research Design Sentiment Analysis Structural Bioinformatics Text Mining
Mathematics
• Algebra Applied Mathematics Calculus Differential Equations Fortran Geometry Image Analysis LabVIEW Linear
Algebra Maple Mathematica Mathematical Modeling Mathematics Matlab Monte Carlo Simulation Numerical
Analysis Numerical Simulation Operations Research Partial Differential Equations Pre-Calculus Scientific Computing
Simulations Trigonometry
Statistical Analysis and Data Mining
• A/B Testing Analytics ANOVA Business Analytics Cluster Analysis Data Analysis Data Mining Decision Trees Design
of Experiments Economic Modeling Experimental Design Factor Analysis Google Analytics JMP Linear Regression
Logistic Regression Marketing Analytics Minitab Pattern Recognition Predictive Analytics Predictive Modeling
Primary Research Questionnaire Design Questionnaires R Sampling SAS SAS Programming SDTM Secondary
Research SPSS Statistical Consulting Statistical Data Analysis Statistical Modeling Statistical Programming Statistics
Survey Research Survival Analysis Time Series Analysis Web Analytics
© KDnuggets 2016 81
78. LinkedIn Skills
N. Skills
relating to
Data
Number of LinkedIn
Members
1 9,708,214
2 3,870,376
3 2,065,318
4 1,097,849
5 576,310
6 305,266
7 169,351
8 98,284
9 60,419
10 37,689
© KDnuggets 2016 82
79. Data Science Skills, Updated
© KDnuggets 2016
84
Database,
Coding
Skills
Domain/Business
Expertise
83. “Unicorn” Data Scientist
© KDnuggets 2016
88
Database,
Coding
Skills
Domain/Business
Expertise
Glassdoor, Apr 2016
US Salary: $113,400
Jobs: 2572
France: €43,500
Jobs: 180
“Unicorn”
Data Scientist
86. Data Career Progression
© KDnuggets 2016 91
BI/Data Analyst Data Engineer
Data Scientist
Machine Learning
Researcher
Data Science
Manager/Director
Company Founder/CEO
Chief Data Officer
Chief
Scientist
88. Shortage of Data Scientists?
• McKinsey (2011): shortage by 2018 in US
– 140-190,000 people with deep analytical skills
– 1.5 M managers/analysts with the know-how to
use the analysis of big data to make effective
decisions.
Source:
www.mckinsey.com/mgi/publications/big_data/
93(c) KDnuggets 2016
89. Data Scientist –
Sexiest Job of the 21st Century?
• Thomas H. Davenport and D.J. Patil, (Harvard
Business Review, 2012)
94(c) KDnuggets 2016
90. “Data Scientist” - leading job trend
© KDnuggets 2016 95
“Data Scientist” Job has grown 1,700% from 2012 to 2016
Top 5 Tech Job Trends in 2016:
Data Scientist, Devops, Puppet, PaaS, Hadoop
?
Indeed.com/jobtrends
91. Attention to Detail:
[Data Scientist] != “Data Scientist”
© KDnuggets 2016 96
Indeed.com/jobtrends
Data Scientist
“Data Scientist” = “data scientist”
92. “Data Scientist” vs Statistician
© KDnuggets 2016 97
Indeed.com job trends
“Data Scientist”
Statistician
93. Data Scientist jobs on KDnuggets
© KDnuggets 2016 98
0%
5%
10%
15%
20%
25%
30%
35%
40%
2010 2011 2012 2013 2014 2015
% Data Scientist jobs on KDnuggets
Including Senior, Junior, Principal, Chief DS, …
96. Big Data
• Next Industrial Revolution
• Data Science is the Engine of Big Data
101(c) KDnuggets 2016
97. Doing Old Things Better
Application areas
– Direct marketing/Customer modeling
– Recommendations
– Fraud detection
– Security/Intelligence
– Healthcare
– …
• Competition will level companies
102(c) KDnuggets 2016
98. Big Data Enables New Things !
• Google – first big success of big data
• Social networks (Facebook, Twitter, LinkedIn,
…) success depends on network size, i.e. big
data
• Big Data in Health-care
– image analysis, diagnosis,
– Personalized medicine
• Recommendations - Netflix streaming
103(c) KDnuggets 2016
99. New services, products, platforms
• Image recognition – FB uses to decide what to
show users
• Face recognition - security
• Location-based services – Tinder
• Big Data to Power AI and Machine Learning
– Imagine Google DeepMind, IBM Watson, Siri in
2020 ?
© KDnuggets 2016 104
102. Gartner Hype Cycle, 2014
© 2016 KDnuggets
107
Big DataData
Science
See http://diggdata.in/ which has 4 years of Gartner Hype Cycle
103. Gartner Hype Cycle, 2015
© 2016 KDnuggets
108
Gartner Hype Cycle
Big Data
www.kdnuggets.com/2015/08/gartner-2015-hype-cycle-big-data-is-out-machine-learning-is-in.html
Citizen
Data
Science
Machine
Learning
104. “Citizen” Data Science
© KDnuggets 2016 110
This is Bob, our new Citizen Data Scientist.
He previously worked as a citizen dentist
and a citizen pilot.
105. Golden Age of Data Science,
Machine Learning
• Amazing New Tools
• Very Complex Algorithms are very easy to use
• scikit-learn, iPython notebooks, etc
• One-Click deployment of TensorFlow on AWS
with GPU
© KDnuggets 2016 111
108. Data Science Automated By 2025?
© KDnuggets 2016 114
KDnuggets Poll in 2015:
51% of voters expect Data Science Automation to happen in 10 years or less -
www.kdnuggets.com/2015/05/data-scientists-automated-2025.html
109. Data Science Automation
© KDnuggets 2016 115
I remember when only a Deep Learning
supercomputer could beat
me in a Data Science competition
110. Data Science Automation
KDnuggets: Software: Automated Data Science:
• AutoDiscovery from ButlerScientifics
• Automatic Business Modeler from Algolytics
• Automatic Statistician project
• DataRobot
• DMWay
• ForecastThis DSX
• FeatureLab
• Loom Systems,
• machineJS: Automated machine learning
• Quill from Narrative Science
• SAP Predictive Analytics
• Savvy from Yseop.
• Skytree Machine Learning Software
• Tree-based Pipeline Optimization Tool (TPOT)
© KDnuggets 2016 116
111. Data Science Automation
• New tools make Data Scientists more
productive
• Make data results more widely available
• Automate lower-level Data Science tasks
© KDnuggets 2016 117
112. “Soft” Data Science Skills
Harder to Automate
• Curiosity
• Intuition
• Business Knowledge
• Selecting a good metric
• Posing the right question
• Presentation Skills
Data Science – still a great profession
© KDnuggets 2016 118
113. Questions?
KDnuggets: Analytics, Big Data, Data Science
• Subscribe to KDnuggets News email at
www.KDnuggets.com/subscribe.html
• Email to editor1@kdnuggets.com
• Twitter: @kdnuggets
• facebook.com/kdnuggets
• LinkedIn group: KDnuggets
119© KDnuggets 2016
Editor's Notes Churn: best algorithms for predicting churn have lift of 5-7 – 5-7 times better than random.
Behavioral advertising: 2-3% CTR – 10 times better than random
Future is Bright for Big Data, but need use caution when evaluating claims