Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Data Science: Past, Present, and Future

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio

Eche un vistazo a continuación

1 de 112 Anuncio

Data Science: Past, Present, and Future

Descargar para leer sin conexión

Presentation at Data ScienceTech Institute campuses, Paris and Nice, May 2016 , including Intro, Data Science History and Terms; 10 Real-World Data Science Lessons; Data Science Now: Polls & Trends; Data Science Roles; Data Science Job Trends; and Data Science Future

Presentation at Data ScienceTech Institute campuses, Paris and Nice, May 2016 , including Intro, Data Science History and Terms; 10 Real-World Data Science Lessons; Data Science Now: Polls & Trends; Data Science Roles; Data Science Job Trends; and Data Science Future

Anuncio
Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

Similares a Data Science: Past, Present, and Future (20)

Anuncio

Más reciente (20)

Data Science: Past, Present, and Future

  1. 1. Data Science: Past, Present, and Future Gregory Piatetsky-Shapiro KDnuggets 1© KDnuggets 2016 La Science des données: passé, présent et futur
  2. 2. Predicting Behavior – Key to Survival © KDnuggets 2016 2 Better prediction – better intelligence
  3. 3. “Predictions”: Astrology © KDnuggets 2016 3 My May 26 Horoscope: So what if things aren't completely wonderful in your life right now? Just keep your hopes high, and your fingers crossed. … Being with the people who make you feel good about yourself will help keep your thoughts bright, so get together with your closest friend as soon as you can.. www.astrology.com/horoscope/daily/aries.html
  4. 4. “Predictions” : Turkish Coffee Grinds © KDnuggets 2016 4 If a big chunk of the coffee grounds falls down on the saucer then it is taken as the first positive sign of your reading. “Trouble and worries are leaving you”.
  5. 5. Pundits “Predictions” • Nate Silver FiveThirtyEight.com prediction for Trump winning Republican nomination: • Aug 2015: 2% • Sep 2015: 5% • Nov 2015: 6% • Jan 2016: 12% • May 2016: 99% © KDnuggets 2016 5
  6. 6. Desire to Predict – Deep Human Trait © KDnuggets 2016 6 • People are hard-wired to see patterns • People want predictions • Human intuition does not work on large scale data, for understanding probability • Good story is essential to a convincing prediction (whether true or false) Lessons
  7. 7. Data Science Data-Driven, Scientific approach to prediction and data analysis 7
  8. 8. Outline • Intro, Data Science History and Terms • 10 Real-World Data Science Lessons • Data Science Now: Polls & Trends • Data Science Roles • Data Science Job Trends • Data Science Future © KDnuggets 2016 8
  9. 9. What do we call it? • Statistics • Data Mining • Knowledge Discovery in Data (KDD) • Predictive Analytics • Data Analytics • Data Science • …? © KDnuggets 2016 9 Core Idea: Finding Useful Patterns in Data
  10. 10. Pre-history (1800-2008): Statistics © KDnuggets 2016 10 From Google Ngram viewer – English language books Search case insensitive. Other languages need to be considered for full picture statistics is the biggest term in 20th century, Analytics is used increasingly thru 20th century data mining appears in late 1990s
  11. 11. French Books, 1800-2008 Statistiques vs Mathematiques © KDnuggets 2016 11
  12. 12. “Data Mining” Surges in 1996 © KDnuggets 2016 12 Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996, Eds: U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy Analytics Data Mining KDD-95, 1st Conference on Knowledge Discovery and Data Mining, Montreal Google N-grams search case insensitive, smoothing 1
  13. 13. Earliest use of “data mining”: 1962 (c) KDnuggets 2016 14 Source: Google Books After eliminating many “following data. Mining cost is ” examples which refer to Mining of minerals, and books from “1958” that have a CD attached (errors in book year) The earliest “data mining” reference I found is
  14. 14. Very Recent History Using Google Trends (c) KDnuggets 2016 15
  15. 15. Google Trends, 2005-2016: After 2006, Analytics > Data Mining 16(c) KDnuggets 2016 Global – all regions
  16. 16. >50% of “Analytics” searches are for “Google Analytics” 17(c) KDnuggets 2016 Google Analytics introduced, Dec 2005
  17. 17. Google Trends, 2005-2016 (c) KDnuggets 2016 data science analytics - Google big data data mining 2010 2012 2014
  18. 18. Google Trends, 2005-2016 (c) KDnuggets 2016 2012: Analytics down, Big Data up 2015 2005
  19. 19. Google Trends, 2005-2016 (c) KDnuggets 2016 2013: Data Science grows 20132005
  20. 20. Google Trends: Machine Learning, Data Science, Deep Learning © KDnuggets 2016 21 2009 2011 2013 2015
  21. 21. Google Trends: Machine Learning © KDnuggets 2016 22 Machine Learning ~ “Machine Learning”
  22. 22. Google Trends: Data Science © KDnuggets 2016 23 [Data Science] != “Data Science” Lesson: Examine assumptions carefully 2009 2011 2013 2015
  23. 23. Regional Interest in “Data Science” in 2015 24(c) KDnuggets 2016 Google Trends Note: search for “Data Science” is different from [Data Science]
  24. 24. KDnuggets Audience by Region, Q1 2016 © KDnuggets 2016 25
  25. 25. Data Science History • < 1900 - Statistics • 1960s Data Mining = bad activity, data “dredging” • 1990 - “Data Mining” is good, surges in 1996 • 2003 - “Data Mining” peaks (bad in press, invasion of privacy?), slowly declines, but still popular • 2006 - Google Analytics • 2007 - Business/Data/Predictive Analytics • 2012 - Big Data • 2014 - Data Science • 2015 - Deep Learning • 2018 - ?? 26© KDnuggets 2016
  26. 26. 10 Real-World Lessons from the Art & Practice of Data Science & Data Mining 27© KDnuggets 2016
  27. 27. Lesson 1: It is a Iterative, Circular Process © KDnuggets 2016 28 Waterfall model does NOT work for Data Science
  28. 28. CRISP-DM: Iterative, Circular Process © KDnuggets 2016 29 See www.kdnuggets.com/2016/03/data-science-process-rediscovered.html Data Mining Process – CRISP-DM, 1998 CRISP-DM, 1998 1. Business Understanding 2. Data Understanding 3. Data Preparation 4. Modeling 5. Evaluation 6. Deployment
  29. 29. Academic Data Science Process © KDnuggets 2016 30 See www.kdnuggets.com/2016/03/data-science-process-rediscovered.html Harvard, 2013
  30. 30. Machine Learning Workflow, MS Azure © KDnuggets 2016 31 See www.kdnuggets.com/2016/04/developers-need-know-about-machine-learning.html blogs.msdn.microsoft.com/continuous_learning/2014/11/15/end-to-end-predictive-model-in- azureml-using-linear-regression/
  31. 31. Lesson 2: Data Engineering Takes The Bulk of Time • Building Machine Learning/Predicting Models is the key (and most fun) part, but only a small part of the whole process • 60-80% (?) spent on Data Preparation/Engineering © KDnuggets 2016 32
  32. 32. Competitions are different © KDnuggets 2016 33 March Machine Learning Mania 2016, Winner's Interview: 1st Place, Miguel Alomar https://twitter.com/kdnuggets/status/730417186167263232 http://blog.kaggle.com/2016/05/10/march-machine-learning- mania-2016-winners-interview-1st-place-miguel-alomar/ How #MachineLearning @Kaggle winner spent time: 35% read forums, 25% build models, 25% evaluate results 15% data preparation,
  33. 33. Lesson 3: Question Assumptions © KDnuggets 2016 34 Problem: Deciles not uniform Decile 1 is too rare, Decile 0 – too frequent? Why ? * Not actual data Measurement
  34. 34. Mass Spectrometry © KDnuggets 2016 35 Mass spectrometry (MS) is an analytical technique that ionizes chemical species and sorts the ions based on their mass to charge ratio. Can produce a large number (~ 20,000) of m/z values for a sample Goal: find biomarkers for disease, test, condition
  35. 35. Question Assumptions © KDnuggets 2016 36 Instead of Measurement Deciles Examine actual ranges, including 0 Nothing between 1 and 14 Value 0 is too frequent Why ? * Not actual data Measurement
  36. 36. Question Assumptions © KDnuggets 2016 37 Instead of Measurement Deciles Examine actual ranges, including 0 Nothing between 1 and 14 Value 0 is too frequent Why ? * Not actual data Measurement Someone added a rule to round raw measurement values below 15 to zero
  37. 37. The best data scientists have one thing in common – unbelievable curiosity DJ Patil, US First Chief Data Scientist http://www.sciencefriday.com/articles/10-questions-for-the- nations-first-chief-data-scientist April 2016 38
  38. 38. Lesson 4: Focus on the Right Metric - Actionable • Consumer: Churn may depend on age, region, usage, and rate plan. Rate plan easiest to change. • Uplift Modeling in Marketing and Politics: focus on persuadables © KDnuggets 2016 39
  39. 39. Right Metric: Uplift Modeling © KDnuggets 2016 40 Don’t model if consumer will buy – Model if consumer will buy in response to an offer
  40. 40. Right Metric: Uplift Modeling © KDnuggets 2016 41 From Eric Siegel presentation at PAW, 2011 In Obama 2012 Campaign www.thefiscaltimes.com/Articles/2013/01/21/The-Real-Story-Behind-Obamas-Election-Victory
  41. 41. Lesson 5: Be a Fox, not a Hedgehog © KDnuggets 2016 42 Read Isaiah Berlin 1953 essay, The Hedgehog and the Fox A fox knows many things, but a hedgehog - one important thing.
  42. 42. Lesson 5: Modeling No Free Lunch Theorem – no method is universally the best (Wolpert) In Kaggle competitions, there are 2 ways to win (Anthony Goldbloom, 2016): • Handcrafted feature engineering • Or Deep Learning Neural Networks www.kdnuggets.com/2016/01/anthony-goldbloom-secret-winning-kaggle-competitions.html • XGBoost – winning method in many recent Kaggle competitions • Ensemble methods For Structured Data (Sebastian Rashka ) • SVM (Support Vector Machines) for smaller data • Random Forests – more data, more automated www.kdnuggets.com/2016/04/deep-learning-vs-svm-random-forest.html Unstructured: • Deep Learning © KDnuggets 2016 43
  43. 43. Lesson 6: Avoid Overfitting © KDnuggets 2016 44 http://www.kdnuggets.com/2014/06/cardinal-sin-data-mining-data-science.html Many examples at http://tylervigen.com/spurious-correlations
  44. 44. Avoid Overfitting © KDnuggets 2016 45 “Irreproducible” results - BIG problem is social sciences, medicine: John P. A. Ioannidis famous paper Why Most Published Research Findings Are False (PLoS Medicine, 2005). Due to • Small samples • Testing too many hypotheses • Confirmation bias (explicit or implicit) • Poor training
  45. 45. How to Avoid Overfitting • If it is too good to be true, it probably is • Find the simplest possible hypothesis • Adjusting the False Discovery Rate • Randomization Testing • Nested cross-validation (train, test, holdout) • Regularization (adding a penalty for complexity) © KDnuggets 2016 46 www.kdnuggets.com/2014/06/cardinal-sin-data-mining-data-science.html
  46. 46. Lesson 7: Tell a story • Combine facts into a story • Combine visual and text presentation • Explanation gives credibility • Dynamic / Interactive • Examples: Kefir, Google Analytics, Quill © KDnuggets 2016 47
  47. 47. KEFIR (KEy FInding Reporter), 1994 • Overview report www.kdnuggets.com/data_mining_course/kefir/overview.htm • Inpatient admissions www.kdnuggets.com/data_mining_course/kefir/s2.htm © KDnuggets 2016 48
  48. 48. Quill report for KDnuggets • Sessions Stay Flat, But Way Higher Than 12-Month Weekly Average • Sessions remained flat compared to the prior week. The 121,040 sessions, however, were above your 85,105-session weekly average for the year. Your site's total pageviews stayed flat last week at 206,124, while pages per session grew less than a percent to 1.7. That's equal to your weekly average for the year. • Among all your pages, Analytics, Data Mining, and Data Science had both the highest bounce rate (43%) and the most pageviews (8,734) last week. © KDnuggets 2016 49
  49. 49. La Diseuse de bonne aventure, Caravaggio, 1595 (Louvre) © KDnuggets 2016 50 Beware of Fortune tellers!
  50. 50. Lesson 8: Limits to Predicting Human Behavior? • Inherent randomness, complexity in human behavior • Individual predictions have limited accuracy (but can still be better than random and very useful for consumer analytics) • Aggregate predictions (eg who will win the election) more accurate, because individual randomness cancels out (c) KDnuggets 2016 51
  51. 51. Example: Netflix Prize, 2006 • Example: Netflix Prize: the most advanced algorithms were only a few percentages better than basic algorithms © KDnuggets 2016 52 See Gregory Piatetsky, “Big Data: Hype & Reality”, Harvard Business Review 2012, https://hbr.org/2012/10/big-data-hype-and-reality/
  52. 52. Direct Marketing Lift: Random and Model-sorted Lists 0 10 20 30 40 50 60 70 80 90 100 5 15 25 35 45 55 65 75 85 95 Random Model 5% of random list have 5% of hits 5% of model-score ranked list have 21% of hits. Lift(5%) = 21%/5% = 4.2 Pct list CPH:CumulativePctHits
  53. 53. Most lift curves are surprising similar- limit to human predictability? Study of lift curves in banking, telecom Best lift curves are similar Special point T=Target percentage Lift(T) ~ sqrt (1/T) G. Piatetsky-Shapiro, B. Masand, Estimating Campaign Benefits and Modeling Lift, in Proceedings of KDD-99 Conference, ACM Press, 1999. (c) KDnuggets 2016 54 0 2 4 6 8 10 12 14 0 5 10 15 20 25 100*T% Lift Actual lift(T) Est. lift(T)
  54. 54. More recent data is more predictive! • Real-time behavior data more predictive than historical, demographic data • Ad retargeting © KDnuggets 2016 55
  55. 55. Lesson 9: Deployment & Maintenance • Netflix Prize winning algorithm not deployed • Technical debt of Machine Learning – (Google research.google.com/pubs/pub43146.html ) © KDnuggets 2016 56 … the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment. Also, our focus on improving Netflix personalization had shifted to the next level by then. http://techblog.netflix.com/2012/04/netflix -recommendations-beyond-5-stars.html
  56. 56. Modeling in Real World vs Kaggle • ROI of extra accuracy vs cost of maintenance • Is model explainable ? (legal, acceptance reasons) • Does model discriminate on basis of race, gender,…? • Netflix Prize algorithm which won $1M - not implemented • In real-world, simpler is usually better © KDnuggets 2016 57
  57. 57. Deployment Test and Monitor • Monitor assumptions – Do fields have the same value distributions • Detect when model is no longer valid, needs rebuilding • Automatic model re-build © KDnuggets 2016 58
  58. 58. Lesson 10: Don’t just predict, optimize • Prediction is usually just one part of making a decision • Consider cost, frequency, latency, human behavior, etc • Goal: Optimization • From Data Science to Decision Science © KDnuggets 2016 59
  59. 59. Privacy in the age of Big Data • Privacy laws much stricter in Europe • Individual Privacy vs Benefits for all (eg aggregated health-care data) • Image and Face recognition (eg Facebook) • Very hard to keep privacy with so many digital breadcrumbs • Privacy vs Security (eg FBI vs Apple) • Politicians are behind technology curve – researchers should help society, politicians make an informed decision © KDnuggets 2016 60
  60. 60. When It Is Ethical To Analyze A Particular Dataset? 61© KDnuggets 2016
  61. 61. Data Ethics Golden Rule Don’t do with someone else data what you don’t want done with your data © KDnuggets 2016 62
  62. 62. Data Science Now What, Where, How KDnuggets Polls Findings www.KDnuggets.com/polls/ 63(c) KDnuggets 2016
  63. 63. 64© KDnuggets 2016 www.kdnuggets.com/2016/01/poll-analytics-data-mining-data-science-applied-2015.html Where did you apply Analytics, Data Mining, Data Science ? Avg. Number of Industries 2.7 Most Popular: - CRM - Finance - Banking - Health Care - Science - e-commerce Highest growth in: Games, 121% Entertainment / Music 74% Social Good/Non-profit, 68% Finance, 42% Education, 30%
  64. 64. Data Types Analyzed/Mined 65© KDnuggets 2016 www.kdnuggets.com/polls/2014/data-types-sources-analyzed.html Most popular: - Table data - Time series - Text - itemsets/transactions Most growing: - music/audio - JSON
  65. 65. Largest Dataset Analyzed? © KDnuggets 2016 66 www.kdnuggets.com/2015/08/largest-dataset-analyzed-more-gigabytes-petabytes.html
  66. 66. Largest Dataset Analyzed? © KDnuggets 2016 67 Python swallowed an Elephant? Antoine de Saint-Exupery
  67. 67. Largest Dataset Analyzed? © KDnuggets 2016 68 Big Data Miners – elite group . www.kdnuggets.com/2015/08/largest-dataset-analyzed-more-gigabytes-petabytes.html Median in 11-100 GB range, slight increase.
  68. 68. Largest Dataset Analyzed by Region © KDnuggets 2016 69 Big Data Miners: TeraBytes and Petabytes 10-25%
  69. 69. 4 Main Languages of Data Science © KDnuggets 2016 70 www.kdnuggets.com/2014/08/four-main-languages-analytics-data-mining-data-science.html
  70. 70. 4 Main Languages of Data Science, 2 © KDnuggets 2016 71
  71. 71. R vs Python © KDnuggets 2016 73 http://www.kdnuggets.com/2015/07/poll-primary-analytics-language-r-python.html Surprising Stability: 88% of R users stayed with R and 91% stayed with Python. % of primary R , Python users up, while % Other or None down.
  72. 72. Data Science Roles 76(c) KDnuggets 2016
  73. 73. Data Science Roles • Data Analyst • (Big) Data Engineer • Data Scientist • Machine Learning Researcher • Data Science Manager/Director • Chief Data Officer • Company Founder © KDnuggets 2016 77
  74. 74. Data Science Venn Diagram, 2010 © KDnuggets 2016 78 Drew Conway, 2010
  75. 75. LinkedIn Data Skills LinkedIn has 334,000 Titles with “Data” • Data Analyst 60,273 • Data Scientist 12,680 • Database Analyst 4,357 • Business Data Analyst 1,709 • Senior Data Scientist 1,691 • Sr. Data Analyst 1,131 Thanks to Lutz Finger, Director of Analytics at LinkedIn for this custom study © KDnuggets 2016 79
  76. 76. LinkedIn: 4 Groups of Skills Skills of people with “Data” in the title grouped into dedicated clusters - using similarity of members with similar skills. Database Management and Software • Access Database BTEQ Cubes Data Center Data Modeling Database Admin Database Administration Database Design Databases DB2 Embedded SQL FastExport FastLoad MDX Memcached Microsoft SQL Server MLOAD MongoDB Multiload MySQL NoSQL OA Framework Oracle Oracle Developer Suite Oracle Discoverer Oracle Enterprise Manager Oracle PL/SQL Development Oracle RAC Oracle SQL Developer Performance Tuning PhpMyAdmin PL/SQL PostgreSQL RDBMS Redis Relational Databases Replication RMAN SQL SQL Server Management Studio SQL*Plus SQL400 SQLite Stored Procedures Sybase T-SQL Teradata Toad TPT TPUMP Machine Learning • Computational Linguistics Data Visualization Information Retrieval Machine Learning Natural Language Processing Research Design Sentiment Analysis Structural Bioinformatics Text Mining Mathematics • Algebra Applied Mathematics Calculus Differential Equations Fortran Geometry Image Analysis LabVIEW Linear Algebra Maple Mathematica Mathematical Modeling Mathematics Matlab Monte Carlo Simulation Numerical Analysis Numerical Simulation Operations Research Partial Differential Equations Pre-Calculus Scientific Computing Simulations Trigonometry Statistical Analysis and Data Mining • A/B Testing Analytics ANOVA Business Analytics Cluster Analysis Data Analysis Data Mining Decision Trees Design of Experiments Economic Modeling Experimental Design Factor Analysis Google Analytics JMP Linear Regression Logistic Regression Marketing Analytics Minitab Pattern Recognition Predictive Analytics Predictive Modeling Primary Research Questionnaire Design Questionnaires R Sampling SAS SAS Programming SDTM Secondary Research SPSS Statistical Consulting Statistical Data Analysis Statistical Modeling Statistical Programming Statistics Survey Research Survival Analysis Time Series Analysis Web Analytics © KDnuggets 2016 80
  77. 77. LinkedIn Skills N. Skills relating to Data Number of LinkedIn Members 1 9,708,214 2 3,870,376 3 2,065,318 4 1,097,849 5 576,310 6 305,266 7 169,351 8 98,284 9 60,419 10 37,689 © KDnuggets 2016 81
  78. 78. Data Science Skills, Updated © KDnuggets 2016 83 Database, Coding Skills Domain/Business Expertise
  79. 79. Database, Coding Skills Domain/Business Expertise Data Analyst/BI Analyst © KDnuggets 2016 84 Data Analyst Glassdoor, Apr 2016 US Avg Salary: $60-70,000 Positions: 13,000
  80. 80. Database, Coding Skills Data Engineer © KDnuggets 2016 85 Domain/Business Expertise Data Engineer Glassdoor, Apr 2016 US Salary: $95,500 Jobs: 40,296 Ingénieur … Data France: 5K Jobs
  81. 81. Machine Learning Researcher © KDnuggets 2016 86 Database, Coding Skills Domain/Business Expertise ML Researcher
  82. 82. “Unicorn” Data Scientist © KDnuggets 2016 87 Database, Coding Skills Domain/Business Expertise Glassdoor, Apr 2016 US Salary: $113,400 Jobs: 2572 France: €43,500 Jobs: 180 “Unicorn” Data Scientist
  83. 83. Data Science Manager/Director © KDnuggets 2016 88 Database, Coding Skills Domain/ Business Expertise People Management Skills Data Science Leader
  84. 84. Company Founder © KDnuggets 2016 89 Database, Coding Skills Domain/ Business Expertise People Management Skills + Vision Founder
  85. 85. Data Career Progression © KDnuggets 2016 90 BI/Data Analyst Data Engineer Data Scientist Machine Learning Researcher Data Science Manager/Director Company Founder/CEO Chief Data Officer Chief Scientist
  86. 86. DATA SCIENCE JOB TRENDS (c) KDnuggets 2016 91
  87. 87. Shortage of Data Scientists? • McKinsey (2011): shortage by 2018 in US – 140-190,000 people with deep analytical skills – 1.5 M managers/analysts with the know-how to use the analysis of big data to make effective decisions. Source: www.mckinsey.com/mgi/publications/big_data/ 92(c) KDnuggets 2016
  88. 88. Data Scientist – Sexiest Job of the 21st Century? • Thomas H. Davenport and D.J. Patil, (Harvard Business Review, 2012) 93(c) KDnuggets 2016
  89. 89. “Data Scientist” - leading job trend © KDnuggets 2016 94 “Data Scientist” Job has grown 1,700% from 2012 to 2016 Top 5 Tech Job Trends in 2016: Data Scientist, Devops, Puppet, PaaS, Hadoop ? Indeed.com/jobtrends
  90. 90. Attention to Detail: [Data Scientist] != “Data Scientist” © KDnuggets 2016 95 Indeed.com/jobtrends Data Scientist “Data Scientist” = “data scientist”
  91. 91. “Data Scientist” vs Statistician © KDnuggets 2016 96 Indeed.com job trends “Data Scientist” Statistician
  92. 92. Data Scientist jobs on KDnuggets © KDnuggets 2016 97 0% 5% 10% 15% 20% 25% 30% 35% 40% 2010 2011 2012 2013 2014 2015 % Data Scientist jobs on KDnuggets Including Senior, Junior, Principal, Chief DS, …
  93. 93. LinkedIn 25 Hot Skills © KDnuggets 2016 98 2015 2014
  94. 94. Data Science Future 99
  95. 95. Big Data • Next Industrial Revolution • Data Science is the Engine of Big Data 100(c) KDnuggets 2016
  96. 96. Doing Old Things Better Application areas – Direct marketing/Customer modeling – Recommendations – Fraud detection – Security/Intelligence – Healthcare – … • Competition will level companies 101(c) KDnuggets 2016
  97. 97. Big Data Enables New Things ! • Google – first big success of big data • Social networks (Facebook, Twitter, LinkedIn, …) success depends on network size, i.e. big data • Big Data in Health-care – image analysis, diagnosis, – Personalized medicine • Recommendations - Netflix streaming 102(c) KDnuggets 2016
  98. 98. New services, products, platforms • Image recognition – FB uses to decide what to show users • Face recognition - security • Location-based services – Tinder • Big Data to Power AI and Machine Learning – Imagine Google DeepMind, IBM Watson, Siri in 2020 ? © KDnuggets 2016 103
  99. 99. Gartner Hype Cycle, 2012 © 2016 KDnuggets 104 Gartner Hype Cycle Big Data
  100. 100. Gartner Hype Cycle, 2013 © 2016 KDnuggets 105 Gartner Hype Cycle Big Data
  101. 101. Gartner Hype Cycle, 2014 © 2016 KDnuggets 106 Big DataData Science See http://diggdata.in/ which has 4 years of Gartner Hype Cycle
  102. 102. Gartner Hype Cycle, 2015 © 2016 KDnuggets 107 Gartner Hype Cycle Big Data www.kdnuggets.com/2015/08/gartner-2015-hype-cycle-big-data-is-out-machine-learning-is-in.html Citizen Data Science Machine Learning
  103. 103. “Citizen” Data Science © KDnuggets 2016 109 This is Bob, our new Citizen Data Scientist. He previously worked as a citizen dentist and a citizen pilot.
  104. 104. Golden Age of Data Science, Machine Learning • Amazing New Tools • Very Complex Algorithms are very easy to use • scikit-learn, iPython notebooks, etc • One-Click deployment of TensorFlow on AWS with GPU © KDnuggets 2016 110
  105. 105. Data Science Automated ? © KDnuggets 2016 111 Expert Human Ability Current Computer Ability
  106. 106. Data Science Automated ? © KDnuggets 2016 112 Expert Human Ability
  107. 107. Data Science Automated By 2025? © KDnuggets 2016 113 KDnuggets Poll in 2015: 51% of voters expect Data Science Automation to happen in 10 years or less - www.kdnuggets.com/2015/05/data-scientists-automated-2025.html
  108. 108. Data Science Automation © KDnuggets 2016 114 I remember when only a Deep Learning supercomputer could beat me in a Data Science competition
  109. 109. Data Science Automation KDnuggets: Software: Automated Data Science: • AutoDiscovery from ButlerScientifics • Automatic Business Modeler from Algolytics • Automatic Statistician project • DataRobot • DMWay • ForecastThis DSX • FeatureLab • Loom Systems, • machineJS: Automated machine learning • Quill from Narrative Science • SAP Predictive Analytics • Savvy from Yseop. • Skytree Machine Learning Software • Tree-based Pipeline Optimization Tool (TPOT) © KDnuggets 2016 115
  110. 110. Data Science Automation • New tools make Data Scientists more productive • Make data results more widely available • Automate lower-level Data Science tasks © KDnuggets 2016 116
  111. 111. “Soft” Data Science Skills Harder to Automate • Curiosity • Intuition • Business Knowledge • Selecting a good metric • Posing the right question • Presentation Skills Data Science – still a great profession © KDnuggets 2016 117
  112. 112. Questions? KDnuggets: Analytics, Big Data, Data Science • Subscribe to KDnuggets News email at www.KDnuggets.com/subscribe.html • Email to editor1@kdnuggets.com • Twitter: @kdnuggets • facebook.com/kdnuggets • LinkedIn group: KDnuggets 118© KDnuggets 2016

Notas del editor

  • Churn: best algorithms for predicting churn have lift of 5-7 – 5-7 times better than random.
    Behavioral advertising: 2-3% CTR – 10 times better than random
  • Future is Bright for Big Data, but need use caution when evaluating claims

×