Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Data Quality: The Data Science struggle nobody mentions - Data Science MeetUp Twente 7 Sep 2017

479 visualizaciones

Publicado el

Presentation about data quality at the second Data Science MeetUp Twente https://www.meetup.com/Data-Meetup-Twente/events/241545781/ on "Responsible Data Analytics", 7 Sep 2017.

Publicado en: Tecnología
  • Sé el primero en comentar

Data Quality: The Data Science struggle nobody mentions - Data Science MeetUp Twente 7 Sep 2017

  1. 1. DATA QUALITY THE DATA SCIENCE STRUGGLE NOBODY MENTIONS MAURICE VAN KEULEN ASSOCIATE PROFESSOR DATA MANAGEMENT TECHNOLOGY HEAD OF DATABASE GROUP
  2. 2.  First, a little story  Data imperfections  Some general mechanisms  My research 7 Sept 2017Data quality: the data science struggle nobody mentions 2 WHAT IS THIS TALK ABOUT?
  3. 3. 7 Sept 2017Data quality: the data science struggle nobody mentions 3 Domain understanding Data understanding Data preparation Modeling (Analysis) Deployment (Use) Data Evaluation (Interpretation) Where the magic happens
  4. 4. 7 Sept 2017Data quality: the data science struggle nobody mentions 4
  5. 5. 7 Sept 2017Data quality: the data science struggle nobody mentions 5 Research on Pregnancy processes based on Electronic Patient Dossiers (EPDs) of some population of women
  6. 6. 7 Sept 2017Data quality: the data science struggle nobody mentions 6 The start 1. Data scientist is a parent him/herself 2. Gather records from patient’s EPDs for pregnancy period from multiple sources 3. Fairly straightforward to identify consults, tests, scans, conditions 4. Extract and store them
  7. 7. 7 Sept 2017Data quality: the data science struggle nobody mentions 7 First analysis and evaluation 1. Analysis with process mining tool 2. Interaction with visualization 3. Interpret results 4. (S)he already sees some interesting patterns
  8. 8. 7 Sept 2017Data quality: the data science struggle nobody mentions 8 But then (s)he notices … and realizes …  … a broken leg … dozens of specialists …  Assumption wrong: “all records that belong to a pregnant woman are related to pregnancy”  Too many records selected during preparation  No objective means to ascertain this: No field ‘related to pregnancy’
  9. 9. 7 Sept 2017Data quality: the data science struggle nobody mentions 9 and a painstaking process starts  Specify complex filter rules  Inspect samples of (not) selected records  Repeat! Quick and dirty or thorough? Never perfect! What is good enough? How does it affect results? fatigue can show later pregnancy related or not
  10. 10. 7 Sept 2017Data quality: the data science struggle nobody mentions 10 Fast forward a bit … Re-perform analysis and evaluation 1. Interaction with mining tool and visualization 2. Something strange in the times of consults: Consult after blood test it prescribed???
  11. 11. 7 Sept 2017Data quality: the data science struggle nobody mentions 11 Realization More cleaning 1. Clinician contacted for explanation: Notes during consult put in EPD in evenings! 2. Modification of EPD record (what is recorded) ≠ actual moment of activity (semantics) Sequence and duration noise 3. More data cleaning ensues
  12. 12. 7 Sept 2017Data quality: the data science struggle nobody mentions 12 Complex Many unexpected surprises in domain/data understanding Time-consuming Most time spent on data preparation (upto 50-80%)
  13. 13. 7 Sept 2017Data quality: the data science struggle nobody mentions 13 A data scientist should know and tell you about the deficiencies in the data and the results Quick and dirty or thorough? Never perfect! What is good enough? How does it affect results?
  14. 14. 7 Sept 2017Data quality: the data science struggle nobody mentions 14 When presented with analytics results / visualizations, Let people know you don’t trust any results if they don’t have good answers to questions like these How was it cleaned? Data quality problems? Data discarded? How reliable are these results? How does this affect the results?
  15. 15. DATA IMPERFECTIONS
  16. 16. What is it  Data and specification on parts, substances, etc. Why is it a problem?  High requirements on data quality  Errors and duplicates may be costly or even pose health risks Even so, it is a mess 7 Sept 2017Data quality: the data science struggle nobody mentions 16 PRODUCT DATA WHAT IS IT AND WHY IS IT A PROBLEM?
  17. 17. Proposed approach  Given catalogue / database with data on products  Gather data on the same products from websites (many more or less independent sources)  Consolidate: match, merge and clean One enriched description for each product 7 Sept 2017Data quality: the data science struggle nobody mentions 17 PRODUCT INFORMATION CLEANING AND ENRICHMENT
  18. 18. 7 Sept 2017Data quality: the data science struggle nobody mentions 18 PILOT: BALL BEARINGS 1. GIVEN CATALOGUE / DATABASE WITH DATA ON PRODUCTS
  19. 19. 7 Sept 2017Data quality: the data science struggle nobody mentions 19 PILOT: BALL BEARINGS 2. GATHER DATA ON THE SAME PRODUCTS FROM WEBSITES; 3. CONSOLIDATE Get product pages Extact data Consolidate (match, merge, clean)
  20. 20. 7 Sept 2017Data quality: the data science struggle nobody mentions 20 PILOT EXPERIENCES: THE DIRT WE FOUND!!
  21. 21. CustID Sales Name 1234 6000 John 2345 5000 Mary 3456 12000 Bart … … … 7 Sept 2017Data quality: the data science struggle nobody mentions 21 DATA IMPERFECTIONS SELECT SUM(Sales) FROM CustSales 3423000  Data sample looks fine  Result of analysis looks perfectly reasonable  If you don’t look hard enough if you don’t properly pay attention to it … you will be unaware … that you are possibly looking at significantly erroneous figures!!!
  22. 22. CustID Sales Name 1234 6000 John 2345 5000 Mary 3456 12000 Bart … … … 7 Sept 2017Data quality: the data science struggle nobody mentions 22 DATA IMPERFECTIONS SELECT SUM(Sales) FROM CustSales 3423000 CustID Sales Name 6789 2 Tom 4567 6000 Jon 5678 NULL Nina … … … ???? Wrong values included Missing data Double counting etc. Many more problems at value, record, schema, source, trust levels
  23. 23. Data integration very important in Data Science  Example purpose: Data enrichment Many names for the same problem  Record linkage, Entity resolution, Entity linking, Semantic duplicates, Data coupling, Data fusion, etc. 7 Sept 2017Data quality: the data science struggle nobody mentions 23 DATA INTEGRATION: SEMANTIC ENTITY LINKING ONE OF THE MOST IMPORTANT DATA QUALITY PROBLEM CustID Name … 1234 P. Jansen … 2345 P. Janssen … 3456 P. Janssen … … … … EmpNr Name … 6789 P. Jansen … 4567 P. Jansen … 5678 P. Janssen … … … … ? They could all be the same person or all different persons!
  24. 24. Data perspective User perspective Context independent Spelling error Missing data Incorrect value Duplicate data Inconsistent data format Syntax violation Violation of integrity constraints Heterogeneity of measurement units Existence of synonyms and homonyms Information is inaccessible Information is insecure Information is hardly retrievable Information is difficult to aggregate Errors in the information transformation Context dependent Violation of domain constraints Violation of organization’s business rules Violation of company and government regulations Violation of constraints provided by the database administrator The information is not based on fact Information is of doubtful credibility Information presents impartial view Information is irrelevant to the work Information is incomplete Information is compactly represented Information is hard to manipulate Information is hard to understand Information is outdated 7 Sept 2017Data quality: the data science struggle nobody mentions 24 CATEGORIZATION OF DATA QUALITY PROBLEMS SOURCE: P. WOODALL, M. OBERHOFER, A. BOREK, “A CLASSIFICATION OF DATA QUALITY ASSESSMENT AND IMPROVEMENT METHODS”, JDIQ 3(4), 2014
  25. 25. Data/information quality cannot be measured with one value  Lot of research on describing desirable properties  Examples: Accuracy, Correctness, Currency, Completeness, Relevance, Reliability, etc.  Metrics: concrete means to estimate score on dims Quite futile endeavor in my opinion  200+ such dimensions have been identified, even more possible metrics  Little agreement  Expensive, too complex DATA QUALITY DIMENSIONS AND METRICS 7 Sept 2017Data quality: the data science struggle nobody mentions 25
  26. 26. “The data quality of a data set” is a meaningless notion  Data quality depends on the purpose!  What is good enough for one purpose, may be insufficient for another Alternative means of measurement  Define purpose as a set of queries on the data  Measure quality of the answers  Purpose determines relevant dimensions & metrics 7 Sept 2017Data quality: the data science struggle nobody mentions 26 DATA QUALITY DEPENDS ON THE PURPOSE
  27. 27. SOME GENERAL MECHANISMS
  28. 28. Wikipedia: “Process of implementing and developing technical standards”  Requires consensus  Typically quite costly  Benefits but also downsides (lack of variety, exceptions, slow evolution, defectors, etc.)  In essence non-technical solution, but can be supported by technology Useful but inherently imperfect itself STANDARDIZATION 7 Sept 2017Data quality: the data science struggle nobody mentions 28
  29. 29. One can do quite a bit of checking  Profiling of domains: detects improper values  Profiling keys: values in column unique, expected keys  Verification of foreign key/primary key relationships  Verification of constraints and business rules  Verification of expected dependencies: inclusion, (conditional) functional dependencies Detects missing / erroneous rows  Matching with reference data  etc. 7 Sept 2017Data quality: the data science struggle nobody mentions 29 ERROR DETECTION BY VERIFICATION & PROFILING
  30. 30. What can be automatically discovered?  Detection of outliers: possible erroneous values  Similarity matching: possible duplicates  Inclusion detection: possible inclusion dependencies ➠ possible missing erroneous rows  Dependency detection: possible (conditional) functional dependencies ➠ possible erroneous values, possible duplicates  Prediction: machine learning to predict value based on other data ➠ possible erroneous values (e.g., categorizations) There is much much more  7 Sept 2017Data quality: the data science struggle nobody mentions 30 ADVANCED PROFILING / AUTOMATIC ERROR DISCOVERY
  31. 31. What can be automatically discovered?  Detection of outliers: possible erroneous values  Similarity matching: possible duplicates  Inclusion detection: possible inclusion dependencies ➠ possible missing erroneous rows  Dependency detection: possible (conditional) functional dependencies ➠ possible erroneous values, possible duplicates  Prediction: machine learning to predict value based on other data ➠ possible erroneous values (e.g., categorizations) There is much much more  7 Sept 2017Data quality: the data science struggle nobody mentions 31 ADVANCED PROFILING / AUTOMATIC ERROR DISCOVERY Very useful but again inherently imperfect itself
  32. 32.  Manually / semi-automatic / automatic  Techniques of previous slide can not only be used to discover errors, but also to automatically clean them; If sufficiently probable  duplicate rows can be merged  erroneous rows deleted  erroneous values corrected  Missing rows / values  Data imputation: fill with predicted values 7 Sept 2017Data quality: the data science struggle nobody mentions 32 DATA CLEANING
  33. 33. MY RESEARCH How do we humans find our way in this mess?
  34. 34.  We have domain knowledge  We know what to expect  We know what is likely and what is not  We compare with what we know and expect  We doubt  We (dis)trust  We learn from others  We reconsider  etc. 7 Sept 2017Data quality: the data science struggle nobody mentions 34 HOW DO WE HUMANS FIND OUR WAY IN THIS MESS? Probabilities Uncertainty Metadata of context: source, situation, … Evidence gathering User feedback
  35. 35. THREE PRINCIPLES Let me illustrate 2 & 3 with an example of combining data 7 Sept 2017Data quality: the data science struggle nobody mentions 35
  36. 36. 7 Sept 2017Data quality: the data science struggle nobody mentions 36 EXAMPLE COMBINING DATA OBJECTIVE: DETERMINE PREFERRED CUSTOMER (WITH SALES > 100) Keulen, M. (2012) Managing Uncertainty: The Road Towards Better Data Interoperability. IT - Information Technology, 54 (3). pp. 138-146. ISSN 1611-2776 Car brand Sales B.M.W. 25 Mercedes 32 Renault 10 Car brand Sales BMW 72 Mercedes-Benz 39 Renault 20 Car brand Sales Bayerische Motoren Werke 8 Mercedes 35 Renault 15 Car brand Sales B.M.W. 25 Bayerische Motoren Werke 8 BMW 72 Mercedes 67 Mercedes-Benz 39 Renault 45
  37. 37. 7 Sept 2017Data quality: the data science struggle nobody mentions 37 VOILA! SEMANTIC DUPLICATES PROBLEM Car brand Sales B.M.W. 25 Bayerische Motoren Werke 8 BMW 72 Mercedes 67 Mercedes-Benz 39 Renault 45 Preferred customers … SELECT SUM(Sales) FROM CarSales WHERE Sales>100 0 ‘No preferred customers’
  38. 38. Database Real world (of car brands) Mercedes-Benz 39 72BMW 45Renault 67Mercedes 8 Bayerische Motoren Werke 25B.M.W. SalesCar brand ω d1 d2 d3 d4 d5 d6 o1 o2 o3 o4 7 Sept 2017Data quality: the data science struggle nobody mentions 38 SEMANTIC DUPLICATES
  39. 39. 7 Sept 2017Data quality: the data science struggle nobody mentions 39 MOST DATA IMPERFECTIONS CAN BE MODELED AS UNCERTAINTY IN DATA Car brand Sales B.M.W. 25 Bayerische Motoren Werke 8 BMW 72 Mercedes 67 Mercedes-Benz 39 Renault 45 Mercedes 106 Mercedes-Benz 106 1 2 3 4 5 6 X=0 X=0 X=1 Y=0 X=1 Y=1 X=0 4 and 5 different 0.2 X=1 4 and 5 the same 0.8 Y=0 “Mercedes” correct name 0.5 Y=1 “Mercedes-Benz” correct name 0.5 B.M.W. / BMW / Bayerische Motoren Werke analogously Run some duplicate detection tool (similarity matching)
  40. 40.  Looks like ordinary database  Several “possible”, “most likely” or “approximate” answers to queries  Important: Scalability (big data!) Sales of “preferred customers”  SELECT SUM(sales) FROM carsales WHERE sales≥ 100  Answer: 106 (most likely) 7 Sept 2017Data quality: the data science struggle nobody mentions 40 WHAT I HAVE NOW IS A PROBABILISTIC DATABASE SUM(sales) P 0 14% 105 6% 106 56% 211 24% Second most likely answer at 24% with impact factor 2 in sales (211 vs 106)
  41. 41. Data integration very important in Data Science  Example purpose: Data enrichment Many names for the same problem  Record linkage, Entity resolution, Entity linking, Semantic duplicates, Data coupling, Data fusion, etc. 7 Sept 2017Data quality: the data science struggle nobody mentions 41 DATA INTEGRATION: SEMANTIC ENTITY LINKING ONE OF THE MOST IMPORTANT DATA QUALITY PROBLEM CustID Name … 1234 P. Jansen … 2345 P. Janssen … 3456 P. Janssen … … … … EmpNr Name … 6789 P. Jansen … 4567 P. Jansen … 5678 P. Janssen … … … … ? They could all be the same person or all different persons! Remember this slide?
  42. 42. Let’s go for an initial integration that can readily and meaningfully be used Let it improve during use “Good is good enough” 7 Sept 2017Data quality: the data science struggle nobody mentions DATA INTEGRATION THE PROBABILISTIC WAY Use Gather evidence Improve data quality Partial data integration Enumerate cases for remaining problems Store data with uncertainty in UDBMS InitialintegrationContinuousimprovement 42
  43. 43. User perceives (part of) source as of doubtful credibility We humans say “I don’t trust this data” Rows correct or not, so each with variable & probability Sources contain conflicting data about (possibly) the same entities; We humans say “I’m left with some doubt” We already saw this one: car brand example! Automatic error discovery tool (AEDT) detects possible erroneous values or rows We humans say “I doubt that this is correct” Alternative rows, each with probability of correctness 7 Sept 2017Data quality: the data science struggle nobody mentions 45 TRUST AND DOUBT THE PROBABILISTIC WAY
  44. 44. CONCLUSIONS
  45. 45. Little story  Peak into the life and frustrations of a data scientist  Awareness for responsible analytics Data imperfections: What a mess!  Ball bearings pilot  Models for data quality (problems) Some general mechanisms: Still quite a mess!  Standardization, error detection & discovery, cleaning … and I talked about my own research 7 Sept 2017Data quality: the data science struggle nobody mentions 55 WHAT HAVE I TALKED ABOUT
  46. 46. 7 Sept 2017Data quality: the data science struggle nobody mentions 56 With this a data scientist can  not trust certain parts of input data as much as others  integrate data in a quick and dirty but responsible way  information about data problems is *in* the data  only solve data quality problems to the degree needed  measure data uncertainty and quality  documentation of data manipulation and its reasons know how DQ problems affect reliability of the results
  47. 47. 7 Sept 2017Data quality: the data science struggle nobody mentions 57 (Francis Bacon, 1605) (Jorge Luis Borges, 1979)
  48. 48.  dripping clock: http://www.gemfive.com  flower: http://cdn.tinybuddha.com  big data: http://tr1.cbsistatic.com  awareness: http://7minutesinthemorning.com 7 Sept 2017Data quality: the data science struggle nobody mentions 58 SOURCES OF IMAGES

×