Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

CommonAnalyticMistakes_v1.17_Unbranded

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Cargando en…3
×

Eche un vistazo a continuación

1 de 65 Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

Anuncio

Similares a CommonAnalyticMistakes_v1.17_Unbranded (20)

CommonAnalyticMistakes_v1.17_Unbranded

  1. 1. Common Analytic Mistakes Jim Parnitzke Webinar Series September, 2014
  2. 2. Introduction Jim Parnitzke Business Intelligence and Enterprise Architecture Advisor, Expert, Trusted Partner, and Publisher
  3. 3. Common Organizational Blunders Analytic Fundamentals Common Measurement Errors Statistical Mistakes Visualization Faults Summary Not Asking the Hard Questions Focusing on Yesterday's Metrics Understanding Metrics and Their Methodology Bottlenecking the Value to the Organization Overvaluing Data Visualization The Conference Table Compromise Compromising Data Ownership Confusing Insight With Action Common Analytic Mistakes
  4. 4. Analytics projects driven by nothing more than management pronouncements that "we need metrics, get us some" don’t ever end well. • Analysis for analysis sake is ridiculous. • Ask the right questions to learn what data and metrics are important and can make a difference. • Know which quantitative measures don't matter and can be disregarded. Not Asking the Hard Questions
  5. 5. • Analytics projects can be driven by answering yesterday's questions, not tomorrow's. It's an easy trap to fall into. People are comfortable with what they know. • A backwards-optimized view provides comfort... at a cost of NOT tracking metrics that help drive business forward, numbers the organization doesn't yet know, or what's unpredictable and uncomfortable but exactly where focus is needed. Focusing on Yesterday's Metrics Historical Reporting “What happened” Real Time Reporting “What are we doing right now” Modelling “What if we did this or that” Predictive Analytics “What can we expect” Prescriptive Analytics “What is the optimal solution” -- Descriptive -- -- Prescriptive ---- Predictive --
  6. 6. No easy answer to this. Try to understand how various systems count, resolve that with your own needs, then accept the compromises you have no control over. • Example: Online media and marketing are immature. Systems that measure them are immature. Don’t assume systems and their outputs are fully baked and results can be taken at face value. Reality check: They aren't; They can't. A page view in one system isn't a page view in another. Definitions and methods that define metrics such as page views, visits, unique visitors, and ad deliveries are different in every system. Without a standard definition for an ad impression you can be sure there aren't standards for all the peripheral metrics. Counts can differ between products from the same vendor by as much as 30%. Misunderstanding Metrics
  7. 7. Many well-intentioned project owners kill the project value because they bottleneck their own success. Well-meaning professionals, usually the systems' power users, create processes and procedures that place themselves at the center of everything regarding the systems: report creation, running ad hoc queries, report distribution, and troubleshooting. This often prevents others from taking real advantage of the systems' intelligence. The only solution is to train others, then step back and let them at it. Mistakes and misinterpretations will happen, but the benefits of widespread adoption across the enterprise will always outweigh them. Bottlenecking Value to the Organization
  8. 8. When it comes to analytics, there's nothing we love more than lots of pretty pictures and dazzling graphics. Theirs is nothing wrong with pleasing visualization tools in data presentation. Too many worry about pictures first, data and analysis second. Pictures cloud their minds and vision, which is exactly why vendors put them there. Graphics are great at grabbing attention, but not always great at putting data into action. What looks good in reports should be a means to the end, not the end in itself. Overvaluing Data Visualization http://www.wheels.org/monkeywrench/?p=413
  9. 9. One reason analytics projects lose focus is they begin compromised. Too many follow the conference table consensus approach. The Conference Table Compromise Every department gets a seat at the table. Everyone contributes suggestions. The final product is a compendium of all requests. Although this method tends to create lots of good feelings, it rarely results in the best set of metrics with which to run the business. You will find it's tracking silly, pet metrics someone at the table thought were important and are completely irrelevant. To keep projects focused, decide which metrics are important and stay, and which are distracting and go.
  10. 10. When budgets are tight and all are clamoring for better analytics, it's understandable that not everyone reads or fully comprehends the fine print associated vendor "partnerships." The nuances of data ownership may seem innocuous, but there are consequences. Many will use analytics services to build a databases of anonymous consumer profiles and their behavior to use in ad targeting when those consumers visit other sites without compensation to the publishers whose sites were harvested. Be careful with this one… Compromising Data Ownership
  11. 11. What good are analysis and insight if you can't act on them? • Almost all analytics systems bill themselves as actionable. Many claim they're real time. Learn what they really mean. • Few systems can enable an enterprise to take immediate, tactical steps to leverage data for value. For most, "actionable" means the system can generate reports, such as user navigation patterns publishers can mull over in meetings, then plan the changes needed to improve. • While that may meet the definition of actionable, it doesn't necessarily mean real-time or even right-time action. Bottom line: Understand, don't assume. Confusing Insight With Action
  12. 12. Data Quality and Context Never Compare Apples to Oranges Don't Overstate (alarm) Unnecessarily Calibrate Your Time Series Always Make Your Point Clearly (and Colors Matter.) Statistical Significance Correlation vs. Causation Improper use of averages There is Such a Thing as Too Little Data! Common Organizational Blunders Analytic Fundamentals Common Measurement Errors Statistical Mistakes Visualization Faults Summary Common Analytic Mistakes
  13. 13. While a poor font choice can ruin a meeting, a poor interpretation of statistics or data can kill you (or someone else, for that matter). Proper use of data is a viciously complicated topic. See the section on Statistical Mistakes for examples. If your findings would lead to the wrong conclusions, not presenting the data at all would be a better choice. Here is a simple rule for designers: Data Quality and Context Your project isn’t ready if you have spent more time choosing a font or color scheme than choosing your data.
  14. 14. • Real data is ugly – but there’s no substitute for real data • Provenance - Critical Questions to Ask • Who collected it? • Why was it collected? • What is its context in a broader subject? • What is its context in the field of research that created it? • How was it collected? • What are its limitations? • Which mathematical transformations are appropriate? • Which methods of display are appropriate? Data Quality and Context
  15. 15. Cleaning and formatting a single data set is hard. What if you’re: • Building a live visualization that will run with many different data sets? • Don’t have time to manually clean each data set? There is no substitute for real data. It doesn’t help you plan for data discrepancies, null values, outliers, or real-world problems. • Use several random samples of real data if you cannot access an entire data set • Invalid and missing data is a guarantee. If your data won’t be cleaned before being graphed, do not clean your sample data. • Real data may be so large it overwhelms your visualization generating it. Be sure that if you use a sample of data you correctly scale up the sample size or reduce it appropriately before creating a final visualization. There’s no substitute for real data
  16. 16. Four different segments are being compared , but they are calibrated wrong. On the surface this is hard to detect. • The clean part is that there is very little overlap between Search Traffic and Referral Traffic. But Mobile is a platform. Never Compare Apples to Oranges Traffic (conversions in this case) is most likely in both Referrals and Search. It is unclear what to make of that orange stacked bar. The graph is showing conversions already included in Search and Referral (double counting) and because you have no idea what it is, it is impossible to know what action to take. Would you recommend a higher investment in Mobile based on this graph? • The same for Social Media. It is likely that the Social Media conversions are already included in Referrals and in Mobile. The green bar in the graph is useless. Is a massive increase in investment in Social Media an imprecise conclusion?
  17. 17. What do you think is wrong with this graph? It artificially inflates the importance of a change in the metric that might not be all that important. In this case the data is not statistically significant, but there is no way we can know that just from the data. Yet the scale used for the y-axis implies that something huge has happened. Don't Overstate or Alarm Unnecessarily Try to avoid being so dramatic in your presentation. It causes people to read things into the performance that they should most likely not read. Setting the y-axis at zero may not be necessary every time but this 1.5 point difference is a waste of everyone's time. Another important thing. Label your x axis. Please.
  18. 18. This chart that shows nine months of performance… by day! The "trend" is completely useless. • Looking at individual days over such a long time period can hide insights and important changes. It can be near impossible to find anything of value. • Try switching to looking at the exact same time period but by week. Now see some kind of trend, especially towards the end of the graph (even this simple insight was hidden before). Calibrate Your Time Series See: http://square.github.io/crossfilter for another good example.
  19. 19. What do you think the two colors in this graph represent? How come only 29 percent of the organizations have more than one person! Problem one is that "red" denotes "good" in this case and "green" represents "bad." Here's something very, very simple you should understand: Red is bad and Green is good. Always. Period. People instinctively think this way. So show "good" in green and "bad" in red. It will communicate your point clearly and faster. Always Make Your Point Clearly Problem two, much worse, was that it was harder than it should be to understand this data. First stacked bar above: "Yes 71 percent of the organizations Yes, more than one person. And what is the 29 percent? If the question is how many people are directly responsible for improving conversion rates and 71 percent have more than one person, then 29 percent are those that have less than one person or no one? Or just less than one person? Unclear (and frustrating).
  20. 20. We all make this mistake. We create a table like the one below. We create a "heat map" in the table highlighting where conversions rates appear good. We declare Organic to be the winner, Direct is close behind. Then the other two. And we recommend doing more SEO. Statistical Significance None of this data could be significant – that the numbers seem to be so different might not mean anything. It is entirely possible that it is completely immaterial that Direct is 34% and Email is 10%, or that Referral is 7%. We should evaluate the raw numbers to see if the percentage is meaningful at all. • The data in the Direct row could represent conversions out of 10 visits and all the Referral data could be represent conversions from 1,000,000 visits. Suggest computing statistical significance to identify which comparison sets we can be confident are different, and in which cases we simply don't have enough confidence. Do you see the problem?
  21. 21. Confusing correlation and causation is one of the most overlooked problems. In the Cheese and Employment Status percentage graph, it is clear that retired Redditors prefer cheddar cheese and freelance Redditors prefer brie. • This does not mean that once an individual retires he or she will develop a sudden affinity for cheddar. • Nor does becoming a freelancer cause one to suddenly prefer brie. Correlation vs. Causation
  22. 22. Improper use of averages Averages can be a great way to get a quick overview of some key areas, but use it wisely. For example, average order value is a useful metric. If we were to look at only the average order value by month it’s enlightening because it shows an increase over time, which indicates a move in the right direction. However, it’s more useful to look at average order value by department by month, because this shows us where the increase in average order value is coming from; the women’s shoes department. If just looked at only average order value by month, we might focus marketing across all departments, which is not the most efficient allocation of resources.
  23. 23. • Do other useful things. Look at your search keyword reports. Do you see a few people coming on keywords you SEOed the site for? Look at the keywords your site is showing up in through Google search results. Are they the ones you were expecting? • Even better… spend time with competitive intelligence tools like Insights for Search, Ad Planner, and others to seek clues from your competitors and your industry ecosystem. At this stage you can learn a lot more from their data than your data… There is Such a Thing as Too Little Data! Another "simple" mistake. We get excited about having data, especially if new at this. We get our tables and charts together and we reporting data and having a lot of fun. This is very dangerous. You see there is such a thing as too little data. You don't want to wait until you've collected millions of rows of data to make any decision, but the table here is nearly useless. So can you do anything with data like this?
  24. 24. Drawing Conclusions from Incomplete Information Assuming a Lab is a Reasonable Substitute Forgetting the Real Experts are Your Customers Gaming the System Sampling Problems Intelligence is not binary Keeping it Simple - Too Simple Beware the long tail Simpson's paradox Common Organizational Blunders Analytic Fundamentals Common Measurement Errors Statistical Mistakes Visualization Faults Summary Common Analytic Mistakes
  25. 25. Data may not tell the full story. For example your analytics show visitors spend a relatively high amount of time on a particular page. • Is that page great – or is it problematic? Maybe visitors simply love the content. • Or, maybe they are getting stuck due to a problem with the page. Your call center statistics show average call time has decreased. • Is a decrease in average call time good news or bad news? • When calls end more quickly, costs go down, but have you actually satisfied callers or left them disgruntled, dissatisfied, and on their way to your competition? Never draw conclusions from any analysis that does not tell the whole story. Use bar charts to visualize relative sizes. If you see one bar that is twice as long as the other bar, you expect the underlying quantity to be twice as big. This relative sizing fails if you do not start your bar chart axis at 0. Rule of thumb, if you want to illustrate a small change, use a line chart if starting your y-axis anywhere other than 0. Conclusions from Incomplete Information
  26. 26. The chart on the left compares Redditors who like dogs vs. Redditors who prefer cats. With the y-axis starting at 9,000, it looks like dog lovers outnumber cat lovers by three times. However, the graph on the right is a more accurate representation of the data. There are not quite twice as many Redditors who prefer dogs to cats. A limitation of bar charts that start at 0 is they do not show small percent differences. If you need to change the start of your axis in order to highlight small changes, switch to a line chart. Conclusions from Incomplete Information
  27. 27. Usability groups and panels are certainly useful and have their place; the problem is the sample sizes are small and the testing takes place in a controlled environment. You bring people into a lab and tell them what you want them to do. • Does that small group of eight participants represent your broader audience? • Does measuring and observing them when they do what we tell them to do provide the same results as real users who do what they want to do? Observation is helpful, but applying science to the voice of customer and measuring the customer experience through the lens of customer satisfaction is a better way to achieve successful results. Assuming a Lab is a Reasonable Substitute
  28. 28. Experts, like usability groups, have their place. But who knows customer intentions, needs, and attitudes better than actual customers? When you really want to know, go to the source. It takes more time and work, but the results are much more valuable. Experts and consultants certainly have their place, but their advice and recommendations must be driven by customer needs as much if not more than by organizational needs. The Real Experts are Your Customers
  29. 29. Many feedback and measurement systems create bias and inaccuracy. How? Ask the wrong people, bias their decisions, or give them incentives for participation. Measuring correctly means creating as little measurement bias as possible while generating as little measurement noise as possible. • Avoid incenting people to complete surveys, especially when there is no need. Never ask for personal data; some will decline to participate if only for privacy concerns. • Never measure with the intent to prove a point. We may, unintentionally, create customer measurements to prove our opinions are correct or support our theories, but to what end? • Customer measurements must measure from the customers’ perspective and through the customers’ eyes, not through a lens of preconceived views. Gaming the System
  30. 30. Sampling works well when sampling is done correctly. Sample selection and sample size are critical to creating a: • credible, • reliable, • accurate, • precise, and predictive methodology. Sampling is a science in and of itself. You need samples representative of the larger population that are randomly selected. See the section on Statistical Mistakes for more on this. Sampling Problems
  31. 31. Taking a binary approach to measuring satisfaction – in effect, asking whether a customer is or is not satisfied – leads to simplistic and inaccurate measurement. Intelligence is not binary. • People are not just smart or stupid. • People are not just tall or short. • Customers are not just satisfied or dissatisfied. “Yes” and “no” do not accurately explain or define levels or nuances of customer satisfaction. The degree of satisfaction with the experience is what determines the customer’s level of loyalty and positive word of mouth. Claiming 97% of your customers are satisfied certainly makes for a catchy marketing slogan but is far from a metric you can use to manage your business forward. If you cannot trust and use the results, why do the research? Intelligence is not binary
  32. 32. The “keep it simple” approach does not work for measuring customer satisfaction (or for measuring anything regarding customer attitudes and behaviors.) • Customers are complex; they make decisions based on a number of criteria, most rational, some less so. Asking three or four questions does not create a usable metric or help to develop actionable intelligence. Measuring customer satisfaction by itself will not provide the best view. Using a complete satisfaction measurement system – including future behaviors and predictive metrics. Many will take this simple approach and make major strategic decisions based on a limited and therefore flawed approach to measurement. Great managers do not make decisions based on hunches or limited data; “directionally accurate” is simply not good enough. Keeping it Simple - Too Simple http://experiencematters.wordpress.com/tag/lowes/page/2/
  33. 33. In statistics, a long tail of some distributions of numbers is the portion of the distribution having a large number of occurrences far from the "head" or central part of the distribution. A probability distribution is said to have a long tail, if a larger share of population rests within its tail than would under a normal distribution. Beware the long tail A long-tail distribution will arise when many values are unusually far from the mean, which increase the magnitude of the skewness of the distribution. Top 10,000 Popular Keywords
  34. 34. The term long tail has gained popularity in describing the retailing strategy of selling a large number of unique items with relatively small quantities sold of each in addition to selling fewer popular items in large quantities. The distribution and inventory costs of businesses successfully applying this strategy allow them to realize significant profit out of selling small volumes of hard- to-find items to many customers instead of only selling large volumes of a reduced number of popular items. The total sales of this large number of "non-hit items" is called "the long tail". • See also: • Black swan theory • Kolmogorov's zero – one law which is also known a tail event. • Mass customization • Micropublishing • Swarm intelligence Source: http://en.wikipedia.org/wiki/Long_tail Beware the long tail (cont’d)
  35. 35. Simpson's paradox, or the Yule–Simpson effect, is a paradox where a trend that appears in different groups of data disappears when these groups are combined, and the reverse trend appears for the aggregate data. Encountered in social-science and medical-science statistics, this effect is confounding when frequency data are unduly given causal interpretations. Using professional baseball as an example it is possible for one player to hit for a higher batting average than another player during a given year, and to do so again during the next year, but to have a lower batting average when the two years are combined. Simpson's paradox Player 1995 Avg. 1996 Avg. Combined Avg. Derek Jeter 12/48 0.250 183/582 0.314 195/630 0.310 David Justice 104/411 0.253 45/140 0.321 149/551 0.270
  36. 36. This phenomenon occurs where there are large differences in the number of at- bats between the years. The same situation applies to calculating batting averages for the first half of the baseball season, and during the second half, and then combining all of the data for the season's batting average. Simpson's paradox (cont’d) Player 1995 Avg. 1996 Avg. Combined Avg. Derek Jeter 12/48 0.250 183/582 0.314 195/630 0.310 David Justice 104/411 0.253 45/140 0.321 149/551 0.270 If weighting is used this phenomenon disappears. The table below has been normalized for the largest totals so the same things are compared. Player 1995 Avg. 1996 Avg. Combined Avg. Derek Jeter 12/48*411 0.250 183/582*582 0.314 285.75/993 0.288 David Justice 104/411*411 0.253 45/140*582 0.321 291/993 0.293
  37. 37. Expecting too much certainty Misunderstanding probability Mistakes in thinking about causation Problematical choice of measure Errors in sampling Over- interpretation Mistakes involving limitations of hypothesis tests/confidence intervals Using an inappropriate model or research design Common Organizational Blunders Analytic Fundamentals Common Measurement Errors Statistical Mistakes Visualization Faults Summary Common Analytic Mistakes
  38. 38. One consequence of not taking uncertainty seriously enough is that we often write results in terms that misleadingly suggest certainty. For example, some might conclude from a study that a hypothesis is true or has been proved, when it would be more correct to say that the evidence supports the hypothesis or is consistent with the hypothesis. Another mistake is misinterpreting results of statistical analyses in a deterministic rather than probabilistic (also called stochastic) manner. Expecting too much certainty " ... as far as the propositions of mathematics refer to reality, they are not certain; and as far as they are certain, they do not refer to reality." Albert Einstein , Geometry and Experience, Lecture before the Prussian Academy of Sciences, January 27, 1921 Source: Martha Smith, University of Texas http://www.ma.utexas.edu/users/mks/statmistakes/TOC.html
  39. 39. There are four perspectives on probability that are commonly used: • Classical, • Empirical (or Frequentist), • Subjective, and • Axiomatic. Using one perspective when another is intended can lead to serious errors. Common misunderstanding: If there are only two possible outcomes, and you don't know which is true, the probability of each of these outcomes is 1/2. In fact, probabilities in such "binary outcome" situations could be anything from 0 to 1. For example, if the outcomes of interest are "has cancer" and "does not have cancer," the probabilities of having cancer are (in most cases) much less than 1/2. The empirical (frequentist) perspective allows us to estimate such probabilities. Misunderstanding probability Source: Martha Smith, University of Texas http://www.ma.utexas.edu/users/mks/statmistakes/TOC.html
  40. 40. Confusing correlation with causation. Example: Students' shoe sizes and scores on a standard reading exam. They are correlated, but saying that larger shoe size causes higher reading scores is as absurd as saying that high reading scores cause larger shoe size. There is a clear lurking variable, namely, age. As the child gets older, both their shoe size and reading ability increase. Do not Interpret causality deterministically when the evidence is statistical. Mistakes in thinking about causation Source: Martha Smith, University of Texas http://www.ma.utexas.edu/users/mks/statmistakes/TOC.html
  41. 41. In most research, one or more outcome variables are measured. Statistical analysis is done on the outcome measures, and conclusions are drawn from the statistical analysis. The analysis itself involves a choice of measure, called a summary statistic. Misleading results occur when inadequate attention to the choice of either outcome variables or summary statistics. • Example: What is a good outcome variable for deciding whether cancer treatment in a country has been improving? A first thought might be "number of deaths in the country from cancer in one year." But number of deaths might increase simply because the population is increasing. Or it might go down if cancer incidence is decreasing. "Percent of the population that dies of cancer in one year" would take care of the first problem, but not the second. In this case rate is a better measure than a count. Problematical choice of measure
  42. 42. Believing a "random sample will be representative of the population". In fact, this statement is false -- a random sample might, by chance, turn out to be anything but representative. For example, it is possible that if you toss a coin ten times, all the tosses will come up heads. A slightly better explanation that is partly true : "Random sampling eliminates bias by giving all individuals an equal chance to be chosen.“ There is very important, the reason why random sampling is important… Mathematical theorems which justify most statistical procedures apply only to random samples. Errors in sampling
  43. 43. • Extrapolation to a larger population than the one studied Example: running a marketing experiment with undergraduates enrolled in marketing classes and drawing a conclusion about people in general. • Extrapolation beyond the range of data Similar to extrapolating to a larger population, but concerns the values of the variables rather than the individuals. • Ignoring Ecological Validity Involves the setting (i.e., the "ecology") rather than the individuals studied, or it may involve extrapolation to a population having characteristics very different from the population that is relevant for application. Over-interpretation
  44. 44. • Using overly strong language in stating results Statistical procedures do not prove results. They only give us information on whether or not the data support or are consistent with a particular conclusion. There is always uncertainty involved. Acknowledge this uncertainty. • Considering statistical significance but no practical significance Example: Suppose that a well-designed, well-carried out, and carefully analyzed study shows that there is a statistically significant difference in life span between people engaging in a certain exercise regime at least five hours a week for at least two years and those not following the exercise regime. If the difference in average life span between the two groups is three days... Over-interpretation (cont’d) Source: Martha Smith, University of Texas http://www.ma.utexas.edu/users/mks/statmistakes/TOC.html
  45. 45. Type I Error • Rejecting the null hypothesis when it is in fact true is known as a Type I error. Many people decide, before doing a hypothesis test, on a maximum p-value for which they will reject the null hypothesis. This value is often denoted α (alpha) and is the significance level. Type II Error • Not rejecting the null hypothesis when in fact the alternate hypothesis is true is called a Type II error. An analogy helpful in understanding the two types of error is to consider a defendant in a trial. The null hypothesis is "defendant is not guilty;" the alternate is "defendant is guilty." • Type I error would correspond to convicting an innocent person; • Type II error would correspond to setting a guilty person free. Test/Hypothesis confidence intervals
  46. 46. Two drugs are known to be equally effective for a certain condition. • Drug 1 has been used for decades with no reports of the side effects • Drug 2 may cause serious side-effects The null hypothesis is "the incidence of the side effect in both drugs is the same", The alternate hypothesis is "the incidence of the side effect in Drug 2 is greater than that in Drug 1." Falsely rejecting the null hypothesis when it is in fact true (Type I error) would have no great consequences for the consumer, but a Type II error (i.e., failing to reject the null hypothesis when in fact the alternate is true, would result in deciding that Drug 2 is no more harmful than Drug 1 when it is in fact more harmful) could have serious consequences. Setting a large significance level is in this case is appropriate. Test/Hypothesis confidence intervals
  47. 47. Each inference technique (hypothesis test or confidence interval) you select has model assumptions. Different techniques have very different model assumptions. The validity of the technique depends on whether the model assumptions fit the context of the data being analyzed. • Common Mistakes Involving Model Assumptions • Using a two-sample test comparing means when cases are paired • Comparisons of treatments applied to people, animals, etc. (Intent to Treat; Comparisons involving Drop-outs) • Fixed vs Random Factors • Analyzing Data without Regard to How the Data was Collected • Dividing a Continuous Variable into Categories ("Chopped Data“, Cohorts) • Pseudo-replication • Mistakes in Regression • Dealing with Missing Data Inappropriate model or research design Source: Martha Smith, University of Texas http://www.ma.utexas.edu/users/mks/statmistakes/TOC.html
  48. 48. Bad Math, Bad Geography Misrepresent Data Serving the Presentation Without the Data Pie Charts – Why? Using Non- Solid Lines in a Line Chart Bar charts with erroneous scale, Arranging Data Non- Intuitively Obscuring Your Data, Making the Reader Do More Work Misrepresent Data Using Different Colors on a Heat Map Making it Hard to Compare Data, Showing Too Much Detail Not Explaining the Interactivity Keep It Simple Common Organizational Blunders Analytic Fundamentals Common Measurement Errors Statistical Mistakes Visualization Faults Summary Common Analytic Mistakes
  49. 49. Visualization is a tool to aid analysis, not a substitute for analytical skill. It is not a substitute for statistics: Really understanding your data generally requires a combination of • analytical skills, • domain expertise, and • effort. Strategy: • Be careful about promising real insight. Work with a statistician or a domain expert if you need to offer reliable conclusions. • Small design decisions - the color palette you use, or how you represent a particular variable can skew the conclusions a visualization suggests. • If using visualizations for analysis, try a variety of options, rather than relying on a single view Visualization is not analysis
  50. 50. The infographic seems informative enough. The self-perceptions of “Baby Boomers” is an interesting data set to visualize. The problem, however, is that this graphic represents 243% of responses. This is not necessarily indicative of faulty data. It is a poor representation of the data. Display the phrases individually, with size determined by percentages to compare the phrases to each other more easily. Remove the percentages from a single body also clarifies that the percentages aren’t mutually exclusive. Bad Math
  51. 51. Credibility? Gone in an instant. Bad Geography - Just Plain Dumb Presented without comment… speechless, how does this happen?
  52. 52. And don't do something like this.. ever Makes sure all representations are accurate. For example, bubbles should be scaled according to area, not diameter. Misrepresenting Data
  53. 53. The graph is clearly trying to make the point that Obamacare enrollment numbers for March 27 are well below the March 31 goal. Take a closer look. How come the first bar is less than half the size of the second one? This is a common problem. Depending on the scale (there isn’t one in this example), the comparison among data points look very deceiving. Misrepresenting Data (Why?) • If you’re working with data points that are really far or really close together, be sure to pay attention to the scale, so that data is accurately portrayed. • If there is no scale, compare different slices, tiles, bubbles, or bars against each other. Do the numbers match up to these?
  54. 54. Which comes first: the presentation or the data? In an effort to make a more “interesting” or “cool” design, don’t allow the presentation layer of a visualization to become more important than the data itself. In this example a considerable amount of work went into it and there are parts that are informative, like the summary counters at the top left. However, without a scale or axis, the time series on the bottom right is meaningless and the 3D chart in the center is even more opaque. Tooltips (pop ups) would help, if they were there. Serving the Presentation Without the Data
  55. 55. They are useful on rare occasions. But most of the time they actually do not communicate anything of value. Pie Charts – Why? Beyond the obvious point made by the line graph in the background (we are storing more data now than we used to), this graph seems to tell us “…don’t know if any of this matters, so we’re going to print everything.” Sometimes it just better to use a table... really.
  56. 56. A common mistake with pie charts is to divide them into percentages that simply do not add up. The basic rule of a pie chart is that the sum of all percentages included should be 100%. • In this example, the percentages fall short of 100%, and the segment sizes do not match their values. This happens due to rounding errors, or when non-mutually exclusive categories are plotted on the same chart. Unless included categories are mutually exclusive, their percentage cannot be plotted separately using the same chart. Pie Charts - Error in Chart Percentages Here is another gem…
  57. 57. Always ask yourself when considering a potential design: Why is this better than a bar chart? If you’re visualizing a single quantitative measure over a single categorical dimension, there is rarely a better option. Pie Charts – Why? • Line charts are preferred when using time-based data • Scatter plots are best for exploring correlations between two linear measures. • Bubble charts support more data points with a wider range of values • Tree maps support hierarchical categories If you have to use a pie chart: • Don’t include more than five segments • Place the largest section at 12 o’clock, going clockwise. • Place the second largest section at 12 o’clock, going counterclockwise. • The remaining sections can be placed below, continuing counterclockwise.
  58. 58. Comparison is a valuable way to showcase differences, but it's useless if your viewer can’t easily compare. Make sure all data is presented in a way that allows the reader to compare data side-by-side. Making it Hard to Compare Data Is this clear to you?
  59. 59. • Using Non-Solid Lines in a Line Chart Dashed and dotted lines can be distracting. Instead, use a solid line and colors that are easy to distinguish from each other. • Making the Reader Do More Work Make it as easy as possible to understand data by aiding the reader with graphic elements. For example, add a trend line to a scatter plot to highlight trends. • Obscuring Your Data Make sure no data is lost or obstructed by design. For example, use transparency in a standard area chart to make sure the viewer can see all data • Using Different Colors on a Heat Map Some colors stand out more than others, giving unnecessary weight to that data. Instead, use a single color with varying shades or a spectrum between two analogous colors to show intensity. Other Common Mistakes
  60. 60. • Arranging the Data Non-Intuitively Content should be presented in a logical and intuitive way to guide readers through the data. Order categories alphabetically, sequentially, or by value. • Not Explaining Interactivity Enabling users to use and interact with a visualization makes it more engaging. If you don’t tell them how to use that interactivity you risk limiting them to the initial view. How you label the interactivity is just as important as doing it in the first place. • Inform the user at the top of the visualization is good practice. • Calling out the interaction on or near the tools that use it. • Use a common design concept such as underlining the words to associate a hyperlink, would have been helpful. See http://www.visualizing.org/full-screen/39118 Other Common Mistakes (cont’d)
  61. 61. Hard to resist… the temptation with a dataset with numerous usable categorical and numerical fields is to show everything at once, and allow users to drill down to the finest level of detail. The visualization is superfluous; the user could simply look at the dataset itself if they wanted to see the finest level of detail. Show enough detail to tell a story, but not so much that that story is convoluted and hidden. Showing Too Much Detail
  62. 62. We all want specific, relevant answers. The closer you can get to providing exactly what is wanted, the less effort we expend looking for answers. Irrelevant data makes finding the relevant information more difficult; irrelevant data is just noise. • Showing several closely related graphs can be a nice compromise between showing too much in one graph and not showing enough overall. • A few clean, clear graphs are better than a single complicated view. • Try to represent your data in the simplest way possible to avoid this. Showing Too Much Detail (cont’d)
  63. 63. Data visualization is about simplicity. Using embellished or artistic representations may result in more clarity but usually distracts from the actual data. Look at the example below. Keep it Simple • Why is the first image blue and the rest are red? • The number in the second image is against the paintbrush and not against the head while in all other columns it is against the head. Is this meaningful? • We might just appreciate different figures and think about the real-life characters represented by them and move on without understanding the data. Really… • It’s important that visual representation of data is free of the pitfalls that make data representation ambiguous and irrelevant.
  64. 64. Summary Common Analytic Mistakes Common Organizational Blunders Analytic Fundamentals Common Measurement Errors Statistical Mistakes Visualization Faults Thank You!
  65. 65. Thank You… Mr. Parnitzke is a hands-on technology executive, trusted partner, advisor, software publisher, and widely recognized database management and enterprise architecture thought leader. Over his career he has served in executive, technical, publisher (commercial software), and practice management roles across a wide range of industries. Now a highly sought after technology management advisor and hands-on practitioner his customers include many of the Fortune 500 as well as emerging businesses where he is known for taking complex challenges and solving for them across all levels of the customer’s organization delivering distinctive value and lasting relationships. Contact: j.parnitzke@comcast.net Blogs: Applied Enterprise Architecture (pragmaticarchitect.wordpress.com) Essential Analytics (essentialanalytics.wordpress.com) The Corner Office (cornerofficeguy.wordpress.com) Data management professional (jparnitzke.wordpress.com) The program office (theprogramoffice.wordpress.com) Data Science Page (http://www.datasciencecentral.com/profile/JamesParnitzke)

Notas del editor

  • Mr. Parnitzke is a hands-on technology executive, trusted partner, advisor, software publisher, and widely recognized database management and enterprise architecture thought leader. Over his career he has served in executive, technical, publisher (commercial software), and practice management roles across a wide range of industries.
  • Retail Example:
    According to recent research by IDC Retail Insights, the Omnichannel shopper is the gold standard consumer. A multichannel shopper will:
    spend, on average 15 percent to 30 percent more than someone using just one channel,
    outspend simple multichannel shoppers by over 20 percent.

    What’s more, multichannel shoppers exhibit strong loyalty and are more likely to influence others to endorse a retailer.
    So, the hard questions are really in generating this kind of lift, and traditional simple focused marketing is not enough…
  • One reason analytics projects lose focus is they begin compromised. When it's time to decide what metrics a company should track, too many follow the conference table consensus approach. They worry more about consensus than about value and accuracy.
  • When budgets are tight and all are clamoring for better analytics, it's understandable that not everyone reads or fully comprehends the fine print associated with some vendors' "partnerships." In these models, the vendor may reserve ownership rights of the data, data aggregates, and the metadata derived from providing analytics services.

    It's not uncommon for companies to use subsidized analytics services to create aggregated research products they sell back to the marketplace (without compensation to publishers whose sites were harvested for the data).
  • Cleaning and formatting a single data set is hard enough, but what if you’re building a live visualization that will run with many different data sets? Do have time to manually clean each data set? Your first instinct may be to grab some demo data and use that to build your visualization; your library may even come with standard sample data.
  • How come only 29 percent of the organizations have more than one person! That is bad. Wait. That did not make sense. Back to read the question. Then the graph. Then the legend. Then back to the question. Then the legend…
  • Measuring customer satisfaction by itself will not provide the best view forward.  Using a complete satisfaction measurement system – including future behaviors and predictive metrics such as likelihood to return to the site or likelihood to purchase again – generates leading indicators that complement and illuminate lagging indicators.
  • Simpson's Paradox disappears when causal relations are understood and accounted for in the analysis.
  • If you agree that increasing age (for school children) causes increasing foot size, and therefore increasing shoe size, then you expect a correlation between age and shoe size. Correlation is symmetric, so shoe size and age are correlated. But it would be absurd to say that shoe size causes age.
  • It is true that sampling randomly will eliminate systematic bias. This is the best plausible explanation that is acceptable to someone with little mathematical background. This statement could easily be misinterpreted as the myth above.
  • Example: An experiment designed to study whether an abstract or concrete approach works better for teaching abstract concepts used computer-delivered instruction. This was done to avoid confounding variables such as the effect of the teacher. However, the study then lacked ecological validity for most real-life classroom use in instruction.
  • Another example:
    Does an increase in tire pressure cause an increase in tread wear? What is the X? Tire Pressure. What is the Y? Tread Wear.
    State the Null (Ho) and Alternative (Ha) Hypothesis  The Null Hypothesis is "r=0' (there is no correlation) Null Hypothesis (Ho) = There is no relationship between pressure and tread wear Alt Hypothesis (Ha) = There is a relationship between pressure and tread wear
    Gather data, Run the analysis and determine the P-Value  Run a Corellation (r=.554, p-Value = .0228).
    Determine the Alpha Risk The Confidence Interval was 95%, therefore the Alpha Risk is 5% (or 0.05)
    What does the P-Value tell us? (Reject or Accept the Null)  Reject the Null (because the P-Value (.0228) is lower than the Alpha Risk (0.05))
  • It's a central tenet of the field that data visualization can yield meaningful insight. While there’s a great deal of truth to this, it’s important to remember that visualization is a tool to aid analysis, not a substitute for analytical skill. It’s also not a substitute for statistics: your chart may highlight differences or correlations between data points, but to reliably draw conclusions from these insights often requires a more rigorous statistical approach. (The reverse can also be true - as Anscombe’s Quartet demonstrates, visualizations can reveal differences statistics hide.) Really understanding your data generally requires a combination of analytical skills, domain expertise, and effort. Don’t expect your visualizations to do this work for you, and make sure you manage the expectations of your clients and your CEO when creating or commissioning visualizations.

    Tools and strategies
    Unless you’re a data analyst, be very careful about promising real insight. Consider working with a statistician or a domain expert if you need to offer reliable conclusions
    Small design decisions - the color palette you use, or how you represent a particular variable - can skew the conclusions a visualization suggests. If you’re using visualizations for analysis, try a variety of options, rather than relying on a single view

    Stephen Few’s Now You See It offers a good practical introduction to using visualization for business analysis, including suggestions for developers on how to design analytically-valid visualization tools
  • You don’t have to be a mathematics major to see what is wrong with an aggregate response of 243%.

×