SlideShare a Scribd company logo
1 of 20
f   carlos@bueno.org
.. …
      ……. ….
     ……….
   ……………………..
  …………………………
 ……………………………..
…………………………………
…………………………………..
How to Sample Data Like a Pro
How to Sample Data Like a Pro

More Related Content

Featured

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellSaba Software
 

Featured (20)

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
 

How to Sample Data Like a Pro

  • 1. f carlos@bueno.org
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18. .. … ……. …. ………. …………………….. ………………………… …………………………….. ………………………………… …………………………………..

Editor's Notes

  1. I’ve been doing performance and optimization work at Facebook for about three years now, and off & on at other companies before that. I didn’t invent anything I’m going to talk about today; it’s just a collection of ideas that seem to work. The strategy is pretty simple, almost stupidly simple. But statistical thinking does not come naturally. You have to practice, watch your biases, and reevaluate constantly to make sure you aren’t fooling yourself.
  2. The Elements of Programming Style was written 5 years before I was born, and this rule is as true now as it was then.
  3. If you went to university you probably saw a graph like this. Complexity! Big O! For large values of N, complexity dominates. This is a good lesson, and important to know. You should be solving your complexity problems when you’re still in the whiteboard stage. But the way this lesson is taught is terrible for two reasons.
  4. One is that code doesn’t execute on whiteboards. It executes on computers, and computers aren’t free.
  5. If you are the one writing checks to Amazon, which line would you rather be on?
  6. The second problem: it’s easy to convince yourself that with sufficient cleverness you can predict how a system will behave.
  7. This is absolutely not true. You can’t predict in fine detail how a complicated program will behave. That would be equivalent to solving the Halting Problem. Or it means you can run the entire program in your head, in which case you don’t need a computer at all. Right? Everything else in performance work –everything-- follows as a consequence of this one fact. You are not psychic. You cannot predict the behavior of a large system. Therefore it follows that you have to observe how it does behave. And the quality and accuracy of your measurements determine success.
  8. That means the strategy for perf is very simple.
  9. One thing I’ve found very useful, and the industry as whole is moving towards, is the idea of keeping raw samples around for ad-hoc analysis. [cf mbostock’s Cube] This follows directly from the “no clairvoyance” rule. Instead of guessing ahead of time which averages and metrics will be important, you generate them ad-hoc. Cube is a system for collecting timestamped events and deriving metrics. By collecting events rather than metrics, Cube lets you compute aggregate statistics post hoc. It also enables richer analysis, such as quantiles and histograms of arbitrary event sets.
  10. It sounds kind of stupid when you say this out loud. Of course you want to measure the thing you want to improve. But sometimes you might spend a lot of effort simulating traffic or doing “load tests”. That’s fine if you have nothing better to do. But live traffic is best because that’s actually what you are trying to improve. This is also a corollary of the “You are not psychic” rule. It’s just as hard to predict how users will behave as computers.
  11. If some measurement is important enough to worry about, it’s important enough to monitor all the time. I can’t remember how many times I’ve noticed something wrong and diagnosed it by looking at historical data. If you are manually spot-checking you’re missing way too many things.
  12. Just like you are not supposed to proofread your own words, you shouldn’t vet your own measurements. It’s way too easy to believe numbers you’ve made just because, well, they are numbers and you made them. You have to cultivate a healthy suspicion and always be ready to state the methodology that produced them. The only way I know to avoid fooling yourself too often is to be scientific about it. Write down and test every single assumption that goes into a measurement. Keep it simple. Get a friend to knock holes in it. A really good technique is to have some overlap in your tools, preferably using different measurement methods, and cross-check. At the very least you’ll catch implementation bugs.
  13. These are pretty hard to root out. A good example that came up a while ago at Facebook was CPU time. It turns out that CPU time is a terrible measurement of the work performed by modern CPUs. As system utilization goes up, more time is spent on the paperwork of context-switching so 1ms at peak is not the same as 1ms at the trough. Also, obv, as time goes on you will have different generations of machines in the fleet running at different clock speeds. And then there is “ TurboBoost ” , which automatically speeds up and slows down the CPU, just to make life interesting. All of that leads to nonsense measurements of “work performed”. After a while it was clear that we couldn’t remove or compensate for the confounding factors in CPU time, and instead switched to CPU instructions. This has its own problems but it’s much more stable and invariant to load. Another kind of confounding factor are transactions which are not actually similar, but are grouped together. A classic example is averaging hits that come from logged-in users and logged-out users. In general you give logged-in users a richer experience. It’s actually a different algorithm being run, with different performance characteristics, so you should group them separately. Or perhaps you have a search page which sometimes returns 10 results per page and sometimes 30.
  14. You never know what will correlate with a performance regression. Since we’re keeping the raw data around and admitting we’re not clairvoyant, it makes sense to record a lot of information about each hit, just in case.
  15. Sums and averages are great, but deceptive. Absent anything fancy you can use stddev/avg, also known as the Noise-To-Signal Ratio or the Relative Standard Deviation. Another good one is to take the geometric mean. The NSR is a pretty good test of both variability and confounding factors. If your NSR is above 0.5 you either have a wildly variable system, or the distribution of hits has more than one peak, which means you should subdivide further.
  16. If at all possible, look at histograms. Even the mighty stddev() will fail if your distribution is multi-modal or otherwise not bell-shaped.
  17. Once you start getting a lot of traffic, the work of storing and analyzing a huge set of data can become alarming. Eventually you’ll realize your analysis system will have to be almost as big as the production system it measures. The good news is that you can sample the data down. The statistical power of 100 million data points isn’t much greater than 10 million, in the same way (and for the same reason) that there isn't a 4x difference in quality between 128kbps audio and 512. You’ll get a lot more bang for your buck by regularly marking and separating your samples into homogenous groups with low variability. You only need enough samples at your most granular level of detail to represent that variability. For example, say that you draw timeseries graphs where each point on the line represents ten minutes of data. Each line represents a different URL, like /home.php or /posts.php. As a rough rule of thumb you want a thousand or more samples per point per line. If you want to get really clever you can vary the sample rate dynamically to ensure that you always log a consistent amount. This could help at night or periods of low traffic. But remember to record the sample rate and weight your metrics! [But then how do you tackle extreme outliers, etc.... can leave that for Q & A if it comes up. TLDR: Nyquist.]