How to Sample Data Like a Pro

•Download as PPT, PDF•

1 like•591 views

A

f carlos@bueno.org

.. …
……. ….
……….
……………………..
…………………………
……………………………..
…………………………………
…………………………………..

How to Sample Data Like a Pro

How to Sample Data Like a Pro

Recommended

Doppleraristus

2024 State of Marketing Report – by Hubspot

2024 State of Marketing Report – by Hubspot

2024 State of Marketing Report – by HubspotMarius Sescu

Everything You Need To Know About ChatGPT

Everything You Need To Know About ChatGPT

Everything You Need To Know About ChatGPTExpeed Software

Product Design Trends in 2024 | Teenage Engineerings

Product Design Trends in 2024 | Teenage Engineerings

Product Design Trends in 2024 | Teenage EngineeringsPixeldarts

How Race, Age and Gender Shape Attitudes Towards Mental Health

How Race, Age and Gender Shape Attitudes Towards Mental Health

How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow

AI Trends in Creative Operations 2024 by Artwork Flow.pdf

AI Trends in Creative Operations 2024 by Artwork Flow.pdf

AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork

Skeleton Culture Code

Skeleton Culture Code

Skeleton Culture CodeSkeleton Technologies

PEPSICO Presentation to CAGNY Conference Feb 2024

PEPSICO Presentation to CAGNY Conference Feb 2024

PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley

Recommended

Doppleraristus

2024 State of Marketing Report – by Hubspot

2024 State of Marketing Report – by Hubspot

2024 State of Marketing Report – by HubspotMarius Sescu

Everything You Need To Know About ChatGPT

Everything You Need To Know About ChatGPT

Everything You Need To Know About ChatGPTExpeed Software

Product Design Trends in 2024 | Teenage Engineerings

Product Design Trends in 2024 | Teenage Engineerings

Product Design Trends in 2024 | Teenage EngineeringsPixeldarts

How Race, Age and Gender Shape Attitudes Towards Mental Health

How Race, Age and Gender Shape Attitudes Towards Mental Health

How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow

AI Trends in Creative Operations 2024 by Artwork Flow.pdf

AI Trends in Creative Operations 2024 by Artwork Flow.pdf

AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork

Skeleton Culture Code

Skeleton Culture Code

Skeleton Culture CodeSkeleton Technologies

PEPSICO Presentation to CAGNY Conference Feb 2024

PEPSICO Presentation to CAGNY Conference Feb 2024

PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley

Content Methodology: A Best Practices Report (Webinar)

Content Methodology: A Best Practices Report (Webinar)

Content Methodology: A Best Practices Report (Webinar)contently

How to Prepare For a Successful Job Search for 2024

How to Prepare For a Successful Job Search for 2024

How to Prepare For a Successful Job Search for 2024Albert Qian

Social Media Marketing Trends 2024 // The Global Indie Insights

Social Media Marketing Trends 2024 // The Global Indie Insights

Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)

Trends In Paid Search: Navigating The Digital Landscape In 2024

Trends In Paid Search: Navigating The Digital Landscape In 2024

Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal

5 Public speaking tips from TED - Visualized summary

5 Public speaking tips from TED - Visualized summary

5 Public speaking tips from TED - Visualized summarySpeakerHub

ChatGPT and the Future of Work - Clark Boyd

ChatGPT and the Future of Work - Clark Boyd

ChatGPT and the Future of Work - Clark Boyd Clark Boyd

Getting into the tech field. what next

Getting into the tech field. what next

Getting into the tech field. what next Tessa Mero

Google's Just Not That Into You: Understanding Core Updates & Search Intent

Google's Just Not That Into You: Understanding Core Updates & Search Intent

Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray

How to have difficult conversations

How to have difficult conversations

How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC

Introduction to Data Science

Introduction to Data Science

Introduction to Data ScienceChristy Abraham Joy

Time Management & Productivity - Best Practices

Time Management & Productivity - Best Practices

Time Management & Productivity - Best PracticesVit Horky

The six step guide to practical project management

The six step guide to practical project management

The six step guide to practical project managementMindGenius

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools

12 Ways to Increase Your Influence at Work

12 Ways to Increase Your Influence at Work

12 Ways to Increase Your Influence at WorkGetSmarter

ChatGPT webinar slides

ChatGPT webinar slides

ChatGPT webinar slidesAlireza Esmikhani

More than Just Lines on a Map: Best Practices for U.S Bike Routes

More than Just Lines on a Map: Best Practices for U.S Bike Routes

More than Just Lines on a Map: Best Practices for U.S Bike RoutesProject for Public Spaces & National Center for Biking and Walking

Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...

Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...

Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference

Barbie - Brand Strategy PresentationErica Santiago

Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well

Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well

Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellSaba Software

More Related Content

Featured

Content Methodology: A Best Practices Report (Webinar)

Content Methodology: A Best Practices Report (Webinar)

Content Methodology: A Best Practices Report (Webinar)contently

How to Prepare For a Successful Job Search for 2024

How to Prepare For a Successful Job Search for 2024

How to Prepare For a Successful Job Search for 2024Albert Qian

Social Media Marketing Trends 2024 // The Global Indie Insights

Social Media Marketing Trends 2024 // The Global Indie Insights

Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)

Trends In Paid Search: Navigating The Digital Landscape In 2024

Trends In Paid Search: Navigating The Digital Landscape In 2024

Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal

5 Public speaking tips from TED - Visualized summary

5 Public speaking tips from TED - Visualized summary

5 Public speaking tips from TED - Visualized summarySpeakerHub

ChatGPT and the Future of Work - Clark Boyd

ChatGPT and the Future of Work - Clark Boyd

ChatGPT and the Future of Work - Clark Boyd Clark Boyd

Getting into the tech field. what next

Getting into the tech field. what next

Getting into the tech field. what next Tessa Mero

Google's Just Not That Into You: Understanding Core Updates & Search Intent

Google's Just Not That Into You: Understanding Core Updates & Search Intent

Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray

How to have difficult conversations

How to have difficult conversations

How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC

Introduction to Data Science

Introduction to Data Science

Introduction to Data ScienceChristy Abraham Joy

Time Management & Productivity - Best Practices

Time Management & Productivity - Best Practices

Time Management & Productivity - Best PracticesVit Horky

The six step guide to practical project management

The six step guide to practical project management

The six step guide to practical project managementMindGenius

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools

12 Ways to Increase Your Influence at Work

12 Ways to Increase Your Influence at Work

12 Ways to Increase Your Influence at WorkGetSmarter

ChatGPT webinar slides

ChatGPT webinar slides

ChatGPT webinar slidesAlireza Esmikhani

More than Just Lines on a Map: Best Practices for U.S Bike Routes

More than Just Lines on a Map: Best Practices for U.S Bike Routes

More than Just Lines on a Map: Best Practices for U.S Bike RoutesProject for Public Spaces & National Center for Biking and Walking

Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...

Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...

Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference

Barbie - Brand Strategy PresentationErica Santiago

Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well

Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well

Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellSaba Software

Featured (20)

Content Methodology: A Best Practices Report (Webinar)

Content Methodology: A Best Practices Report (Webinar)

Content Methodology: A Best Practices Report (Webinar)

How to Prepare For a Successful Job Search for 2024

How to Prepare For a Successful Job Search for 2024

How to Prepare For a Successful Job Search for 2024

Social Media Marketing Trends 2024 // The Global Indie Insights

Social Media Marketing Trends 2024 // The Global Indie Insights

Social Media Marketing Trends 2024 // The Global Indie Insights

Trends In Paid Search: Navigating The Digital Landscape In 2024

Trends In Paid Search: Navigating The Digital Landscape In 2024

Trends In Paid Search: Navigating The Digital Landscape In 2024

5 Public speaking tips from TED - Visualized summary

5 Public speaking tips from TED - Visualized summary

5 Public speaking tips from TED - Visualized summary

ChatGPT and the Future of Work - Clark Boyd

ChatGPT and the Future of Work - Clark Boyd

ChatGPT and the Future of Work - Clark Boyd

Getting into the tech field. what next

Getting into the tech field. what next

Getting into the tech field. what next

Google's Just Not That Into You: Understanding Core Updates & Search Intent

Google's Just Not That Into You: Understanding Core Updates & Search Intent

Google's Just Not That Into You: Understanding Core Updates & Search Intent

How to have difficult conversations

How to have difficult conversations

How to have difficult conversations

Introduction to Data Science

Introduction to Data Science

Introduction to Data Science

Time Management & Productivity - Best Practices

Time Management & Productivity - Best Practices

Time Management & Productivity - Best Practices

The six step guide to practical project management

The six step guide to practical project management

The six step guide to practical project management

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...

12 Ways to Increase Your Influence at Work

12 Ways to Increase Your Influence at Work

12 Ways to Increase Your Influence at Work

ChatGPT webinar slides

ChatGPT webinar slides

ChatGPT webinar slides

More than Just Lines on a Map: Best Practices for U.S Bike Routes

More than Just Lines on a Map: Best Practices for U.S Bike Routes

More than Just Lines on a Map: Best Practices for U.S Bike Routes

Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...

Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...

Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...

Barbie - Brand Strategy Presentation

Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well

Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well

Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well

How to Sample Data Like a Pro

1. f carlos@bueno.org

2.

3.

4.

5.

6.

7.

8.

9.

10.

11.

12.

13.

14.

15.

16.

17.

18. .. … ……. …. ………. …………………….. ………………………… …………………………….. ………………………………… …………………………………..

Editor's Notes

I’ve been doing performance and optimization work at Facebook for about three years now, and off & on at other companies before that. I didn’t invent anything I’m going to talk about today; it’s just a collection of ideas that seem to work. The strategy is pretty simple, almost stupidly simple. But statistical thinking does not come naturally. You have to practice, watch your biases, and reevaluate constantly to make sure you aren’t fooling yourself.
The Elements of Programming Style was written 5 years before I was born, and this rule is as true now as it was then.
If you went to university you probably saw a graph like this. Complexity! Big O! For large values of N, complexity dominates. This is a good lesson, and important to know. You should be solving your complexity problems when you’re still in the whiteboard stage. But the way this lesson is taught is terrible for two reasons.
One is that code doesn’t execute on whiteboards. It executes on computers, and computers aren’t free.
If you are the one writing checks to Amazon, which line would you rather be on?
The second problem: it’s easy to convince yourself that with sufficient cleverness you can predict how a system will behave.
This is absolutely not true. You can’t predict in fine detail how a complicated program will behave. That would be equivalent to solving the Halting Problem. Or it means you can run the entire program in your head, in which case you don’t need a computer at all. Right? Everything else in performance work –everything-- follows as a consequence of this one fact. You are not psychic. You cannot predict the behavior of a large system. Therefore it follows that you have to observe how it does behave. And the quality and accuracy of your measurements determine success.
That means the strategy for perf is very simple.
One thing I’ve found very useful, and the industry as whole is moving towards, is the idea of keeping raw samples around for ad-hoc analysis. [cf mbostock’s Cube] This follows directly from the “no clairvoyance” rule. Instead of guessing ahead of time which averages and metrics will be important, you generate them ad-hoc. Cube is a system for collecting timestamped events and deriving metrics. By collecting events rather than metrics, Cube lets you compute aggregate statistics post hoc. It also enables richer analysis, such as quantiles and histograms of arbitrary event sets.
It sounds kind of stupid when you say this out loud. Of course you want to measure the thing you want to improve. But sometimes you might spend a lot of effort simulating traffic or doing “load tests”. That’s fine if you have nothing better to do. But live traffic is best because that’s actually what you are trying to improve. This is also a corollary of the “You are not psychic” rule. It’s just as hard to predict how users will behave as computers.
If some measurement is important enough to worry about, it’s important enough to monitor all the time. I can’t remember how many times I’ve noticed something wrong and diagnosed it by looking at historical data. If you are manually spot-checking you’re missing way too many things.
Just like you are not supposed to proofread your own words, you shouldn’t vet your own measurements. It’s way too easy to believe numbers you’ve made just because, well, they are numbers and you made them. You have to cultivate a healthy suspicion and always be ready to state the methodology that produced them. The only way I know to avoid fooling yourself too often is to be scientific about it. Write down and test every single assumption that goes into a measurement. Keep it simple. Get a friend to knock holes in it. A really good technique is to have some overlap in your tools, preferably using different measurement methods, and cross-check. At the very least you’ll catch implementation bugs.
These are pretty hard to root out. A good example that came up a while ago at Facebook was CPU time. It turns out that CPU time is a terrible measurement of the work performed by modern CPUs. As system utilization goes up, more time is spent on the paperwork of context-switching so 1ms at peak is not the same as 1ms at the trough. Also, obv, as time goes on you will have different generations of machines in the fleet running at different clock speeds. And then there is “ TurboBoost ” , which automatically speeds up and slows down the CPU, just to make life interesting. All of that leads to nonsense measurements of “work performed”. After a while it was clear that we couldn’t remove or compensate for the confounding factors in CPU time, and instead switched to CPU instructions. This has its own problems but it’s much more stable and invariant to load. Another kind of confounding factor are transactions which are not actually similar, but are grouped together. A classic example is averaging hits that come from logged-in users and logged-out users. In general you give logged-in users a richer experience. It’s actually a different algorithm being run, with different performance characteristics, so you should group them separately. Or perhaps you have a search page which sometimes returns 10 results per page and sometimes 30.
You never know what will correlate with a performance regression. Since we’re keeping the raw data around and admitting we’re not clairvoyant, it makes sense to record a lot of information about each hit, just in case.
Sums and averages are great, but deceptive. Absent anything fancy you can use stddev/avg, also known as the Noise-To-Signal Ratio or the Relative Standard Deviation. Another good one is to take the geometric mean. The NSR is a pretty good test of both variability and confounding factors. If your NSR is above 0.5 you either have a wildly variable system, or the distribution of hits has more than one peak, which means you should subdivide further.
If at all possible, look at histograms. Even the mighty stddev() will fail if your distribution is multi-modal or otherwise not bell-shaped.
Once you start getting a lot of traffic, the work of storing and analyzing a huge set of data can become alarming. Eventually you’ll realize your analysis system will have to be almost as big as the production system it measures. The good news is that you can sample the data down. The statistical power of 100 million data points isn’t much greater than 10 million, in the same way (and for the same reason) that there isn't a 4x difference in quality between 128kbps audio and 512. You’ll get a lot more bang for your buck by regularly marking and separating your samples into homogenous groups with low variability. You only need enough samples at your most granular level of detail to represent that variability. For example, say that you draw timeseries graphs where each point on the line represents ten minutes of data. Each line represents a different URL, like /home.php or /posts.php. As a rough rule of thumb you want a thousand or more samples per point per line. If you want to get really clever you can vary the sample rate dynamically to ensure that you always log a consistent amount. This could help at night or periods of low traffic. But remember to record the sample rate and weight your metrics! [But then how do you tackle extreme outliers, etc.... can leave that for Q & A if it comes up. TLDR: Nyquist.]