I’ve been doing performance and optimization work at Facebook for about three years now, and off & on at other companies before that. I didn’t invent anything I’m going to talk about today; it’s just a collection of ideas that seem to work. The strategy is pretty simple, almost stupidly simple. But statistical thinking does not come naturally. You have to practice, watch your biases, and reevaluate constantly to make sure you aren’t fooling yourself.
The Elements of Programming Style was written 5 years before I was born, and this rule is as true now as it was then.
If you went to university you probably saw a graph like this. Complexity! Big O! For large values of N, complexity dominates. This is a good lesson, and important to know. You should be solving your complexity problems when you’re still in the whiteboard stage. But the way this lesson is taught is terrible for two reasons.
One is that code doesn’t execute on whiteboards. It executes on computers, and computers aren’t free.
If you are the one writing checks to Amazon, which line would you rather be on?
The second problem: it’s easy to convince yourself that with sufficient cleverness you can predict how a system will behave.
This is absolutely not true. You can’t predict in fine detail how a complicated program will behave. That would be equivalent to solving the Halting Problem. Or it means you can run the entire program in your head, in which case you don’t need a computer at all. Right? Everything else in performance work –everything-- follows as a consequence of this one fact. You are not psychic. You cannot predict the behavior of a large system. Therefore it follows that you have to observe how it does behave. And the quality and accuracy of your measurements determine success.
That means the strategy for perf is very simple.
One thing I’ve found very useful, and the industry as whole is moving towards, is the idea of keeping raw samples around for ad-hoc analysis. [cf mbostock’s Cube] This follows directly from the “no clairvoyance” rule. Instead of guessing ahead of time which averages and metrics will be important, you generate them ad-hoc. Cube is a system for collecting timestamped events and deriving metrics. By collecting events rather than metrics, Cube lets you compute aggregate statistics post hoc. It also enables richer analysis, such as quantiles and histograms of arbitrary event sets.
It sounds kind of stupid when you say this out loud. Of course you want to measure the thing you want to improve. But sometimes you might spend a lot of effort simulating traffic or doing “load tests”. That’s fine if you have nothing better to do. But live traffic is best because that’s actually what you are trying to improve. This is also a corollary of the “You are not psychic” rule. It’s just as hard to predict how users will behave as computers.
If some measurement is important enough to worry about, it’s important enough to monitor all the time. I can’t remember how many times I’ve noticed something wrong and diagnosed it by looking at historical data. If you are manually spot-checking you’re missing way too many things.
Just like you are not supposed to proofread your own words, you shouldn’t vet your own measurements. It’s way too easy to believe numbers you’ve made just because, well, they are numbers and you made them. You have to cultivate a healthy suspicion and always be ready to state the methodology that produced them. The only way I know to avoid fooling yourself too often is to be scientific about it. Write down and test every single assumption that goes into a measurement. Keep it simple. Get a friend to knock holes in it. A really good technique is to have some overlap in your tools, preferably using different measurement methods, and cross-check. At the very least you’ll catch implementation bugs.
These are pretty hard to root out. A good example that came up a while ago at Facebook was CPU time. It turns out that CPU time is a terrible measurement of the work performed by modern CPUs. As system utilization goes up, more time is spent on the paperwork of context-switching so 1ms at peak is not the same as 1ms at the trough. Also, obv, as time goes on you will have different generations of machines in the fleet running at different clock speeds. And then there is “ TurboBoost ” , which automatically speeds up and slows down the CPU, just to make life interesting. All of that leads to nonsense measurements of “work performed”. After a while it was clear that we couldn’t remove or compensate for the confounding factors in CPU time, and instead switched to CPU instructions. This has its own problems but it’s much more stable and invariant to load. Another kind of confounding factor are transactions which are not actually similar, but are grouped together. A classic example is averaging hits that come from logged-in users and logged-out users. In general you give logged-in users a richer experience. It’s actually a different algorithm being run, with different performance characteristics, so you should group them separately. Or perhaps you have a search page which sometimes returns 10 results per page and sometimes 30.
You never know what will correlate with a performance regression. Since we’re keeping the raw data around and admitting we’re not clairvoyant, it makes sense to record a lot of information about each hit, just in case.
Sums and averages are great, but deceptive. Absent anything fancy you can use stddev/avg, also known as the Noise-To-Signal Ratio or the Relative Standard Deviation. Another good one is to take the geometric mean. The NSR is a pretty good test of both variability and confounding factors. If your NSR is above 0.5 you either have a wildly variable system, or the distribution of hits has more than one peak, which means you should subdivide further.
If at all possible, look at histograms. Even the mighty stddev() will fail if your distribution is multi-modal or otherwise not bell-shaped.
Once you start getting a lot of traffic, the work of storing and analyzing a huge set of data can become alarming. Eventually you’ll realize your analysis system will have to be almost as big as the production system it measures. The good news is that you can sample the data down. The statistical power of 100 million data points isn’t much greater than 10 million, in the same way (and for the same reason) that there isn't a 4x difference in quality between 128kbps audio and 512. You’ll get a lot more bang for your buck by regularly marking and separating your samples into homogenous groups with low variability. You only need enough samples at your most granular level of detail to represent that variability. For example, say that you draw timeseries graphs where each point on the line represents ten minutes of data. Each line represents a different URL, like /home.php or /posts.php. As a rough rule of thumb you want a thousand or more samples per point per line. If you want to get really clever you can vary the sample rate dynamically to ensure that you always log a consistent amount. This could help at night or periods of low traffic. But remember to record the sample rate and weight your metrics! [But then how do you tackle extreme outliers, etc.... can leave that for Q & A if it comes up. TLDR: Nyquist.]