Tips on how to improve how you use your CDN. Condensed from a lot of material, this talk was crammed into 20 minutes.
More info available at http://mikebrittain.com.
Thanks So earlier this year I started kicking around ideas for a talk about CDNs, and I thought “How about I go through an outline of everything I’ve learned about CDNs over the last 4 years”?
And then I got an email from Steve and Jesse saying… That sounds great!
You’ve got 20 minutes. How’s that sound?
And I said, “Perfect”. So this is the short version. I’m going to quickly talk about just a few topics that hopefully will be of use to you. I’m going to start with a little groundwork.
In a typical setup for your CDN, your user makes requests for HTML pages directly to your web server. That page includes references to various other assets, like images, stylesheets, and javascript.
Those assets are loaded from your CDN. Files delivered faster from the CDN because they are closer to your end user. Assuming that your CDN bandwidth is cheaper than bandwidth at your datacenter, you stand to save money by delivering content through the CDN. How do your files get onto the CDN?
Traditionally, you would push your files onto the CDN’s storage network, avail to all cache servers. If you have frequently changing files, or are handling user-gen content, this can become a hassle. Additionally, you get billed for storage on the CDN.
Instead, I prefer to use a service called origin-pull. This is CDN lingo for reverse-proxy. As users make requests for files from the CDN, the CDN fetches those files from your webserver (the origin) and then returns them to the end user. The CDN caches those files to handle subsequent requests without having to re-fetch from your server. The benefit is that you simply serve files from your origin server as you would if the CDN wasn’t in place. It’s like a transparent layer between you and your user.
Now, you control your content in the CDN by setting response headers. You should already recognize these because you already use them to control content in the browser’s cache. They work the same for CDNs. An important point to be made about these headers is that they are pretty simple to configure for static files. This happens in your web server config. But for dynamic content, your application code is responsible for generating these headers. Let’s look at an example.
This is the conversation that the CDN has with your server to re-fetch a file that has expired from cache. The CDN says, “I need a new copy of global.css, but only if it’s different than the version I already have.” Your origin server either says, “OK, here’s a new 35 K file”, or “no problem, just keep serving the latest version, which you already have in storage.” (ZERO K). If you’re not handling this negotiation in your application code, you’re wasting bandwidth, money, and time to re-send this file to the CDN every time it expires from the cache. So make sure you are handling revalidation properly for dynamic content.
You should periodically review server logs for any files that are being accessed too often. These generally stick out pretty badly. Three cached, one not (unexpected file type added by editors) Server logs good for finding mis-configured headers (origin-pull) So is this a problem? It’s still being delivered to the user. BUT, we’re double-paying for delivery.
See … we are paying for bandwidth to deliver from the CDN to the user, but we’re also paying for bandwidth from the origin server to the CDN. These prices are arbitrary, but they show a common scenario where CDN bandwidth is cheaper than origin bandwidth. It costs us more than double what it normally should to serve this from the CDN.
So we go back and fix the caching headers for powerpoint files. Assuming a ratio of 200 cache hits to 1 miss, we see that the price of origin bandwidth becomes almost negligible. So look through your server logs, find bad caching, and avoid double paying for delivery.
For the second half, I want to talk about caching HTML pages. Previously we looked at a setup where HTML was served directly from your web server, and images were served through the CDN. But now we’re going to use the CDN to cache HTML requests as well.
Point your www hostname at CDN, and the CDN proxies those requests to your web server which we have renamed “origin”. Just like we did for images and other files, we need to specify caching headers for HTML that we serve from origin.
So here’s a sample web page that we might want to cache. Your sites probably have some of these features: - Personalized content - User-feedback Real-time data This is a page built specifically for me. We obviously don’t want to cache my version of this page for the next 100 visitors to see. So how do we deal with this?
We can take out the personalized sections and serve a generic page to the CDN, which then goes out to our users. Using logic on the client-side, we tailor the page for whoever is looking at it.
We strip out my name from the page, and serve sign-up and registration links at the top of the page … even if I’m already signed in. Reading my username from a cookie, we can replace the sign-in links with our personalized welcome message. Similarly, store ad targeting details (demographic data) in a cookie and write in your ad tags on the client side.
For many sites… users view a lot more pages than they interact with. Most pages will look fine even though they’re technically a few minutes old. Track a short history of these events and the pages where they occurred in a cookie. When a user returns to a page where they submitted some kind of update, use Ajax to fetch a fragment of HTML that can be used to overwrite that section of the cached page. These changes, however, won’t be obvious to other users until the version of the file on the CDN expires and then is updated from origin.
It’s very easy to update view counts at the point when you generate an HTML page for a user, but when dealing with cached HTML you don’t generate most of those pages anymore. Use tracking pixels to log view counts to your origin server. When a user loads a cached version of the page, you can either use ajax to fetch the real-time count from your server, or just fake the increasing count on the client side.
Once you’re running your www host through the CDN, there are suddenly a lot of requests that become candidates for caching. Search results, ajax responses, and public APIs can all potentially be cached, as long as you use GETs (and not POSTs) to retrieve them.
If you’ve got large areas of your site that you would never want to cache, split them out to separate hostnames and serve those directly from your origin server. This way you page caching where you need it, and it’s not in the way where you don’t want it.
For full page caching, I generally use expirations of between 3 minutes and an hour. TTLs should make sense for how often content might change on your pages, along with users expectations for change. Short TTLs are used for areas of the site that change often or where users are interacting. If you’re serving breaking news, stick to those short TTLs. Long TTLs can be used for pages that contain more general information.
I highly recommend taking a look at this. On the last site where I used HTML caching, 92% of our pages were served directly from the CDN. Only about one of every ten requests hit our web servers, mostly for POSTs. That didn’t just save us money on bandwidth, it saved money that we would have otherwise invested in additional web servers to handle the traffic.
So that’s what I have time for today. I’m planning to post additional information on my site… some tips that I couldn’t fit into this presentation. I’ll take questions for the remainder of the time. And of course, please feel free to ask questions later today or tomorrow.
Take-aways: - Review objects you are serving by CDN, make sure they are caching -- especially headers for dynamic objects - Consider caching HTML pages -- talk to your vendor about it, let them help (find someone technical) - Add redundancy to your CDN delivery, plan for an outage - Look out for double paying for delivery. Last tip -- keep up with your vendor: - Review any documentation for juicey tips you missed the first time through. - Ask questions… talk to someone other than a sales person who can help you with implementation questions. - Ask for stuff. If you don’t like something about the service, reporting, or features, ask for them to add it (no matter how ridiculous) - Re-commit. If you are pushing more traffic than you initially committed to, ask for a price break in exchange for committing to a longer term. Thank you - hope it was useful. Slides available online, along with additional articles on my site. Some time for questions, but if you have anything that comes up after this, please don’t hesitate to find me afterward or send me email.
Take-aways: - Review objects you are serving by CDN, make sure they are caching -- especially headers for dynamic objects - Consider caching HTML pages -- talk to your vendor about it, let them help (find someone technical) - Add redundancy to your CDN delivery, plan for an outage - Look out for double paying for delivery. Last tip -- keep up with your vendor: - Review any documentation for juicey tips you missed the first time through. - Ask questions… talk to someone other than a sales person who can help you with implementation questions. - Ask for stuff. If you don’t like something about the service, reporting, or features, ask for them to add it (no matter how ridiculous) - Re-commit. If you are pushing more traffic than you initially committed to, ask for a price break in exchange for committing to a longer term. Thank you - hope it was useful. Slides available online, along with additional articles on my site. Some time for questions, but if you have anything that comes up after this, please don’t hesitate to find me afterward or send me email.
Will be talking about response headers, here is how to look at them if you’re not used to looking at these.
A second CDN might also be cheaper… top-tier CDNs cost more money. This is a case I worked on a year ago, serving video content. Popularity spikes for new content (humps then decline) Business perspective, eyeballs on new content. Archives less important. Same could go for paid subscribers vs. free/anonymous users… or premium vs. user-generated. Videos with low viewership served from cheaper CDN, or from origin. Unpopular content and large files drops out of cache Avoid double paying.
If no secondary CDN, then origin is your failover. Make sure you have the capacity. Outage earlier this year, this graph is of one of our app servers (doubles as origin). Traffic more than doubled. Note sharp drop for 10 minutes… users getting confused and leaving the site.
Cloud storage: Super cheap option if you can handle pushing files into CDN storage. Both have CDNs in front. No contracts, pay as you go. $0.15 / GB storage ~300 MB site images, store for 5 cents. Site mostly running during an outage. Delivery prices (amazon is 0.22 in asia) Lots of tools available to help move files around. Can’t go wrong with these prices.
A second CDN might also be cheaper… top-tier CDNs cost more money. This is a case I worked on a year ago, serving video content. Popularity spikes for new content (humps then decline) Business perspective, eyeballs on new content. Archives less important. Same could go for paid subscribers vs. free/anonymous users… or premium vs. user-generated. Videos with low viewership served from cheaper CDN, or from origin. Unpopular content and large files drops out of cache Avoid double paying.