Human Factors of XR: Using Human Factors to Design XR Systems
Making Facebook Faster
1. (c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0
Sunday, September 27, 2009 1
2. Making Facebook faster
Frontend performance
engineering
David Wei and Changhao
Jiang
Velocity 2009
Jun 24, 2009 San Jose, CA
Sunday, September 27, 2009 2
3. Agenda
1 Site speed matters
2 Performance monitoring
3 Static resource management
4 Ajaxification
5 Client side cache
Sunday, September 27, 2009 3
5. Site speed matters: large scale
200 million users, more than 4 billion page views /
day
▪ 10ms per page = more than 1 man-year
per day
= more than 5 human-life of
time per year
Sunday, September 27, 2009 5
Facebook cares site speed. … -- so yes, we care about site speed.
With our scales, our 200 Million users generated more than 4 billion page loads per day.
If we can speed up each page load by 10 ms, aggregately, we will save our users 1 man-year of time per day; and accumulating over a year, that’s more than 5 human life
of time.
Site speed is also affecting our bottleline. Experiments show that if we reduce the latency by 600ms, the user click rate improves by more than 5%. We are currently running
an in-depth experiment on the impact of latency.
6. Site speed matters: emerging
• Agile development
Sunday, September 27, 2009 6
On the other hand, there are huge challenges for a site like facebook in term of site performance optimization. Here are a few major ones….
Move fast, no stable code base
Fast Development: every week we release a new version of the site – with hundreds of code changes; tens of small code changes are pushed everyday. So the code base is
never stable and there is no time to stop for pure optimization
7. Site speed matters: emerging
• Agile development
• Deep integration
Sunday, September 27, 2009 7
Deep integration: Each facebook home page is customized for a particular user, with features developed by many teams – some of them are applications by 3rd party
developers, some of them are internal facebook feature – depending on the users’ adoption on the features and applications.
it also takes a lot of javascript to run them.
8. Site speed matters: emerging
• Agile development
• Deep integration
• Viral adoption
Sunday, September 27, 2009 8
Viral adoption: it is very hard to predict if a feature that is released today will be used by 1 million users or 10 million users next week. It is difficult to optimize
beforehand. The infrastructure has to be adaptable to the growth of user adoption.
9. • Agile development
• Deep integration
• Viral adoption
• Heavily interactive
Sunday, September 27, 2009 9
… this talk, we will share our experience on how to make a site faster with these challenges
Heavy interaction: our pages have many dynamic features that rely on javascript. E.g. the in-browser chat and application dock provide very convenient user experience,
while it also takes a lot of javascript to run them.
10. Site speed matters: emerging
• Agile development
• Deep integration
• Viral adoption
• Heavily interactive
Sunday, September 27, 2009 10
In summary, we have a lot of challenges.
And these challenges are actually essential to make Facebook a paradise for people who want to build new things – you can write something cool tonight, and push it out
tomorrow to 200millions users. At the same time, it also makes the site performance hard to predict and maintain.
In this talk, we will share our experience on how to optimize front end performance with these challenges.
11. Site speed: end-to-end latency experienced by
▪ From a user request to the
presentation of the page at
the browser, interactive:
Rende Browsers
Content
▪ Network Transfer Time r Distribution
Network
(CDN)
▪ Server Generation Time
▪ Client Render Time
▪ NetTim
▪ GenTim FB
Server
Sunday, September 27, 2009 11
Before going into details, we’d define our problem domains.
We define the end-to-end user latency as the time from user starts a page request, to the time the page is presented in the browser, interactive.
There are three components of latency in this process:
Network Transfer time is the time from the user browser to Facebook server, and back;
Server Generation time is the time spent on the Facebook servers;
And client render time is the time the browser spends on parsing the HTML, loading javascript/css/images and rendering the contents.
12. Site speed: end-to-end latency experienced by
User latency = RenderTime + NetTime + GenTime
▪ RenderTime: ~50% of end-user latency
▪ NetTime: ~25% of end-user latency
▪ GenTime: ~25% of end-user latency
Sunday, September 27, 2009 12
Looking at facebook’s user latency, client side render time is about 50% of the end-to-end latency; network time and server-side generation time are about 25% each.
13. Site speed: end-to-end latency experienced by
User latency = RenderTime + NetTime + GenTime
▪ RenderTime: ~50% of end-user latency
▪ NetTime: ~25% of end-user latency
▪ GenTime: ~25% of end-user latency
Sunday, September 27, 2009 13
In this talk, we focus on the biggest chunk: render time.
15. User-based measurement All content loaded,
First bytes Page Interactive
What’s our speed?
Server of HTML
▪ sampling 1/10000 page loads
JS Report
Sunday, September 27, 2009 15
To make the site faster, the first question we want to ask is: what is our site speed?
There are usually two approaches: run some in-house testing, or samples on real users
We did both and found that the second approach is much more helpful for us.
We actually have lessons on the first approach: our pages are vastly different for different users, and Facebook employees are most likely to be the outliers because they
tend to have much more features and functionalities than normal users, and installed many plugins such as firebug, ie developers. even finding a “typical” users is hard, as
the usage behaviors of our users have been changing all the time.
Our approach is to take samples from our users. We have javascript measurement on a sampled users, 1/10000. to measure the real speed. The red arrows are the events
that we records.
This gives us a real image of what the site speed looks like for facebook.
Btw, we are loading the javascripts before our css, because the javascripts are loaded in parallel, along with css and images
16. User-based measurement All content loaded,
First bytes Page Interactive
What’s our speed?
Server of HTML
▪ sampling 1/10000 page loads
JS Report
Sunday, September 27, 2009 16
The last thing I want to point out on this slide is that, we are loading the javascripts before our css – this violates the common best practice of putting css in front of js.
However, the case here is that we are downloading most of our javascripts in parallel. If we put JS at top, we make JS, css and images are all in parallels. Half a year ago, we
tested and found this is faster. We are running another set of experiments to see if things changed.
17. Cavalry: Day-to-day monitoring
What’s our speed?
▪ Collect gen time / network transfer time and render time
GenTime Daily site speed
monitoring
Network
Time
Browser
onload time Cavalry
Logs
Sunday, September 27, 2009 17
We combine the js measurement along with our serverside measurement on page generation time and network round trip time, and put it into a database.
Now we can yell to the company that “Hey the site is slower today!”.
However, we still don’t know who made it? We are continuously launching different features every week. It is hard to stop-and-test for performance.
18. Cavalry: Project-based analysis
Who made it faster / slower?
▪ Integrated with Launch System
GenTime Launch Daily site speed
System monitoring
Network
Time
Project-based
Browser regression
onload time Cavalry detection
Logs
Sunday, September 27, 2009 18
1. The second step of our measurement is to hook the logs with our launching system. For each measurement sample, we record what new features are launched in the
page load.
2. When there is a regression, we can go over the samples and identify the feature launch that causes regression.
3. This can make the corresponding team much more responsive to a regression.
4. Then there is still a question: “why is it slow? How can I fix it?”
19. Cavalry: Numeric metrics
Why are we fast / slow? How can I fix it?
▪ YSlow-like technical metrics
GenTime Gate Daily site speed
Keeper monitoring
Network
Time
Project-based
Browser regression
onload time Cavalry detection
Logs
Yslow-like Regression
metrics analysis
Sunday, September 27, 2009 19
To answer the “why” question, Yslow is a good tool.
1. We instrument a subset of the Yslow metrics into our sampled page load. We measure the # of images / # of dom nodes / # of script tags / # of html bytes / # of css
rules and etc. These metrics can give indication on what causes a perf regression.
2. The missing thing is that we still don’t have a mapping from the yslow-metrics to the actual time (msec)
20. “WWW” in performance monitoring:
What? Who? Why?
▪ User-based measurement: unbiased, representative results
▪ Feature-launch integration: identify the regression
▪ Technical metrics: define actionable items for
improvement
Sunday, September 27, 2009 20
1. Missing part is the priority definition: how much saving, in ms, is if we reduce the # of css rules by 10%? Vs we move the js down to the bottom?
22. Why we need SR Management?
• Day 1: Some smart engineers start a project!
<Print css tag for feature A>
“Let’s write a
<Print css tag for feature B> new page with
features A, B
<Print css tag for feature C> and C!”
<print HTML of feature A>
<print HTML of feature B>
<print HTML of feature C>
Sunday, September 27, 2009 22
23. Why we need SR Management?
• Day 2: Some smart engineers run PageSpeed and
thinks…
<Print css tag for feature A> “A & B & C are
always used;
<Print css tag for feature B> let’s package
them
<Print css tag for feature C> together!”
<print HTML of feature A>
<print HTML of feature B>
<print HTML of feature C>
Sunday, September 27, 2009 23
24. Why we need SR Management?
• Day 2: Awesome!
<Print css tag for feature
A&B&C>
<print HTML of feature A>
<print HTML of feature B>
<print HTML of feature C>
…
Sunday, September 27, 2009 24
25. Why we need SR Management?
• Day 3: feature C evolves…
<Print css tag for feature A & B & C>
<print HTML of feature A>
<print HTML of feature B>
If (users_signup_for_C()) { <print HTML of feature C>}
…
Sunday, September 27, 2009 25
26. Why we need SR Management?
• Day 3:
<Print css tag for feature A & B & C> A&B are always
used, while C is
<print HTML of feature A> not. ..
<print HTML of feature B>
If (users_signup_for_C()) { <print HTML of feature C>}
…
Sunday, September 27, 2009 26
27. Why we need SR Management?
• Day 4: feature C is deprecated
<Print css tag for feature A & B & C>
<print HTML of feature A>
<print HTML of feature B>
// no one uses C { <print HTML of feature C>}
…
Sunday, September 27, 2009 27
28. Why we need SR Management?
• Day 4: we start to send unused bits
<Print css tag for feature A & B & C>
It is hard to
<print HTML of feature A> remember we
should remove C
<print HTML of feature B> here.
// no one uses C { <print HTML of feature C>}
…
Sunday, September 27, 2009 28
29. Why we need SR Management?
• One months later…
<Print css tag for feature A & B & C & D & E & F & G…>
Thousands of
if (F is used) <print HTML of feature F> dead CSS rules in
the package.
<print HTML of feature G>
if (F is not used) { <print HTML of feature E>}
…
Sunday, September 27, 2009 29
30. Static Resource Management @
Challenges: Responses:
• Deep Integration • Separate requirement
declaration and delivery of static
• Viral Adoption resources
• Agile Development • Requirement declaration: lives
with HTML generation
• Delivery: Globally optimized
Sunday, September 27, 2009 30
Deep Integration: each page has many features;
Viral adoption: usage pattern changes quickly
Agile development: feature changes fast
31. Haste: Static Resource Management
Separate Declaration from
actual Delivery
• Back to Day 1:
require_static(A_css); <render HTML of feature
A>
require_static(B_css); <render HTML of feature B>
require_static(C_css);<render HTML Requirement Declaration lives
of feature C>
with HTML
<deliver all required CSS>
Global Optimization on Delivery
<print all rendered HTML>
Sunday, September 27, 2009 31
32. Haste: Global Optimization
Online process Offline analysis
require_static(A_css);<render HTML of
feature A>
Usage Pattern logs
require_static(B_css); <render HTML of
feature B>
Clustering algorithms
require_static(C_css); <render HTML of
feature C>
“Optimal” packages
<deliver all required CSS>
<print all rendered HTML>
Sunday, September 27, 2009 32
33. Haste: Trace-based Packaging
Nov 2008 => May 2009
# of pkg at a # of bytes at
Date # of JS files # of JS bytes
home.php a home.php
Nov 2008 461 4.4 MB 29 629 KB
May 2009 729 5.9 MB 14 560 KB
Sunday, September 27, 2009 33
The # of JS files are increased by 60%, the byte sites are increased by 30%. The # of pkg sent is halved, the byte size is 10% less.
find | grep -v .svn | grep -v intern | grep .css$ -c
find | grep -v .svn | grep -v intern | grep .css$ | xargs cat > /tmp/dwei_2008
34. Haste: Trace-based Packaging
Nov 2008 => May 2009
# of pkg at a # of bytes at
Date # of JS files # of JS bytes
home.php a home.php
Nov 2008 461 4.4 MB 29 629 KB
May 2009 729 5.9 MB 14 560 KB
'js/careers/jobs.js’,
'js/lib/ui/timeeditor.js’,
'resume/js/resumepro.js’,
'resume/js/resumesection.js’
Sunday, September 27, 2009 34
Developers think that timeeditor.js is a library file – in fact, it is only used in one production page (career)
On the other hand, it turns out that “resume“ function is almost always used in career page.
35. Haste: Trace-based Packaging
Nov 2008 => May 2009
# of pkg at a # of bytes at
Date # of JS files # of JS bytes
home.php a home.php
Nov 2008 461 4.4 MB 29 629 KB
May 2009 729 5.9 MB 14 560 KB
# of CSS # of pkg at a # of bytes at
Date # CSS files
bytes home.php a home.php
Nov 2008 487 1.7 MB 24 69 KB
May 2009 706 1.9 MB 15 64 KB
Sunday, September 27, 2009 35
CSS is a similar story
36. Haste: Trace-based Analysis
Potentials for image sprites too!
• Thousands of virtual gifts with static images, which to sprite?
Sunday, September 27, 2009 36
The same tracebase analysis techniques can be use in image spriting too:
37. Haste: Trace-based Analysis
Potentials for image sprites too!
• The answer is…
Sunday, September 27, 2009 37
The answer is…
In retrospection, this is pretty straight forward.
38. Haste: Trace-based Analysis
Adaptive Performance Optimization
• JS / CSS package optimization
• Guidance for image spriting
• Guidance of progressive rendering
Sunday, September 27, 2009 38
Once we separate the declaration and delivery of static resources, we have tons of area for automatic optimizations with trace analysis.
You can do automatic packaging, you can do automatic spriting, you can also do automatic progressive rendering – you can look at the most frequently used resources,
and flush them out before generating the page.
44. How Quickling works?
1. User clicks a link or back/forward
button
Sunday, September 27, 2009 41
45. How Quickling works?
1. User clicks a link or back/forward
button
2. Quickling sends an ajax to server
3. Response arrives
Sunday, September 27, 2009 41
46. How Quickling works?
1. User clicks a link or back/forward
button
2. Quickling sends an ajax to server
3. Response arrives
4. Quickling blanks the content
area
Sunday, September 27, 2009 41
47. How Quickling works?
1. User clicks a link or back/forward
button
2. Quickling sends an ajax to server
3. Response arrives
4. Quickling blanks the content
area
5. Download javascript/CSS
Sunday, September 27, 2009 41
48. How Quickling works?
1. User clicks a link or back/forward
button
2. Quickling sends an ajax to server
3. Response arrives
4. Quickling blanks the content
area
5. Download javascript/CSS
6. Show new content
Sunday, September 27, 2009 41
49. LinkController
Intercept user clicks on links
▪ Dynamically attach a handler to all link clicks:
$(‘a’).click(function() {
// ‘payload’ is a JSON encoded response from the server
$.get(this.href, function(payload) {
// Dynamically load ‘js’, ‘css’ resources for this page.
bootload(payload.bootload, function() {
// Swap in the new page’s content
$(‘#content’).html(payload.html)
// Execute the onloadRegister’ed js code
execute(payload.onload)
});
}
});
Sunday, September 27, 2009 42
50. HistoryManager
Enable ‘Back/Forward’ buttons for AJAX requests
▪ Set target page URL as the fragment of the URL
▪ http://www.facebook.com/home.php
▪ http://www.facebook.com/home.php#/cjiang?ref=profile
▪ http://www.facebook.com/home.php#/friends/?ref=tn
Sunday, September 27, 2009 43
51. Bootloader
Load static resources via ‘script’, ‘link’ tag injection
function requestResource(type, source) {
var h = document.getElementsByTagName('head')[0];
switch (type) {
case 'js':
var script = document.createElement('script');
script.src = source;
script.type = 'text/javascript';
h.appendChild(script);
break;
case 'css':
var link = document.createElement('link');
link.rel = "stylesheet";
link.type = "text/css";
link.media = "all" ;
link.href = source;
h.appendChild(link);
break;
}
}
Sunday, September 27, 2009 44
52. Other details
▪ All pages now share a single global javascript scope:
▪ Explicitly reclaim resources or reset states before leaving a page
▪ Stub out setTimeout and setInterval
▪ All CSS rules will be accumulated
▪ Name-spacing CSS rules with page-specific information
▪ Busy indicator
▪iframe transport
▪ Permanent link
▪prelude inlined js code to redirect if necessary
Sunday, September 27, 2009 45
53. Current status
▪ Turned on for FireFox and IE users: (>90% users)
▪ ~60% of page hits to Facebook site are Quickling requests
Sunday, September 27, 2009 46
56. PageCache
Cache user visited pages in browsers
▪ Motivation:
▪ A typical user session:
▪ home -> profile -> photo -> home -> notes -> home -> photo
-> photo
▪ Some pages are likely to be revisited soon (temporal locality)
▪ Home page visited every 3 ~ 5 page views
▪ Back/Forward button
Sunday, September 27, 2009 49
57. How PageCache works?
1. User clicks a link or back
button
2. Quickling sends ajax to server
3. Response arrives
4. Quickling blanks the content
area
5. Download javascript/CSS
6. Show new content
Sunday, September 27, 2009 50
58. How PageCache works?
1. User clicks a link or back
button
2. Quickling sends ajax to server
3. Response arrives
3.5 Save response in
cache
4. Quickling blanks the content
area
5. Download javascript/CSS
6. Show new content
Sunday, September 27, 2009 50
59. How PageCache works?
1. User clicks a link or back
button
2. Quickling sends ajax to server
3. Response arrives
4. Quickling blanks the content
area
5. Download javascript/CSS
6. Show new content
Sunday, September 27, 2009 50
60. How PageCache works?
1. User clicks a link or back
button
2. Find Page in the cache
3. Response arrives
4. Quickling blanks the content
area
5. Download javascript/CSS
6. Show new content
Sunday, September 27, 2009 50
61. Cache consistency 1: Incremental
updates
Cached version
Sunday, September 27, 2009 51
Provide functions to programmers to allow registering a javascript function to be called right before cached page is shown.
Used by home page to refresh ‘ads’, fetch latest stories
62. Cache consistency 1: Incremental
updates
Cached version Restored version
Sunday, September 27, 2009 51
Provide functions to programmers to allow registering a javascript function to be called right before cached page is shown.
Used by home page to refresh ‘ads’, fetch latest stories
63. Cache consistency 1: Incremental
Poll server for incremental updates via ajax calls.
▪ Allow registering javascript functions to be called right before
cached page is shown.
▪ Used by home page to refresh ‘ads’, fetch latest stories
Cached version Restored version
Sunday, September 27, 2009 52
Provide functions to programmers to allow registering a javascript function to be called right before cached page is shown.
Used by home page to refresh ‘ads’, fetch latest stories
64. Cache consistency 2: In-page writes
Cached version
Sunday, September 27, 2009 53
65. Cache consistency 2: In-page writes
Cached version Restored version
Sunday, September 27, 2009 53
66. Cache consistency 2: In-page writes
Record and replay
▪ Automatically record all state-changing operations in a cached
page
▪ Automatically replay those operations when cached page is
restored.
Cached version Restored version
Sunday, September 27, 2009 54
67. Cache consistency 3: Cross-page writes
Cached version
Sunday, September 27, 2009 55
68. Cache consistency 3: Cross-page writes
Cached version State-changing
op
Sunday, September 27, 2009 55
69. Cache consistency 3: Cross-page writes
Cached version State-changing Restored version
op
Sunday, September 27, 2009 55
70. Cache consistency 3: Cross-page writes
Server side invalidation
▪ Instrument server-side database access API, whenever a write
operations is detected, send a signal to the client to invalidate
the cache.
Cached version State-changing Restored version
op
Sunday, September 27, 2009 56
71. Current status
▪ Deployed on production
▪ Only cache in memory
▪ Only turned on for home page
Sunday, September 27, 2009 57
72. 20%
~20% savings on page hits to home
Sunday, September 27, 2009 page 58
73. Performance improvement
3X ~ 4X speedup in render time vs
Quickling
Sunday, September 27, 2009 59
75. Summary
▪ Performance monitoring: What, Who, and Why (“WWW”)
▪ Static resource management: Adaptive to fast evolution
▪ Ajaxify the website.
▪ Client side caching of user visited pages
Sunday, September 27, 2009 61
Measurement: we need to answer three questions: what’s the speed, who made it faster/slower, why it is faster/slower.
Static resource management: need to be adaptive to fast evolution of code changes and user adoption
Ajaxifying websites where pages in a user session share a lot of common work can save the redundant work and improve user perceived performance.
Caching user’s visited pages on the client side can reduce server’s overall load and improve user perceived performance