8447779800, Low rate Call girls in Tughlakabad Delhi NCR
Web 2.0 Performance and Reliability: How to Run Large Web Apps
1. Artur Bergman
sky@crucially.net
• Wikia Inc
– We are hiring
– Community/Bizdev in Germany
– Engineers in Poland
– http://www.wikia.com/wiki/hiring
• O’Reilly Radar
– http://radar.oreilly.com/artur/
2. The value of operations
• Google
• Orkut
• Friendster
• Myspace
3. Benefits
• Users trust your brand
• They rely on you
• They spend more time on your site
• Bad operations wastes R&D money
• Fixed amount of time + faster site =
more page views
5. Operations Engineering
• It is engineering
• Google terminology -
– Site Reliability Engineer
• Sure there are sysadmins too, people
mananing NOCs and datacenters
• Provide career growth
6. Good Engineers
• Detail Oriented
• Aspire to be operational engineers
• Stubborn
• Can steer their inner ADD
– Interrupt driven
• Not the same as good developers
7. Danger signs
• Thinks operation is a path to
development engineering
– Fire them
• Want people dedicated to the task
• A good operations engineer should
spend some time in development
• A good development engineer MUST
spend some time in operations
8.
9. Debugging
• 9 Rules of debugging
• http://www.debuggingrules.com/Poster_
download.html
– Yes the font is horrible
10. Rule 1:
Understand the system
• Complexity Kills
• No excuse
• If you write it, you must know it
• If you run it, you must know it
• If you buy it, you must know it
11. Rule 3:
Quit thinking and look
• quot;It is a capital mistake to theorize before
one has data. Insensibly one begins to
twist facts to suit theories, instead of
theories to suit facts.”
12. Rule 3:
Quit thinking and look
• What do you look at?
• The importance of monitoring
• Monitoring
• Monitoring
• Monitoring
14. Monitoring
• Collects data
• Puts into databases
• Makes it available for you
• Active collection
• Passive interaction
15. Alerting
• Acts on monitoring data
• Severe alerts
– Active
– Needs action
• Passive alerts
– Things that need to be done but not right now
• DO NOT OVER ALERT
• DO NOT CRY WOLF
16. Wikia alerting strategy
• When the site is slow
• Or down
• We send emails and do phone calls
• Europe and US West coast
• Looking to hire in East Asia
• No night time
19. External Monitoring
• Use one, tells you what your clients see
every x minutes
• Keynote
• Gomez
• Websitepulse (cheap - easy - I like
them; no annoying salesforce)
22. Cricket MRTG Cacti
• Impossible to configure
• You need to write tools to do it
• Especially Cacti
– Somewhat more pleasant than clawing out
your eyes
23. Ganglia
• We love ganglia
• Automatically graphs everything you
want - just works
• Large scale clusters
• Multicast
• Zero config
• RRD
28. Custom Ganglia Gmetrics
• Write your own
gmetric --name='Oldest query' --type=int32
--units='sec' --dmax=65 --value=`echo '
show processlist' | mysql -uroot -ppass |
grep -v Sleep | grep -v 'system user' | head -2 |
tail -1 | cut -f 6`
29. Custom Ganglia Gmetrics
• Write your own
gmetric --name='Oldest query' --type=int32
--units='sec' --dmax=65 --value=`echo '
show processlist' | mysql -uroot -ppass |
grep -v Sleep | grep -v 'system user' | head -2 |
tail -1 | cut -f 6`
30. Custom Ganglia Gmetrics
• Write your own
gmetric --name='Oldest query' --type=int32
--units='sec' --dmax=65 --value=`echo '
show processlist' | mysql -uroot -ppass |
grep -v Sleep | grep -v 'system user' | head -2 |
tail -1 | cut -f 6`
31. Something is wrong
• Don’t worry, data warehouse
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
32. tcpdump / waveshark
• If you suspect the network
• Don’t just suspect
• LOOK AT IT
• Tcpdump / waveshark will tell you
– If your packets are lost, delayed or
corrupted
– Your windowing is wrong
33. Rule 4: Divde and Conquer
• Look at the problems in turn
• Split between people
• Go in the order you suspect is the most
likely
34. Rule 5:
Change one thing at a time
• I cannot stress this enough
• IF YOU DO NOT THEN YOU HAVE
FAILED TO IDENTIFY THE PROBLEM
35. Rule 6:
Keep an audit trail
• You might be making things worse
• Good for the root cause analysis
• Have your shell log all commands
– Good practice anyway
• Version control
36. Rule 9:
If you didn’t fix it, it ain’t fixed
• You must do something to fix a problem
• Or it will bite you again
• And again
• And again
• They don’t just appear and disappear
• Except BGP route convergence :)
39. Complexity kills
• Design against it
• Reuse components
• Define standards
• Have a few images that all machines
look like - reimage machines every now
and then for the heck of it.
– EC2 forces you to do this
40. MTBF
Meduim Time Between Failure
• Actually mostly irrelevant
• Dealing with failure is more important
• Target the right uptime
– Complexity scales exponatially with
required uptime
• Don’t kid yourself, you don’t need 5
nines
41. MTTR
Medium Time To Recovery
• Important
• Noone cares if you fail once a minute
– If you recover in 50 ms
• If you are down 1 minute a week, you
are still going to hit 4 nines (99.99%)
• Failures happen, plan how to deal with
them
42. Problem found
• If it is critical, start a phone conversation
• Use IRC to communicate technical data
• One person liasons with non technical
staff
• One person specifically in command
• Sleep scheduling ( audit log important )
43. Post crisis
• Root cause analysis
– Just find out what went wrong
– And how to avoid it
– Or fix it faster next time if you can’t
• Keep track of your uptime
44. Automation
• All machines are created equal
• Seriously
• If you manually make changes
• You are wrong
– Unless you know what you are doing
45. Best practices
• Version control
• Gold images
• Centralised authentication
• Time Sync ( NTP )
• Central logging
• ( All of this applies for virtual machines
too!)
46. cfengine
• Standard automation tool
• Written in C
• Not much support
• Very good
• Very annoying
47. contro :
l
s te
i = ( mys te )
i domain = (
mysite .count y )
r
sysadm = (mark ) netmask = (
255.255.255.0 ) ac i
t onsequence =
( mounta ll mount nfo
i
addmounts mounta l
l lnks
i
) mountpat rn = / ie) (
te $(s t /$ host))
homepat r = ( u? )
te n
48. Puppet
• New hip kid on the block
• Written in ruby
• Better support?
• Much nicer syntax
• Easier to extend
49. def ne yumrepo (enab
i led = true)
{c i i
onf gfle
{ /e c
quot; t /yum.repos /
.d $name.repo”: mode
=> 644,
source => quot; yum/repos
/ /$name. repoquot;,
ensure => $enab led ? {
true => fl ,
ie
defau t=> absent
l }
}}
50. cobb er
l
• Automatic PXE Installer
– Uses kickstart files
• Redhat Enterprise
• Centos
• Fedora
• Some support for debian
53. koan
• Client install tool
– Xen
– Or OS re-image
koan --server=10.10.30.205 --virt --
profile=virt_fc6 --virt-name=otrs
54. Your datacenter
• Keep it tidy
– Label things, keep cables as short as possible
– Have a switch in each rack
• If you are small without dedicated DC staff
you need
– Remote control power switches
– Remote console!
55. Virtualization
• Please use it
• Managing becomes much easier
• Power consumption
• Need a new test box
– The requestor can have it in minutes
56. Power consumption
• Maybe not as important in Europe
• 8 core machines are more efficient than
1 core
• But memcache uses 1 core and all RAM
• Get more RAM and virtualise
57. Our network admin boxes
• 1 Xen CPU for Vyatta
• 1 Xen CPU for LVS
• 1 Xen CPU for Squid - Carp
• 1 Xen CPU for Squid
• 1 Xen CPU for Monitoring
• 1 Xen CPU for network tasks
• We can have more of these and a loss of one
affects us less
59. LVS
• Linux Virtual Server
• Low level load balancer
• HA
• Fast
• Doesn’t inspire people to put things in
the only place that is hard to scale
60. Squid Carp
• Squids configured to hash the urls and
send them to specific backend
• Very little configuration done
• Logging of UDP - no disk IO
61. Squid
• As a reverse web accelerator
• 90 % of our hits served from RAM in less than
1 ms
• Same as wikipedia
• We only use RAM cache ( unlike wikipedia)
• Cached per user
• If not cacheable - cache for a second to
redue backend effect
62. App servers
• 1 xen cpu for memcache ( 5 GB Ram)
• 1 xen cpu for squid ( 5GB Ram )
• 6 xen cpus for apache (6 GB Ram )
• More power efficient, less affected by
loss
• Applications can’t affect each other
64. Outsourcing
• As much as possible
• The younger you are as a company the
less risk
– When you have no users, you have no
value
• VCs don’t like having their money go
into Capex
65. What I want from Vendors
• They do what they tell me
• They do what I tell them
• No annoying up sells, no premium
services
– I know more about what you are selling
than you
67. Panther Express
• Fantastic Content Distribution Network
• Cheap, simple price list
– Take note akamai
• Cut delivery time to Europe by 70%
• We let our images be cached 1 second
to redue load
68. EC2 and S3
• We save all our binlogs to S3
• We save database dumps to S3
• We have monitors running from EC2
• We plan to build a datawarehouse
cluster on EC2
69. EC2 Requires Automation
• Machine is blank when you bring it up
• Download database dump from S3 and
replicate up - automatically
• Use puppet
• Amazon saves you hardware
headaches
– But complexity is still a problem