Web 2.0 Performance and Reliability: How to Run Large Web Apps

Artur Bergman
sky@crucially.net
• Wikia Inc
– We are hiring
– Community/Bizdev in Germany
– Engineers in Poland
– http://www.wikia.com/wiki/hiring
• O’Reilly Radar
– http://radar.oreilly.com/artur/

The value of operations
• Google
• Orkut
• Friendster
• Myspace

Benefits
• Users trust your brand
• They rely on you
• They spend more time on your site
• Bad operations wastes R&D money

• Fixed amount of time + faster site =
more page views

Stepchild of Engineering
• Product development
• Engineering
• Operations
– Sysadmins?
• Why?

Operations Engineering
• It is engineering
• Google terminology -
– Site Reliability Engineer
• Sure there are sysadmins too, people
mananing NOCs and datacenters
• Provide career growth

Good Engineers
• Detail Oriented
• Aspire to be operational engineers
• Stubborn
• Can steer their inner ADD
– Interrupt driven
• Not the same as good developers

Danger signs
• Thinks operation is a path to
development engineering
– Fire them
• Want people dedicated to the task
• A good operations engineer should
spend some time in development
• A good development engineer MUST
spend some time in operations

Debugging
• 9 Rules of debugging
• http://www.debuggingrules.com/Poster_
download.html
– Yes the font is horrible

Rule 1:
Understand the system
• Complexity Kills
• No excuse
• If you write it, you must know it
• If you run it, you must know it
• If you buy it, you must know it

Rule 3:
Quit thinking and look
• quot;It is a capital mistake to theorize before
one has data. Insensibly one begins to
twist facts to suit theories, instead of
theories to suit facts.”

Rule 3:
Quit thinking and look
• What do you look at?
• The importance of monitoring
• Monitoring
• Monitoring
• Monitoring

My my, confusing term
• Monitoring
• Alerting
• Trending

Monitoring
• Collects data
• Puts into databases
• Makes it available for you
• Active collection
• Passive interaction

Alerting
• Acts on monitoring data
• Severe alerts
– Active
– Needs action
• Passive alerts
– Things that need to be done but not right now
• DO NOT OVER ALERT
• DO NOT CRY WOLF

Wikia alerting strategy
• When the site is slow
• Or down
• We send emails and do phone calls
• Europe and US West coast
• Looking to hire in East Asia
• No night time

Trending
• Long term
• Capacity planning

Monitor Tools
• Nagios
• Cacti
• MRTG
• Hyperic
• Cricket
• Ganglia

External Monitoring
• Use one, tells you what your clients see
every x minutes
• Keynote
• Gomez
• Websitepulse (cheap - easy - I like
them; no annoying salesforce)

Nagios
• Alerting
• Hassle
• C CGI??
• Doesn’t
scale

Hyperic
• Most exciting open source tool
• Agent base - self configured
• Baseline alerting

Cricket MRTG Cacti
• Impossible to configure
• You need to write tools to do it
• Especially Cacti
– Somewhat more pleasant than clawing out
your eyes

Ganglia
• We love ganglia
• Automatically graphs everything you
want - just works
• Large scale clusters
• Multicast
• Zero config
• RRD

http://ganglia.wikimedia.org/
• 270 hosts
• 880 CPU
• 2 clusters
• 1.2 TB of Memory

Something is wrong

• Don’t worry, data warehouse

QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.

tcpdump / waveshark
• If you suspect the network
• Don’t just suspect
• LOOK AT IT
• Tcpdump / waveshark will tell you
– If your packets are lost, delayed or
corrupted
– Your windowing is wrong

Rule 4: Divde and Conquer
• Look at the problems in turn
• Split between people
• Go in the order you suspect is the most
likely

Rule 5:
Change one thing at a time
• I cannot stress this enough
• IF YOU DO NOT THEN YOU HAVE
FAILED TO IDENTIFY THE PROBLEM

Rule 6:
Keep an audit trail
• You might be making things worse
• Good for the root cause analysis
• Have your shell log all commands
– Good practice anyway
• Version control

Rule 9:
If you didn’t fix it, it ain’t fixed
• You must do something to fix a problem
• Or it will bite you again
• And again
• And again
• They don’t just appear and disappear
• Except BGP route convergence :)

Process
• You need a little
• Don’t worry

Complexity kills
• Design against it
• Reuse components
• Define standards
• Have a few images that all machines
look like - reimage machines every now
and then for the heck of it.
– EC2 forces you to do this

MTBF
Meduim Time Between Failure
• Actually mostly irrelevant
• Dealing with failure is more important
• Target the right uptime
– Complexity scales exponatially with
required uptime
• Don’t kid yourself, you don’t need 5
nines

MTTR
Medium Time To Recovery
• Important
• Noone cares if you fail once a minute
– If you recover in 50 ms
• If you are down 1 minute a week, you
are still going to hit 4 nines (99.99%)
• Failures happen, plan how to deal with
them

Problem found
• If it is critical, start a phone conversation
• Use IRC to communicate technical data
• One person liasons with non technical
staff
• One person specifically in command
• Sleep scheduling ( audit log important )

Post crisis
• Root cause analysis
– Just find out what went wrong
– And how to avoid it
– Or fix it faster next time if you can’t
• Keep track of your uptime

Automation
• All machines are created equal
• Seriously
• If you manually make changes
• You are wrong
– Unless you know what you are doing

Best practices
• Version control
• Gold images
• Centralised authentication
• Time Sync ( NTP )
• Central logging
• ( All of this applies for virtual machines
too!)

cfengine
• Standard automation tool
• Written in C
• Not much support
• Very good
• Very annoying

contro :
l
s te
i = ( mys te )
i domain = (
mysite .count y )
r
sysadm = (mark ) netmask = (
255.255.255.0 ) ac i
t onsequence =
( mounta ll mount nfo
i
addmounts mounta l
l lnks
i
) mountpat rn = / ie) (
te $(s t /$ host))
homepat r = ( u? )
te n

Puppet
• New hip kid on the block
• Written in ruby
• Better support?
• Much nicer syntax
• Easier to extend

def ne yumrepo (enab
i led = true)
{c i i
onf gfle
{ /e c
quot; t /yum.repos /
.d $name.repo”: mode
=> 644,
source => quot; yum/repos
/ /$name. repoquot;,
ensure => $enab led ? {
true => fl ,
ie
defau t=> absent
l }
}}

cobb er
l
• Automatic PXE Installer
– Uses kickstart files
• Redhat Enterprise
• Centos
• Fedora
• Some support for debian

cobbler
cobbler system add
--name=xen8
--mac=00:19:B9:EE:6D:0A
--ip=10.10.30.208
--profile=Centos-5-x86_64
--kopts='ksdevice=00:19:B9:EE:6D:0A
console=ttyS1,57600 console=tty0'

cobbler
cobbler system add
--name=xen8
--mac=00:19:B9:EE:6D:0A
--ip=10.10.30.208
--profile=Centos-5-x86_64
--kopts='ksdevice=00:19:B9:EE:6D:0A
console=ttyS1,57600 console=tty0’

koan
• Client install tool
– Xen
– Or OS re-image

koan --server=10.10.30.205 --virt --
profile=virt_fc6 --virt-name=otrs

Your datacenter
• Keep it tidy
– Label things, keep cables as short as possible
– Have a switch in each rack
• If you are small without dedicated DC staff
you need
– Remote control power switches
– Remote console!

Virtualization
• Please use it
• Managing becomes much easier
• Power consumption
• Need a new test box
– The requestor can have it in minutes

Power consumption
• Maybe not as important in Europe
• 8 core machines are more efficient than
1 core
• But memcache uses 1 core and all RAM
• Get more RAM and virtualise

Our network admin boxes
• 1 Xen CPU for Vyatta
• 1 Xen CPU for LVS
• 1 Xen CPU for Squid - Carp
• 1 Xen CPU for Squid
• 1 Xen CPU for Monitoring
• 1 Xen CPU for network tasks

• We can have more of these and a loss of one
affects us less

Vyatta
• Opensource router
– Really like it
– No need to use Cisco

LVS
• Linux Virtual Server
• Low level load balancer
• HA
• Fast
• Doesn’t inspire people to put things in
the only place that is hard to scale

Squid Carp
• Squids configured to hash the urls and
send them to specific backend
• Very little configuration done
• Logging of UDP - no disk IO

Squid
• As a reverse web accelerator
• 90 % of our hits served from RAM in less than
1 ms
• Same as wikipedia
• We only use RAM cache ( unlike wikipedia)
• Cached per user
• If not cacheable - cache for a second to
redue backend effect

App servers
• 1 xen cpu for memcache ( 5 GB Ram)
• 1 xen cpu for squid ( 5GB Ram )
• 6 xen cpus for apache (6 GB Ram )

• More power efficient, less affected by
loss
• Applications can’t affect each other

Databases
• Keep developers on short leash
• Report bad queries
• Fear object relational mappers

Outsourcing
• As much as possible
• The younger you are as a company the
less risk
– When you have no users, you have no
value
• VCs don’t like having their money go
into Capex

What I want from Vendors
• They do what they tell me
• They do what I tell them

• No annoying up sells, no premium
services
– I know more about what you are selling
than you

Services we use
• Amazon EC2 and S3
• Panther-Express

Panther Express
• Fantastic Content Distribution Network
• Cheap, simple price list
– Take note akamai
• Cut delivery time to Europe by 70%
• We let our images be cached 1 second
to redue load

EC2 and S3
• We save all our binlogs to S3
• We save database dumps to S3
• We have monitors running from EC2
• We plan to build a datawarehouse
cluster on EC2

EC2 Requires Automation
• Machine is blank when you bring it up
• Download database dump from S3 and
replicate up - automatically
• Use puppet
• Amazon saves you hardware
headaches
– But complexity is still a problem

Web 2.0 Performance and Reliability: How to Run Large Web Apps

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (12)

Similar a Web 2.0 Performance and Reliability: How to Run Large Web Apps

Similar a Web 2.0 Performance and Reliability: How to Run Large Web Apps (20)

Más de adunne

Más de adunne (20)

Último

Último (20)

Web 2.0 Performance and Reliability: How to Run Large Web Apps