The document discusses 7 lessons learned from building high performance and highly available systems. The lessons are: 1) Make assumptions explicit and keep challenging them, 2) Performance and high availability are not extra features, 3) Do not reinvent the wheel but keep things simple, 4) Be wary of cargo-cult optimization, 5) High availability requires more than just redundancy, 6) Embrace diversity, and 7) Monitoring is essential but can be improved.
3. @EdMcBane
Lean Software Development and team coaching
Continuous Delivery, High availability, performance
Security sensitive & high uncertainty domains
4. @EdMcBane
The challenge
● Primary european client
● Innovative service for the consumer market
● Large userbase (200K+ users)
● Very high request rate
● Low latency requirement (<< RTT)
14. @EdMcBane
SO_REUSEPORT
For TCP, so_reuseport allows multiple
listener sockets to be bound to the same
port.
Received packets are distributed to
multiple sockets bound to the same port
using a 4-tuple hash.
With so_reuseport the distribution is
uniform.
16. @EdMcBane
LESS(1) General Commands Manual LESS(1)
NAME
less - opposite of more
SYNOPSIS
less -?
less --help
less -V
less --version
less [-[+]aABcCdeEfFgGiIJKLmMnNqQrRsSuUVwWX~]
[-b space] [-h lines] [-j line] [-k keyfile]
[-{oO} logfile] [-p pattern] [-P prompt] [-t tag]
[-T tagsfile] [-x tab,...] [-y lines] [-[z] lines]
[-# shift] [+[+]cmd] [--] [filename]...
(See the OPTIONS section for alternate option syntax with long option
names.)
DESCRIPTION
LESS IS similar to MORE (1), but has many more features.
Less does not have to read the entire input file before starting, so
with large input files it starts up faster than text editors like vi
(1). Less uses termcap (or terminfo on some systems), so it can run on
Manual page less(1) line 1 (press h for help or q to quit) .
19. @EdMcBane
TCP_TW_RECYCLE
Enable fast recycling TIME-WAIT sockets.
Default value is 0. It should not be changed
without advice/request of technical experts.
Linux will drop any segment from the remote
host whose timestamp is not strictly bigger
than the latest recorded timestamp
TCP_TW_RECYCLE + NAT = MADNESS
27. @EdMcBane
...but be prepared to improvise
● In house experience
● Developers on call
● Drills (chaos monkeys)
Processes designed for ordinary times
are not resilient in a crisis and need to be changed.
32. @EdMcBane
No one size fits all
● “Monitor everything”, like “100% test coverage”
is a nice slogan.
● Each environment requires a slightly different
solution
● Balance between data availability, cost and
ability to keep it actionable
34. @EdMcBane
We are doing logging wrong
● Unstructured
● Inconsistent
● Poor defaults
● Complex, obscure components
● A huge waste of computing power
35. @EdMcBane
We need a complete overview
● Logs
● Metrics
● Alerts
● Together, coherent, cross-referenced
36. @EdMcBane
Human beings, who are almost unique in
having the ability to learn from the
experience of others, are also remarkable
for their apparent disinclination to do so.
Douglas Adams
“
”