Killer Bugs From Outer Space

Killer Bugs
From Outer Space
Jérôme Petazzoni — @jpetazzo
LinuxCon — Chicago — 2014

Why this talk?
Codito, ergo erro
I code, therefore I make mistakes

Introduction(s)
● Hi, I’m Jérôme.

Introduction(s)
● Sometimes, I write code.

Introduction(s)
● Sometimes, the code has bugs.

Introduction(s)
● Sometimes, I fix the bugs in my code.

Introduction(s)
● Sometimes, I fix the bugs in other people’s code.

Introduction(s)
I like bullet points!

Introduction(s)
I like bullet points!
● And I carry a pager.

Introduction(s)
A pager is a device that wakes
you up, or tells you to stop
whatever you’re doing, so you
can fix other people’s bugs.

Introduction(s)
A pager is a device that wakes
you up, or tells you to stop
whatever you’re doing, so you
can fix other people’s bugs.
WE
HATESSS
THEMSS.

What about you?
● Do you write code?
● Does it sometimes have bugs?
● Do you fix them?
● Do you fix other people’s code too?
● Do you carry a pager?
● Do you love it?

Outline
● Let’s talk about some really nasty bugs
● How they were found, how they were fixed
● How to be prepared next time
● This is not about testing, TDD, etc.
(when the bugs are there, it’s too late anyway)

Outline
● Node.js
● Harmless hardware bugs
● Docker
● Harmful hardware bugs
● Linux

Context
● Hipache* is a reverse-proxy in Node.js
● Handles a bit of traffic
○ >100 req/s
○ >10K virtual hosts
○ >10K different containers
● Vhosts and containers change all the time
(more than 1 time per minute)
*Hipache is Hipster’s Apache. Sorry.

The bug
It all starts with an angry customer.
“Sometimes, our application will crash,
because this 700 KB JSON file is truncated by
Hipache!”
What about Content-Length?
The client code should scream, but it doesn’t.

Let’s sniff some packets
Log into the load balancer (running Hipache)...
# ngrep -tipd any -Wbyline '/api/v1/download-all-the-things' tcp port 80
interface: any
filter: (ip or ip6) and ( tcp port 80 )
match: /api/v1/download-all-the-things
####
T 2013/08/22 04:11:27.848663 23.20.88.251:55983 -> 10.116.195.150:80 [AP]
GET /api/v1/download-all-the-things.json HTTP/1.0.
Host: angrycustomer.com
X-Forwarded-Port: 443.
X-Forwarded-For: ::ffff:24.13.146.16.
X-Forwarded-Proto: https.
...

Too much traffic, not
enough visibility!
# tcpdump -peni any -s0 -wdump tcp port 80
(Wait a bit)
^C
Transfer dump file
DEMO TIME!

What did we find out?
● Truncated files happen because a chunk
(probably exactly one) gets dropped.
But:
● Impossible to reproduce locally.
● Only the customer sees the problem.
TONIGHT, WE DINE IN CODE!

This is Node.js.
I have no idea what I’m doing.
● Warm up the debuggers!

This is Node.js.
● Warm up the debuggers!
● … but Node.js is asynchronous,
callback-driven, spaghetti code
● Hmmmm, spaghetti

This is Node.js.
● Plan B: PRINT ALL THE THINGS

You need a phrasebook!
● How do you say “printf”
in your language?
● How do you find where
a function comes from?
● How do you trace the
standard library?

Shotgun debugging
● Add console.log() statements everywhere:
○ in Hipache
○ in node-http-proxy
○ in node/lib/http.js
● For the last one (part of std lib), we need to:
○ replace require(‘http’) with require(‘_http’)
○ add our own _http.js to our node_modules
○ do the same to net.js (in “our” _http.js).
● Now analyze big stream of obscure events!
● Let There Be Light

Interlude about pauses
● With Node.js, you can pause a TCP stream.
(Node.js will stop reading from the socket.)
● Then whenever you are ready to continue,
you are supposed to send a resume event.
● Hipache does that: when a client is too slow,
it will pause the socket to the backend.
SO FAR, SO GOOD

What really happens
● There are two layers in Node: tcp and http.
● When the tcp layer reads the last chunk,
the backend closes the socket (it’s done).
● The tcp layer notices that the socket is now
closed, and emits an end event.
● The end event bubbles up to the http layer.
● The http layer finishes what it was doing,
without sending a resume.
● Node never reads the chunks in the kernel
buffers. They are lost, forever alone.

How do we fix this?
Pester Node.js folks
Catch that end event, and when it happens,
send a resume to the stream to drain it.
(Implementation detail: you only have the http
socket, and you need to listen for an event on
the tcp socket, so you need to do slightly dirty
things with the http socket. But eh, it works!)

What did we learn?
When you can’t reproduce a bug at will,
record it in action (tcpdump) and dissect it
(wireshark).
Spraying code with print statements helps.
(But it’s better to use the logging framework!)
You don’t have to know Node.js to fix Node.js!

Intel Pentium
(insert appropriate ©™ where required)
● Pentium FDIV bug (1994)
○ errors at 4th decimal place
○ fixed by replacing CPUs
○ cost (for Intel): $475,000,000
○ cost (for users): approx. $0
● Pentium F00F bug (1997)
○ using the wrong instruction
hangs the machine
○ fixed in software
○ cost: ???

ATA ribbon cables
● Touch or move those cables:
the transfer speed changes
● SATA was introduced in 2003,
and (mostly) addresses the issue
● Vibration is still an issue, though

Docker (because even when it’s not about Docker, it’s still about Docker)

Bug:
It never works the first time
# docker run -t -i ubuntu echo hello world
2013/08/06 23:20:53 Error: Error starting container 06d642aae1a:
fork/exec /usr/bin/lxc-start: operation not permitted
hello world
hello world
hello world
hello world

Strace to the rescue!
Steps:
1. Boot the machine.
2. Find pid of process to analyze.
(ps | grep, pidof docker...)
3. strace -o log -f -p $PID
4. docker run -t -i run ubuntu echo hello world
5. Ctrl-C the strace process.
6. Repeat steps 3-4-5, using a different log file.
Note: can also strace directly, e.g. “strace ls”.

Let’s compare the log files
● Thousands and thousands of lines.
● Look for the error message in file A.
(e.g. “operation not permitted”)
● If lucky: it will reveal the issue.
● Otherwise, look what happens in file B.
● Other approach: start from the beginning or
the end, and try to find the point when things
started to diverge.

Specialized hardware helps
● Now you have a good reason to ask your
CFO about that dual 30” monitor setup!

Investigation results
First time
[pid 1331] setsid() = 1331
[pid 1331] dup2(10, 0) = 0
[pid 1331] dup2(10, 1) = 1
[pid 1331] dup2(10, 2) = 2
[pid 1331] ioctl(0, TIOCSCTTY) = -1 EPERM (Operation not
permitted)
[pid 1331] write(12, "10000000", 8) = 8
[pid 1331] _exit(253) = ?
Second time (and every following attempt)
[pid 1414] setsid() = 1414
[pid 1414] dup2(14, 0) = 0
[pid 1414] dup2(14, 1) = 1
[pid 1414] dup2(14, 2) = 2
[pid 1414] ioctl(0, TIOCSCTTY) = 0
[pid 1414] execve("/usr/bin/lxc-start", ["lxc-start", "-n", ...]) <...>

What does that mean?
● For some reason, the code wants file
descriptor 0 (stdin) to be a terminal.
● The first time we run, it fails, but in the
process, we acquire a terminal.
(UNIX 101: when you don’t have a controlling terminal and open a
file which is a terminal, it becomes your controlling terminal, unless
you open the file with flag O_NOCTTY)
● Next attempts are therefore successful.

… Really?
To confirm that this is indeed the bug:
● reproduce the issue
(start the process with “setsid”, to detach
from controlling terminal)
● check the output of “ps”
(it shows controlling terminals)
#before
23083 ? Sl+ 0:12 ./docker -d -b br0
#after
23083 pts/6 Sl+ 0:12 ./docker -d -b br0

What did we learn?
You can attach to running processes.
● strace is awesome.
It traces syscalls.
● ltrace is awesome too.
It traces library calls.
● gdb is your friend.
(A very peculiar friend, but a friend nonetheless.)

“Errare humanum est,
perseverare autem
diabolicum”
“To err is human,
but to really foul things up,
you need a computer”

Really nasty (and sad) bug:
The Therac-25
● Radiotherapy machine
(shoots beams to cure cancer)
● Two modes:
○ low energy
(direct exposure)
○ high energy
(beam hits a special
target/filter first)

The problem
● In older versions of the machine,
a hardware interlock prevented the high
energy beam from shooting if the filter was
not in place.
● On the Therac-25, it’s in software.
● What could possibly go wrong?

What went wrong
● 6 people got radiation burns
● 3 people died
● … over the course of 3 years (1985 to 1987)

Konami Code of Death
On the keyboard, press:
(in less than 8 seconds)
X ↑ E [ENTER] B
...And the high energy beam shoots, unfiltered!

How could it happen?
● Race condition in the software.
● Never happened during tests:
○ the tests did not include “unusual sequences”
(which were not that unusual after all)
○ test operators were slower than real operators

Aggravating details
● Many engineering and institutional issues
○ No code review
○ No evaluation of possible failures
○ Undocumented error codes
○ No sensor feedback
● The machine had tons of “normal errors”
○ And operators learned to ignore them
● So the “real errors” were ignored
○ Just hit retry, same player shoot again!

Let’s get back to weird
Linux Kernel bugs

Linux Kernel
and spinlocks and Xen and ...

Random crashes on EC2
● Pool of ~50 identical instances
● Same role (run 100s of containers)
● Sometimes, one of them would crash
○ Total crash
○ no SSH
○ no ping
○ no log
○ no nothing
● EC2 console won’t show anything
● Impossible to reproduce

Try a million things...
● Different kernel versions
● Different filesystems tunings
● Different security settings (GRSEC)
● Different memory settings (overcommit, OOM)
● Different instance sizes
● Different EBS volumes
● Different differences
● Nothing changed

And one fine day...
● One machine crashes very often
(every few days, sometimes few hours)
CLONE IT!
ONE MILLION TIMES!

A New Hope!
● Change everything (again!)
● Find nothing (again!)
● Do something crazy:
contact AWS support
● Repeat tests on “official” image (AMI)
(this required porting our stuff
from Ubuntu 10.04 to 12.04)

Happy ending
● Re-ran tests with official image
● Eventually got it to crash
● Left it in crashed state
● Support analyzed the image...

Happy ending
● Support analyzed the image
“oh yeah it’s a known issue,
see that link.”

Happy ending
● Support analyzed the image
“oh yeah it’s a known issue,
see that link.”
U SERIOUS?

I can explain!
● The bug only happens:
○ on workloads using spinlocks intensively
○ only on Xen VMs with many CPUs
● Spinlocks = actively spinning the CPU
● On VMs, you don’t want to hold the CPU
● Xen has special implementation of spinlocks
When waking up CPUs waiting on a spinlock,
the code would only wake up the first one,
even if there were multiple CPUs waiting.

The patch (priceless)
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index d69cc6c..67bc7ba 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -328,7 +328,6 @@ static noinline void
xen_spin_unlock_slow(struct xen_spinlock *xl)
if (per_cpu(lock_spinners, cpu) == xl) {
ADD_STATS(released_slow_kicked, 1);
xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR);
- break;
}
}
}
--

What did we learn?
We didn’t try all the combinations.
(Trying on HVM machines would have helped!)
AWS support can be helpful sometimes.
(This one was a surprise.)
Trying to debug a kernel issue without console
output is like trying to learn to read in the dark.
(Compare to local VM with serial output…)

Overall Conclusions
When facing a mystic bug from outer space:
● reproduce it at all costs!
● collect data with tcpdump, ngrep, wireshark,
strace, ltrace, gdb; and log files, obviously!
● don’t be afraid of uncharted places!
● document it, at least with a 2 AM ragetweet!

One last thing...
● Get all the help you can get!
● Your developers will rarely reproduce bugs
(Ain’t nobody got time for that)
● Your support team will
(They talk to your customers all the time)
● Help your support team to help your devs
● Bonus points if your support team fixes bugs

Killer Bugs From Outer Space

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (14)

Destacado

Destacado (7)

Similar a Killer Bugs From Outer Space

Similar a Killer Bugs From Outer Space (20)

Más de Jérôme Petazzoni

Más de Jérôme Petazzoni (20)

Último

Último (20)

Killer Bugs From Outer Space