This document describes various software bugs encountered by Jérôme Petazzoni in his work fixing code. It details a bug in the Node.js library that was causing file truncation over the network. Through packet sniffing and logging statements, the issue was traced to an event handling problem. Other bugs discussed include hardware issues in CPUs and dockers, as well as dangerous bugs like one in a radiation therapy machine. The importance of debugging tools like strace and careful investigation is emphasized.
6. Introduction(s)
● Hi, I’m Jérôme.
● Sometimes, I write code.
● Sometimes, the code has bugs.
7. Introduction(s)
● Hi, I’m Jérôme.
● Sometimes, I write code.
● Sometimes, the code has bugs.
● Sometimes, I fix the bugs in my code.
8. Introduction(s)
● Hi, I’m Jérôme.
● Sometimes, I write code.
● Sometimes, the code has bugs.
● Sometimes, I fix the bugs in my code.
● Sometimes, I fix the bugs in other people’s code.
9. Introduction(s)
● Hi, I’m Jérôme.
● Sometimes, I write code.
● Sometimes, the code has bugs.
● Sometimes, I fix the bugs in my code.
● Sometimes, I fix the bugs in other people’s code.
I like bullet points!
10. Introduction(s)
● Hi, I’m Jérôme.
● Sometimes, I write code.
● Sometimes, the code has bugs.
● Sometimes, I fix the bugs in my code.
● Sometimes, I fix the bugs in other people’s code.
I like bullet points!
● And I carry a pager.
11. Introduction(s)
A pager is a device that wakes
you up, or tells you to stop
whatever you’re doing, so you
can fix other people’s bugs.
12. Introduction(s)
A pager is a device that wakes
you up, or tells you to stop
whatever you’re doing, so you
can fix other people’s bugs.
WE
HATESSS
THEMSS.
13. What about you?
● Do you write code?
● Does it sometimes have bugs?
● Do you fix them?
● Do you fix other people’s code too?
● Do you carry a pager?
● Do you love it?
14. Outline
● Let’s talk about some really nasty bugs
● How they were found, how they were fixed
● How to be prepared next time
● This is not about testing, TDD, etc.
(when the bugs are there, it’s too late anyway)
17. Context
● Hipache* is a reverse-proxy in Node.js
● Handles a bit of traffic
○ >100 req/s
○ >10K virtual hosts
○ >10K different containers
● Vhosts and containers change all the time
(more than 1 time per minute)
*Hipache is Hipster’s Apache. Sorry.
18. The bug
It all starts with an angry customer.
“Sometimes, our application will crash,
because this 700 KB JSON file is truncated by
Hipache!”
What about Content-Length?
The client code should scream, but it doesn’t.
19. Let’s sniff some packets
Log into the load balancer (running Hipache)...
# ngrep -tipd any -Wbyline '/api/v1/download-all-the-things' tcp port 80
interface: any
filter: (ip or ip6) and ( tcp port 80 )
match: /api/v1/download-all-the-things
####
T 2013/08/22 04:11:27.848663 23.20.88.251:55983 -> 10.116.195.150:80 [AP]
GET /api/v1/download-all-the-things.json HTTP/1.0.
Host: angrycustomer.com
X-Forwarded-Port: 443.
X-Forwarded-For: ::ffff:24.13.146.16.
X-Forwarded-Proto: https.
...
20. Too much traffic, not
enough visibility!
# tcpdump -peni any -s0 -wdump tcp port 80
(Wait a bit)
^C
Transfer dump file
DEMO TIME!
21.
22. What did we find out?
● Truncated files happen because a chunk
(probably exactly one) gets dropped.
But:
● Impossible to reproduce locally.
● Only the customer sees the problem.
TONIGHT, WE DINE IN CODE!
23. This is Node.js.
I have no idea what I’m doing.
● Warm up the debuggers!
24.
25. This is Node.js.
I have no idea what I’m doing.
● Warm up the debuggers!
● … but Node.js is asynchronous,
callback-driven, spaghetti code
● Hmmmm, spaghetti
26. This is Node.js.
I have no idea what I’m doing.
● Plan B: PRINT ALL THE THINGS
27. You need a phrasebook!
● How do you say “printf”
in your language?
● How do you find where
a function comes from?
● How do you trace the
standard library?
28. Shotgun debugging
● Add console.log() statements everywhere:
○ in Hipache
○ in node-http-proxy
○ in node/lib/http.js
● For the last one (part of std lib), we need to:
○ replace require(‘http’) with require(‘_http’)
○ add our own _http.js to our node_modules
○ do the same to net.js (in “our” _http.js).
● Now analyze big stream of obscure events!
● Let There Be Light
29. Interlude about pauses
● With Node.js, you can pause a TCP stream.
(Node.js will stop reading from the socket.)
● Then whenever you are ready to continue,
you are supposed to send a resume event.
● Hipache does that: when a client is too slow,
it will pause the socket to the backend.
SO FAR, SO GOOD
30. What really happens
● There are two layers in Node: tcp and http.
● When the tcp layer reads the last chunk,
the backend closes the socket (it’s done).
● The tcp layer notices that the socket is now
closed, and emits an end event.
● The end event bubbles up to the http layer.
● The http layer finishes what it was doing,
without sending a resume.
● Node never reads the chunks in the kernel
buffers. They are lost, forever alone.
31. How do we fix this?
Pester Node.js folks
Catch that end event, and when it happens,
send a resume to the stream to drain it.
(Implementation detail: you only have the http
socket, and you need to listen for an event on
the tcp socket, so you need to do slightly dirty
things with the http socket. But eh, it works!)
32.
33. What did we learn?
When you can’t reproduce a bug at will,
record it in action (tcpdump) and dissect it
(wireshark).
Spraying code with print statements helps.
(But it’s better to use the logging framework!)
You don’t have to know Node.js to fix Node.js!
36. ATA ribbon cables
● Touch or move those cables:
the transfer speed changes
● SATA was introduced in 2003,
and (mostly) addresses the issue
● Vibration is still an issue, though
38. Bug:
It never works the first time
# docker run -t -i ubuntu echo hello world
2013/08/06 23:20:53 Error: Error starting container 06d642aae1a:
fork/exec /usr/bin/lxc-start: operation not permitted
# docker run -t -i ubuntu echo hello world
hello world
# docker run -t -i ubuntu echo hello world
hello world
# docker run -t -i ubuntu echo hello world
hello world
# docker run -t -i ubuntu echo hello world
hello world
39.
40. Strace to the rescue!
Steps:
1. Boot the machine.
2. Find pid of process to analyze.
(ps | grep, pidof docker...)
3. strace -o log -f -p $PID
4. docker run -t -i run ubuntu echo hello world
5. Ctrl-C the strace process.
6. Repeat steps 3-4-5, using a different log file.
Note: can also strace directly, e.g. “strace ls”.
41. Let’s compare the log files
● Thousands and thousands of lines.
● Look for the error message in file A.
(e.g. “operation not permitted”)
● If lucky: it will reveal the issue.
● Otherwise, look what happens in file B.
● Other approach: start from the beginning or
the end, and try to find the point when things
started to diverge.
45. What does that mean?
● For some reason, the code wants file
descriptor 0 (stdin) to be a terminal.
● The first time we run, it fails, but in the
process, we acquire a terminal.
(UNIX 101: when you don’t have a controlling terminal and open a
file which is a terminal, it becomes your controlling terminal, unless
you open the file with flag O_NOCTTY)
● Next attempts are therefore successful.
46. … Really?
To confirm that this is indeed the bug:
● reproduce the issue
(start the process with “setsid”, to detach
from controlling terminal)
● check the output of “ps”
(it shows controlling terminals)
#before
23083 ? Sl+ 0:12 ./docker -d -b br0
#after
23083 pts/6 Sl+ 0:12 ./docker -d -b br0
48. What did we learn?
You can attach to running processes.
● strace is awesome.
It traces syscalls.
● ltrace is awesome too.
It traces library calls.
● gdb is your friend.
(A very peculiar friend, but a friend nonetheless.)
50. “Errare humanum est,
perseverare autem
diabolicum”
“To err is human,
but to really foul things up,
you need a computer”
51. Really nasty (and sad) bug:
The Therac-25
● Radiotherapy machine
(shoots beams to cure cancer)
● Two modes:
○ low energy
(direct exposure)
○ high energy
(beam hits a special
target/filter first)
52. The problem
● In older versions of the machine,
a hardware interlock prevented the high
energy beam from shooting if the filter was
not in place.
● On the Therac-25, it’s in software.
● What could possibly go wrong?
53. What went wrong
● 6 people got radiation burns
● 3 people died
● … over the course of 3 years (1985 to 1987)
54. Konami Code of Death
On the keyboard, press:
(in less than 8 seconds)
X ↑ E [ENTER] B
...And the high energy beam shoots, unfiltered!
55. How could it happen?
● Race condition in the software.
● Never happened during tests:
○ the tests did not include “unusual sequences”
(which were not that unusual after all)
○ test operators were slower than real operators
56. Aggravating details
● Many engineering and institutional issues
○ No code review
○ No evaluation of possible failures
○ Undocumented error codes
○ No sensor feedback
● The machine had tons of “normal errors”
○ And operators learned to ignore them
● So the “real errors” were ignored
○ Just hit retry, same player shoot again!
60. Random crashes on EC2
● Pool of ~50 identical instances
● Same role (run 100s of containers)
● Sometimes, one of them would crash
○ Total crash
○ no SSH
○ no ping
○ no log
○ no nothing
● EC2 console won’t show anything
● Impossible to reproduce
61. Try a million things...
● Different kernel versions
● Different filesystems tunings
● Different security settings (GRSEC)
● Different memory settings (overcommit, OOM)
● Different instance sizes
● Different EBS volumes
● Different differences
● Nothing changed
62. And one fine day...
● One machine crashes very often
(every few days, sometimes few hours)
CLONE IT!
ONE MILLION TIMES!
63. A New Hope!
● Change everything (again!)
● Find nothing (again!)
● Do something crazy:
contact AWS support
● Repeat tests on “official” image (AMI)
(this required porting our stuff
from Ubuntu 10.04 to 12.04)
64. Happy ending
● Re-ran tests with official image
● Eventually got it to crash
● Left it in crashed state
● Support analyzed the image...
65. Happy ending
● Re-ran tests with official image
● Eventually got it to crash
● Left it in crashed state
● Support analyzed the image
“oh yeah it’s a known issue,
see that link.”
66. Happy ending
● Re-ran tests with official image
● Eventually got it to crash
● Left it in crashed state
● Support analyzed the image
“oh yeah it’s a known issue,
see that link.”
U SERIOUS?
67. I can explain!
● The bug only happens:
○ on workloads using spinlocks intensively
○ only on Xen VMs with many CPUs
● Spinlocks = actively spinning the CPU
● On VMs, you don’t want to hold the CPU
● Xen has special implementation of spinlocks
When waking up CPUs waiting on a spinlock,
the code would only wake up the first one,
even if there were multiple CPUs waiting.
70. What did we learn?
We didn’t try all the combinations.
(Trying on HVM machines would have helped!)
AWS support can be helpful sometimes.
(This one was a surprise.)
Trying to debug a kernel issue without console
output is like trying to learn to read in the dark.
(Compare to local VM with serial output…)
71.
72. Overall Conclusions
When facing a mystic bug from outer space:
● reproduce it at all costs!
● collect data with tcpdump, ngrep, wireshark,
strace, ltrace, gdb; and log files, obviously!
● don’t be afraid of uncharted places!
● document it, at least with a 2 AM ragetweet!
73. One last thing...
● Get all the help you can get!
● Your developers will rarely reproduce bugs
(Ain’t nobody got time for that)
● Your support team will
(They talk to your customers all the time)
● Help your support team to help your devs
● Bonus points if your support team fixes bugs