SlideShare una empresa de Scribd logo
1 de 74
Descargar para leer sin conexión
Killer Bugs 
From Outer Space 
Jérôme Petazzoni — @jpetazzo 
LinuxCon — Chicago — 2014
Why this talk? 
Codito, ergo erro 
I code, therefore I make mistakes
Introduction(s) 
● Hi, I’m Jérôme.
Introduction(s) 
● Hi, I’m Jérôme. 
● Sometimes, I write code.
Introduction(s) 
● Hi, I’m Jérôme. 
● Sometimes, I write code. 
● Sometimes, the code has bugs.
Introduction(s) 
● Hi, I’m Jérôme. 
● Sometimes, I write code. 
● Sometimes, the code has bugs. 
● Sometimes, I fix the bugs in my code.
Introduction(s) 
● Hi, I’m Jérôme. 
● Sometimes, I write code. 
● Sometimes, the code has bugs. 
● Sometimes, I fix the bugs in my code. 
● Sometimes, I fix the bugs in other people’s code.
Introduction(s) 
● Hi, I’m Jérôme. 
● Sometimes, I write code. 
● Sometimes, the code has bugs. 
● Sometimes, I fix the bugs in my code. 
● Sometimes, I fix the bugs in other people’s code. 
I like bullet points!
Introduction(s) 
● Hi, I’m Jérôme. 
● Sometimes, I write code. 
● Sometimes, the code has bugs. 
● Sometimes, I fix the bugs in my code. 
● Sometimes, I fix the bugs in other people’s code. 
I like bullet points! 
● And I carry a pager.
Introduction(s) 
A pager is a device that wakes 
you up, or tells you to stop 
whatever you’re doing, so you 
can fix other people’s bugs.
Introduction(s) 
A pager is a device that wakes 
you up, or tells you to stop 
whatever you’re doing, so you 
can fix other people’s bugs. 
WE 
HATESSS 
THEMSS.
What about you? 
● Do you write code? 
● Does it sometimes have bugs? 
● Do you fix them? 
● Do you fix other people’s code too? 
● Do you carry a pager? 
● Do you love it?
Outline 
● Let’s talk about some really nasty bugs 
● How they were found, how they were fixed 
● How to be prepared next time 
● This is not about testing, TDD, etc. 
(when the bugs are there, it’s too late anyway)
Outline 
● Node.js 
● Harmless hardware bugs 
● Docker 
● Harmful hardware bugs 
● Linux
Node.js
Context 
● Hipache* is a reverse-proxy in Node.js 
● Handles a bit of traffic 
○ >100 req/s 
○ >10K virtual hosts 
○ >10K different containers 
● Vhosts and containers change all the time 
(more than 1 time per minute) 
*Hipache is Hipster’s Apache. Sorry.
The bug 
It all starts with an angry customer. 
“Sometimes, our application will crash, 
because this 700 KB JSON file is truncated by 
Hipache!” 
What about Content-Length? 
The client code should scream, but it doesn’t.
Let’s sniff some packets 
Log into the load balancer (running Hipache)... 
# ngrep -tipd any -Wbyline '/api/v1/download-all-the-things' tcp port 80 
interface: any 
filter: (ip or ip6) and ( tcp port 80 ) 
match: /api/v1/download-all-the-things 
#### 
T 2013/08/22 04:11:27.848663 23.20.88.251:55983 -> 10.116.195.150:80 [AP] 
GET /api/v1/download-all-the-things.json HTTP/1.0. 
Host: angrycustomer.com 
X-Forwarded-Port: 443. 
X-Forwarded-For: ::ffff:24.13.146.16. 
X-Forwarded-Proto: https. 
...
Too much traffic, not 
enough visibility! 
# tcpdump -peni any -s0 -wdump tcp port 80 
(Wait a bit) 
^C 
Transfer dump file 
DEMO TIME!
What did we find out? 
● Truncated files happen because a chunk 
(probably exactly one) gets dropped. 
But: 
● Impossible to reproduce locally. 
● Only the customer sees the problem. 
TONIGHT, WE DINE IN CODE!
This is Node.js. 
I have no idea what I’m doing. 
● Warm up the debuggers!
This is Node.js. 
I have no idea what I’m doing. 
● Warm up the debuggers! 
● … but Node.js is asynchronous, 
callback-driven, spaghetti code 
● Hmmmm, spaghetti
This is Node.js. 
I have no idea what I’m doing. 
● Plan B: PRINT ALL THE THINGS
You need a phrasebook! 
● How do you say “printf” 
in your language? 
● How do you find where 
a function comes from? 
● How do you trace the 
standard library?
Shotgun debugging 
● Add console.log() statements everywhere: 
○ in Hipache 
○ in node-http-proxy 
○ in node/lib/http.js 
● For the last one (part of std lib), we need to: 
○ replace require(‘http’) with require(‘_http’) 
○ add our own _http.js to our node_modules 
○ do the same to net.js (in “our” _http.js). 
● Now analyze big stream of obscure events! 
● Let There Be Light
Interlude about pauses 
● With Node.js, you can pause a TCP stream. 
(Node.js will stop reading from the socket.) 
● Then whenever you are ready to continue, 
you are supposed to send a resume event. 
● Hipache does that: when a client is too slow, 
it will pause the socket to the backend. 
SO FAR, SO GOOD
What really happens 
● There are two layers in Node: tcp and http. 
● When the tcp layer reads the last chunk, 
the backend closes the socket (it’s done). 
● The tcp layer notices that the socket is now 
closed, and emits an end event. 
● The end event bubbles up to the http layer. 
● The http layer finishes what it was doing, 
without sending a resume. 
● Node never reads the chunks in the kernel 
buffers. They are lost, forever alone.
How do we fix this? 
Pester Node.js folks 
Catch that end event, and when it happens, 
send a resume to the stream to drain it. 
(Implementation detail: you only have the http 
socket, and you need to listen for an event on 
the tcp socket, so you need to do slightly dirty 
things with the http socket. But eh, it works!)
What did we learn? 
When you can’t reproduce a bug at will, 
record it in action (tcpdump) and dissect it 
(wireshark). 
Spraying code with print statements helps. 
(But it’s better to use the logging framework!) 
You don’t have to know Node.js to fix Node.js!
Harmless 
hardware bugs
Intel Pentium 
(insert appropriate ©™ where required) 
● Pentium FDIV bug (1994) 
○ errors at 4th decimal place 
○ fixed by replacing CPUs 
○ cost (for Intel): $475,000,000 
○ cost (for users): approx. $0 
● Pentium F00F bug (1997) 
○ using the wrong instruction 
hangs the machine 
○ fixed in software 
○ cost: ???
ATA ribbon cables 
● Touch or move those cables: 
the transfer speed changes 
● SATA was introduced in 2003, 
and (mostly) addresses the issue 
● Vibration is still an issue, though
Docker (because even when it’s not about Docker, it’s still about Docker)
Bug: 
It never works the first time 
# docker run -t -i ubuntu echo hello world 
2013/08/06 23:20:53 Error: Error starting container 06d642aae1a: 
fork/exec /usr/bin/lxc-start: operation not permitted 
# docker run -t -i ubuntu echo hello world 
hello world 
# docker run -t -i ubuntu echo hello world 
hello world 
# docker run -t -i ubuntu echo hello world 
hello world 
# docker run -t -i ubuntu echo hello world 
hello world
Strace to the rescue! 
Steps: 
1. Boot the machine. 
2. Find pid of process to analyze. 
(ps | grep, pidof docker...) 
3. strace -o log -f -p $PID 
4. docker run -t -i run ubuntu echo hello world 
5. Ctrl-C the strace process. 
6. Repeat steps 3-4-5, using a different log file. 
Note: can also strace directly, e.g. “strace ls”.
Let’s compare the log files 
● Thousands and thousands of lines. 
● Look for the error message in file A. 
(e.g. “operation not permitted”) 
● If lucky: it will reveal the issue. 
● Otherwise, look what happens in file B. 
● Other approach: start from the beginning or 
the end, and try to find the point when things 
started to diverge.
Specialized hardware helps
Specialized hardware helps 
● Now you have a good reason to ask your 
CFO about that dual 30” monitor setup!
Investigation results 
First time 
[pid 1331] setsid() = 1331 
[pid 1331] dup2(10, 0) = 0 
[pid 1331] dup2(10, 1) = 1 
[pid 1331] dup2(10, 2) = 2 
[pid 1331] ioctl(0, TIOCSCTTY) = -1 EPERM (Operation not 
permitted) 
[pid 1331] write(12, "10000000", 8) = 8 
[pid 1331] _exit(253) = ? 
Second time (and every following attempt) 
[pid 1414] setsid() = 1414 
[pid 1414] dup2(14, 0) = 0 
[pid 1414] dup2(14, 1) = 1 
[pid 1414] dup2(14, 2) = 2 
[pid 1414] ioctl(0, TIOCSCTTY) = 0 
[pid 1414] execve("/usr/bin/lxc-start", ["lxc-start", "-n", ...]) <...>
What does that mean? 
● For some reason, the code wants file 
descriptor 0 (stdin) to be a terminal. 
● The first time we run, it fails, but in the 
process, we acquire a terminal. 
(UNIX 101: when you don’t have a controlling terminal and open a 
file which is a terminal, it becomes your controlling terminal, unless 
you open the file with flag O_NOCTTY) 
● Next attempts are therefore successful.
… Really? 
To confirm that this is indeed the bug: 
● reproduce the issue 
(start the process with “setsid”, to detach 
from controlling terminal) 
● check the output of “ps” 
(it shows controlling terminals) 
#before 
23083 ? Sl+ 0:12 ./docker -d -b br0 
#after 
23083 pts/6 Sl+ 0:12 ./docker -d -b br0
V I C T O R Y
What did we learn? 
You can attach to running processes. 
● strace is awesome. 
It traces syscalls. 
● ltrace is awesome too. 
It traces library calls. 
● gdb is your friend. 
(A very peculiar friend, but a friend nonetheless.)
Harmful 
hardware bugs
“Errare humanum est, 
perseverare autem 
diabolicum” 
“To err is human, 
but to really foul things up, 
you need a computer”
Really nasty (and sad) bug: 
The Therac-25 
● Radiotherapy machine 
(shoots beams to cure cancer) 
● Two modes: 
○ low energy 
(direct exposure) 
○ high energy 
(beam hits a special 
target/filter first)
The problem 
● In older versions of the machine, 
a hardware interlock prevented the high 
energy beam from shooting if the filter was 
not in place. 
● On the Therac-25, it’s in software. 
● What could possibly go wrong?
What went wrong 
● 6 people got radiation burns 
● 3 people died 
● … over the course of 3 years (1985 to 1987)
Konami Code of Death 
On the keyboard, press: 
(in less than 8 seconds) 
X ↑ E [ENTER] B 
...And the high energy beam shoots, unfiltered!
How could it happen? 
● Race condition in the software. 
● Never happened during tests: 
○ the tests did not include “unusual sequences” 
(which were not that unusual after all) 
○ test operators were slower than real operators
Aggravating details 
● Many engineering and institutional issues 
○ No code review 
○ No evaluation of possible failures 
○ Undocumented error codes 
○ No sensor feedback 
● The machine had tons of “normal errors” 
○ And operators learned to ignore them 
● So the “real errors” were ignored 
○ Just hit retry, same player shoot again!
Let’s get back to weird 
Linux Kernel bugs
Linux Kernel 
and spinlocks and Xen and ...
Let’s get back to weird 
Linux Kernel bugs
Random crashes on EC2 
● Pool of ~50 identical instances 
● Same role (run 100s of containers) 
● Sometimes, one of them would crash 
○ Total crash 
○ no SSH 
○ no ping 
○ no log 
○ no nothing 
● EC2 console won’t show anything 
● Impossible to reproduce
Try a million things... 
● Different kernel versions 
● Different filesystems tunings 
● Different security settings (GRSEC) 
● Different memory settings (overcommit, OOM) 
● Different instance sizes 
● Different EBS volumes 
● Different differences 
● Nothing changed
And one fine day... 
● One machine crashes very often 
(every few days, sometimes few hours) 
CLONE IT! 
ONE MILLION TIMES!
A New Hope! 
● Change everything (again!) 
● Find nothing (again!) 
● Do something crazy: 
contact AWS support 
● Repeat tests on “official” image (AMI) 
(this required porting our stuff 
from Ubuntu 10.04 to 12.04)
Happy ending 
● Re-ran tests with official image 
● Eventually got it to crash 
● Left it in crashed state 
● Support analyzed the image...
Happy ending 
● Re-ran tests with official image 
● Eventually got it to crash 
● Left it in crashed state 
● Support analyzed the image 
“oh yeah it’s a known issue, 
see that link.”
Happy ending 
● Re-ran tests with official image 
● Eventually got it to crash 
● Left it in crashed state 
● Support analyzed the image 
“oh yeah it’s a known issue, 
see that link.” 
U SERIOUS?
I can explain! 
● The bug only happens: 
○ on workloads using spinlocks intensively 
○ only on Xen VMs with many CPUs 
● Spinlocks = actively spinning the CPU 
● On VMs, you don’t want to hold the CPU 
● Xen has special implementation of spinlocks 
When waking up CPUs waiting on a spinlock, 
the code would only wake up the first one, 
even if there were multiple CPUs waiting.
The patch (priceless) 
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c 
index d69cc6c..67bc7ba 100644 
--- a/arch/x86/xen/spinlock.c 
+++ b/arch/x86/xen/spinlock.c 
@@ -328,7 +328,6 @@ static noinline void 
xen_spin_unlock_slow(struct xen_spinlock *xl) 
if (per_cpu(lock_spinners, cpu) == xl) { 
ADD_STATS(released_slow_kicked, 1); 
xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR); 
- break; 
} 
} 
} 
--
What did we learn? 
We didn’t try all the combinations. 
(Trying on HVM machines would have helped!) 
AWS support can be helpful sometimes. 
(This one was a surprise.) 
Trying to debug a kernel issue without console 
output is like trying to learn to read in the dark. 
(Compare to local VM with serial output…)
Overall Conclusions 
When facing a mystic bug from outer space: 
● reproduce it at all costs! 
● collect data with tcpdump, ngrep, wireshark, 
strace, ltrace, gdb; and log files, obviously! 
● don’t be afraid of uncharted places! 
● document it, at least with a 2 AM ragetweet!
One last thing... 
● Get all the help you can get! 
● Your developers will rarely reproduce bugs 
(Ain’t nobody got time for that) 
● Your support team will 
(They talk to your customers all the time) 
● Help your support team to help your devs 
● Bonus points if your support team fixes bugs
Thank you! Questions?

Más contenido relacionado

La actualidad más candente

[Defcon24] Introduction to the Witchcraft Compiler Collection
[Defcon24] Introduction to the Witchcraft Compiler Collection[Defcon24] Introduction to the Witchcraft Compiler Collection
[Defcon24] Introduction to the Witchcraft Compiler Collection
Moabi.com
 
Choosing the right parallel compute architecture
Choosing the right parallel compute architecture Choosing the right parallel compute architecture
Choosing the right parallel compute architecture
corehard_by
 

La actualidad más candente (14)

[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis
 
Archeology for Entertainment, or Checking Microsoft Word 1.1a with PVS-Studio
Archeology for Entertainment, or Checking Microsoft Word 1.1a with PVS-StudioArcheology for Entertainment, or Checking Microsoft Word 1.1a with PVS-Studio
Archeology for Entertainment, or Checking Microsoft Word 1.1a with PVS-Studio
 
[HITB Malaysia 2011] Exploit Automation
[HITB Malaysia 2011] Exploit Automation[HITB Malaysia 2011] Exploit Automation
[HITB Malaysia 2011] Exploit Automation
 
Don't Tell Joanna the Virtualized Rootkit is Dead (Blackhat 2007)
Don't Tell Joanna the Virtualized Rootkit is Dead (Blackhat 2007)Don't Tell Joanna the Virtualized Rootkit is Dead (Blackhat 2007)
Don't Tell Joanna the Virtualized Rootkit is Dead (Blackhat 2007)
 
Cats And Dogs Living Together: Langsec Is Also About Usability
Cats And Dogs Living Together: Langsec Is Also About UsabilityCats And Dogs Living Together: Langsec Is Also About Usability
Cats And Dogs Living Together: Langsec Is Also About Usability
 
Solaris DTrace, An Introduction
Solaris DTrace, An IntroductionSolaris DTrace, An Introduction
Solaris DTrace, An Introduction
 
[Defcon24] Introduction to the Witchcraft Compiler Collection
[Defcon24] Introduction to the Witchcraft Compiler Collection[Defcon24] Introduction to the Witchcraft Compiler Collection
[Defcon24] Introduction to the Witchcraft Compiler Collection
 
One Year of Porting - Post-mortem of two Linux/SteamOS launches
One Year of Porting - Post-mortem of two Linux/SteamOS launchesOne Year of Porting - Post-mortem of two Linux/SteamOS launches
One Year of Porting - Post-mortem of two Linux/SteamOS launches
 
Implementing Lightweight Networking
Implementing Lightweight NetworkingImplementing Lightweight Networking
Implementing Lightweight Networking
 
Linux as a gaming platform, ideology aside
Linux as a gaming platform, ideology asideLinux as a gaming platform, ideology aside
Linux as a gaming platform, ideology aside
 
Choosing the right parallel compute architecture
Choosing the right parallel compute architecture Choosing the right parallel compute architecture
Choosing the right parallel compute architecture
 
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernelKernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel
 
Troubleshooting Linux Kernel Modules And Device Drivers
Troubleshooting Linux Kernel Modules And Device DriversTroubleshooting Linux Kernel Modules And Device Drivers
Troubleshooting Linux Kernel Modules And Device Drivers
 
Celebrating 30-th anniversary of the first C++ compiler: let's find bugs in it.
Celebrating 30-th anniversary of the first C++ compiler: let's find bugs in it.Celebrating 30-th anniversary of the first C++ compiler: let's find bugs in it.
Celebrating 30-th anniversary of the first C++ compiler: let's find bugs in it.
 

Destacado (7)

Freedom from IT: How to Give Power Back to Marketing and Merchandising Teams
Freedom from IT: How to Give Power Back to Marketing and Merchandising TeamsFreedom from IT: How to Give Power Back to Marketing and Merchandising Teams
Freedom from IT: How to Give Power Back to Marketing and Merchandising Teams
 
INDOKON BETON INSTAN
INDOKON BETON INSTANINDOKON BETON INSTAN
INDOKON BETON INSTAN
 
โครงงานคอม
โครงงานคอมโครงงานคอม
โครงงานคอม
 
Presentation1 prose
Presentation1 prosePresentation1 prose
Presentation1 prose
 
โครงร่างโครงงานคอมพิวเตอร์
โครงร่างโครงงานคอมพิวเตอร์โครงร่างโครงงานคอมพิวเตอร์
โครงร่างโครงงานคอมพิวเตอร์
 
Sign up for zumba classes in dubai
Sign up for zumba classes in dubaiSign up for zumba classes in dubai
Sign up for zumba classes in dubai
 
Statement by-prime-minister-hon.-dr-kenny-anthony-on-the-fire-service-impasse
Statement by-prime-minister-hon.-dr-kenny-anthony-on-the-fire-service-impasseStatement by-prime-minister-hon.-dr-kenny-anthony-on-the-fire-service-impasse
Statement by-prime-minister-hon.-dr-kenny-anthony-on-the-fire-service-impasse
 

Similar a Killer Bugs From Outer Space

hashdays 2011: Ange Albertini - Such a weird processor - messing with x86 opc...
hashdays 2011: Ange Albertini - Such a weird processor - messing with x86 opc...hashdays 2011: Ange Albertini - Such a weird processor - messing with x86 opc...
hashdays 2011: Ange Albertini - Such a weird processor - messing with x86 opc...
Area41
 
Lightweight Virtualization with Linux Containers and Docker I YaC 2013
Lightweight Virtualization with Linux Containers and Docker I YaC 2013Lightweight Virtualization with Linux Containers and Docker I YaC 2013
Lightweight Virtualization with Linux Containers and Docker I YaC 2013
Docker, Inc.
 
Exploitation and State Machines
Exploitation and State MachinesExploitation and State Machines
Exploitation and State Machines
Michael Scovetta
 
Debugging Apache Spark
Debugging Apache SparkDebugging Apache Spark
Debugging Apache Spark
Joey Echeverria
 
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Holden Karau
 
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018
Holden Karau
 

Similar a Killer Bugs From Outer Space (20)

Puppet@Citygrid - Julien Rottenberg - PuppetCamp LA '12
Puppet@Citygrid - Julien Rottenberg - PuppetCamp LA '12Puppet@Citygrid - Julien Rottenberg - PuppetCamp LA '12
Puppet@Citygrid - Julien Rottenberg - PuppetCamp LA '12
 
hashdays 2011: Ange Albertini - Such a weird processor - messing with x86 opc...
hashdays 2011: Ange Albertini - Such a weird processor - messing with x86 opc...hashdays 2011: Ange Albertini - Such a weird processor - messing with x86 opc...
hashdays 2011: Ange Albertini - Such a weird processor - messing with x86 opc...
 
Practical SystemTAP basics: Perl memory profiling
Practical SystemTAP basics: Perl memory profilingPractical SystemTAP basics: Perl memory profiling
Practical SystemTAP basics: Perl memory profiling
 
Ake hedman why we need to unite and why vscp is a solution to a problem
Ake hedman  why we need to unite and why vscp is a solution to a problemAke hedman  why we need to unite and why vscp is a solution to a problem
Ake hedman why we need to unite and why vscp is a solution to a problem
 
Iot with-the-best & VSCP
Iot with-the-best & VSCPIot with-the-best & VSCP
Iot with-the-best & VSCP
 
Introduction to Docker (as presented at December 2013 Global Hackathon)
Introduction to Docker (as presented at December 2013 Global Hackathon)Introduction to Docker (as presented at December 2013 Global Hackathon)
Introduction to Docker (as presented at December 2013 Global Hackathon)
 
Streaming huge databases using logical decoding
Streaming huge databases using logical decodingStreaming huge databases using logical decoding
Streaming huge databases using logical decoding
 
Troubleshooting .net core on linux
Troubleshooting .net core on linuxTroubleshooting .net core on linux
Troubleshooting .net core on linux
 
Lightweight Virtualization with Linux Containers and Docker | YaC 2013
Lightweight Virtualization with Linux Containers and Docker | YaC 2013Lightweight Virtualization with Linux Containers and Docker | YaC 2013
Lightweight Virtualization with Linux Containers and Docker | YaC 2013
 
Lightweight Virtualization with Linux Containers and Docker I YaC 2013
Lightweight Virtualization with Linux Containers and Docker I YaC 2013Lightweight Virtualization with Linux Containers and Docker I YaC 2013
Lightweight Virtualization with Linux Containers and Docker I YaC 2013
 
Exploitation and State Machines
Exploitation and State MachinesExploitation and State Machines
Exploitation and State Machines
 
Multiprocessing with python
Multiprocessing with pythonMultiprocessing with python
Multiprocessing with python
 
Debugging Apache Spark
Debugging Apache SparkDebugging Apache Spark
Debugging Apache Spark
 
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
 
Debugging ZFS: From Illumos to Linux
Debugging ZFS: From Illumos to LinuxDebugging ZFS: From Illumos to Linux
Debugging ZFS: From Illumos to Linux
 
Debugging multiplayer games
Debugging multiplayer gamesDebugging multiplayer games
Debugging multiplayer games
 
Let's begin io t with $10
Let's begin io t with $10Let's begin io t with $10
Let's begin io t with $10
 
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018
 
Drizzle Talk
Drizzle TalkDrizzle Talk
Drizzle Talk
 
Defcon 22-paul-mcmillan-attacking-the-iot-using-timing-attac
Defcon 22-paul-mcmillan-attacking-the-iot-using-timing-attacDefcon 22-paul-mcmillan-attacking-the-iot-using-timing-attac
Defcon 22-paul-mcmillan-attacking-the-iot-using-timing-attac
 

Más de Jérôme Petazzoni

Microservices. Microservices everywhere! (At OSCON 2015)
Microservices. Microservices everywhere! (At OSCON 2015)Microservices. Microservices everywhere! (At OSCON 2015)
Microservices. Microservices everywhere! (At OSCON 2015)
Jérôme Petazzoni
 

Más de Jérôme Petazzoni (20)

Use the Source or Join the Dark Side: differences between Docker Community an...
Use the Source or Join the Dark Side: differences between Docker Community an...Use the Source or Join the Dark Side: differences between Docker Community an...
Use the Source or Join the Dark Side: differences between Docker Community an...
 
Orchestration for the rest of us
Orchestration for the rest of usOrchestration for the rest of us
Orchestration for the rest of us
 
Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Eu...
Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Eu...Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Eu...
Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Eu...
 
Docker : quels enjeux pour le stockage et réseau ? Paris Open Source Summit ...
Docker : quels enjeux pour le stockage et réseau ? Paris Open Source Summit ...Docker : quels enjeux pour le stockage et réseau ? Paris Open Source Summit ...
Docker : quels enjeux pour le stockage et réseau ? Paris Open Source Summit ...
 
Making DevOps Secure with Docker on Solaris (Oracle Open World, with Jesse Bu...
Making DevOps Secure with Docker on Solaris (Oracle Open World, with Jesse Bu...Making DevOps Secure with Docker on Solaris (Oracle Open World, with Jesse Bu...
Making DevOps Secure with Docker on Solaris (Oracle Open World, with Jesse Bu...
 
Containers, docker, and security: state of the union (Bay Area Infracoders Me...
Containers, docker, and security: state of the union (Bay Area Infracoders Me...Containers, docker, and security: state of the union (Bay Area Infracoders Me...
Containers, docker, and security: state of the union (Bay Area Infracoders Me...
 
From development environments to production deployments with Docker, Compose,...
From development environments to production deployments with Docker, Compose,...From development environments to production deployments with Docker, Compose,...
From development environments to production deployments with Docker, Compose,...
 
How to contribute to large open source projects like Docker (LinuxCon 2015)
How to contribute to large open source projects like Docker (LinuxCon 2015)How to contribute to large open source projects like Docker (LinuxCon 2015)
How to contribute to large open source projects like Docker (LinuxCon 2015)
 
Containers, Docker, and Security: State Of The Union (LinuxCon and ContainerC...
Containers, Docker, and Security: State Of The Union (LinuxCon and ContainerC...Containers, Docker, and Security: State Of The Union (LinuxCon and ContainerC...
Containers, Docker, and Security: State Of The Union (LinuxCon and ContainerC...
 
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxCon
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxConAnatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxCon
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxCon
 
Microservices. Microservices everywhere! (At OSCON 2015)
Microservices. Microservices everywhere! (At OSCON 2015)Microservices. Microservices everywhere! (At OSCON 2015)
Microservices. Microservices everywhere! (At OSCON 2015)
 
Deploy microservices in containers with Docker and friends - KCDC2015
Deploy microservices in containers with Docker and friends - KCDC2015Deploy microservices in containers with Docker and friends - KCDC2015
Deploy microservices in containers with Docker and friends - KCDC2015
 
Containers: from development to production at DevNation 2015
Containers: from development to production at DevNation 2015Containers: from development to production at DevNation 2015
Containers: from development to production at DevNation 2015
 
Immutable infrastructure with Docker and containers (GlueCon 2015)
Immutable infrastructure with Docker and containers (GlueCon 2015)Immutable infrastructure with Docker and containers (GlueCon 2015)
Immutable infrastructure with Docker and containers (GlueCon 2015)
 
The Docker ecosystem and the future of application deployment
The Docker ecosystem and the future of application deploymentThe Docker ecosystem and the future of application deployment
The Docker ecosystem and the future of application deployment
 
Docker: automation for the rest of us
Docker: automation for the rest of usDocker: automation for the rest of us
Docker: automation for the rest of us
 
Docker Non Technical Presentation
Docker Non Technical PresentationDocker Non Technical Presentation
Docker Non Technical Presentation
 
Introduction to Docker, December 2014 "Tour de France" Bordeaux Special Edition
Introduction to Docker, December 2014 "Tour de France" Bordeaux Special EditionIntroduction to Docker, December 2014 "Tour de France" Bordeaux Special Edition
Introduction to Docker, December 2014 "Tour de France" Bordeaux Special Edition
 
Introduction to Docker, December 2014 "Tour de France" Edition
Introduction to Docker, December 2014 "Tour de France" EditionIntroduction to Docker, December 2014 "Tour de France" Edition
Introduction to Docker, December 2014 "Tour de France" Edition
 
Containers, Docker, and Microservices: the Terrific Trio
Containers, Docker, and Microservices: the Terrific TrioContainers, Docker, and Microservices: the Terrific Trio
Containers, Docker, and Microservices: the Terrific Trio
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 

Killer Bugs From Outer Space

  • 1. Killer Bugs From Outer Space Jérôme Petazzoni — @jpetazzo LinuxCon — Chicago — 2014
  • 2.
  • 3. Why this talk? Codito, ergo erro I code, therefore I make mistakes
  • 4. Introduction(s) ● Hi, I’m Jérôme.
  • 5. Introduction(s) ● Hi, I’m Jérôme. ● Sometimes, I write code.
  • 6. Introduction(s) ● Hi, I’m Jérôme. ● Sometimes, I write code. ● Sometimes, the code has bugs.
  • 7. Introduction(s) ● Hi, I’m Jérôme. ● Sometimes, I write code. ● Sometimes, the code has bugs. ● Sometimes, I fix the bugs in my code.
  • 8. Introduction(s) ● Hi, I’m Jérôme. ● Sometimes, I write code. ● Sometimes, the code has bugs. ● Sometimes, I fix the bugs in my code. ● Sometimes, I fix the bugs in other people’s code.
  • 9. Introduction(s) ● Hi, I’m Jérôme. ● Sometimes, I write code. ● Sometimes, the code has bugs. ● Sometimes, I fix the bugs in my code. ● Sometimes, I fix the bugs in other people’s code. I like bullet points!
  • 10. Introduction(s) ● Hi, I’m Jérôme. ● Sometimes, I write code. ● Sometimes, the code has bugs. ● Sometimes, I fix the bugs in my code. ● Sometimes, I fix the bugs in other people’s code. I like bullet points! ● And I carry a pager.
  • 11. Introduction(s) A pager is a device that wakes you up, or tells you to stop whatever you’re doing, so you can fix other people’s bugs.
  • 12. Introduction(s) A pager is a device that wakes you up, or tells you to stop whatever you’re doing, so you can fix other people’s bugs. WE HATESSS THEMSS.
  • 13. What about you? ● Do you write code? ● Does it sometimes have bugs? ● Do you fix them? ● Do you fix other people’s code too? ● Do you carry a pager? ● Do you love it?
  • 14. Outline ● Let’s talk about some really nasty bugs ● How they were found, how they were fixed ● How to be prepared next time ● This is not about testing, TDD, etc. (when the bugs are there, it’s too late anyway)
  • 15. Outline ● Node.js ● Harmless hardware bugs ● Docker ● Harmful hardware bugs ● Linux
  • 17. Context ● Hipache* is a reverse-proxy in Node.js ● Handles a bit of traffic ○ >100 req/s ○ >10K virtual hosts ○ >10K different containers ● Vhosts and containers change all the time (more than 1 time per minute) *Hipache is Hipster’s Apache. Sorry.
  • 18. The bug It all starts with an angry customer. “Sometimes, our application will crash, because this 700 KB JSON file is truncated by Hipache!” What about Content-Length? The client code should scream, but it doesn’t.
  • 19. Let’s sniff some packets Log into the load balancer (running Hipache)... # ngrep -tipd any -Wbyline '/api/v1/download-all-the-things' tcp port 80 interface: any filter: (ip or ip6) and ( tcp port 80 ) match: /api/v1/download-all-the-things #### T 2013/08/22 04:11:27.848663 23.20.88.251:55983 -> 10.116.195.150:80 [AP] GET /api/v1/download-all-the-things.json HTTP/1.0. Host: angrycustomer.com X-Forwarded-Port: 443. X-Forwarded-For: ::ffff:24.13.146.16. X-Forwarded-Proto: https. ...
  • 20. Too much traffic, not enough visibility! # tcpdump -peni any -s0 -wdump tcp port 80 (Wait a bit) ^C Transfer dump file DEMO TIME!
  • 21.
  • 22. What did we find out? ● Truncated files happen because a chunk (probably exactly one) gets dropped. But: ● Impossible to reproduce locally. ● Only the customer sees the problem. TONIGHT, WE DINE IN CODE!
  • 23. This is Node.js. I have no idea what I’m doing. ● Warm up the debuggers!
  • 24.
  • 25. This is Node.js. I have no idea what I’m doing. ● Warm up the debuggers! ● … but Node.js is asynchronous, callback-driven, spaghetti code ● Hmmmm, spaghetti
  • 26. This is Node.js. I have no idea what I’m doing. ● Plan B: PRINT ALL THE THINGS
  • 27. You need a phrasebook! ● How do you say “printf” in your language? ● How do you find where a function comes from? ● How do you trace the standard library?
  • 28. Shotgun debugging ● Add console.log() statements everywhere: ○ in Hipache ○ in node-http-proxy ○ in node/lib/http.js ● For the last one (part of std lib), we need to: ○ replace require(‘http’) with require(‘_http’) ○ add our own _http.js to our node_modules ○ do the same to net.js (in “our” _http.js). ● Now analyze big stream of obscure events! ● Let There Be Light
  • 29. Interlude about pauses ● With Node.js, you can pause a TCP stream. (Node.js will stop reading from the socket.) ● Then whenever you are ready to continue, you are supposed to send a resume event. ● Hipache does that: when a client is too slow, it will pause the socket to the backend. SO FAR, SO GOOD
  • 30. What really happens ● There are two layers in Node: tcp and http. ● When the tcp layer reads the last chunk, the backend closes the socket (it’s done). ● The tcp layer notices that the socket is now closed, and emits an end event. ● The end event bubbles up to the http layer. ● The http layer finishes what it was doing, without sending a resume. ● Node never reads the chunks in the kernel buffers. They are lost, forever alone.
  • 31. How do we fix this? Pester Node.js folks Catch that end event, and when it happens, send a resume to the stream to drain it. (Implementation detail: you only have the http socket, and you need to listen for an event on the tcp socket, so you need to do slightly dirty things with the http socket. But eh, it works!)
  • 32.
  • 33. What did we learn? When you can’t reproduce a bug at will, record it in action (tcpdump) and dissect it (wireshark). Spraying code with print statements helps. (But it’s better to use the logging framework!) You don’t have to know Node.js to fix Node.js!
  • 35. Intel Pentium (insert appropriate ©™ where required) ● Pentium FDIV bug (1994) ○ errors at 4th decimal place ○ fixed by replacing CPUs ○ cost (for Intel): $475,000,000 ○ cost (for users): approx. $0 ● Pentium F00F bug (1997) ○ using the wrong instruction hangs the machine ○ fixed in software ○ cost: ???
  • 36. ATA ribbon cables ● Touch or move those cables: the transfer speed changes ● SATA was introduced in 2003, and (mostly) addresses the issue ● Vibration is still an issue, though
  • 37. Docker (because even when it’s not about Docker, it’s still about Docker)
  • 38. Bug: It never works the first time # docker run -t -i ubuntu echo hello world 2013/08/06 23:20:53 Error: Error starting container 06d642aae1a: fork/exec /usr/bin/lxc-start: operation not permitted # docker run -t -i ubuntu echo hello world hello world # docker run -t -i ubuntu echo hello world hello world # docker run -t -i ubuntu echo hello world hello world # docker run -t -i ubuntu echo hello world hello world
  • 39.
  • 40. Strace to the rescue! Steps: 1. Boot the machine. 2. Find pid of process to analyze. (ps | grep, pidof docker...) 3. strace -o log -f -p $PID 4. docker run -t -i run ubuntu echo hello world 5. Ctrl-C the strace process. 6. Repeat steps 3-4-5, using a different log file. Note: can also strace directly, e.g. “strace ls”.
  • 41. Let’s compare the log files ● Thousands and thousands of lines. ● Look for the error message in file A. (e.g. “operation not permitted”) ● If lucky: it will reveal the issue. ● Otherwise, look what happens in file B. ● Other approach: start from the beginning or the end, and try to find the point when things started to diverge.
  • 43. Specialized hardware helps ● Now you have a good reason to ask your CFO about that dual 30” monitor setup!
  • 44. Investigation results First time [pid 1331] setsid() = 1331 [pid 1331] dup2(10, 0) = 0 [pid 1331] dup2(10, 1) = 1 [pid 1331] dup2(10, 2) = 2 [pid 1331] ioctl(0, TIOCSCTTY) = -1 EPERM (Operation not permitted) [pid 1331] write(12, "10000000", 8) = 8 [pid 1331] _exit(253) = ? Second time (and every following attempt) [pid 1414] setsid() = 1414 [pid 1414] dup2(14, 0) = 0 [pid 1414] dup2(14, 1) = 1 [pid 1414] dup2(14, 2) = 2 [pid 1414] ioctl(0, TIOCSCTTY) = 0 [pid 1414] execve("/usr/bin/lxc-start", ["lxc-start", "-n", ...]) <...>
  • 45. What does that mean? ● For some reason, the code wants file descriptor 0 (stdin) to be a terminal. ● The first time we run, it fails, but in the process, we acquire a terminal. (UNIX 101: when you don’t have a controlling terminal and open a file which is a terminal, it becomes your controlling terminal, unless you open the file with flag O_NOCTTY) ● Next attempts are therefore successful.
  • 46. … Really? To confirm that this is indeed the bug: ● reproduce the issue (start the process with “setsid”, to detach from controlling terminal) ● check the output of “ps” (it shows controlling terminals) #before 23083 ? Sl+ 0:12 ./docker -d -b br0 #after 23083 pts/6 Sl+ 0:12 ./docker -d -b br0
  • 47. V I C T O R Y
  • 48. What did we learn? You can attach to running processes. ● strace is awesome. It traces syscalls. ● ltrace is awesome too. It traces library calls. ● gdb is your friend. (A very peculiar friend, but a friend nonetheless.)
  • 50. “Errare humanum est, perseverare autem diabolicum” “To err is human, but to really foul things up, you need a computer”
  • 51. Really nasty (and sad) bug: The Therac-25 ● Radiotherapy machine (shoots beams to cure cancer) ● Two modes: ○ low energy (direct exposure) ○ high energy (beam hits a special target/filter first)
  • 52. The problem ● In older versions of the machine, a hardware interlock prevented the high energy beam from shooting if the filter was not in place. ● On the Therac-25, it’s in software. ● What could possibly go wrong?
  • 53. What went wrong ● 6 people got radiation burns ● 3 people died ● … over the course of 3 years (1985 to 1987)
  • 54. Konami Code of Death On the keyboard, press: (in less than 8 seconds) X ↑ E [ENTER] B ...And the high energy beam shoots, unfiltered!
  • 55. How could it happen? ● Race condition in the software. ● Never happened during tests: ○ the tests did not include “unusual sequences” (which were not that unusual after all) ○ test operators were slower than real operators
  • 56. Aggravating details ● Many engineering and institutional issues ○ No code review ○ No evaluation of possible failures ○ Undocumented error codes ○ No sensor feedback ● The machine had tons of “normal errors” ○ And operators learned to ignore them ● So the “real errors” were ignored ○ Just hit retry, same player shoot again!
  • 57. Let’s get back to weird Linux Kernel bugs
  • 58. Linux Kernel and spinlocks and Xen and ...
  • 59. Let’s get back to weird Linux Kernel bugs
  • 60. Random crashes on EC2 ● Pool of ~50 identical instances ● Same role (run 100s of containers) ● Sometimes, one of them would crash ○ Total crash ○ no SSH ○ no ping ○ no log ○ no nothing ● EC2 console won’t show anything ● Impossible to reproduce
  • 61. Try a million things... ● Different kernel versions ● Different filesystems tunings ● Different security settings (GRSEC) ● Different memory settings (overcommit, OOM) ● Different instance sizes ● Different EBS volumes ● Different differences ● Nothing changed
  • 62. And one fine day... ● One machine crashes very often (every few days, sometimes few hours) CLONE IT! ONE MILLION TIMES!
  • 63. A New Hope! ● Change everything (again!) ● Find nothing (again!) ● Do something crazy: contact AWS support ● Repeat tests on “official” image (AMI) (this required porting our stuff from Ubuntu 10.04 to 12.04)
  • 64. Happy ending ● Re-ran tests with official image ● Eventually got it to crash ● Left it in crashed state ● Support analyzed the image...
  • 65. Happy ending ● Re-ran tests with official image ● Eventually got it to crash ● Left it in crashed state ● Support analyzed the image “oh yeah it’s a known issue, see that link.”
  • 66. Happy ending ● Re-ran tests with official image ● Eventually got it to crash ● Left it in crashed state ● Support analyzed the image “oh yeah it’s a known issue, see that link.” U SERIOUS?
  • 67. I can explain! ● The bug only happens: ○ on workloads using spinlocks intensively ○ only on Xen VMs with many CPUs ● Spinlocks = actively spinning the CPU ● On VMs, you don’t want to hold the CPU ● Xen has special implementation of spinlocks When waking up CPUs waiting on a spinlock, the code would only wake up the first one, even if there were multiple CPUs waiting.
  • 68. The patch (priceless) diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c index d69cc6c..67bc7ba 100644 --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -328,7 +328,6 @@ static noinline void xen_spin_unlock_slow(struct xen_spinlock *xl) if (per_cpu(lock_spinners, cpu) == xl) { ADD_STATS(released_slow_kicked, 1); xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR); - break; } } } --
  • 69.
  • 70. What did we learn? We didn’t try all the combinations. (Trying on HVM machines would have helped!) AWS support can be helpful sometimes. (This one was a surprise.) Trying to debug a kernel issue without console output is like trying to learn to read in the dark. (Compare to local VM with serial output…)
  • 71.
  • 72. Overall Conclusions When facing a mystic bug from outer space: ● reproduce it at all costs! ● collect data with tcpdump, ngrep, wireshark, strace, ltrace, gdb; and log files, obviously! ● don’t be afraid of uncharted places! ● document it, at least with a 2 AM ragetweet!
  • 73. One last thing... ● Get all the help you can get! ● Your developers will rarely reproduce bugs (Ain’t nobody got time for that) ● Your support team will (They talk to your customers all the time) ● Help your support team to help your devs ● Bonus points if your support team fixes bugs