Debugging ZFS: From Illumos to Linux

Debugging ZFS
From Illumos to Linux
Serapheim Dimitropoulos | Delphix

Background
Delphix
● Our product is an appliance/VM that we ship to customers
○ Either on the public cloud (e.g. AWS) or on-prem on top of a hypervisor (e.g. VMware)
● The core functionality of our product relies on ZFS
● Recently we switched our OS from Illumos to Linux
Question
“Can we maintain our existing debugging processes for issues in production with Linux?”

Debugging In Production
Main goal - Root-cause on first failure
For performance pathologies this means hoping on the system to check any monitoring logs for errors and
other clues or examining the runtime behavior through tracing.
For severe failures (e.g. panics and deadlocks) we collect a crash dump (and potentially on disk state with
ZDB if the VM was still running) for postmortem debugging.
Illumos was built with the above in mind so our processes were created accordingly.
We’ve had some success adjusting to Linux but we found general support for postmortem debugging to
be lacking. (crash dump generation, size of debug info, tools to analyze dumps, etc..)

Postmortem Debugging - The Alternative
Consider the alternative when a customer VM crashes.
The information is limited to:
● What the customer thought to mention
● What support thought to ask
● Random/unrelated logs and maybe a stack trace
In most cases the above is not enough and you need to iterate with the customer.

Postmortem Debugging
The act of debugging a program after it has crashed.
For OS kernels this is generally done by analyzing a crash dump generated at the time of the crash.
A crash dump is a file on disk containing all (or some) of the system’s in-memory and processor state at
the time of the crash, like kernel pages and CPU register values.
At Delphix crash-dumps are an essential part of our debugging procedures.

Postmortem Debugging
A correctly-generated crash dump is comprehensive and never lies.
It can also decouple the activity of root-causing the failure from the process of restoring the system.

Failure: A Panic!
Console Log:
Think dmesg(1) kernel ring buffer in Linux ...

Failure: What?
Investigation Notes

Investigation Notes
● ZIO’s BP is not the same as original BP
Failure: What?

Investigation Notes
● This is a NOPWRITE ZIO
Failure: What?

Investigation Notes
● This is a NOPWRITE ZIO
● ZIO’s BP is not an embedded BP
Failure: What?

Aside: NOP-WRITE
● Performance optimization with space savings for snapshots
● ZFS compares checksums of incoming block vs block on disk
● If they match, nothing has changed and we can skip issuing a write I/O
● Common in frequently overwritten files with almost-identical data
○ E.g. full-backups of large random-access files

Problem
We are issuing a NOP-WRITE but BP
differs from what’s on disk.
Notes
● The BP is not an embedded BP
Failure: Problem Summary

Failure: What now?
We understand the problem, just don’t know why it’s happening
Plan
1. Capture a crash dump for in-house analysis
2. Unblock customer by disabling NOP-WRITEs (System Recovery)
3. Figure out the problem from the crash-dump (Root-Cause Analysis)
4. Implement a fix and ship it to customers

Need to figure out the root-cause of the issue!
Have two choices
1. Start reading the ~12K lines of related code
2. Analyze the crash dump and make targeted questions towards the culprit
Failure: OK, what now?

Failure: Path Forward
● Control flow in ZIO code is complex

● Stack traces from panics don’t tell you where it came from

● Generally the thread that issued the ZIO is not around anymore (async case)
○ Printing all the stack traces won’t do the trick

● Generally the thread that issued the ZIO is not around anymore (async case)
○ Printing all the stack traces won’t do the trick
● We need to inspect the data of the actual ZIO

Failure: Inspecting the ZIO
zio_t pointer

Failure: Examining the BPs
BP of ZIO
BP on disk

Failure: Examining the BPs
BP on disk
Current TXG

Failure: Notes
● Issuing a NOP-WRITE
○ ZIO’s BP differs from on-disk BP
○ The BP on-disk was freed the TXG before the current one
● ZIO’s BP is not an embedded BP
● This was an write override ZIO (io_done is dbuf_write_override_done)

Failure: Where did the ZIO come from?
Only one place where io_done is set to that!

Failure: Where did the ZIO come from?
New Clue!
2 suspects for the origin of our ZIO
Can’t be dmu_buf_write_embedded()
Must be dmu_sync() !

Failure: Examination of dmu_sync()

Failure: Case Closed
Problem
ZFS issued a NOP-WRITE but the BP was different from the BP on disk
Root Cause
dmu_sync()’s check wasn’t complete and ZFS wouldn’t disable nop-writes for recently freed blocks
Fix
Add check in dmu_sync() to see if block has been freed

Postmortem Debugging - Recap
Crash dumps
● Allow you to examine processor and in-memory state at the time of the crash
● Bundled with ZDB output - all the state you’ll need for ZFS issues
● Decouple System Recovery from Root-Cause Analysis

SDB
● A postmortem and live debugger
● User experience similar to MDB
○ Ask any question by chaining a pipeline of commands (Unix Shell Style)
● Can be easily extended with Python

What’s going on in the system?

Examining Data Structures
Figures out that we are passing
an AVL tree and walks the
structure appropriately in-order

Command either pretty-prints or pipes
all the spa_t structures depending on
where it is in a pipeline

How many metaslabs are loaded in rpool?

Unflushed Allocation Segments in rpool
… above offset 0xa000b600?

Any I/O in the system at time of crash?

Memory usage - SPL caches
Indicates if SPL cache is backed by
the Linux SLUB allocator
Ordered by top offenders in active
memory by default
E.g. “arc_buf_hdr_t_full” is backed by
a Linux cache called “taskstats”

Memory usage of B-Tree leaves?
316 KB used for B-Tree leaves. This is
8% of the underlying Linux cache.
Overall cache utilization for
underlying cache is 65%.

Memory usage of metaslab_t structs?

drgn
Developed by Omar Sandoval @ Facebook - https://drgn.readthedocs.io/en/latest/index.html
A C library wrapped by a Python layer allowing the use of Python to introspect live-systems and crash
dumps.
● Python API and Object model
● Fast start-up and command execution
● Still young and lacks certain features (e.g. function args) but promising
● Small but growing community that is open to patches
Writing Python in the REPL to debug can be cumbersome

SDB
A Python layer that leverages the drgn API to provide a debugging experience similar to MDB.
Can be extended in Python with new commands using:
1. The drgn API to query info from the debugging target
2. Pre-made constructs that allow them to receive and pass objects through a pipe
Point (2) allows for pipelines that are more powerful than what we had in MDB
(e.g. we pass objects with C type info vs plain pointers/integers values through the pipe)

SDB - Recap
● Debugger for live-systems and crash dumps
● Leverages drgn for introspecting its target and provides a shell-like interface (e.g. pipes)
● Can be easily extended in Python with new commands that
○ Walk complex data structures
○ Aggregate, filter, and pretty-print data
● A user can ask almost any question that can be answered given the available state
● Great for debugging ZFS on Linux!

Resources
●
●
●
●
●
●
●

Future Work
SDB repo: https://github.com/delphix/sdb/tree/master
● More commands (help us at the hackathon tomorrow!)
● Tutorials for writing new commands
● Proper parser code & Test Suite
● Support for modules loaded at runtime
● Out of the box support for ztest core dumps
SDB & OpenZFS
● Discuss potentially moving ZFS-related commands in a module under contrib/
● Discuss potentially enabling crash dumps in VMs performing automated testing on Github

Criteria
A good kernel debugger for our use-case:
1. Can access all available state in a live-system or a crash dump
2. Presents data in a precise and readable format
3. Is easily extensible
4. Doesn’t get in your way

Access to Everything
Should be able to at least:
1. Print all available stack traces together with their function arguments
2. Allow access to any available region in memory
3. Be able to walk complex data structures efficiently

Meaningful Output
The debugger should be able to:
● Present the same data in multiple ways
○ Each representation emphasizing the answer to a different question
● Output insightful reports drawing info from multiple data sources

Meaningful Output
This is better!

Extensibility
Developers should be able to extend the debugger, preferably without recompiling it.
Examples:
● MDB supports modules written in C
● GDB can be extended in Python, either by scripts or on the fly during a session

Doesn’t get in your way!
GDB without Python (old versions):
● A prompt with a laundry-list of commands
● Your questions were limited to what the debugger is programmed to answer
GDB with Python:
● You can ask anything you want, as long as you are willing to type code in the Python REPL
● … but now your focus is more programming (and spaces…) instead of debugging
MDB:
● Pipes - sweet spot in the middle and familiar
● Ask any question you have by chaining a pipeline of commands (Unix Shell Style)

MDB Pipes Example
Question: How many of them are of length 800?
Print all segments of all the loaded metaslabs in ZFS:

GDB & KGDB
GDB
● Users recognize it and Python extensibility is a plus
● Not applicable - it doesn’t work with live kernels and kernel crash dumps anymore
KGDB
● GDB but for the kernel!
● Not available in most distros - requires you to recompile the kernel to enable it
● Not applicable for Delphix - requires a second machine to introspect the first one

crash(8)
The SVR4 utility re-written as a layer that understands the Linux kernel and has GDB 7.6 embedded in it
to provide a familiar experience.
Works out-of-the-box with live systems, crash dumps, and even hypervisor snapshots (ESX, KVM, etc..)
Unfortunately
● Not easily extensible
○ C API exists but it’s lacking
○ Enabling the Python in the embedded GDB doesn’t work properly due to architecture
● Development seems to be in maintenance mode

crash-python
Developed by Jeff Mahoney @ SUSE - https://github.com/jeffmahoney/crash-python
A patched version of GDB that reads kernel crash dumps by leveraging libkdumpfile.
Well-designed but:
● GDB patch has been in the mailing list for years with no updates
○ Downstream patch maintained by 1 person
● Doesn’t work with live-systems
○ Enabling this requires more GDB patches

drgn
Developed by Omar Sandoval @ Facebook - https://drgn.readthedocs.io/en/latest/index.html
A C library wrapped by a Python layer allowing the use of Python to introspect live-systems and crash
dumps.
● Python API and Object model well-designed
● Fast start-up and command execution
● Still young and lacks certain features (e.g. function args) but promising
● Small but growing community that is open to patches
Checks most of our boxes but writing Python in the REPL to debug can be cumbersome

SDB
A Python layer that leverages the drgn API to provide a debugging experience similar to MDB.
Provides a set of primitive commands that can be chained together in a pipeline.
Can be extended in Python with new commands using:
1. The drgn API to query info from the debugging target
2. Pre-made constructs that allow them to receive and pass objects through a pipe
Point (2) allows for pipelines that are more powerful than what we had in MDB
(e.g. we pass objects with C type info vs plain pointers/integers values through the pipe)

Debugging ZFS: From Illumos to Linux

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Debugging ZFS: From Illumos to Linux

Similar to Debugging ZFS: From Illumos to Linux (20)

Recently uploaded

Recently uploaded (20)

Debugging ZFS: From Illumos to Linux