SlideShare a Scribd company logo
1 of 86
Download to read offline
Debugging ZFS
From Illumos to Linux
Serapheim Dimitropoulos | Delphix
Background
Background
Delphix
● Our product is an appliance/VM that we ship to customers
○ Either on the public cloud (e.g. AWS) or on-prem on top of a hypervisor (e.g. VMware)
● The core functionality of our product relies on ZFS
● Recently we switched our OS from Illumos to Linux
Question
“Can we maintain our existing debugging processes for issues in production with Linux?”
Debugging In Production
Main goal - Root-cause on first failure
For performance pathologies this means hoping on the system to check any monitoring logs for errors and
other clues or examining the runtime behavior through tracing.
For severe failures (e.g. panics and deadlocks) we collect a crash dump (and potentially on disk state with
ZDB if the VM was still running) for postmortem debugging.
Illumos was built with the above in mind so our processes were created accordingly.
We’ve had some success adjusting to Linux but we found general support for postmortem debugging to
be lacking. (crash dump generation, size of debug info, tools to analyze dumps, etc..)
Postmortem Debugging
Postmortem Debugging - The Alternative
Consider the alternative when a customer VM crashes.
The information is limited to:
● What the customer thought to mention
● What support thought to ask
● Random/unrelated logs and maybe a stack trace
In most cases the above is not enough and you need to iterate with the customer.
Postmortem Debugging
The act of debugging a program after it has crashed.
For OS kernels this is generally done by analyzing a crash dump generated at the time of the crash.
A crash dump is a file on disk containing all (or some) of the system’s in-memory and processor state at
the time of the crash, like kernel pages and CPU register values.
At Delphix crash-dumps are an essential part of our debugging procedures.
Postmortem Debugging
A correctly-generated crash dump is comprehensive and never lies.
It can also decouple the activity of root-causing the failure from the process of restoring the system.
A Real-World Example
Failure: A Panic!
Console Log:
Think dmesg(1) kernel ring buffer in Linux ...
Failure: Where?
Failure: What?
Investigation Notes
Investigation Notes
● ZIO’s BP is not the same as original BP
Failure: What?
Investigation Notes
● ZIO’s BP is not the same as original BP
● This is a NOPWRITE ZIO
Failure: What?
Investigation Notes
● ZIO’s BP is not the same as original BP
● This is a NOPWRITE ZIO
● ZIO’s BP is not an embedded BP
Failure: What?
Aside: NOP-WRITE
● Performance optimization with space savings for snapshots
● ZFS compares checksums of incoming block vs block on disk
● If they match, nothing has changed and we can skip issuing a write I/O
● Common in frequently overwritten files with almost-identical data
○ E.g. full-backups of large random-access files
Problem
We are issuing a NOP-WRITE but BP
differs from what’s on disk.
Notes
● The BP is not an embedded BP
Failure: Problem Summary
Failure: What now?
We understand the problem, just don’t know why it’s happening
Plan
1. Capture a crash dump for in-house analysis
2. Unblock customer by disabling NOP-WRITEs (System Recovery)
3. Figure out the problem from the crash-dump (Root-Cause Analysis)
4. Implement a fix and ship it to customers
Failure: System Recovery
Need to figure out the root-cause of the issue!
Have two choices
1. Start reading the ~12K lines of related code
2. Analyze the crash dump and make targeted questions towards the culprit
Failure: OK, what now?
Failure: Path Forward
Failure: Path Forward
● Control flow in ZIO code is complex
● Control flow in ZIO code is complex
● Stack traces from panics don’t tell you where it came from
Failure: Path Forward
● Control flow in ZIO code is complex
● Stack traces from panics don’t tell you where it came from
● Generally the thread that issued the ZIO is not around anymore (async case)
○ Printing all the stack traces won’t do the trick
Failure: Path Forward
Failure: Path Forward
● Control flow in ZIO code is complex
● Stack traces from panics don’t tell you where it came from
● Generally the thread that issued the ZIO is not around anymore (async case)
○ Printing all the stack traces won’t do the trick
● We need to inspect the data of the actual ZIO
Failure: Inspecting the ZIO
zio_t pointer
Failure: Examining the BPs
BP of ZIO
BP on disk
Failure: Examining the BPs
BP on disk
Current TXG
Failure: Notes
● Issuing a NOP-WRITE
○ ZIO’s BP differs from on-disk BP
○ The BP on-disk was freed the TXG before the current one
● ZIO’s BP is not an embedded BP
● This was an write override ZIO (io_done is dbuf_write_override_done)
Failure: Where did the ZIO come from?
Only one place where io_done is set to that!
Failure: Where did the ZIO come from?
New Clue!
2 suspects for the origin of our ZIO
Can’t be dmu_buf_write_embedded()
Must be dmu_sync() !
Failure: Examination of dmu_sync()
Failure: The Fix
Failure: Case Closed
Problem
ZFS issued a NOP-WRITE but the BP was different from the BP on disk
Root Cause
dmu_sync()’s check wasn’t complete and ZFS wouldn’t disable nop-writes for recently freed blocks
Fix
Add check in dmu_sync() to see if block has been freed
Postmortem Debugging - Recap
Crash dumps
● Allow you to examine processor and in-memory state at the time of the crash
● Bundled with ZDB output - all the state you’ll need for ZFS issues
● Decouple System Recovery from Root-Cause Analysis
SDB - The Slick Debugger
SDB
● A postmortem and live debugger
● User experience similar to MDB
○ Ask any question by chaining a pipeline of commands (Unix Shell Style)
● Can be easily extended with Python
Debugging ZFS with SDB
What’s going on in the system?
What’s going on in ZFS?
Threads issuing ZFS IOCTLs?
Examining Data Structures
Examining Data Structures
Figures out that we are passing
an AVL tree and walks the
structure appropriately in-order
Command either pretty-prints or pipes
all the spa_t structures depending on
where it is in a pipeline
TXG of rpool?
How many metaslabs are loaded in rpool?
Unflushed Allocation Segments in rpool
… above offset 0xa000b600?
Pretty-Printing BPs
Any I/O in the system at time of crash?
Memory usage - SPL caches
Indicates if SPL cache is backed by
the Linux SLUB allocator
Ordered by top offenders in active
memory by default
E.g. “arc_buf_hdr_t_full” is backed by
a Linux cache called “taskstats”
Memory usage of B-Tree leaves?
316 KB used for B-Tree leaves. This is
8% of the underlying Linux cache.
Overall cache utilization for
underlying cache is 65%.
Memory usage of metaslab_t structs?
How does SDB work?
drgn
Developed by Omar Sandoval @ Facebook - https://drgn.readthedocs.io/en/latest/index.html
A C library wrapped by a Python layer allowing the use of Python to introspect live-systems and crash
dumps.
● Python API and Object model
● Fast start-up and command execution
● Still young and lacks certain features (e.g. function args) but promising
● Small but growing community that is open to patches
Writing Python in the REPL to debug can be cumbersome
SDB
A Python layer that leverages the drgn API to provide a debugging experience similar to MDB.
Can be extended in Python with new commands using:
1. The drgn API to query info from the debugging target
2. Pre-made constructs that allow them to receive and pass objects through a pipe
Point (2) allows for pipelines that are more powerful than what we had in MDB
(e.g. we pass objects with C type info vs plain pointers/integers values through the pipe)
SDB - Recap
● Debugger for live-systems and crash dumps
● Leverages drgn for introspecting its target and provides a shell-like interface (e.g. pipes)
● Can be easily extended in Python with new commands that
○ Walk complex data structures
○ Aggregate, filter, and pretty-print data
● A user can ask almost any question that can be answered given the available state
● Great for debugging ZFS on Linux!
Questions?
Resources
●
●
●
●
●
●
●
Thank you for your time!
Future Work
Future Work
SDB repo: https://github.com/delphix/sdb/tree/master
● More commands (help us at the hackathon tomorrow!)
● Tutorials for writing new commands
● Proper parser code & Test Suite
● Support for modules loaded at runtime
● Out of the box support for ztest core dumps
SDB & OpenZFS
● Discuss potentially moving ZFS-related commands in a module under contrib/
● Discuss potentially enabling crash dumps in VMs performing automated testing on Github
Implementing SDB Commands
Example - Hello, World!
Example - count
Example - list_t walker
Example - zfs_dbgmsg
Debugger Criteria
Criteria
A good kernel debugger for our use-case:
1. Can access all available state in a live-system or a crash dump
2. Presents data in a precise and readable format
3. Is easily extensible
4. Doesn’t get in your way
Access to Everything
Should be able to at least:
1. Print all available stack traces together with their function arguments
2. Allow access to any available region in memory
3. Be able to walk complex data structures efficiently
Meaningful Output
The debugger should be able to:
● Present the same data in multiple ways
○ Each representation emphasizing the answer to a different question
● Output insightful reports drawing info from multiple data sources
Meaningful Output
Meaningful Output
This is ok.
Meaningful Output
This is better!
Meaningful Output
Extensibility
Developers should be able to extend the debugger, preferably without recompiling it.
Examples:
● MDB supports modules written in C
● GDB can be extended in Python, either by scripts or on the fly during a session
Doesn’t get in your way!
GDB without Python (old versions):
● A prompt with a laundry-list of commands
● Your questions were limited to what the debugger is programmed to answer
GDB with Python:
● You can ask anything you want, as long as you are willing to type code in the Python REPL
● … but now your focus is more programming (and spaces…) instead of debugging
MDB:
● Pipes - sweet spot in the middle and familiar
● Ask any question you have by chaining a pipeline of commands (Unix Shell Style)
MDB Pipes Example
Question: How many of them are of length 800?
Print all segments of all the loaded metaslabs in ZFS:
Kernel Debuggers (in Linux)
GDB & KGDB
GDB
● Users recognize it and Python extensibility is a plus
● Not applicable - it doesn’t work with live kernels and kernel crash dumps anymore
KGDB
● GDB but for the kernel!
● Not available in most distros - requires you to recompile the kernel to enable it
● Not applicable for Delphix - requires a second machine to introspect the first one
crash(8)
The SVR4 utility re-written as a layer that understands the Linux kernel and has GDB 7.6 embedded in it
to provide a familiar experience.
Works out-of-the-box with live systems, crash dumps, and even hypervisor snapshots (ESX, KVM, etc..)
Unfortunately
● Not easily extensible
○ C API exists but it’s lacking
○ Enabling the Python in the embedded GDB doesn’t work properly due to architecture
● Development seems to be in maintenance mode
crash-python
Developed by Jeff Mahoney @ SUSE - https://github.com/jeffmahoney/crash-python
A patched version of GDB that reads kernel crash dumps by leveraging libkdumpfile.
Well-designed but:
● GDB patch has been in the mailing list for years with no updates
○ Downstream patch maintained by 1 person
● Doesn’t work with live-systems
○ Enabling this requires more GDB patches
drgn
Developed by Omar Sandoval @ Facebook - https://drgn.readthedocs.io/en/latest/index.html
A C library wrapped by a Python layer allowing the use of Python to introspect live-systems and crash
dumps.
● Python API and Object model well-designed
● Fast start-up and command execution
● Still young and lacks certain features (e.g. function args) but promising
● Small but growing community that is open to patches
Checks most of our boxes but writing Python in the REPL to debug can be cumbersome
SDB
A Python layer that leverages the drgn API to provide a debugging experience similar to MDB.
Provides a set of primitive commands that can be chained together in a pipeline.
Can be extended in Python with new commands using:
1. The drgn API to query info from the debugging target
2. Pre-made constructs that allow them to receive and pass objects through a pipe
Point (2) allows for pipelines that are more powerful than what we had in MDB
(e.g. we pass objects with C type info vs plain pointers/integers values through the pipe)
Questions?
Resources
●
●
●
●
●
●
●
Thank you for your time!

More Related Content

What's hot

Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2
ice799
 
Debug tutorial
Debug tutorialDebug tutorial
Debug tutorial
Defri N
 
You didnt see it’s coming? "Dawn of hardened Windows Kernel"
You didnt see it’s coming? "Dawn of hardened Windows Kernel" You didnt see it’s coming? "Dawn of hardened Windows Kernel"
You didnt see it’s coming? "Dawn of hardened Windows Kernel"
Peter Hlavaty
 
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytesWindows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Peter Hlavaty
 

What's hot (20)

Kernel Recipes 2019 - Kernel documentation: past, present, and future
Kernel Recipes 2019 - Kernel documentation: past, present, and futureKernel Recipes 2019 - Kernel documentation: past, present, and future
Kernel Recipes 2019 - Kernel documentation: past, present, and future
 
Practical Windows Kernel Exploitation
Practical Windows Kernel ExploitationPractical Windows Kernel Exploitation
Practical Windows Kernel Exploitation
 
Kernel Recipes 2019 - Formal modeling made easy
Kernel Recipes 2019 - Formal modeling made easyKernel Recipes 2019 - Formal modeling made easy
Kernel Recipes 2019 - Formal modeling made easy
 
Driver Debugging Basics
Driver Debugging BasicsDriver Debugging Basics
Driver Debugging Basics
 
Kernel Recipes 2019 - BPF at Facebook
Kernel Recipes 2019 - BPF at FacebookKernel Recipes 2019 - BPF at Facebook
Kernel Recipes 2019 - BPF at Facebook
 
Back to the CORE
Back to the COREBack to the CORE
Back to the CORE
 
DeathNote of Microsoft Windows Kernel
DeathNote of Microsoft Windows KernelDeathNote of Microsoft Windows Kernel
DeathNote of Microsoft Windows Kernel
 
Linux Server Deep Dives (DrupalCon Amsterdam)
Linux Server Deep Dives (DrupalCon Amsterdam)Linux Server Deep Dives (DrupalCon Amsterdam)
Linux Server Deep Dives (DrupalCon Amsterdam)
 
.NET Debugging Workshop
.NET Debugging Workshop.NET Debugging Workshop
.NET Debugging Workshop
 
Improving the ZFS Userland-Kernel API with Channel Programs - BSDCAN 2017 - M...
Improving the ZFS Userland-Kernel API with Channel Programs - BSDCAN 2017 - M...Improving the ZFS Userland-Kernel API with Channel Programs - BSDCAN 2017 - M...
Improving the ZFS Userland-Kernel API with Channel Programs - BSDCAN 2017 - M...
 
Power of linked list
Power of linked listPower of linked list
Power of linked list
 
Advanced windows debugging
Advanced windows debuggingAdvanced windows debugging
Advanced windows debugging
 
Kernel Recipes 2019 - Analyzing changes to the binary interface exposed by th...
Kernel Recipes 2019 - Analyzing changes to the binary interface exposed by th...Kernel Recipes 2019 - Analyzing changes to the binary interface exposed by th...
Kernel Recipes 2019 - Analyzing changes to the binary interface exposed by th...
 
Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2
 
Debug tutorial
Debug tutorialDebug tutorial
Debug tutorial
 
You didnt see it’s coming? "Dawn of hardened Windows Kernel"
You didnt see it’s coming? "Dawn of hardened Windows Kernel" You didnt see it’s coming? "Dawn of hardened Windows Kernel"
You didnt see it’s coming? "Dawn of hardened Windows Kernel"
 
Attack on the Core
Attack on the CoreAttack on the Core
Attack on the Core
 
Rainbow Over the Windows: More Colors Than You Could Expect
Rainbow Over the Windows: More Colors Than You Could ExpectRainbow Over the Windows: More Colors Than You Could Expect
Rainbow Over the Windows: More Colors Than You Could Expect
 
Fuzzing the Media Framework in Android
Fuzzing the Media Framework in AndroidFuzzing the Media Framework in Android
Fuzzing the Media Framework in Android
 
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytesWindows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytes
 

Similar to Debugging ZFS: From Illumos to Linux

Introductiontoasp netwindbgdebugging-100506045407-phpapp01
Introductiontoasp netwindbgdebugging-100506045407-phpapp01Introductiontoasp netwindbgdebugging-100506045407-phpapp01
Introductiontoasp netwindbgdebugging-100506045407-phpapp01
Camilo Alvarez Rivera
 
A Gentle Introduction to Docker and Containers
A Gentle Introduction to Docker and ContainersA Gentle Introduction to Docker and Containers
A Gentle Introduction to Docker and Containers
Docker, Inc.
 
Lightweight Virtualization with Linux Containers and Docker I YaC 2013
Lightweight Virtualization with Linux Containers and Docker I YaC 2013Lightweight Virtualization with Linux Containers and Docker I YaC 2013
Lightweight Virtualization with Linux Containers and Docker I YaC 2013
Docker, Inc.
 

Similar to Debugging ZFS: From Illumos to Linux (20)

12 tricks to avoid hackers breaks your CI / CD
12 tricks to avoid hackers breaks your  CI / CD12 tricks to avoid hackers breaks your  CI / CD
12 tricks to avoid hackers breaks your CI / CD
 
Introductiontoasp netwindbgdebugging-100506045407-phpapp01
Introductiontoasp netwindbgdebugging-100506045407-phpapp01Introductiontoasp netwindbgdebugging-100506045407-phpapp01
Introductiontoasp netwindbgdebugging-100506045407-phpapp01
 
.Net Debugging Techniques
.Net Debugging Techniques.Net Debugging Techniques
.Net Debugging Techniques
 
.NET Debugging Tips and Techniques
.NET Debugging Tips and Techniques.NET Debugging Tips and Techniques
.NET Debugging Tips and Techniques
 
Surge2012
Surge2012Surge2012
Surge2012
 
Road to sbt 1.0 paved with server
Road to sbt 1.0   paved with serverRoad to sbt 1.0   paved with server
Road to sbt 1.0 paved with server
 
Introduction to Docker (and a bit more) at LSPE meetup Sunnyvale
Introduction to Docker (and a bit more) at LSPE meetup SunnyvaleIntroduction to Docker (and a bit more) at LSPE meetup Sunnyvale
Introduction to Docker (and a bit more) at LSPE meetup Sunnyvale
 
A Gentle Introduction to Docker and Containers
A Gentle Introduction to Docker and ContainersA Gentle Introduction to Docker and Containers
A Gentle Introduction to Docker and Containers
 
Lightweight Virtualization with Linux Containers and Docker | YaC 2013
Lightweight Virtualization with Linux Containers and Docker | YaC 2013Lightweight Virtualization with Linux Containers and Docker | YaC 2013
Lightweight Virtualization with Linux Containers and Docker | YaC 2013
 
Lightweight Virtualization with Linux Containers and Docker I YaC 2013
Lightweight Virtualization with Linux Containers and Docker I YaC 2013Lightweight Virtualization with Linux Containers and Docker I YaC 2013
Lightweight Virtualization with Linux Containers and Docker I YaC 2013
 
Kubernetes 101
Kubernetes 101Kubernetes 101
Kubernetes 101
 
It's always sunny with OpenJ9
It's always sunny with OpenJ9It's always sunny with OpenJ9
It's always sunny with OpenJ9
 
Vxcon 2016
Vxcon 2016Vxcon 2016
Vxcon 2016
 
Docker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12xDocker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12x
 
Sonatype DevSecOps Leadership forum 2020
Sonatype DevSecOps Leadership forum 2020Sonatype DevSecOps Leadership forum 2020
Sonatype DevSecOps Leadership forum 2020
 
Real-World Docker: 10 Things We've Learned
Real-World Docker: 10 Things We've Learned  Real-World Docker: 10 Things We've Learned
Real-World Docker: 10 Things We've Learned
 
Experiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsExperiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah Watkins
 
Customize and Secure the Runtime and Dependencies of Your Procedural Language...
Customize and Secure the Runtime and Dependencies of Your Procedural Language...Customize and Secure the Runtime and Dependencies of Your Procedural Language...
Customize and Secure the Runtime and Dependencies of Your Procedural Language...
 
Demo 0.9.4
Demo 0.9.4Demo 0.9.4
Demo 0.9.4
 
C# Production Debugging Made Easy
 C# Production Debugging Made Easy C# Production Debugging Made Easy
C# Production Debugging Made Easy
 

Recently uploaded

%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
masabamasaba
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 

Recently uploaded (20)

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
 
WSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AIWSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AI
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 

Debugging ZFS: From Illumos to Linux

  • 1. Debugging ZFS From Illumos to Linux Serapheim Dimitropoulos | Delphix
  • 3. Background Delphix ● Our product is an appliance/VM that we ship to customers ○ Either on the public cloud (e.g. AWS) or on-prem on top of a hypervisor (e.g. VMware) ● The core functionality of our product relies on ZFS ● Recently we switched our OS from Illumos to Linux Question “Can we maintain our existing debugging processes for issues in production with Linux?”
  • 4. Debugging In Production Main goal - Root-cause on first failure For performance pathologies this means hoping on the system to check any monitoring logs for errors and other clues or examining the runtime behavior through tracing. For severe failures (e.g. panics and deadlocks) we collect a crash dump (and potentially on disk state with ZDB if the VM was still running) for postmortem debugging. Illumos was built with the above in mind so our processes were created accordingly. We’ve had some success adjusting to Linux but we found general support for postmortem debugging to be lacking. (crash dump generation, size of debug info, tools to analyze dumps, etc..)
  • 6. Postmortem Debugging - The Alternative Consider the alternative when a customer VM crashes. The information is limited to: ● What the customer thought to mention ● What support thought to ask ● Random/unrelated logs and maybe a stack trace In most cases the above is not enough and you need to iterate with the customer.
  • 7. Postmortem Debugging The act of debugging a program after it has crashed. For OS kernels this is generally done by analyzing a crash dump generated at the time of the crash. A crash dump is a file on disk containing all (or some) of the system’s in-memory and processor state at the time of the crash, like kernel pages and CPU register values. At Delphix crash-dumps are an essential part of our debugging procedures.
  • 8. Postmortem Debugging A correctly-generated crash dump is comprehensive and never lies. It can also decouple the activity of root-causing the failure from the process of restoring the system.
  • 10. Failure: A Panic! Console Log: Think dmesg(1) kernel ring buffer in Linux ...
  • 13. Investigation Notes ● ZIO’s BP is not the same as original BP Failure: What?
  • 14. Investigation Notes ● ZIO’s BP is not the same as original BP ● This is a NOPWRITE ZIO Failure: What?
  • 15. Investigation Notes ● ZIO’s BP is not the same as original BP ● This is a NOPWRITE ZIO ● ZIO’s BP is not an embedded BP Failure: What?
  • 16. Aside: NOP-WRITE ● Performance optimization with space savings for snapshots ● ZFS compares checksums of incoming block vs block on disk ● If they match, nothing has changed and we can skip issuing a write I/O ● Common in frequently overwritten files with almost-identical data ○ E.g. full-backups of large random-access files
  • 17. Problem We are issuing a NOP-WRITE but BP differs from what’s on disk. Notes ● The BP is not an embedded BP Failure: Problem Summary
  • 18. Failure: What now? We understand the problem, just don’t know why it’s happening Plan 1. Capture a crash dump for in-house analysis 2. Unblock customer by disabling NOP-WRITEs (System Recovery) 3. Figure out the problem from the crash-dump (Root-Cause Analysis) 4. Implement a fix and ship it to customers
  • 20. Need to figure out the root-cause of the issue! Have two choices 1. Start reading the ~12K lines of related code 2. Analyze the crash dump and make targeted questions towards the culprit Failure: OK, what now?
  • 22. Failure: Path Forward ● Control flow in ZIO code is complex
  • 23. ● Control flow in ZIO code is complex ● Stack traces from panics don’t tell you where it came from Failure: Path Forward
  • 24. ● Control flow in ZIO code is complex ● Stack traces from panics don’t tell you where it came from ● Generally the thread that issued the ZIO is not around anymore (async case) ○ Printing all the stack traces won’t do the trick Failure: Path Forward
  • 25. Failure: Path Forward ● Control flow in ZIO code is complex ● Stack traces from panics don’t tell you where it came from ● Generally the thread that issued the ZIO is not around anymore (async case) ○ Printing all the stack traces won’t do the trick ● We need to inspect the data of the actual ZIO
  • 26. Failure: Inspecting the ZIO zio_t pointer
  • 27. Failure: Examining the BPs BP of ZIO BP on disk
  • 28. Failure: Examining the BPs BP on disk Current TXG
  • 29. Failure: Notes ● Issuing a NOP-WRITE ○ ZIO’s BP differs from on-disk BP ○ The BP on-disk was freed the TXG before the current one ● ZIO’s BP is not an embedded BP ● This was an write override ZIO (io_done is dbuf_write_override_done)
  • 30. Failure: Where did the ZIO come from? Only one place where io_done is set to that!
  • 31. Failure: Where did the ZIO come from? New Clue! 2 suspects for the origin of our ZIO Can’t be dmu_buf_write_embedded() Must be dmu_sync() !
  • 34. Failure: Case Closed Problem ZFS issued a NOP-WRITE but the BP was different from the BP on disk Root Cause dmu_sync()’s check wasn’t complete and ZFS wouldn’t disable nop-writes for recently freed blocks Fix Add check in dmu_sync() to see if block has been freed
  • 35. Postmortem Debugging - Recap Crash dumps ● Allow you to examine processor and in-memory state at the time of the crash ● Bundled with ZDB output - all the state you’ll need for ZFS issues ● Decouple System Recovery from Root-Cause Analysis
  • 36. SDB - The Slick Debugger
  • 37. SDB ● A postmortem and live debugger ● User experience similar to MDB ○ Ask any question by chaining a pipeline of commands (Unix Shell Style) ● Can be easily extended with Python
  • 39. What’s going on in the system?
  • 40. What’s going on in ZFS?
  • 43. Examining Data Structures Figures out that we are passing an AVL tree and walks the structure appropriately in-order
  • 44. Command either pretty-prints or pipes all the spa_t structures depending on where it is in a pipeline
  • 46. How many metaslabs are loaded in rpool?
  • 47. Unflushed Allocation Segments in rpool … above offset 0xa000b600?
  • 49. Any I/O in the system at time of crash?
  • 50. Memory usage - SPL caches Indicates if SPL cache is backed by the Linux SLUB allocator Ordered by top offenders in active memory by default E.g. “arc_buf_hdr_t_full” is backed by a Linux cache called “taskstats”
  • 51. Memory usage of B-Tree leaves? 316 KB used for B-Tree leaves. This is 8% of the underlying Linux cache. Overall cache utilization for underlying cache is 65%.
  • 52. Memory usage of metaslab_t structs?
  • 53. How does SDB work?
  • 54. drgn Developed by Omar Sandoval @ Facebook - https://drgn.readthedocs.io/en/latest/index.html A C library wrapped by a Python layer allowing the use of Python to introspect live-systems and crash dumps. ● Python API and Object model ● Fast start-up and command execution ● Still young and lacks certain features (e.g. function args) but promising ● Small but growing community that is open to patches Writing Python in the REPL to debug can be cumbersome
  • 55. SDB A Python layer that leverages the drgn API to provide a debugging experience similar to MDB. Can be extended in Python with new commands using: 1. The drgn API to query info from the debugging target 2. Pre-made constructs that allow them to receive and pass objects through a pipe Point (2) allows for pipelines that are more powerful than what we had in MDB (e.g. we pass objects with C type info vs plain pointers/integers values through the pipe)
  • 56. SDB - Recap ● Debugger for live-systems and crash dumps ● Leverages drgn for introspecting its target and provides a shell-like interface (e.g. pipes) ● Can be easily extended in Python with new commands that ○ Walk complex data structures ○ Aggregate, filter, and pretty-print data ● A user can ask almost any question that can be answered given the available state ● Great for debugging ZFS on Linux!
  • 59. Thank you for your time!
  • 61. Future Work SDB repo: https://github.com/delphix/sdb/tree/master ● More commands (help us at the hackathon tomorrow!) ● Tutorials for writing new commands ● Proper parser code & Test Suite ● Support for modules loaded at runtime ● Out of the box support for ztest core dumps SDB & OpenZFS ● Discuss potentially moving ZFS-related commands in a module under contrib/ ● Discuss potentially enabling crash dumps in VMs performing automated testing on Github
  • 68. Criteria A good kernel debugger for our use-case: 1. Can access all available state in a live-system or a crash dump 2. Presents data in a precise and readable format 3. Is easily extensible 4. Doesn’t get in your way
  • 69. Access to Everything Should be able to at least: 1. Print all available stack traces together with their function arguments 2. Allow access to any available region in memory 3. Be able to walk complex data structures efficiently
  • 70. Meaningful Output The debugger should be able to: ● Present the same data in multiple ways ○ Each representation emphasizing the answer to a different question ● Output insightful reports drawing info from multiple data sources
  • 75. Extensibility Developers should be able to extend the debugger, preferably without recompiling it. Examples: ● MDB supports modules written in C ● GDB can be extended in Python, either by scripts or on the fly during a session
  • 76. Doesn’t get in your way! GDB without Python (old versions): ● A prompt with a laundry-list of commands ● Your questions were limited to what the debugger is programmed to answer GDB with Python: ● You can ask anything you want, as long as you are willing to type code in the Python REPL ● … but now your focus is more programming (and spaces…) instead of debugging MDB: ● Pipes - sweet spot in the middle and familiar ● Ask any question you have by chaining a pipeline of commands (Unix Shell Style)
  • 77. MDB Pipes Example Question: How many of them are of length 800? Print all segments of all the loaded metaslabs in ZFS:
  • 79. GDB & KGDB GDB ● Users recognize it and Python extensibility is a plus ● Not applicable - it doesn’t work with live kernels and kernel crash dumps anymore KGDB ● GDB but for the kernel! ● Not available in most distros - requires you to recompile the kernel to enable it ● Not applicable for Delphix - requires a second machine to introspect the first one
  • 80. crash(8) The SVR4 utility re-written as a layer that understands the Linux kernel and has GDB 7.6 embedded in it to provide a familiar experience. Works out-of-the-box with live systems, crash dumps, and even hypervisor snapshots (ESX, KVM, etc..) Unfortunately ● Not easily extensible ○ C API exists but it’s lacking ○ Enabling the Python in the embedded GDB doesn’t work properly due to architecture ● Development seems to be in maintenance mode
  • 81. crash-python Developed by Jeff Mahoney @ SUSE - https://github.com/jeffmahoney/crash-python A patched version of GDB that reads kernel crash dumps by leveraging libkdumpfile. Well-designed but: ● GDB patch has been in the mailing list for years with no updates ○ Downstream patch maintained by 1 person ● Doesn’t work with live-systems ○ Enabling this requires more GDB patches
  • 82. drgn Developed by Omar Sandoval @ Facebook - https://drgn.readthedocs.io/en/latest/index.html A C library wrapped by a Python layer allowing the use of Python to introspect live-systems and crash dumps. ● Python API and Object model well-designed ● Fast start-up and command execution ● Still young and lacks certain features (e.g. function args) but promising ● Small but growing community that is open to patches Checks most of our boxes but writing Python in the REPL to debug can be cumbersome
  • 83. SDB A Python layer that leverages the drgn API to provide a debugging experience similar to MDB. Provides a set of primitive commands that can be chained together in a pipeline. Can be extended in Python with new commands using: 1. The drgn API to query info from the debugging target 2. Pre-made constructs that allow them to receive and pass objects through a pipe Point (2) allows for pipelines that are more powerful than what we had in MDB (e.g. we pass objects with C type info vs plain pointers/integers values through the pipe)
  • 86. Thank you for your time!