SlideShare una empresa de Scribd logo
1 de 158
USENIX LISA09


   ZFS Tutorial
Richard.Elling@RichardElling.com
Agenda
Overview
Foundations
Pooled Storage Layer
Transactional Object Layer
Commands
  zpool
  zfs
Sharing
Properties
Performance
Troubleshooting
Wrap


                               2
Ground Rules
No religilous discussion
No licensing discussion
No “future of <company>” discussion
No zones/containers/jails discussion
No “when is it going to be in Solaris 10” discussion... ok maybe a few...




                                                               3
History
Announced September 14, 2004
Integration history
   SXCE b27 (November 2005)
   FreeBSD (April 2007)
   Mac OSX Leopard
       Preview shown, but removed from Snow Leopard
       Disappointed community reforming as the zfs-macos google group
         (Oct 2009)
  OpenSolaris 2008.05
  Solaris 10 6/06 (June 2006)
  Linux FUSE (summer 2006)
  greenBytes ZFS+ (September 2008)
More than 45 patents, contributed to the CDDL Patents Common


                                                               4
Brief List of Features
Future-proof                       “No silent data corruption ever”
Cutting-edge data integrity        “Mind-boggling scalability”
High performance                   “Breathtaking speed”
Simplified administration          “Near zero administration”
Eliminates need for volume         “Radical new architecture”
   managers                        “Greatly simplifies support
Reduced costs                        issues”
Compatibility with POSIX file      “RAIDZ saves money”
  system & block devices
Self-healing


                                       Marketing: 2 drink minimum

                                                          5
ZFS Design Goals
Figure out why storage has gotten so complicated
Blow away 20+ years of obsolete assumptions
Gotta replace UFS
Design an integrated system from scratch
End the suffering




                                                   6
Limits

248 — Number of entries in any individual directory
256 — Number of attributes of a file [1]
256 — Number of files in a directory [1]
16 EiB (264 bytes) — Maximum size of a file system
16 EiB — Maximum size of a single file
16 EiB — Maximum size of any attribute
264 — Number of devices in any pool
264 — Number of pools in a system
264 — Number of file systems in a pool
264 — Number of snapshots of any file system
256 ZiB (278 bytes) — Maximum size of any pool
[1] actually constrained to 248 for the number of files in a ZFS file system


                                                                       7
Sidetrack: Understanding Builds
Build is often referenced when speaking of feature/bug integration
Short-hand notation: b#
OpenSolaris and SXCE are based on NV
  SXCE will soon end
  OpenSolaris carries forward
ZFS development done for NV
  Bi-weekly build cycle
  Schedule at http://opensolaris.org/os/community/on/schedule/
ZFS is ported to Solaris 10 and other OSes




                                                             8
Foundations


              9
Overhead View of a Pool

                    Pool
                                  File System
Configuration
 Information


                                        Volume
                   File System



                             Volume
         Dataset




                                                 10
Layer View

raw   swap dump iSCSI   ??      ZFS   NFS CIFS        ??


ZFS Volume Emulator (Zvol)       ZFS POSIX Layer (ZPL)          pNFS Lustre   ??


                         Transactional Object Layer


                             Pooled Storage Layer


                             Block Device Driver


               HDD           SSD          iSCSI            ??




                                                                        11
Source Code Structure
                     File system     Device     GUI        Mgmt
                     Consumer       Consumer
                                                JNI

User                                                  libzfs

Kernel
         Interface           ZPL         ZVol           /dev/zfs
         Layer


         Transactional     ZIL     ZAP                  Traversal
         Object
         Layer                      DMU         DSL


                                    ARC
         Pooled
         Storage                     ZIO
         Layer
                                    VDEV               Configuration

                                                                    12
Acronyms
ARC – Adaptive Replacement Cache
DMU – Data Management Unit
DSL – Dataset and Snapshot Layer
JNI – Java Native InterfaceZPL – ZFS POSIX Layer (traditional file
  system interface)
VDEV – Virtual Device layer
ZAP – ZFS Attribute Processor
ZIL – ZFS Intent Log
ZIO – ZFS I/O layer
Zvol – ZFS volume (raw/cooked block device interface)




                                                            13
nvlists
name=value pairs
libnvpair(3LIB)
Allows ZFS capabilities to change without changing the physical on-
   disk format
Data stored is XDR encoded
A good thing, used often




                                                            14
Versioning
Features can be added and identified by nvlist entries
Change in pool or dataset versions do not change physical on-disk
  format (!)
   does change nvlist parameters
Older-versions can be used
   might see warning messages, but harmless
Available versions and features can be easily viewed
   zpool upgrade -v
   zfs upgrade -v
Online references
   zpool: www.opensolaris.org/os/community/zfs/version/N
   zfs: www.opensolaris.org/os/community/zfs/version/zpl/N

    Don't confuse zpool and zfs versions
                                                             15
zpool versions
VER   DESCRIPTION
---   --------------------------------------------------------
 1    Initial ZFS version
 2    Ditto blocks (replicated metadata)
 3    Hot spares and double parity RAID-Z
 4    zpool history
 5    Compression using the gzip algorithm
 6    bootfs pool property
 7    Separate intent log devices
 8    Delegated administration
 9    refquota and refreservation properties
 10   Cache devices
 11   Improved scrub performance
 12   Snapshot properties
 13   snapused property
 14   passthrough-x aclinherit support
 15   user/group space accounting
 16   stmf property support
 17   Triple-parity RAID-Z
 18   snapshot user holds
 19   Log device removal
                                                                 16
zfs versions
VER   DESCRIPTION
---   --------------------------------------------------------
 1    Initial ZFS filesystem version
 2    Enhanced directory entries
 3    Case insensitive and File system unique identifier (FUID)
 4    userquota, groupquota properties




                                                                  17
Copy on Write
1. Initial block tree     2. COW some data




3. COW metadata         4. Update Uberblocks & free




                                             18
COW Notes
COW works on blocks, not files
ZFS reserves 32 MBytes or 1/64 of
  pool size
   COWs need some free space to
      remove files
   need space for ZIL
For fixed-record size workloads
  “fragmentation” and “poor
  performance” can occur if the
  recordsize is not matched
Spatial distribution is good fodder for
  performance speculation
   affects HDDs
   moot for SSDs


                                               19
Pooled Storage Layer

raw   swap dump iSCSI   ??      ZFS   NFS CIFS        ??


ZFS Volume Emulator (Zvol)       ZFS POSIX Layer (ZPL)          pNFS Lustre   ??


                         Transactional Object Layer


                             Pooled Storage Layer


                             Block Device Driver


               HDD           SSD          iSCSI            ??




                                                                        20
vdevs – Virtual Devices
                  Logical vdevs

                        root vdev



       top-level vdev                      top-level vdev
         children[0]                         children[1]
           mirror                              mirror




   vdev             vdev               vdev             vdev
type=disk        type=disk          type=disk        type=disk
children[0]      children[1]        children[0]      children[1]


              Physical or leaf vdevs

                                                                   21
vdev Labels
vdev labels != disk labels
Four 256 kByte labels written to every physical vdev
Two-stage update process
   write label0 & label2
   flush cache & check for errors
   write label1 & label3
   flush cache & check for errors
                                                         N = 256k * (size % 256k)
0            256k   512k             4M                    N-512k N-256k       N

    label0    label1    boot block                              label2    label3




    Blank
               Boot          Name=Value
                                                                         ...
              header            Pairs            128-slot Uberblock Array

0            8k        16k                128k                               256k
                                                                            22
Observing Labels
# zdb -l /dev/rdsk/c0t0d0s0
--------------------------------------------
LABEL 0
--------------------------------------------
    version=14
    name='rpool'
    state=0
    txg=13152
    pool_guid=17111649328928073943
    hostid=8781271
    hostname=''
    top_guid=11960061581853893368
    guid=11960061581853893368
    vdev_tree
        type='disk'
        id=0
        guid=11960061581853893368
        path='/dev/dsk/c0t0d0s0'
        devid='id1,sd@SATA_____ST3500320AS_________________9QM3FWFT/a'
        phys_path='/pci@0,0/pci1458,b002@11/disk@0,0:a'
        whole_disk=0
        metaslab_array=24
        metaslab_shift=30
        ashift=9
        asize=157945167872
        is_log=0

                                                                23
To fsck or not to fsck
fsck was created to fix known inconsistencies in file system metadata
  UFS is not transactional
  metadata inconsistencies must be reconciled
  does NOT repair data – how could it?
ZFS doesn't need fsck, as-is
   all on-disk changes are transactional
   COW means previously existing, consistent metadata is not
      overwritten
   ZFS can repair itself
      metadata is at least dual-redundant
      data can also be redundant
Reality check – this does not mean that ZFS is not susceptible to
  corruption
   nor is any other file system
                                                             24
VDEV


       25
Dynamic Striping
   RAID-0
     −  SNIA definition: fixed-length sequences of virtual disk data
        addresses are mapped to sequences of member disk
        addresses in a regular rotating pattern
   Dynamic Stripe
     −   Data is dynamically mapped to member disks
     −   No fixed-length sequences
     −   Allocate up to ~1 MByte/vdev before changing vdev
     −   vdevs can be different size
     −   Good combination of the concatenation feature with RAID-0
         performance




                                                               26
Dynamic Striping

RAID-0 Column size = 128 kBytes, stripe width = 384 kBytes




ZFS Dynamic Stripe recordsize = 128 kBytes




             Total write size = 2816 kBytes

                                                       27
Mirroring
   Straightforward: put N copies of the data on N vdevs
   Unlike RAID-1
     −    No 1:1 mapping at the block level
     −    vdev labels are still at beginning and end
     −    vdevs can be of different size
              effective space is that of smallest vdev
   Arbitration: ZFS does not blindly trust either side of mirror
     −    Most recent, correct view of data wins
     −    Checksums validate data




                                                                    28
Mirroring




   29
Dynamic vdev Replacement
    zpool replace poolname vdev [vdev]
    Today, replacing vdev must be same size or larger
       −   Before b117: as measured by blocks
       −   After b117: as measured by metaslabs
    Replacing all vdevs in a top-level vdev with larger vdevs results in
     top-level vdev resizing
    Policy controlled by zpool autoexpand property



    15G 10G
      10G        15G   10G   20G    15G 20G
                                      10G        15G   20G   20G    20G 20G
                                                                      10G

    10G Mirror     10G Mirror       15G Mirror      15G Mirror      20G Mirror




                                                                    30
RAIDZ
   RAID-5
     −  Parity check data is distributed across the RAID array's disks
     −  Must read/modify/write when data is smaller than stripe width
   RAIDZ
     −    Dynamic data placement
     −    Parity added as needed
     −    Writes are full-stripe writes
     −    No read/modify/write (write hole)
   Arbitration: ZFS does not blindly trust any device
     −     Does not rely on disk reporting read error
     −     Checksums validate data
     −     If checksum fails, read parity

         Space used is dependent on how used
                                                              31
RAID-5 vs RAIDZ

         DiskA   DiskB   DiskC   DiskD   DiskE
         D0:0    D0:1    D0:2    D0:3     P0
RAID-5    P1     D1:0    D1:1    D1:2    D1:3
         D2:3     P2     D2:0    D2:1    D2:2
         D3:2    D3:3     P3     D3:0    D3:1


         DiskA   DiskB   DiskC   DiskD   DiskE
          P0     D0:0    D0:1    D0:2    D0:3
RAIDZ     P1     D1:0    D1:1    P2:0    D2:0
         D2:1    D2:2    D2:3    P2:1    D2:4
         D2:5    Gap      P3     D3:0    D3:1

                                             32
RAID-5 Write Hole
   Occurs when data to be written is smaller than stripe size
   Must read unallocated columns to recalculate the parity or the parity
    must be read/modify/write
   Read/modify/write is risky for consistency
     −    Multiple disks
     −    Reading independently
     −    Writing independently
     −    System failure before all writes are complete to media could
          result in data loss
   Effects can be hidden from host using RAID array with nonvolatile
    write cache, but extra I/O cannot be hidden from disks




                                                                 33
RAIDZ2 and RAIDZ3
   RAIDZ2 = double parity RAIDZ
   RAIDZ3 = triple parity RAIDZ
   Sorta like RAID-6
     −    Parity 1: XOR
     −    Parity 2: another Reed-Soloman syndrome
     −    Parity 3: yet another Reed-Soloman syndrome
   Arbitration: ZFS does not blindly trust any device
     −   Does not rely on disk reporting read error
     −   Checksums validate data
     −   If data not valid, read parity
     −   If data still not valid, read other parity


             Space used is dependent on how used
                                                         34
Evaluating Data Retention

   MTTDL = Mean Time To Data Loss
   Note: MTBF is not constant in the real world, but keeps math simple
   MTTDL[1] is a simple MTTDL model
   No parity (single vdev, striping, RAID-0)
     −    MTTDL[1] = MTBF / N
   Single Parity (mirror, RAIDZ, RAID-1, RAID-5)
     −   MTTDL[1] = MTBF2 / (N * (N-1) * MTTR)
   Double Parity (3-way mirror, RAIDZ2, RAID-6)
     −   MTTDL[1] = MTBF3 / (N * (N-1) * (N-2) * MTTR2)




                                                                35
Another MTTDL Model
   MTTDL[1] model doesn't take into account unrecoverable read
   But unrecoverable reads (UER) are becoming the dominant failure
    mode
     −  UER specifed as errors per bits read
     −  More bits = higher probability of loss per vdev
   MTTDL[2] model considers UER




                                                            36
Why Worry about UER?

   Richard's study
     −   3,684 hosts with 12,204 LUNs
     −   11.5% of all LUNs reported read errors
   Bairavasundaram et.al. FAST08
    www.cs.wisc.edu/adsl/Publications/corruption-fast08.pdf
     −   1.53M LUNs over 41 months
     −   RAID reconstruction discovers 8% of checksum mismatches
     −   4% of disks studies developed checksum errors over 17 months




                                                              37
MTTDL[2] Model

   Probability that a reconstruction will fail
     −  Precon_fail = (N-1) * size / UER
   Model doesn't work for non-parity schemes (single vdev, striping,
    RAID-0)
   Single Parity (mirror, RAIDZ, RAID-1, RAID-5)
     −   MTTDL[2] = MTBF / (N * Precon_fail)
   Double Parity (3-way mirror, RAIDZ2, RAID-6)
     −    MTTDL[2] = MTBF2/ (N * (N-1) * MTTR * Precon_fail)




                                                                 38
Practical View of MTTDL[1]




                     39
MTTDL Models: Mirror




                40
MTTDL Models: RAIDZ2




               41
Ditto Blocks
Recall that each blkptr_t contains 3 DVAs
Dataset property used to indicate how many copies (aka ditto blocks)
  of data is desired
   Write all copies
   Read any copy
   Recover corrupted read from a copy
Not a replacement for mirroring
Easier to describe in pictures...

      copies parameter        Data copies      Metadata copies
      copies=1 (default)             1                 2
      copies=2                       2                 3
      copies=3                       3                 3

                                                            42
Copies in Pictures




            43
Copies in Pictures




            44
ZIO – ZFS I/O Layer


                  45
ZIO Framework
All physical disk I/O goes through ZIO Framework
Translates DVAs into Logical Block Address (LBA) on leaf vdevs
   Keeps free space maps (spacemap)
   If contiguous space is not available:
      Allocate smaller blocks (the gang)
      Allocate gang block, pointing to the gang
Implemented as multi-stage pipeline
  Allows extensions to be added fairly easily
Handles I/O errors




                                                          46
SpaceMap from Space




              47
ZIO Write Pipeline
ZIO State    Compression          Crypto    Checksum       DVA       vdev I/O

  open
              compress if
            savings > 12.5%
                                  encrypt
                                             generate
                                                          allocate

                                                                      start
                                                                       start
                                                                        start
                                                                      done
                                                                       done
                                                                        done
                                                                     assess
                                                                      assess
                                                                       assess

 done


                              Gang activity elided, for clarity
                                                                     48
ZIO Read Pipeline
ZIO State   Compression       Crypto    Checksum       DVA    vdev I/O

  open

                                                               start
                                                                start
                                                                 start
                                                               done
                                                                done
                                                                 done
                                                              assess
                                                               assess
                                                                assess


                                           verify

                              decrypt
            decompress

 done



                          Gang activity elided, for clarity
                                                              49
VDEV – Virtual Device Subsytem
Where mirrors, RAIDZ, and RAIDZ2
  are implemented                   Name          Priority
   Surprisingly few lines of code   NOW                0
      needed to implement RAID      SYNC_READ          0
Leaf vdev (physical device) I/O     SYNC_WRITE         0
  management
                                    FREE               0
   Number of outstanding iops       CACHE_FILL         0
   Read-ahead cache                 LOG_WRITE          0
Priority scheduling                 ASYNC_READ         4
                                    ASYNC_WRITE        4
                                    RESILVER           10
                                    SCRUB              20



                                                  50
ARC – Adaptive
Replacement Cache


                51
Object Cache
UFS uses page cache managed by the virtual memory system
ZFS does not use the page cache, except for mmap'ed files
ZFS uses a Adaptive Replacement Cache (ARC)
ARC used by DMU to cache DVA data objects
Only one ARC per system, but caching policy can be changed on a
  per-dataset basis
Seems to work much better than page cache ever did for UFS




                                                            52
Traditional Cache
Works well when data being accessed was recently added
Doesn't work so well when frequently accessed data is evicted



       Misses cause insert



              MRU
                                     Dynamic caches can change
             Cache           size    size by either not evicting
                                     or aggressively evicting

              LRU


         Evict the oldest


                                                              53
ARC – Adaptive Replacement
                          Cache
  Evict the oldest single-use entry


                 LRU
                Recent
                Cache
     Miss
                MRU                          Evictions and dynamic
                MRU                   size   resizing needs to choose best
      Hit                                    cache to evict (shrink)
               Frequent
                Cache

                 LRU


Evict the oldest multiple accessed entry

                                                                    54
ZFS ARC – Adaptive Replacement
      Cache with Locked Pages
                        Evict the oldest single-use entry


        Cannot evict                  LRU
       locked pages!
                                     Recent
                                     Cache
                          Miss
                                     MRU
                                     MRU
                           Hit                     size

                                    Frequent
        If hit occurs                Cache
        within 62 ms

                                      LRU


                   Evict the oldest multiple accessed entry

ZFS ARC handles mixed-size pages
                                                              55
ARC Directory
Each ARC directory entry contains arc_buf_hdr structs
   Info about the entry
   Pointer to the entry
Directory entries have size, ~200 bytes
ZFS block size is dynamic, 512 bytes – 128 kBytes
Disks are large
Suppose we use a Seagate LP 2 TByte disk for the L2ARC
   Disk has 3,907,029,168 512 byte sectors, guaranteed
   Workload uses 8 kByte fixed record size
   RAM needed for arc_buf_hdr entries
      Need = (3,907,029,168 - 9,232) * 200 / 16 = ~48 GBytes
Don't underestimate the RAM needed for large L2ARCs


                                                               56
L2ARC – Level 2 ARC
ARC evictions are sent to cache vdev
ARC directory remains in memory
Works well when cache vdev is optimized for
  fast reads
                                               ARC
  lower latency than pool disks
  inexpensive way to “increase memory”
Content considered volatile, no ZFS data
                                              evicted
  protection allowed                           data
Monitor usage with zpool iostat


                                              “cache”
                                               “cache”
                                                “cache”
                                                vdev
                                                 vdev
                                                  vdev



                                                   57
ARC Tips
In general, it seems to work well for most workloads
ARC size will vary, based on usage
    Default max is 3/4 of memory or memory - 1 GByte
    Min is 64 MB
    Metadata capped at 1/4 of max ARC size
Internals tracked by kstats in Solaris
  Use memory_throttle_count to observe pressure to evict
Can limit at boot time
   Solaris – set zfs:zfs_arc_max in /etc/system
Performance
   Prior to b107, L2ARC fill rate was limited to 8 MBytes/s



    L2ARC keeps its directory in kernel memory
                                                              58
Transactional Object
       Layer


                   59
flash
                              Source Code Structure
                        File system     Device     GUI        Mgmt
                        Consumer       Consumer
                                                   JNI

User                                                     libzfs

Kernel
            Interface           ZPL         ZVol           /dev/zfs
            Layer


            Transactional     ZIL     ZAP                  Traversal
            Object
            Layer                      DMU         DSL


                                       ARC
            Pooled
            Storage                     ZIO
            Layer
                                       VDEV               Configuration

                                                                       60
DMU – Data Management Layer
Datasets issue transactions to the DMU
Transactional based object model
Transactions are
  Atomic
  Grouped (txg = transaction group)
Responsible for on-disk data
ZFS Attribute Processor (ZAP)
Dataset and Snapshot Layer (DSL)
ZFS Intent Log (ZIL)




                                         61
Transaction Engine
Manages physical I/O
Transactions grouped into transaction group (txg)
   txg updates
   All-or-nothing
   Commit interval
      Older versions: 5 seconds
      Now: 30 seconds max, dynamically scale based on time required to
        commit txg
Delay committing data to physical storage
   Improves performance
   A bad thing for sync workloads – hence the ZFS Intent Log (ZIL)


    30 second delay can impact failure detection time

                                                              62
ZIL – ZFS Intent Log
DMU is transactional, and likes to group I/O into transactions for later
  commits, but still needs to handle “write it now” desire of sync
  writers
   NFS
   Databases
ZIL recordsize inflation can occur for some workloads
   May cause larger than expected actual I/O for sync workloads
   Oracle redo logs
   Can tune zfs_immediate_write_sz, but after b122 use logbias
     property instead
Never read, except at import (eg reboot), when transactions may need
  to be rolled forward




                                                               63
Separate Logs (slogs)
ZIL competes with pool for iops
   Applications will wait for sync writes to be on nonvolatile media
   Very noticeable on HDD JBODs
Put ZIL on separate vdev, outside of pool
   ZIL writes tend to be sequential
   No competition with pool for IOPS
   Downside: slog device required to be operational at import
   b125 adds slog device removal support
   Size of separate log < than size of RAM (duh)
10x or more performance improvements possible
  Use write-optimized SSD or non-volatile write cache on RAID array
Use zilstat to observe ZIL activity


                                                              64
Synchronous Write Destination
                  Without separate log
        Sync I/O size >
        zfs_immediate_write_sz ? ZIL Destination
        no                          ZIL log
        yes                         bypass to pool

                   With separate log
Sync I/O size >
zfs_immediate_write_sz ? logbias?                    ZIL Destination
no                                                   log device
yes                       prior to logbias (b122) log device
                          latency (default)          log device
                          throughput                 bypass to pool +

      Default zfs_immediate_write_sz = 32 kBytes
                                                             65
Disabling the ZIL
Rule 0: Don’t disable the ZIL
If you love your data, do not disable the ZIL
You can find references to this as a way to speed up ZFS
   NFS workloads
   “tar -x” benchmarks
Golden Rule: Don’t disable the ZIL
Can set via mdb, but need to remount the file system under test
Friends don’t let friends disable the ZIL
Solaris - can set in /etc/system

  *** TEMPORARY disable ZIL for non-production use
  *** disabled by <your name> on <date>
  set zfs:zil_disable=1


Nostradamus wrote, “disabling the ZIL will lead to the apocalypse”

                                                             66
DSL – Dataset and
 Snapshot Layer


                    67
flash
                         Copy on Write
 1. Initial block tree     2. COW some data




 3. COW metadata         4. Update Uberblocks & free




                                              68
zfs snapshot
Create a read-only, point-in-time window into the dataset (file system
  or Zvol)
Computationally free, because of COW architecture
Very handy feature
   Patching/upgrades
   Basis for Time Slider




                                                              69
Snapshot
                                       Current tree root
  Snapshot tree root




Create a snapshot by not free'ing COWed blocks
Snapshot creation is fast and easy
Number of snapshots determined by use – no hardwired limit
Recursive snapshots also possible

                                                           70
Clones
Snapshots are read-only
Clones are read-write based upon a snapshot
Child depends on parent
  Cannot destroy parent without destroying all children
  Can promote children to be parents
Good ideas
   OS upgrades
   Change control
   Replication
      zones
      virtual disks




                                                           71
zfs clone
Create a read-write file system from a read-only snapshot
Used extensively for OpenSolaris upgrades


OS rev1   OS rev1    OS rev1              OS rev1



          OS rev1    OS rev1              OS rev1
          snapshot   snapshot             snapshot


                     OS rev1
                                upgrade   OS rev2
                     clone

                                                        boot
                                                       manager


    Origin snapshot cannot be destroyed, if clone exists
                                                            72
zfs rollback


      OS b104                            OS b104

   rpool/ROOT/b104                    rpool/ROOT/b104

        OS b104                            OS b104
        snapshot        rollback           snapshot
rpool/ROOT/b104@today              rpool/ROOT/b104@today




                                                        73
Commands


           74
zpool(1m)

raw   swap dump iSCSI   ??      ZFS   NFS CIFS        ??


ZFS Volume Emulator (Zvol)       ZFS POSIX Layer (ZPL)          pNFS Lustre   ??


                         Transactional Object Layer


                             Pooled Storage Layer


                             Block Device Driver


               HDD           SSD          iSCSI            ??




                                                                        75
Dataset & Snapshot Layer
Object
   Allocated storage
   dnode describes collection of        Dataset Directory
      blocks
                                      Dataset
Object Set
                                     Object Set      Childmap
   Group of related objects
Dataset                              Object
                                     Object
                                      Object         Properties
   Snapmap: snapshot
     relationships                   Snapmap
   Space usage
Dataset directory
   Childmap: dataset relationships
   Properties


                                                            76
zpool create
zpool create poolname vdev-configuration
   vdev-configuration examples
      mirror c0t0d0 c3t6d0
      mirror c0t0d0 c3t6d0 mirror c4t0d0 c0t1d6
      mirror disk1s0 disk2s0 cache disk4s0 log disk5
      raidz c0d0s1 c0d1s1 c1d2s0 spare c1d3s0
Solaris
  Additional checks to see if disk/slice overlaps or is currently in use
  Whole disks are given EFI labels
Can set initial pool or dataset properties
By default, creates a file system with the same name
   poolname pool → /poolname file system

    People get confused by a file system with same
    name as the pool
                                                               77
zpool add
Adds a device to the pool as a top-level vdev
zpool add poolname vdev-configuration
vdev-configuration can be any combination also used for zpool create
Complains if the added vdev-configuration would cause a different data
  protection scheme than is already in use – use “-f” to override
Good idea: try with “-n” flag first – will show final configuration without
  actually performing the add




    Do not add a device which is in use as a quorum device
                                                                 78
zpool remove
Remove a top-level vdev from the pool
zpool remove poolname vdev
Today, you can only remove the following vdevs:
   cache
   hot spare
   separate log (b124)
An RFE is open to allow removal of other top-level vdevs




   Don't confuse “remove” with “detach”
                                                           79
zpool attach
Attach a vdev as a mirror to an existing vdev
zpool attach poolname existing-vdev vdev
Attaching vdev must be the same size or larger than the existing vdev
Note: today, not available for RAIDZ, RAIDZ2, or RAIDZ3 vdevs
                 vdev Configurations
                 ok   simple vdev → mirror
                 ok   mirror
                 ok   log → mirrored log
                 no   RAIDZ
                 no   RAIDZ2
                 no   RAIDZ3
     “Same size” literally means the same number of blocks until b117.
     Beware that many “same size” disks have different number of
     available blocks.
                                                            80
zpool import
Import a pool and mount all mountable datasets
Import a specific pool
   zpool import poolname
   zpool import GUID
Scan LUNs for pools which may be imported
   zpool import
Can set options, such as alternate root directory or other properties




    Beware of zpool.cache interactions

    Beware of artifacts, especially partial artifacts
                                                              81
zpool history
  Show history of changes made to the pool

# zpool history rpool
History for 'rpool':
2009-03-04.07:29:46 zpool create -f -o failmode=continue -R /a -m legacy -o
cachefile=/tmp/root/etc/zfs/zpool.cache rpool c0t0d0s0
2009-03-04.07:29:47 zfs set canmount=noauto rpool
2009-03-04.07:29:47 zfs set mountpoint=/rpool rpool
2009-03-04.07:29:47 zfs create -o mountpoint=legacy rpool/ROOT
2009-03-04.07:29:48 zfs create -b 4096 -V 2048m rpool/swap
2009-03-04.07:29:48 zfs create -b 131072 -V 1024m rpool/dump
2009-03-04.07:29:49 zfs create -o canmount=noauto rpool/ROOT/snv_106
2009-03-04.07:29:50 zpool set bootfs=rpool/ROOT/snv_106 rpool
2009-03-04.07:29:50 zfs set mountpoint=/ rpool/ROOT/snv_106
2009-03-04.07:29:51 zfs set canmount=on rpool
2009-03-04.07:29:51 zfs create -o mountpoint=/export rpool/export
2009-03-04.07:29:51 zfs create rpool/export/home
2009-03-04.00:21:42 zpool import -f -R /a 17111649328928073943
2009-03-04.00:21:42 zpool export rpool
2009-03-04.08:47:08 zpool set bootfs=rpool rpool
2009-03-04.08:47:08 zpool set bootfs=rpool/ROOT/snv_106 rpool
2009-03-04.08:47:12 zfs snapshot rpool/ROOT/snv_106@snv_b108
2009-03-04.08:47:12 zfs clone rpool/ROOT/snv_106@snv_b108 rpool/ROOT/
snv_b108
...


                                                                82
zpool status
 Shows the status of the current pools, including their configuration
 Important troubleshooting step

# zpool status
…
  pool: zwimming
  state: ONLINE
status: The pool is formatted using an older on-disk format. The pool can
         still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
         pool will no longer be accessible on older software versions.
  scrub: none requested
config:
        NAME            STATE    READ WRITE CKSUM
        zwimming        ONLINE      0     0     0
          mirror        ONLINE      0     0     0
             c0t2d0s0   ONLINE      0     0     0
             c0t0d0s7   ONLINE      0     0     0
errors: No known data errors



      Understanding status output error messages can be tricky
                                                                  83
zpool iostat
Show pool physical I/O activity, in an iostat-like manner
Solaris: fsstat will show I/O activity looking into a ZFS file system
Especially useful for showing slog activity

  # zpool iostat -v
                   capacity          operations       bandwidth
  pool           used avail         read write       read write
  ------------ ----- -----         ----- -----      ----- -----
  rpool         16.5G   131G           0      0     1.16K 2.80K
    c0t0d0s0    16.5G   131G           0      0     1.16K 2.80K
  ------------ ----- -----         ----- -----      ----- -----
  zwimming       135G 14.4G            0      5     2.09K 27.3K
    mirror       135G 14.4G            0      5     2.09K 27.3K
      c0t2d0s0      -       -          0      3     1.25K 27.5K
      c0t0d0s7      -       -          0      2     1.27K 27.5K
  ------------ ----- -----         ----- -----      ----- -----




    Unlike iostat, does not show latency
                                                                 84
zfs(1m)

raw   swap dump iSCSI   ??      ZFS   NFS CIFS        ??


ZFS Volume Emulator (Zvol)       ZFS POSIX Layer (ZPL)          pNFS Lustre   ??


                         Transactional Object Layer


                             Pooled Storage Layer


                             Block Device Driver


               HDD           SSD          iSCSI            ??




                                                                        85
zfs create, destroy
By default, a file system with the same name as the pool is created by
  zpool create
Name format is: pool/name[/name ...]
File system
    zfs create fs-name
    zfs destroy fs-name
Zvol
    zfs create -V size vol-name
    zfs destroy vol-name
Parameters can be set at create time




                                                             86
zfs list
List mounted datasets
Old versions: listed everything
After b108: do not list snapshots
   See zpool listsnapshots property
Examples
   zfs list
   zfs list -t snapshot
   zfs list -H -o name




                                       87
zfs send, receive
Send
  send a snapshot to stdout
  data is decompressed
Receive
  receive a snapshot from stdin
  receiving file system parameters apply (compression, et.al)
Can incrementally send snapshots in time order
Handy way to replicate dataset snapshots
Only method for replicating dataset properties, except quotas
NOT a replacement for traditional backup solutions
   All-or-nothing design per snapshot
   In general, does not send files (!)

   Send streams from b35 (or older) no longer supported after b89
                                                            88
Sharing


          89
Sharing
zfs share dataset
Type of sharing set by parameters
  shareiscsi = [on | off]
  sharenfs = [on | off | options]
  sharesmb = [on | off | options]
Shortcut to manage sharing
  Uses external services (nfsd, iscsi target, smbshare, etc)
  Importing pool will also share
May vary by OS




                                                               90
NFS
ZFS file systems work as expected
   use ACLs based on NFSv4 ACLs
Parallel NFS, aks pNFS, aka NFSv4.1
   Still a work-in-progress
   http://opensolaris.org/os/project/nfsv41/
   zfs create -t pnfsdata mypnfsdata

       pNFS
       Client
                           pNFS Data Server   pNFS Data Server

                               pnfsdata           pnfsdata
                 pNFS           dataset            dataset
                Metadata
                 Server          pool               pool




                                                             91
CIFS
UID mapping
casesensitivity parameter
  Good idea, set when file system is created
  zfs create -o casesensitivity=insensitive mypool/Shared
Shadow Copies for Shared Folders (VSS) supported
   CIFS clients cannot create shadow remotely (yet)




              CIFS features vary by OS, Samba, etc.


                                                            92
iSCSI
SCSI over IP
Block-level protocol
Uses Zvols as storage
Solaris has 2 iSCSI target implementations
   shareiscsi enables old, user-land iSCSI target
   To use COMSTAR, enable using itadm(1m)
   b116 more closely integrates COMSTAR (zpool version 16)
iSCSI performance hiccup
   Prior to b107, iSCSI over Zvols didn’t properly handle sync writes
   b107-b113, iSCSI over Zvols made all writes sync (read: slow)
      Workaround: enable write cache enable in the iSCSI target, see
       CR6770534
      OpenSolaris 2009.06 is b111
   b114, write cache enable works automatically iSCSI over Zvol
                                                                93
Properties


             94
Properties
Properties are stored in an nvlist
By default, are inherited
Some properties are common to all datasets, but a specific dataset
  type may have additional properties
Easily set or retrieved via scripts
In general, properties affect future file system activity




    zpool get doesn't script as nicely as zfs get




                                                            95
User-defined Properties
Names
   Must include colon ':'
   Can contain lower case alphanumerics or “+” “.” “_”
   Max length = 256 characters
   By convention, module:property
      com.sun:auto-snapshot
Values
   Max length = 1024 characters
Examples
   com.sun:auto-snapshot=true
   com.richardelling:important_files=true




                                                         96
set & get properties
Set
   zfs set compression=on export/home/relling
Get
  zfs get compression export/home/relling
Reset to inherited value
  zfs inherit compression export/home/relling
Clear user-defined parameter
   zfs inherit com.sun:auto-snapshot export/home/
     relling




                                                97
Pool Properties
Property      Change? Brief Description
altroot                  Alternate root directory (ala chroot)
autoexpand               Policy for expanding when vdev size
                         changes
autoreplace              vdev replacement policy
available     readonly   Available storage space
bootfs                   Default bootable dataset for root pool
cachefile                Cache file to use other than /etc/zfs/
                         zpool.cache
capacity      readonly   Percent of pool space used

delegation               Master pool delegation switch
failmode                 Catastrophic pool failure policy

                                                                  98
More Pool Properties
Property        Change?    Brief Description
guid            readonly   Unique identifier
health          readonly   Current health of the pool
listsnapshots              zfs list policy
size            readonly   Total size of pool
used            readonly   Amount of space used
version         readonly   Current on-disk version




                                                        99
Property        Change?    Brief Description
available       readonly   Space available to dataset & children
checksum                   Checksum algorithm
compression                Compression algorithm
compressratio   readonly   Compression ratio – logical
                           size:referenced physical
copies                     Number of copies of user data
creation        readonly   Dataset creation time
logbias                    Separate log write policy
origin          readonly   For clones, origin snapshot
primarycache               ARC caching policy
readonly                   Is dataset in readonly mode?
referenced      readonly   Size of data accessible by this dataset
                                                           100
More Common Dataset Properties
Property         Change? Brief Description
refreservation              Max space guaranteed to a dataset,
                            excluding descendants (snapshots &
                            clones)
reservation                 Minimum space guaranteed to
                            dataset, including descendants
secondarycache              L2ARC caching policy
type             readonly   Type of dataset (filesystem,
                            snapshot, volume)




                                                           101
More Common Dataset Properties
Property            Change? Brief Description
used                readonly    Sum of usedby* (see below)
usedbychildren      readonly    Space used by descendants
usedbydataset       readonly    Space used by dataset
usedbyrefreservation readonly   Space used by a refreservation for
                                this dataset
usedbysnapshots     readonly    Space used by all snapshots of this
                                dataset
zoned               readonly    Is dataset added to non-global zone
                                (Solaris)




                                                             102
Volume Dataset Properties
Property        Change? Brief Description
shareiscsi                 iSCSI service (not COMSTAR)
volblocksize    creation   fixed block size
volsize                    Implicit quota
zoned           readonly   Set if dataset delegated to non-global
                           zone (Solaris)




                                                       103
File System Properties
Property        Change?    Brief Description
aclinherit                 ACL inheritance policy, when files or
                           directories are created
aclmode                    ACL modification policy, when chmod is
                           used
atime                      Disable access time metadata updates
canmount                   Mount policy
casesensitivity creation   Filename matching algorithm (CIFS client
                           feature)
devices                    Device opening policy for dataset
exec                       File execution policy for dataset
mounted         readonly   Is file system currently mounted?


                                                               104
More File System Properties
Property      Change? Brief Description
nbmand        export/    File system should be mounted with non-
              import     blocking mandatory locks (CIFS client
                         feature)
normalization creation   Unicode normalization of file names for
                         matching
quota                    Max space dataset and descendants can
                         consume
recordsize               Suggested maximum block size for files
refquota                 Max space dataset can consume, not
                         including descendants
setuid                   setuid mode policy
sharenfs                 NFS sharing options
sharesmb                 Files system shared with CIFS
                                                            105
File System Properties
Property   Change? Brief Description
snapdir               Controls whether .zfs directory is hidden
utf8only   creation   UTF-8 character file name policy
vscan                 Virus scan enabled
xattr                 Extended attributes policy




                                                          106
More Goodies...


                  107
Dataset Space Accounting
 used = usedbydataset + usedbychildren + usedbysnapshots +
   usedbyrefreservation
 Lazy updates, may not be correct until txg commits
 ls and du will show size of allocated files which includes all copies of a
    file
 Shorthand report available


$ zfs list -o space
NAME                  AVAIL    USED   USEDSNAP   USEDDS   USEDREFRESERV    USEDCHILD
rpool                  126G   18.3G          0    35.5K               0        18.3G
rpool/ROOT             126G   15.3G          0      18K               0        15.3G
rpool/ROOT/snv_106     126G   86.1M          0    86.1M               0            0
rpool/ROOT/snv_b108    126G   15.2G      5.89G    9.28G               0            0
rpool/dump             126G   1.00G          0    1.00G               0            0
rpool/export           126G     37K          0      19K               0          18K
rpool/export/home      126G     18K          0      18K               0            0
rpool/swap             128G      2G          0     193M           1.81G            0




                                                                          108
zfs vs zpool Space Accounting
zfs list != zpool list
zfs list shows space used by the dataset plus space for internal
  accounting
zpool list shows physical space available to the pool
For simple pools and mirrors, they are nearly the same
For RAIDZ, RAIDZ2, or RAIDZ3, zpool list will show space
  available for parity


   Users will be confused about reported space available




                                                           109
Accessing Snapshots
By default, snapshots are accessible in .zfs directory
Visibility of .zfs directory is tunable via snapdir property
   Don't really want find to find the .zfs directory
Windows CIFS clients can see snapshots as Shadow Copies for
  Shared Folders (VSS)


    # zfs snapshot rpool/export/home/relling@20090415
    # ls -a /export/home/relling
    …
    .Xsession
    .xsession-errors
    # ls /export/home/relling/.zfs
    shares    snapshot
    # ls /export/home/relling/.zfs/snapshot
    20090415
    # ls /export/home/relling/.zfs/snapshot/20090415
    Desktop Documents Downloads Public




                                                               110
Time-based Resilvering
Block pointers contain birth txg
   number
Resilvering begins with oldest
  blocks first                                    73
                                                   73
Interrupted resilver will still result
   in a valid file system view                  73 55
                                                 73 27


                                         68 73            27   27
                                          73 68




                                             Birth txg = 27
                                             Birth txg = 68
                                             Birth txg = 73

                                                               111
Time Slider - Automatic Snapshots
Underpinnings for Solaris feature similar to OSX's Time Machine
SMF service for managing snapshots
SMF properties used to specify policies: frequency (interval) and number to keep
Creates cron jobs
GUI tool makes it easy to select individual file systems
Tip: take additional snapshots for important milestones to avoid automatic
   snapshot deletion

Service Name                    Interval (default)    Keep (default)
auto-snapshot:frequent          15 minutes            4
auto-snapshot:hourly            1 hour                24
auto-snapshot:daily             1 day                 31
auto-snapshot:weekly            7 days                4
auto-snapshot:monthly           1 month               12

                                                                       112
Nautilus
File system views which can go back in time




                                                 113
ACL – Access Control List
Based on NFSv4 ACLs
Similar to Windows NT ACLs
Works well with CIFS services
Supports ACL inheritance
Change using chmod
View using ls




                                    114
Checksums for Data
DVA contains 256 bits for checksum
Checksum is in the parent, not in the block itself
Types
   none
   fletcher2: truncated 2nd order Fletcher-like algorithm (default prior
      to b114)
   fletcher4: 4th order Fletcher-like algorithm (default, starting b114)
   SHA-256
There are open proposals for better algorithms




                                                               115
Checksum Use
Pool               Algorithm          Notes
Uberblock          SHA-256            self-checksummed
Metadata           fletcher4
Labels             SHA-256
Gang block         SHA-256            self-checksummed


Dataset            Algorithm              Notes
Metadata           fletcher4
Data               fletcher4 (default) zfs checksum parameter
ZIL log            fletcher2              self-checksummed
Send stream        fletcher4

Note: fletcher2 was the default for data prior to b114
Note: ZIL log has additional checking beyond the checksum

                                                             116
Compression
Builtin
   lzjb, Lempel-Ziv by Jeff Bonwick
   gzip, levels 1-9
Extensible
  new compressors can be added
  backwards compatibility issues
Uses taskqs to take advantage of multi-processor systems
Do you have a better compressor in mind?
   http://richardelling.blogspot.com/2009/08/justifying-new-
      compression-algorithms.html



    Cannot boot from gzip compressed root (RFE is open)

                                                               117
Encryption
Placeholder – details TBD
http://opensolaris.org/os/project/zfs-crypto
Complicated by:
   Block pointer rewrites
   Deduplication




                                                    118
Quotas
File system quotas
  quota includes descendants (snapshots, clones)
  refquota does not include descendants
User and group quotas
   b114, Solaris 10 10/09 (patch 141444-03 or 141445-03)
   Works like refquota, descendants don't count
   Not inherited
   zfs userspace and groupspace subcommands show quotas
      Users can only see their own and group quota, but can delegate
   Managed like properties
      [user|group]quota@[UID|username|SID name|SID number]
      not visible via zfs get all



                                                               119
zpool.cache
Old way
  mount /
  read /etc/[v]fstab
  mount file systems
ZFS
  import pool(s)
  find mountable datasets and mount them
/etc/zpool.cache is a cache of pools to be imported at boot time
   No scanning of all available LUNs for pools to import
   Binary: dump contents with zdb -C
   cachefile property permits selecting an alternate zpool.cache
      Useful for OS installers
      Useful for clusters, where you don't want a booting node to
       automatically import a pool
      Not persistent (!)                                          120
Mounting ZFS File Systems
By default, mountable file systems are mounted when the pool is
  imported
   Controlled by canmount policy (not inherited)
      on – (default) file system is mountable
      off – file system is not mountable
          if you want children to be mountable, but not the parent
      noauto – file system must be explicitly mounted (boot environment)
Can zfs set mountpoint=legacy to use /etc/vfstab
By default, cannot mount on top of non-empty directory
   Can override explicitly using zfs mount -O or legacy mountpoint
Mount properties are persistent, use zfs mount -o for temporary
  changes

    Imports are done in parallel, beware of mountpoint races
    prior to b104
                                                                     121
recordsize
Dynamic
   Max 128 kBytes
   Min 512 Bytes
   Power of 2
For most workloads, don't worry about it
For fixed size workloads, can set to match workloads
    Databases
    iSCSI Zvols serving NTFS or ext3 (use 4 KB)
File systems or Zvols
zfs set recordsize=8k dataset




                                                       122
Delegated Administration
Fine grain control
   users or groups of users
   subcommands, parameters, or sets
Similar to Solaris' Role Based Access Control (RBAC)
Enable/disable at the pool level
   zpool set delegation=on mypool (default)
Allow/unallow at the dataset level
    zfs allow relling snapshot mypool/relling
    zfs allow @backupusers snapshot,send mypool/sw
    zfs allow mypool/relling




                                                       123
Delegatable Subcommands
allow                 receive
clone                 rename
create                rollback
destroy               send
groupquota            share
groupused             snapshot
mount                 userquota
promote               userused




                                  124
Delegatable Parameters
aclinherit         nbmand           sharesmb
aclmode            normalization    snapdir
atime              quota            userprop
canmount           readonly         utf8only
casesensitivity    recordsize       version
checksum           refquota         volsize
compression        refreservation   vscan
copies             reservation      xattr
devices            setuid           zoned
exec               shareiscsi
mountpoint         sharenfs




                                               125
Browser User Interface
Solaris 10 – WebConsole
Nexenta
OpenStorage




                                      126
Solaris WebConsole




             127
Solaris WebConsole




             128
Nexenta




www.nexenta.com/corp/images/stories/pdfs/nexentastor%20briefing%206%2030%20final%20june%2029%2009.pdf

                                                                                                129
OpenStorage




      130
Solaris Swap and Dump
Swap
  Solaris does not have automatic swap resizing
  Swap as a separate dataset
  Swap device is raw, with a refreservation
  Blocksize matched to pagesize: 8 kB SPARC, 4 kB x86
  Don't really need or want snapshots or clones
  Can resize while online, manually
Dump
  Only used during crash dump
  Preallocated
  No refreservation
  Checksum off
  Compression off (dumps are already compressed)

                                                        131
Performance



              132
General Comments
In general, performs well out of the box
Standard performance improvement techniques apply
Lots of DTrace knowledge available
Typical areas of concern:
   ZIL
         check with zilstat, improve with slogs
   COW “fragmentation”
         check iostat, improve with L2ARC
   Memory consumption
         check with arcstat
         set primarycache property
         can be capped
         can compete with large page aware apps
   Compression, or lack thereof
                                                    133
ZIL Performance : NFS
Big performance increases demonstrated
  especially with SSDs
  for RAID arrays with nonvolatile RAM cache, not so much
NFS servers
   32kByte threshold (zfs_immediate_write_sz) also corresponds to
     NFSv3 write size
      May cause more work than needed
      See CR6686887




                                                         134
ZIL Performance : Databases
The logbias property can be set on a dataset to control threshold for
  writing to pool when a slog is used
   logbias=latency (default) all writes go to slog
   logbias=throughput, writes > zfs_immediate_write_sz go to pool
   Settable on-the-fly
      Consider changing policy during database loads
Can have different sync policies for logs and data
   Oracle, separate latency-sensitive redo log traffic from
      Redo logs: logbias=latency
      Indexes: logbias=latency
      Data files: logbias=throughput
   MySQL with InnoDB
      logbias=latency


                                                              135
More ZIL Performance : Databases
I/O size inflation
   Once a file grows to use a block size, it will keep that block size
       Block size is capped by recordsize
       recordsize is a power of 2: 512 bytes, 1 KB, 2 KB, 4 KB, ... 128 KB
   Can be inefficient if the workload is sync and writes variable sized
     data
Oracle performance work: Roch reports 40% improvement for JBOD
  (HDD) + separate log (SSD) with:

File system or Zvol Role     recordsize           logbias
data files                   8 KB                 throughput
redo logs                    128 KB (default) latency (default)
indices                      8-32 KB?             latency (default)


                                                                  136
vdev Cache
vdev cache occurs at the SPA level
   readahead
   10 MBytes per vdev
   only caches metadata (b70 or later)
Stats collected as Solaris kstats



   # kstat -n vdev_cache_stats
   module: zfs                                     instance: 0
   name:   vdev_cache_stats                        class:    misc
           crtime                                  38.83342625
           delegations                             14030
           hits                                    105169
           misses                                  59452
           snaptime                                4564628.18130739


                      Hit rate = 59%, not bad...


                                                                      137
Intelligent Prefetching
Intelligent file-level prefetching occurs at the DMU level
Feeds the ARC
In a nutshell, prefetch hits cause more prefetching
  Read a block, prefetch a block
  If we used the prefetched block, read 2 more blocks
  Up to 256 blocks
Recognizes strided reads
   2 sequential reads of same length and a fixed distance will be
     coalesced
Fetches backwards
Seems to work pretty well, as-is, for most workloads




                                                             138
Unintelligent Prefetch?
Some workloads don't do so well with intelligent prefetch
   CR6859997, zfs caching performance problem, fixed in NV b124
Look for time spent in zfetch_* functions using lockstat
   lockstat -I sleep 10
Easy to disable in mdb for testing on Solaris
  echo zfs_prefetch_disable/W0t1 | mdb -kw
Re-enable with
   echo zfs_prefetch_disable/W0t0 | mdb -kw
Set via /etc/system
   set zfs:zfs_prefetch_disable = 1




                                                            139
I/O Queues
By default, for devices which can support multiple I/Os, up to 35 I/Os
  are queued to each vdev
   Tunable with zfs_vdev_max_pending, set to 10 with:
   echo zfs_vdev_max_pending/W0t10 | mdb -kw
Implies that more vdevs is better
  Consider avoiding RAID array with a single, large LUN
ZFS I/O scheduler loses control once iops are queued
  CR6471212 proposes reserved slots for high-priority iops
May need to match queues for the entire data path
  zfs_vdev_max_pending
   Fibre channel, SCSI, SAS, SATA driver
   RAID array controller
Fast disks → small queues, slow disks → larger queues

                                                             140
COW Penalty
COW can negatively affect workloads which have updates and
  sequential reads
   Initial writes will be sequential
   Updates (writes) will cause seeks to read data
Lots of people seem to worry a lot about this
Only affects HDDs
Very difficult to speculate about the impact on real-world apps
   Large sequential scans of random data hurt anyway
   Reads are cached in many places in the data path
   Databases can COW, too
Sysbench benchmark used to test on MySQL w/InnoDB engine
   One hour read/write test
   select count(*)
   repeat, for a week
                                                             141
COW Penalty




            Performance seems to level at about 25% penalty

Results compliments of Allan Packer & Neelakanth Nadgir
http://blogs.sun.com/realneel/resource/MySQL_Conference_2009_ZFS_MySQL.pdf
                                                                             142
About Disks...
 Disks still the most important performance bottleneck
    Modern processors are multi-core
    Default checksums and compression are computationally efficient

                                       Average
                          Max Size    Rotational     Average Seek
 Disk     Size   RPM      (GBytes)   Latency (ms)        (ms)
HDD       2.5”   5,400      500           5.5                11
HDD       3.5”   5,900     2,000          5.1                16
HDD       3.5”   7,200     1,500          4.2              8 - 8.5
HDD       2.5”   10,000     300            3              4.2 - 4.6
HDD       2.5”   15,000     146            2              3.2 - 3.5
SSD       2.5”    N/A        73            0             0.02 - 0.15
(w) (r)
SSD       2.5”    N/A       500            0             0.02 - 0.15
                                                               143
DirectIO
UFS forcedirectio option brought the early 1980s design of UFS up to
  the 1990s
ZFS designed to run on modern multiprocessors
Databases or applications which manage their data cache may benefit
  by disabling file system caching
Expect L2ARC to improve random reads (secondarycache)
Prefetch disabled by primarycache=none|metadata

  UFS DirectIO                    ZFS
  Unbuffered I/O                  primarycache=metadata
                                  primarycache=none
  Concurrency                     Available at inception
  Improved Async I/O code path Available at inception

                                                           144
Hybrid Storage Pool

                                     SPA

                   separate log                   L2ARC
                                   Main Pool
                     device                    cache device




                Write optimized     HDD
                                     HDD       Read optimized
                 device (SSD)         HDD       device (SSD)


Size (GBytes)    < 1 GByte          large            big
Cost             write iops/$       size/$         size/$
Performance      low-latency          -         low-latency
                    writes                         reads

                                                              145
RAID-Z Bandwidth
Traditional RAID-Z had a “mind the gap” feature
Impacts possible bandwidth
Mirrors could show higher bandwidth
Now RAID-Z shows better bandwidth, when channel bandwidth is the
  constrained resource




     Implementation caused spurious errors for b118-b123
                                                           146
Troubleshooting


                  147
Checking Status
zpool status
zpool status -v
Solaris
   fmadm faulty
   fmdump
  fmdump -ev or fmdump -eV
  format or rmformat




                                       148
flash
                                     Copy on Write
  1. Initial block tree                2. COW some data




  3. COW metadata                   4. Update Uberblocks & free




What if the uberblock is updated prior to leaves?
                                                          149
What if flush is ignored?
Some devices ignore cache flush commands (!)
   Virtualization default=ignore flush: VirtualBox, others?
   Some USB/Firewire to IDE/SATA converters
Problem: uberblock could be updated before leaves
Symptom: can’t import pool, uberblock points to random data
Affected systems
   Many OSes and file systems
   Laptops - rarely because of battery
   Enterprise-class systems - rarely because of power redundancy and
     solid design
   Desktops - more frequently
Solution (pending further automation)
   Check integrity of recent transaction groups
   If damaged, rollback to older uberblock
   Today, can do this by hand, but process is tedious
                                                              150
Can't Import Pool?
Check device paths with zpool import
   Be aware of /etc/zfs/zpool.cache
  May need zpool -d directory option
  “phantom paths”?
Check for 4 labels
  zdb -l /dev/dsk/c0t0d0s0




     Beware of device short names: c0d0 != c0d0s0
                                                    151
Slow Pool Import?
Case: zvols with snapshots
Symptom: reboot or zpool import is really slllooooowwwwwww...
Cause: inefficient incrementing over all zvols creating entries in
  /dev/zvol/dsk
Cure: CR6761786 integrated in b125




                                                               152
File System Mounts B0rken?
Prevention
  Avoid complex heirarchies (KISS)
  Be aware of legacy mounts
  Be aware of alternate boot environments (Solaris)
Check mountpoint properties
  zfs list -o name,mountpoint
Shared file systems
   Be aware of inherited shares
   Some clients do not mirror mount (Linux)
   NFS version differences?
   Check name services




                                                      153
Can't Boot?
Check if BIOS/OBP supports booting from device
Make sure LUN has SMI label, not EFI
   Common mistake when mirroring root
   OK: zpool attach rpool c0t0d0s0 c0t1d0s7
   Not OK: zpool attach rpool c0t0d0s0 c0t1d0
installboot?
grub issues
   Boot environments usually handled by grub
   Check grub menu.lst
Know how to do a failsafe boot
Be aware of LiveCD import
Be aware of zpool.cache interactions


                                                     154
Future Plans
Announced enhancements in the pipeline from
  Kernel Conference Australia, July 15-17 2009
   Encryption
   Deduplication
   Block pointer rewrite
   Shadow migration
   More performance tweeks
      New block allocator
      Pipeline improvements
         Raw scrub
         Scrub prefetch
         Just in time decompression or decryption
      Native iSCSI (COMSTAR)
      Zero copy I/O
      Parallel device open
                                                           155
More Future Plans
Snapshot holds (b124)
Access-based enumeration (b125)
Multiple mount protection
Separate log offlining (b125) (removal later)




                                                156
Now you know...
ZFS structure: pools, datasets
Data redundancy: mirrors, RAIDZ, copies
Data verification: checksums
Data replication: snapshots, clones, send, receive
Hybrid storage: separate logs, cache devices, ARC
Security: allow, deny, encryption
Resource management: quotas, references, I/O scheduler
Performance: latency, COW, zilstat, arcstat, logbias, recordsize
Troubleshooting: FMA, zdb, importance of cache flushes




                                                             157
Its a wrap!



      Thank You!
       Questions?
Richard.Elling@RichardElling.com



                                   158

Más contenido relacionado

La actualidad más candente

Linux Device Driver parallelism using SMP and Kernel Pre-emption
Linux Device Driver parallelism using SMP and Kernel Pre-emptionLinux Device Driver parallelism using SMP and Kernel Pre-emption
Linux Device Driver parallelism using SMP and Kernel Pre-emptionHemanth Venkatesh
 
Deferred Rendering in Killzone 2
Deferred Rendering in Killzone 2Deferred Rendering in Killzone 2
Deferred Rendering in Killzone 2Guerrilla
 
Memory management in Linux
Memory management in LinuxMemory management in Linux
Memory management in LinuxRaghu Udiyar
 
X / DRM (Direct Rendering Manager) Architectural Overview
X / DRM (Direct Rendering Manager) Architectural OverviewX / DRM (Direct Rendering Manager) Architectural Overview
X / DRM (Direct Rendering Manager) Architectural OverviewMoriyoshi Koizumi
 
Memory management in Linux kernel
Memory management in Linux kernelMemory management in Linux kernel
Memory management in Linux kernelVadim Nikitin
 
Practical Occlusion Culling on PS3
Practical Occlusion Culling on PS3Practical Occlusion Culling on PS3
Practical Occlusion Culling on PS3Guerrilla
 
RocksDB detail
RocksDB detailRocksDB detail
RocksDB detailMIJIN AN
 
ALSS14: Xen Project Automotive Hypervisor (Demo)
ALSS14: Xen Project Automotive Hypervisor (Demo)ALSS14: Xen Project Automotive Hypervisor (Demo)
ALSS14: Xen Project Automotive Hypervisor (Demo)The Linux Foundation
 
Trip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningRenaldas Zioma
 
Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems confluent
 
Dx11 performancereloaded
Dx11 performancereloadedDx11 performancereloaded
Dx11 performancereloadedmistercteam
 
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)Johan Andersson
 
Disk and File System Management in Linux
Disk and File System Management in LinuxDisk and File System Management in Linux
Disk and File System Management in LinuxHenry Osborne
 
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3Electronic Arts / DICE
 
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Graham Wihlidal
 
Linux kernel memory allocators
Linux kernel memory allocatorsLinux kernel memory allocators
Linux kernel memory allocatorsHao-Ran Liu
 
Crysis Next-Gen Effects (GDC 2008)
Crysis Next-Gen Effects (GDC 2008)Crysis Next-Gen Effects (GDC 2008)
Crysis Next-Gen Effects (GDC 2008)Tiago Sousa
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at NetflixBrendan Gregg
 

La actualidad más candente (20)

Linux Device Driver parallelism using SMP and Kernel Pre-emption
Linux Device Driver parallelism using SMP and Kernel Pre-emptionLinux Device Driver parallelism using SMP and Kernel Pre-emption
Linux Device Driver parallelism using SMP and Kernel Pre-emption
 
Deferred Rendering in Killzone 2
Deferred Rendering in Killzone 2Deferred Rendering in Killzone 2
Deferred Rendering in Killzone 2
 
Memory management in Linux
Memory management in LinuxMemory management in Linux
Memory management in Linux
 
X / DRM (Direct Rendering Manager) Architectural Overview
X / DRM (Direct Rendering Manager) Architectural OverviewX / DRM (Direct Rendering Manager) Architectural Overview
X / DRM (Direct Rendering Manager) Architectural Overview
 
Memory management in Linux kernel
Memory management in Linux kernelMemory management in Linux kernel
Memory management in Linux kernel
 
Practical Occlusion Culling on PS3
Practical Occlusion Culling on PS3Practical Occlusion Culling on PS3
Practical Occlusion Culling on PS3
 
ZFS in 30 minutes
ZFS in 30 minutesZFS in 30 minutes
ZFS in 30 minutes
 
why we need ext4
why we need ext4why we need ext4
why we need ext4
 
RocksDB detail
RocksDB detailRocksDB detail
RocksDB detail
 
ALSS14: Xen Project Automotive Hypervisor (Demo)
ALSS14: Xen Project Automotive Hypervisor (Demo)ALSS14: Xen Project Automotive Hypervisor (Demo)
ALSS14: Xen Project Automotive Hypervisor (Demo)
 
Trip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine Learning
 
Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems
 
Dx11 performancereloaded
Dx11 performancereloadedDx11 performancereloaded
Dx11 performancereloaded
 
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
 
Disk and File System Management in Linux
Disk and File System Management in LinuxDisk and File System Management in Linux
Disk and File System Management in Linux
 
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
 
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016
 
Linux kernel memory allocators
Linux kernel memory allocatorsLinux kernel memory allocators
Linux kernel memory allocators
 
Crysis Next-Gen Effects (GDC 2008)
Crysis Next-Gen Effects (GDC 2008)Crysis Next-Gen Effects (GDC 2008)
Crysis Next-Gen Effects (GDC 2008)
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflix
 

Destacado

An Introduction to the Implementation of ZFS by Kirk McKusick
An Introduction to the Implementation of ZFS by Kirk McKusickAn Introduction to the Implementation of ZFS by Kirk McKusick
An Introduction to the Implementation of ZFS by Kirk McKusickeurobsdcon
 
ZFS and FreeBSD Jails
ZFS and FreeBSD JailsZFS and FreeBSD Jails
ZFS and FreeBSD Jailsapeiron
 
SmartOS ZFS Architecture
SmartOS ZFS ArchitectureSmartOS ZFS Architecture
SmartOS ZFS ArchitectureBill Pijewski
 
ZFS Tutorial LISA 2011
ZFS Tutorial LISA 2011ZFS Tutorial LISA 2011
ZFS Tutorial LISA 2011Richard Elling
 
Введение в технологии FC и FCoE для сетевых инженеров.
 Введение в технологии FC и FCoE для сетевых инженеров.  Введение в технологии FC и FCoE для сетевых инженеров.
Введение в технологии FC и FCoE для сетевых инженеров. Cisco Russia
 
Glusterfs 구성제안 및_운영가이드_v2.0
Glusterfs 구성제안 및_운영가이드_v2.0Glusterfs 구성제안 및_운영가이드_v2.0
Glusterfs 구성제안 및_운영가이드_v2.0sprdd
 

Destacado (7)

An Introduction to the Implementation of ZFS by Kirk McKusick
An Introduction to the Implementation of ZFS by Kirk McKusickAn Introduction to the Implementation of ZFS by Kirk McKusick
An Introduction to the Implementation of ZFS by Kirk McKusick
 
ZFS
ZFSZFS
ZFS
 
ZFS and FreeBSD Jails
ZFS and FreeBSD JailsZFS and FreeBSD Jails
ZFS and FreeBSD Jails
 
SmartOS ZFS Architecture
SmartOS ZFS ArchitectureSmartOS ZFS Architecture
SmartOS ZFS Architecture
 
ZFS Tutorial LISA 2011
ZFS Tutorial LISA 2011ZFS Tutorial LISA 2011
ZFS Tutorial LISA 2011
 
Введение в технологии FC и FCoE для сетевых инженеров.
 Введение в технологии FC и FCoE для сетевых инженеров.  Введение в технологии FC и FCoE для сетевых инженеров.
Введение в технологии FC и FCoE для сетевых инженеров.
 
Glusterfs 구성제안 및_운영가이드_v2.0
Glusterfs 구성제안 및_운영가이드_v2.0Glusterfs 구성제안 및_운영가이드_v2.0
Glusterfs 구성제안 및_운영가이드_v2.0
 

Similar a ZFS Tutorial USENIX LISA09 Conference

ZFS Tutorial USENIX June 2009
ZFS  Tutorial  USENIX June 2009ZFS  Tutorial  USENIX June 2009
ZFS Tutorial USENIX June 2009Richard Elling
 
Zettabyte File Storage System
Zettabyte File Storage SystemZettabyte File Storage System
Zettabyte File Storage SystemAmdocs
 
Zettabyte File Storage System
Zettabyte File Storage SystemZettabyte File Storage System
Zettabyte File Storage SystemAmdocs
 
Architecture of the Upcoming OrangeFS v3 Distributed Parallel File System
Architecture of the Upcoming OrangeFS v3 Distributed Parallel File SystemArchitecture of the Upcoming OrangeFS v3 Distributed Parallel File System
Architecture of the Upcoming OrangeFS v3 Distributed Parallel File SystemAll Things Open
 
Vancouver bug enterprise storage and zfs
Vancouver bug   enterprise storage and zfsVancouver bug   enterprise storage and zfs
Vancouver bug enterprise storage and zfsRami Jebara
 
pnfs status
pnfs statuspnfs status
pnfs statusbergwolf
 
New Oracle Infrastructure2
New Oracle Infrastructure2New Oracle Infrastructure2
New Oracle Infrastructure2markleeuw
 
ZFS for Databases
ZFS for DatabasesZFS for Databases
ZFS for Databasesahl0003
 
OSDC 2016 - Interesting things you can do with ZFS by Allan Jude&Benedict Reu...
OSDC 2016 - Interesting things you can do with ZFS by Allan Jude&Benedict Reu...OSDC 2016 - Interesting things you can do with ZFS by Allan Jude&Benedict Reu...
OSDC 2016 - Interesting things you can do with ZFS by Allan Jude&Benedict Reu...NETWAYS
 

Similar a ZFS Tutorial USENIX LISA09 Conference (20)

ZFS Tutorial USENIX June 2009
ZFS  Tutorial  USENIX June 2009ZFS  Tutorial  USENIX June 2009
ZFS Tutorial USENIX June 2009
 
Tlf2014
Tlf2014Tlf2014
Tlf2014
 
Zettabyte File Storage System
Zettabyte File Storage SystemZettabyte File Storage System
Zettabyte File Storage System
 
Zettabyte File Storage System
Zettabyte File Storage SystemZettabyte File Storage System
Zettabyte File Storage System
 
Posscon2013
Posscon2013Posscon2013
Posscon2013
 
Flourish16
Flourish16Flourish16
Flourish16
 
Nycbsdcon14
Nycbsdcon14Nycbsdcon14
Nycbsdcon14
 
Introduction to OpenSolaris 2008.11
Introduction to OpenSolaris 2008.11Introduction to OpenSolaris 2008.11
Introduction to OpenSolaris 2008.11
 
Asiabsdcon14
Asiabsdcon14Asiabsdcon14
Asiabsdcon14
 
Scale2014
Scale2014Scale2014
Scale2014
 
Sweden11
Sweden11Sweden11
Sweden11
 
Architecture of the Upcoming OrangeFS v3 Distributed Parallel File System
Architecture of the Upcoming OrangeFS v3 Distributed Parallel File SystemArchitecture of the Upcoming OrangeFS v3 Distributed Parallel File System
Architecture of the Upcoming OrangeFS v3 Distributed Parallel File System
 
Vancouver bug enterprise storage and zfs
Vancouver bug   enterprise storage and zfsVancouver bug   enterprise storage and zfs
Vancouver bug enterprise storage and zfs
 
Fsoss2011
Fsoss2011Fsoss2011
Fsoss2011
 
Olf2013
Olf2013Olf2013
Olf2013
 
pnfs status
pnfs statuspnfs status
pnfs status
 
New Oracle Infrastructure2
New Oracle Infrastructure2New Oracle Infrastructure2
New Oracle Infrastructure2
 
ZFS for Databases
ZFS for DatabasesZFS for Databases
ZFS for Databases
 
OSDC 2016 - Interesting things you can do with ZFS by Allan Jude&Benedict Reu...
OSDC 2016 - Interesting things you can do with ZFS by Allan Jude&Benedict Reu...OSDC 2016 - Interesting things you can do with ZFS by Allan Jude&Benedict Reu...
OSDC 2016 - Interesting things you can do with ZFS by Allan Jude&Benedict Reu...
 
Fossetcon14
Fossetcon14Fossetcon14
Fossetcon14
 

Último

Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 

Último (20)

20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 

ZFS Tutorial USENIX LISA09 Conference

  • 1. USENIX LISA09 ZFS Tutorial Richard.Elling@RichardElling.com
  • 2. Agenda Overview Foundations Pooled Storage Layer Transactional Object Layer Commands zpool zfs Sharing Properties Performance Troubleshooting Wrap 2
  • 3. Ground Rules No religilous discussion No licensing discussion No “future of <company>” discussion No zones/containers/jails discussion No “when is it going to be in Solaris 10” discussion... ok maybe a few... 3
  • 4. History Announced September 14, 2004 Integration history SXCE b27 (November 2005) FreeBSD (April 2007) Mac OSX Leopard Preview shown, but removed from Snow Leopard Disappointed community reforming as the zfs-macos google group (Oct 2009) OpenSolaris 2008.05 Solaris 10 6/06 (June 2006) Linux FUSE (summer 2006) greenBytes ZFS+ (September 2008) More than 45 patents, contributed to the CDDL Patents Common 4
  • 5. Brief List of Features Future-proof “No silent data corruption ever” Cutting-edge data integrity “Mind-boggling scalability” High performance “Breathtaking speed” Simplified administration “Near zero administration” Eliminates need for volume “Radical new architecture” managers “Greatly simplifies support Reduced costs issues” Compatibility with POSIX file “RAIDZ saves money” system & block devices Self-healing Marketing: 2 drink minimum 5
  • 6. ZFS Design Goals Figure out why storage has gotten so complicated Blow away 20+ years of obsolete assumptions Gotta replace UFS Design an integrated system from scratch End the suffering 6
  • 7. Limits 248 — Number of entries in any individual directory 256 — Number of attributes of a file [1] 256 — Number of files in a directory [1] 16 EiB (264 bytes) — Maximum size of a file system 16 EiB — Maximum size of a single file 16 EiB — Maximum size of any attribute 264 — Number of devices in any pool 264 — Number of pools in a system 264 — Number of file systems in a pool 264 — Number of snapshots of any file system 256 ZiB (278 bytes) — Maximum size of any pool [1] actually constrained to 248 for the number of files in a ZFS file system 7
  • 8. Sidetrack: Understanding Builds Build is often referenced when speaking of feature/bug integration Short-hand notation: b# OpenSolaris and SXCE are based on NV SXCE will soon end OpenSolaris carries forward ZFS development done for NV Bi-weekly build cycle Schedule at http://opensolaris.org/os/community/on/schedule/ ZFS is ported to Solaris 10 and other OSes 8
  • 10. Overhead View of a Pool Pool File System Configuration Information Volume File System Volume Dataset 10
  • 11. Layer View raw swap dump iSCSI ?? ZFS NFS CIFS ?? ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ?? Transactional Object Layer Pooled Storage Layer Block Device Driver HDD SSD iSCSI ?? 11
  • 12. Source Code Structure File system Device GUI Mgmt Consumer Consumer JNI User libzfs Kernel Interface ZPL ZVol /dev/zfs Layer Transactional ZIL ZAP Traversal Object Layer DMU DSL ARC Pooled Storage ZIO Layer VDEV Configuration 12
  • 13. Acronyms ARC – Adaptive Replacement Cache DMU – Data Management Unit DSL – Dataset and Snapshot Layer JNI – Java Native InterfaceZPL – ZFS POSIX Layer (traditional file system interface) VDEV – Virtual Device layer ZAP – ZFS Attribute Processor ZIL – ZFS Intent Log ZIO – ZFS I/O layer Zvol – ZFS volume (raw/cooked block device interface) 13
  • 14. nvlists name=value pairs libnvpair(3LIB) Allows ZFS capabilities to change without changing the physical on- disk format Data stored is XDR encoded A good thing, used often 14
  • 15. Versioning Features can be added and identified by nvlist entries Change in pool or dataset versions do not change physical on-disk format (!) does change nvlist parameters Older-versions can be used might see warning messages, but harmless Available versions and features can be easily viewed zpool upgrade -v zfs upgrade -v Online references zpool: www.opensolaris.org/os/community/zfs/version/N zfs: www.opensolaris.org/os/community/zfs/version/zpl/N Don't confuse zpool and zfs versions 15
  • 16. zpool versions VER DESCRIPTION --- -------------------------------------------------------- 1 Initial ZFS version 2 Ditto blocks (replicated metadata) 3 Hot spares and double parity RAID-Z 4 zpool history 5 Compression using the gzip algorithm 6 bootfs pool property 7 Separate intent log devices 8 Delegated administration 9 refquota and refreservation properties 10 Cache devices 11 Improved scrub performance 12 Snapshot properties 13 snapused property 14 passthrough-x aclinherit support 15 user/group space accounting 16 stmf property support 17 Triple-parity RAID-Z 18 snapshot user holds 19 Log device removal 16
  • 17. zfs versions VER DESCRIPTION --- -------------------------------------------------------- 1 Initial ZFS filesystem version 2 Enhanced directory entries 3 Case insensitive and File system unique identifier (FUID) 4 userquota, groupquota properties 17
  • 18. Copy on Write 1. Initial block tree 2. COW some data 3. COW metadata 4. Update Uberblocks & free 18
  • 19. COW Notes COW works on blocks, not files ZFS reserves 32 MBytes or 1/64 of pool size COWs need some free space to remove files need space for ZIL For fixed-record size workloads “fragmentation” and “poor performance” can occur if the recordsize is not matched Spatial distribution is good fodder for performance speculation affects HDDs moot for SSDs 19
  • 20. Pooled Storage Layer raw swap dump iSCSI ?? ZFS NFS CIFS ?? ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ?? Transactional Object Layer Pooled Storage Layer Block Device Driver HDD SSD iSCSI ?? 20
  • 21. vdevs – Virtual Devices Logical vdevs root vdev top-level vdev top-level vdev children[0] children[1] mirror mirror vdev vdev vdev vdev type=disk type=disk type=disk type=disk children[0] children[1] children[0] children[1] Physical or leaf vdevs 21
  • 22. vdev Labels vdev labels != disk labels Four 256 kByte labels written to every physical vdev Two-stage update process write label0 & label2 flush cache & check for errors write label1 & label3 flush cache & check for errors N = 256k * (size % 256k) 0 256k 512k 4M N-512k N-256k N label0 label1 boot block label2 label3 Blank Boot Name=Value ... header Pairs 128-slot Uberblock Array 0 8k 16k 128k 256k 22
  • 23. Observing Labels # zdb -l /dev/rdsk/c0t0d0s0 -------------------------------------------- LABEL 0 -------------------------------------------- version=14 name='rpool' state=0 txg=13152 pool_guid=17111649328928073943 hostid=8781271 hostname='' top_guid=11960061581853893368 guid=11960061581853893368 vdev_tree type='disk' id=0 guid=11960061581853893368 path='/dev/dsk/c0t0d0s0' devid='id1,sd@SATA_____ST3500320AS_________________9QM3FWFT/a' phys_path='/pci@0,0/pci1458,b002@11/disk@0,0:a' whole_disk=0 metaslab_array=24 metaslab_shift=30 ashift=9 asize=157945167872 is_log=0 23
  • 24. To fsck or not to fsck fsck was created to fix known inconsistencies in file system metadata UFS is not transactional metadata inconsistencies must be reconciled does NOT repair data – how could it? ZFS doesn't need fsck, as-is all on-disk changes are transactional COW means previously existing, consistent metadata is not overwritten ZFS can repair itself metadata is at least dual-redundant data can also be redundant Reality check – this does not mean that ZFS is not susceptible to corruption nor is any other file system 24
  • 25. VDEV 25
  • 26. Dynamic Striping  RAID-0 − SNIA definition: fixed-length sequences of virtual disk data addresses are mapped to sequences of member disk addresses in a regular rotating pattern  Dynamic Stripe − Data is dynamically mapped to member disks − No fixed-length sequences − Allocate up to ~1 MByte/vdev before changing vdev − vdevs can be different size − Good combination of the concatenation feature with RAID-0 performance 26
  • 27. Dynamic Striping RAID-0 Column size = 128 kBytes, stripe width = 384 kBytes ZFS Dynamic Stripe recordsize = 128 kBytes Total write size = 2816 kBytes 27
  • 28. Mirroring  Straightforward: put N copies of the data on N vdevs  Unlike RAID-1 − No 1:1 mapping at the block level − vdev labels are still at beginning and end − vdevs can be of different size  effective space is that of smallest vdev  Arbitration: ZFS does not blindly trust either side of mirror − Most recent, correct view of data wins − Checksums validate data 28
  • 29. Mirroring 29
  • 30. Dynamic vdev Replacement  zpool replace poolname vdev [vdev]  Today, replacing vdev must be same size or larger − Before b117: as measured by blocks − After b117: as measured by metaslabs  Replacing all vdevs in a top-level vdev with larger vdevs results in top-level vdev resizing  Policy controlled by zpool autoexpand property 15G 10G 10G 15G 10G 20G 15G 20G 10G 15G 20G 20G 20G 20G 10G 10G Mirror 10G Mirror 15G Mirror 15G Mirror 20G Mirror 30
  • 31. RAIDZ  RAID-5 − Parity check data is distributed across the RAID array's disks − Must read/modify/write when data is smaller than stripe width  RAIDZ − Dynamic data placement − Parity added as needed − Writes are full-stripe writes − No read/modify/write (write hole)  Arbitration: ZFS does not blindly trust any device − Does not rely on disk reporting read error − Checksums validate data − If checksum fails, read parity Space used is dependent on how used 31
  • 32. RAID-5 vs RAIDZ DiskA DiskB DiskC DiskD DiskE D0:0 D0:1 D0:2 D0:3 P0 RAID-5 P1 D1:0 D1:1 D1:2 D1:3 D2:3 P2 D2:0 D2:1 D2:2 D3:2 D3:3 P3 D3:0 D3:1 DiskA DiskB DiskC DiskD DiskE P0 D0:0 D0:1 D0:2 D0:3 RAIDZ P1 D1:0 D1:1 P2:0 D2:0 D2:1 D2:2 D2:3 P2:1 D2:4 D2:5 Gap P3 D3:0 D3:1 32
  • 33. RAID-5 Write Hole  Occurs when data to be written is smaller than stripe size  Must read unallocated columns to recalculate the parity or the parity must be read/modify/write  Read/modify/write is risky for consistency − Multiple disks − Reading independently − Writing independently − System failure before all writes are complete to media could result in data loss  Effects can be hidden from host using RAID array with nonvolatile write cache, but extra I/O cannot be hidden from disks 33
  • 34. RAIDZ2 and RAIDZ3  RAIDZ2 = double parity RAIDZ  RAIDZ3 = triple parity RAIDZ  Sorta like RAID-6 − Parity 1: XOR − Parity 2: another Reed-Soloman syndrome − Parity 3: yet another Reed-Soloman syndrome  Arbitration: ZFS does not blindly trust any device − Does not rely on disk reporting read error − Checksums validate data − If data not valid, read parity − If data still not valid, read other parity Space used is dependent on how used 34
  • 35. Evaluating Data Retention  MTTDL = Mean Time To Data Loss  Note: MTBF is not constant in the real world, but keeps math simple  MTTDL[1] is a simple MTTDL model  No parity (single vdev, striping, RAID-0) − MTTDL[1] = MTBF / N  Single Parity (mirror, RAIDZ, RAID-1, RAID-5) − MTTDL[1] = MTBF2 / (N * (N-1) * MTTR)  Double Parity (3-way mirror, RAIDZ2, RAID-6) − MTTDL[1] = MTBF3 / (N * (N-1) * (N-2) * MTTR2) 35
  • 36. Another MTTDL Model  MTTDL[1] model doesn't take into account unrecoverable read  But unrecoverable reads (UER) are becoming the dominant failure mode − UER specifed as errors per bits read − More bits = higher probability of loss per vdev  MTTDL[2] model considers UER 36
  • 37. Why Worry about UER?  Richard's study − 3,684 hosts with 12,204 LUNs − 11.5% of all LUNs reported read errors  Bairavasundaram et.al. FAST08 www.cs.wisc.edu/adsl/Publications/corruption-fast08.pdf − 1.53M LUNs over 41 months − RAID reconstruction discovers 8% of checksum mismatches − 4% of disks studies developed checksum errors over 17 months 37
  • 38. MTTDL[2] Model  Probability that a reconstruction will fail − Precon_fail = (N-1) * size / UER  Model doesn't work for non-parity schemes (single vdev, striping, RAID-0)  Single Parity (mirror, RAIDZ, RAID-1, RAID-5) − MTTDL[2] = MTBF / (N * Precon_fail)  Double Parity (3-way mirror, RAIDZ2, RAID-6) − MTTDL[2] = MTBF2/ (N * (N-1) * MTTR * Precon_fail) 38
  • 39. Practical View of MTTDL[1] 39
  • 42. Ditto Blocks Recall that each blkptr_t contains 3 DVAs Dataset property used to indicate how many copies (aka ditto blocks) of data is desired Write all copies Read any copy Recover corrupted read from a copy Not a replacement for mirroring Easier to describe in pictures... copies parameter Data copies Metadata copies copies=1 (default) 1 2 copies=2 2 3 copies=3 3 3 42
  • 45. ZIO – ZFS I/O Layer 45
  • 46. ZIO Framework All physical disk I/O goes through ZIO Framework Translates DVAs into Logical Block Address (LBA) on leaf vdevs Keeps free space maps (spacemap) If contiguous space is not available: Allocate smaller blocks (the gang) Allocate gang block, pointing to the gang Implemented as multi-stage pipeline Allows extensions to be added fairly easily Handles I/O errors 46
  • 48. ZIO Write Pipeline ZIO State Compression Crypto Checksum DVA vdev I/O open compress if savings > 12.5% encrypt generate allocate start start start done done done assess assess assess done Gang activity elided, for clarity 48
  • 49. ZIO Read Pipeline ZIO State Compression Crypto Checksum DVA vdev I/O open start start start done done done assess assess assess verify decrypt decompress done Gang activity elided, for clarity 49
  • 50. VDEV – Virtual Device Subsytem Where mirrors, RAIDZ, and RAIDZ2 are implemented Name Priority Surprisingly few lines of code NOW 0 needed to implement RAID SYNC_READ 0 Leaf vdev (physical device) I/O SYNC_WRITE 0 management FREE 0 Number of outstanding iops CACHE_FILL 0 Read-ahead cache LOG_WRITE 0 Priority scheduling ASYNC_READ 4 ASYNC_WRITE 4 RESILVER 10 SCRUB 20 50
  • 52. Object Cache UFS uses page cache managed by the virtual memory system ZFS does not use the page cache, except for mmap'ed files ZFS uses a Adaptive Replacement Cache (ARC) ARC used by DMU to cache DVA data objects Only one ARC per system, but caching policy can be changed on a per-dataset basis Seems to work much better than page cache ever did for UFS 52
  • 53. Traditional Cache Works well when data being accessed was recently added Doesn't work so well when frequently accessed data is evicted Misses cause insert MRU Dynamic caches can change Cache size size by either not evicting or aggressively evicting LRU Evict the oldest 53
  • 54. ARC – Adaptive Replacement Cache Evict the oldest single-use entry LRU Recent Cache Miss MRU Evictions and dynamic MRU size resizing needs to choose best Hit cache to evict (shrink) Frequent Cache LRU Evict the oldest multiple accessed entry 54
  • 55. ZFS ARC – Adaptive Replacement Cache with Locked Pages Evict the oldest single-use entry Cannot evict LRU locked pages! Recent Cache Miss MRU MRU Hit size Frequent If hit occurs Cache within 62 ms LRU Evict the oldest multiple accessed entry ZFS ARC handles mixed-size pages 55
  • 56. ARC Directory Each ARC directory entry contains arc_buf_hdr structs Info about the entry Pointer to the entry Directory entries have size, ~200 bytes ZFS block size is dynamic, 512 bytes – 128 kBytes Disks are large Suppose we use a Seagate LP 2 TByte disk for the L2ARC Disk has 3,907,029,168 512 byte sectors, guaranteed Workload uses 8 kByte fixed record size RAM needed for arc_buf_hdr entries Need = (3,907,029,168 - 9,232) * 200 / 16 = ~48 GBytes Don't underestimate the RAM needed for large L2ARCs 56
  • 57. L2ARC – Level 2 ARC ARC evictions are sent to cache vdev ARC directory remains in memory Works well when cache vdev is optimized for fast reads ARC lower latency than pool disks inexpensive way to “increase memory” Content considered volatile, no ZFS data evicted protection allowed data Monitor usage with zpool iostat “cache” “cache” “cache” vdev vdev vdev 57
  • 58. ARC Tips In general, it seems to work well for most workloads ARC size will vary, based on usage Default max is 3/4 of memory or memory - 1 GByte Min is 64 MB Metadata capped at 1/4 of max ARC size Internals tracked by kstats in Solaris Use memory_throttle_count to observe pressure to evict Can limit at boot time Solaris – set zfs:zfs_arc_max in /etc/system Performance Prior to b107, L2ARC fill rate was limited to 8 MBytes/s L2ARC keeps its directory in kernel memory 58
  • 60. flash Source Code Structure File system Device GUI Mgmt Consumer Consumer JNI User libzfs Kernel Interface ZPL ZVol /dev/zfs Layer Transactional ZIL ZAP Traversal Object Layer DMU DSL ARC Pooled Storage ZIO Layer VDEV Configuration 60
  • 61. DMU – Data Management Layer Datasets issue transactions to the DMU Transactional based object model Transactions are Atomic Grouped (txg = transaction group) Responsible for on-disk data ZFS Attribute Processor (ZAP) Dataset and Snapshot Layer (DSL) ZFS Intent Log (ZIL) 61
  • 62. Transaction Engine Manages physical I/O Transactions grouped into transaction group (txg) txg updates All-or-nothing Commit interval Older versions: 5 seconds Now: 30 seconds max, dynamically scale based on time required to commit txg Delay committing data to physical storage Improves performance A bad thing for sync workloads – hence the ZFS Intent Log (ZIL) 30 second delay can impact failure detection time 62
  • 63. ZIL – ZFS Intent Log DMU is transactional, and likes to group I/O into transactions for later commits, but still needs to handle “write it now” desire of sync writers NFS Databases ZIL recordsize inflation can occur for some workloads May cause larger than expected actual I/O for sync workloads Oracle redo logs Can tune zfs_immediate_write_sz, but after b122 use logbias property instead Never read, except at import (eg reboot), when transactions may need to be rolled forward 63
  • 64. Separate Logs (slogs) ZIL competes with pool for iops Applications will wait for sync writes to be on nonvolatile media Very noticeable on HDD JBODs Put ZIL on separate vdev, outside of pool ZIL writes tend to be sequential No competition with pool for IOPS Downside: slog device required to be operational at import b125 adds slog device removal support Size of separate log < than size of RAM (duh) 10x or more performance improvements possible Use write-optimized SSD or non-volatile write cache on RAID array Use zilstat to observe ZIL activity 64
  • 65. Synchronous Write Destination Without separate log Sync I/O size > zfs_immediate_write_sz ? ZIL Destination no ZIL log yes bypass to pool With separate log Sync I/O size > zfs_immediate_write_sz ? logbias? ZIL Destination no log device yes prior to logbias (b122) log device latency (default) log device throughput bypass to pool + Default zfs_immediate_write_sz = 32 kBytes 65
  • 66. Disabling the ZIL Rule 0: Don’t disable the ZIL If you love your data, do not disable the ZIL You can find references to this as a way to speed up ZFS NFS workloads “tar -x” benchmarks Golden Rule: Don’t disable the ZIL Can set via mdb, but need to remount the file system under test Friends don’t let friends disable the ZIL Solaris - can set in /etc/system *** TEMPORARY disable ZIL for non-production use *** disabled by <your name> on <date> set zfs:zil_disable=1 Nostradamus wrote, “disabling the ZIL will lead to the apocalypse” 66
  • 67. DSL – Dataset and Snapshot Layer 67
  • 68. flash Copy on Write 1. Initial block tree 2. COW some data 3. COW metadata 4. Update Uberblocks & free 68
  • 69. zfs snapshot Create a read-only, point-in-time window into the dataset (file system or Zvol) Computationally free, because of COW architecture Very handy feature Patching/upgrades Basis for Time Slider 69
  • 70. Snapshot Current tree root Snapshot tree root Create a snapshot by not free'ing COWed blocks Snapshot creation is fast and easy Number of snapshots determined by use – no hardwired limit Recursive snapshots also possible 70
  • 71. Clones Snapshots are read-only Clones are read-write based upon a snapshot Child depends on parent Cannot destroy parent without destroying all children Can promote children to be parents Good ideas OS upgrades Change control Replication zones virtual disks 71
  • 72. zfs clone Create a read-write file system from a read-only snapshot Used extensively for OpenSolaris upgrades OS rev1 OS rev1 OS rev1 OS rev1 OS rev1 OS rev1 OS rev1 snapshot snapshot snapshot OS rev1 upgrade OS rev2 clone boot manager Origin snapshot cannot be destroyed, if clone exists 72
  • 73. zfs rollback OS b104 OS b104 rpool/ROOT/b104 rpool/ROOT/b104 OS b104 OS b104 snapshot rollback snapshot rpool/ROOT/b104@today rpool/ROOT/b104@today 73
  • 74. Commands 74
  • 75. zpool(1m) raw swap dump iSCSI ?? ZFS NFS CIFS ?? ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ?? Transactional Object Layer Pooled Storage Layer Block Device Driver HDD SSD iSCSI ?? 75
  • 76. Dataset & Snapshot Layer Object Allocated storage dnode describes collection of Dataset Directory blocks Dataset Object Set Object Set Childmap Group of related objects Dataset Object Object Object Properties Snapmap: snapshot relationships Snapmap Space usage Dataset directory Childmap: dataset relationships Properties 76
  • 77. zpool create zpool create poolname vdev-configuration vdev-configuration examples mirror c0t0d0 c3t6d0 mirror c0t0d0 c3t6d0 mirror c4t0d0 c0t1d6 mirror disk1s0 disk2s0 cache disk4s0 log disk5 raidz c0d0s1 c0d1s1 c1d2s0 spare c1d3s0 Solaris Additional checks to see if disk/slice overlaps or is currently in use Whole disks are given EFI labels Can set initial pool or dataset properties By default, creates a file system with the same name poolname pool → /poolname file system People get confused by a file system with same name as the pool 77
  • 78. zpool add Adds a device to the pool as a top-level vdev zpool add poolname vdev-configuration vdev-configuration can be any combination also used for zpool create Complains if the added vdev-configuration would cause a different data protection scheme than is already in use – use “-f” to override Good idea: try with “-n” flag first – will show final configuration without actually performing the add Do not add a device which is in use as a quorum device 78
  • 79. zpool remove Remove a top-level vdev from the pool zpool remove poolname vdev Today, you can only remove the following vdevs: cache hot spare separate log (b124) An RFE is open to allow removal of other top-level vdevs Don't confuse “remove” with “detach” 79
  • 80. zpool attach Attach a vdev as a mirror to an existing vdev zpool attach poolname existing-vdev vdev Attaching vdev must be the same size or larger than the existing vdev Note: today, not available for RAIDZ, RAIDZ2, or RAIDZ3 vdevs vdev Configurations ok simple vdev → mirror ok mirror ok log → mirrored log no RAIDZ no RAIDZ2 no RAIDZ3 “Same size” literally means the same number of blocks until b117. Beware that many “same size” disks have different number of available blocks. 80
  • 81. zpool import Import a pool and mount all mountable datasets Import a specific pool zpool import poolname zpool import GUID Scan LUNs for pools which may be imported zpool import Can set options, such as alternate root directory or other properties Beware of zpool.cache interactions Beware of artifacts, especially partial artifacts 81
  • 82. zpool history Show history of changes made to the pool # zpool history rpool History for 'rpool': 2009-03-04.07:29:46 zpool create -f -o failmode=continue -R /a -m legacy -o cachefile=/tmp/root/etc/zfs/zpool.cache rpool c0t0d0s0 2009-03-04.07:29:47 zfs set canmount=noauto rpool 2009-03-04.07:29:47 zfs set mountpoint=/rpool rpool 2009-03-04.07:29:47 zfs create -o mountpoint=legacy rpool/ROOT 2009-03-04.07:29:48 zfs create -b 4096 -V 2048m rpool/swap 2009-03-04.07:29:48 zfs create -b 131072 -V 1024m rpool/dump 2009-03-04.07:29:49 zfs create -o canmount=noauto rpool/ROOT/snv_106 2009-03-04.07:29:50 zpool set bootfs=rpool/ROOT/snv_106 rpool 2009-03-04.07:29:50 zfs set mountpoint=/ rpool/ROOT/snv_106 2009-03-04.07:29:51 zfs set canmount=on rpool 2009-03-04.07:29:51 zfs create -o mountpoint=/export rpool/export 2009-03-04.07:29:51 zfs create rpool/export/home 2009-03-04.00:21:42 zpool import -f -R /a 17111649328928073943 2009-03-04.00:21:42 zpool export rpool 2009-03-04.08:47:08 zpool set bootfs=rpool rpool 2009-03-04.08:47:08 zpool set bootfs=rpool/ROOT/snv_106 rpool 2009-03-04.08:47:12 zfs snapshot rpool/ROOT/snv_106@snv_b108 2009-03-04.08:47:12 zfs clone rpool/ROOT/snv_106@snv_b108 rpool/ROOT/ snv_b108 ... 82
  • 83. zpool status Shows the status of the current pools, including their configuration Important troubleshooting step # zpool status … pool: zwimming state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scrub: none requested config: NAME STATE READ WRITE CKSUM zwimming ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t2d0s0 ONLINE 0 0 0 c0t0d0s7 ONLINE 0 0 0 errors: No known data errors Understanding status output error messages can be tricky 83
  • 84. zpool iostat Show pool physical I/O activity, in an iostat-like manner Solaris: fsstat will show I/O activity looking into a ZFS file system Especially useful for showing slog activity # zpool iostat -v capacity operations bandwidth pool used avail read write read write ------------ ----- ----- ----- ----- ----- ----- rpool 16.5G 131G 0 0 1.16K 2.80K c0t0d0s0 16.5G 131G 0 0 1.16K 2.80K ------------ ----- ----- ----- ----- ----- ----- zwimming 135G 14.4G 0 5 2.09K 27.3K mirror 135G 14.4G 0 5 2.09K 27.3K c0t2d0s0 - - 0 3 1.25K 27.5K c0t0d0s7 - - 0 2 1.27K 27.5K ------------ ----- ----- ----- ----- ----- ----- Unlike iostat, does not show latency 84
  • 85. zfs(1m) raw swap dump iSCSI ?? ZFS NFS CIFS ?? ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ?? Transactional Object Layer Pooled Storage Layer Block Device Driver HDD SSD iSCSI ?? 85
  • 86. zfs create, destroy By default, a file system with the same name as the pool is created by zpool create Name format is: pool/name[/name ...] File system zfs create fs-name zfs destroy fs-name Zvol zfs create -V size vol-name zfs destroy vol-name Parameters can be set at create time 86
  • 87. zfs list List mounted datasets Old versions: listed everything After b108: do not list snapshots See zpool listsnapshots property Examples zfs list zfs list -t snapshot zfs list -H -o name 87
  • 88. zfs send, receive Send send a snapshot to stdout data is decompressed Receive receive a snapshot from stdin receiving file system parameters apply (compression, et.al) Can incrementally send snapshots in time order Handy way to replicate dataset snapshots Only method for replicating dataset properties, except quotas NOT a replacement for traditional backup solutions All-or-nothing design per snapshot In general, does not send files (!) Send streams from b35 (or older) no longer supported after b89 88
  • 89. Sharing 89
  • 90. Sharing zfs share dataset Type of sharing set by parameters shareiscsi = [on | off] sharenfs = [on | off | options] sharesmb = [on | off | options] Shortcut to manage sharing Uses external services (nfsd, iscsi target, smbshare, etc) Importing pool will also share May vary by OS 90
  • 91. NFS ZFS file systems work as expected use ACLs based on NFSv4 ACLs Parallel NFS, aks pNFS, aka NFSv4.1 Still a work-in-progress http://opensolaris.org/os/project/nfsv41/ zfs create -t pnfsdata mypnfsdata pNFS Client pNFS Data Server pNFS Data Server pnfsdata pnfsdata pNFS dataset dataset Metadata Server pool pool 91
  • 92. CIFS UID mapping casesensitivity parameter Good idea, set when file system is created zfs create -o casesensitivity=insensitive mypool/Shared Shadow Copies for Shared Folders (VSS) supported CIFS clients cannot create shadow remotely (yet) CIFS features vary by OS, Samba, etc. 92
  • 93. iSCSI SCSI over IP Block-level protocol Uses Zvols as storage Solaris has 2 iSCSI target implementations shareiscsi enables old, user-land iSCSI target To use COMSTAR, enable using itadm(1m) b116 more closely integrates COMSTAR (zpool version 16) iSCSI performance hiccup Prior to b107, iSCSI over Zvols didn’t properly handle sync writes b107-b113, iSCSI over Zvols made all writes sync (read: slow) Workaround: enable write cache enable in the iSCSI target, see CR6770534 OpenSolaris 2009.06 is b111 b114, write cache enable works automatically iSCSI over Zvol 93
  • 95. Properties Properties are stored in an nvlist By default, are inherited Some properties are common to all datasets, but a specific dataset type may have additional properties Easily set or retrieved via scripts In general, properties affect future file system activity zpool get doesn't script as nicely as zfs get 95
  • 96. User-defined Properties Names Must include colon ':' Can contain lower case alphanumerics or “+” “.” “_” Max length = 256 characters By convention, module:property com.sun:auto-snapshot Values Max length = 1024 characters Examples com.sun:auto-snapshot=true com.richardelling:important_files=true 96
  • 97. set & get properties Set zfs set compression=on export/home/relling Get zfs get compression export/home/relling Reset to inherited value zfs inherit compression export/home/relling Clear user-defined parameter zfs inherit com.sun:auto-snapshot export/home/ relling 97
  • 98. Pool Properties Property Change? Brief Description altroot Alternate root directory (ala chroot) autoexpand Policy for expanding when vdev size changes autoreplace vdev replacement policy available readonly Available storage space bootfs Default bootable dataset for root pool cachefile Cache file to use other than /etc/zfs/ zpool.cache capacity readonly Percent of pool space used delegation Master pool delegation switch failmode Catastrophic pool failure policy 98
  • 99. More Pool Properties Property Change? Brief Description guid readonly Unique identifier health readonly Current health of the pool listsnapshots zfs list policy size readonly Total size of pool used readonly Amount of space used version readonly Current on-disk version 99
  • 100. Property Change? Brief Description available readonly Space available to dataset & children checksum Checksum algorithm compression Compression algorithm compressratio readonly Compression ratio – logical size:referenced physical copies Number of copies of user data creation readonly Dataset creation time logbias Separate log write policy origin readonly For clones, origin snapshot primarycache ARC caching policy readonly Is dataset in readonly mode? referenced readonly Size of data accessible by this dataset 100
  • 101. More Common Dataset Properties Property Change? Brief Description refreservation Max space guaranteed to a dataset, excluding descendants (snapshots & clones) reservation Minimum space guaranteed to dataset, including descendants secondarycache L2ARC caching policy type readonly Type of dataset (filesystem, snapshot, volume) 101
  • 102. More Common Dataset Properties Property Change? Brief Description used readonly Sum of usedby* (see below) usedbychildren readonly Space used by descendants usedbydataset readonly Space used by dataset usedbyrefreservation readonly Space used by a refreservation for this dataset usedbysnapshots readonly Space used by all snapshots of this dataset zoned readonly Is dataset added to non-global zone (Solaris) 102
  • 103. Volume Dataset Properties Property Change? Brief Description shareiscsi iSCSI service (not COMSTAR) volblocksize creation fixed block size volsize Implicit quota zoned readonly Set if dataset delegated to non-global zone (Solaris) 103
  • 104. File System Properties Property Change? Brief Description aclinherit ACL inheritance policy, when files or directories are created aclmode ACL modification policy, when chmod is used atime Disable access time metadata updates canmount Mount policy casesensitivity creation Filename matching algorithm (CIFS client feature) devices Device opening policy for dataset exec File execution policy for dataset mounted readonly Is file system currently mounted? 104
  • 105. More File System Properties Property Change? Brief Description nbmand export/ File system should be mounted with non- import blocking mandatory locks (CIFS client feature) normalization creation Unicode normalization of file names for matching quota Max space dataset and descendants can consume recordsize Suggested maximum block size for files refquota Max space dataset can consume, not including descendants setuid setuid mode policy sharenfs NFS sharing options sharesmb Files system shared with CIFS 105
  • 106. File System Properties Property Change? Brief Description snapdir Controls whether .zfs directory is hidden utf8only creation UTF-8 character file name policy vscan Virus scan enabled xattr Extended attributes policy 106
  • 108. Dataset Space Accounting used = usedbydataset + usedbychildren + usedbysnapshots + usedbyrefreservation Lazy updates, may not be correct until txg commits ls and du will show size of allocated files which includes all copies of a file Shorthand report available $ zfs list -o space NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD rpool 126G 18.3G 0 35.5K 0 18.3G rpool/ROOT 126G 15.3G 0 18K 0 15.3G rpool/ROOT/snv_106 126G 86.1M 0 86.1M 0 0 rpool/ROOT/snv_b108 126G 15.2G 5.89G 9.28G 0 0 rpool/dump 126G 1.00G 0 1.00G 0 0 rpool/export 126G 37K 0 19K 0 18K rpool/export/home 126G 18K 0 18K 0 0 rpool/swap 128G 2G 0 193M 1.81G 0 108
  • 109. zfs vs zpool Space Accounting zfs list != zpool list zfs list shows space used by the dataset plus space for internal accounting zpool list shows physical space available to the pool For simple pools and mirrors, they are nearly the same For RAIDZ, RAIDZ2, or RAIDZ3, zpool list will show space available for parity Users will be confused about reported space available 109
  • 110. Accessing Snapshots By default, snapshots are accessible in .zfs directory Visibility of .zfs directory is tunable via snapdir property Don't really want find to find the .zfs directory Windows CIFS clients can see snapshots as Shadow Copies for Shared Folders (VSS) # zfs snapshot rpool/export/home/relling@20090415 # ls -a /export/home/relling … .Xsession .xsession-errors # ls /export/home/relling/.zfs shares snapshot # ls /export/home/relling/.zfs/snapshot 20090415 # ls /export/home/relling/.zfs/snapshot/20090415 Desktop Documents Downloads Public 110
  • 111. Time-based Resilvering Block pointers contain birth txg number Resilvering begins with oldest blocks first 73 73 Interrupted resilver will still result in a valid file system view 73 55 73 27 68 73 27 27 73 68 Birth txg = 27 Birth txg = 68 Birth txg = 73 111
  • 112. Time Slider - Automatic Snapshots Underpinnings for Solaris feature similar to OSX's Time Machine SMF service for managing snapshots SMF properties used to specify policies: frequency (interval) and number to keep Creates cron jobs GUI tool makes it easy to select individual file systems Tip: take additional snapshots for important milestones to avoid automatic snapshot deletion Service Name Interval (default) Keep (default) auto-snapshot:frequent 15 minutes 4 auto-snapshot:hourly 1 hour 24 auto-snapshot:daily 1 day 31 auto-snapshot:weekly 7 days 4 auto-snapshot:monthly 1 month 12 112
  • 113. Nautilus File system views which can go back in time 113
  • 114. ACL – Access Control List Based on NFSv4 ACLs Similar to Windows NT ACLs Works well with CIFS services Supports ACL inheritance Change using chmod View using ls 114
  • 115. Checksums for Data DVA contains 256 bits for checksum Checksum is in the parent, not in the block itself Types none fletcher2: truncated 2nd order Fletcher-like algorithm (default prior to b114) fletcher4: 4th order Fletcher-like algorithm (default, starting b114) SHA-256 There are open proposals for better algorithms 115
  • 116. Checksum Use Pool Algorithm Notes Uberblock SHA-256 self-checksummed Metadata fletcher4 Labels SHA-256 Gang block SHA-256 self-checksummed Dataset Algorithm Notes Metadata fletcher4 Data fletcher4 (default) zfs checksum parameter ZIL log fletcher2 self-checksummed Send stream fletcher4 Note: fletcher2 was the default for data prior to b114 Note: ZIL log has additional checking beyond the checksum 116
  • 117. Compression Builtin lzjb, Lempel-Ziv by Jeff Bonwick gzip, levels 1-9 Extensible new compressors can be added backwards compatibility issues Uses taskqs to take advantage of multi-processor systems Do you have a better compressor in mind? http://richardelling.blogspot.com/2009/08/justifying-new- compression-algorithms.html Cannot boot from gzip compressed root (RFE is open) 117
  • 118. Encryption Placeholder – details TBD http://opensolaris.org/os/project/zfs-crypto Complicated by: Block pointer rewrites Deduplication 118
  • 119. Quotas File system quotas quota includes descendants (snapshots, clones) refquota does not include descendants User and group quotas b114, Solaris 10 10/09 (patch 141444-03 or 141445-03) Works like refquota, descendants don't count Not inherited zfs userspace and groupspace subcommands show quotas Users can only see their own and group quota, but can delegate Managed like properties [user|group]quota@[UID|username|SID name|SID number] not visible via zfs get all 119
  • 120. zpool.cache Old way mount / read /etc/[v]fstab mount file systems ZFS import pool(s) find mountable datasets and mount them /etc/zpool.cache is a cache of pools to be imported at boot time No scanning of all available LUNs for pools to import Binary: dump contents with zdb -C cachefile property permits selecting an alternate zpool.cache Useful for OS installers Useful for clusters, where you don't want a booting node to automatically import a pool Not persistent (!) 120
  • 121. Mounting ZFS File Systems By default, mountable file systems are mounted when the pool is imported Controlled by canmount policy (not inherited) on – (default) file system is mountable off – file system is not mountable if you want children to be mountable, but not the parent noauto – file system must be explicitly mounted (boot environment) Can zfs set mountpoint=legacy to use /etc/vfstab By default, cannot mount on top of non-empty directory Can override explicitly using zfs mount -O or legacy mountpoint Mount properties are persistent, use zfs mount -o for temporary changes Imports are done in parallel, beware of mountpoint races prior to b104 121
  • 122. recordsize Dynamic Max 128 kBytes Min 512 Bytes Power of 2 For most workloads, don't worry about it For fixed size workloads, can set to match workloads Databases iSCSI Zvols serving NTFS or ext3 (use 4 KB) File systems or Zvols zfs set recordsize=8k dataset 122
  • 123. Delegated Administration Fine grain control users or groups of users subcommands, parameters, or sets Similar to Solaris' Role Based Access Control (RBAC) Enable/disable at the pool level zpool set delegation=on mypool (default) Allow/unallow at the dataset level zfs allow relling snapshot mypool/relling zfs allow @backupusers snapshot,send mypool/sw zfs allow mypool/relling 123
  • 124. Delegatable Subcommands allow receive clone rename create rollback destroy send groupquota share groupused snapshot mount userquota promote userused 124
  • 125. Delegatable Parameters aclinherit nbmand sharesmb aclmode normalization snapdir atime quota userprop canmount readonly utf8only casesensitivity recordsize version checksum refquota volsize compression refreservation vscan copies reservation xattr devices setuid zoned exec shareiscsi mountpoint sharenfs 125
  • 126. Browser User Interface Solaris 10 – WebConsole Nexenta OpenStorage 126
  • 130. OpenStorage 130
  • 131. Solaris Swap and Dump Swap Solaris does not have automatic swap resizing Swap as a separate dataset Swap device is raw, with a refreservation Blocksize matched to pagesize: 8 kB SPARC, 4 kB x86 Don't really need or want snapshots or clones Can resize while online, manually Dump Only used during crash dump Preallocated No refreservation Checksum off Compression off (dumps are already compressed) 131
  • 132. Performance 132
  • 133. General Comments In general, performs well out of the box Standard performance improvement techniques apply Lots of DTrace knowledge available Typical areas of concern: ZIL check with zilstat, improve with slogs COW “fragmentation” check iostat, improve with L2ARC Memory consumption check with arcstat set primarycache property can be capped can compete with large page aware apps Compression, or lack thereof 133
  • 134. ZIL Performance : NFS Big performance increases demonstrated especially with SSDs for RAID arrays with nonvolatile RAM cache, not so much NFS servers 32kByte threshold (zfs_immediate_write_sz) also corresponds to NFSv3 write size May cause more work than needed See CR6686887 134
  • 135. ZIL Performance : Databases The logbias property can be set on a dataset to control threshold for writing to pool when a slog is used logbias=latency (default) all writes go to slog logbias=throughput, writes > zfs_immediate_write_sz go to pool Settable on-the-fly Consider changing policy during database loads Can have different sync policies for logs and data Oracle, separate latency-sensitive redo log traffic from Redo logs: logbias=latency Indexes: logbias=latency Data files: logbias=throughput MySQL with InnoDB logbias=latency 135
  • 136. More ZIL Performance : Databases I/O size inflation Once a file grows to use a block size, it will keep that block size Block size is capped by recordsize recordsize is a power of 2: 512 bytes, 1 KB, 2 KB, 4 KB, ... 128 KB Can be inefficient if the workload is sync and writes variable sized data Oracle performance work: Roch reports 40% improvement for JBOD (HDD) + separate log (SSD) with: File system or Zvol Role recordsize logbias data files 8 KB throughput redo logs 128 KB (default) latency (default) indices 8-32 KB? latency (default) 136
  • 137. vdev Cache vdev cache occurs at the SPA level readahead 10 MBytes per vdev only caches metadata (b70 or later) Stats collected as Solaris kstats # kstat -n vdev_cache_stats module: zfs instance: 0 name: vdev_cache_stats class: misc crtime 38.83342625 delegations 14030 hits 105169 misses 59452 snaptime 4564628.18130739 Hit rate = 59%, not bad... 137
  • 138. Intelligent Prefetching Intelligent file-level prefetching occurs at the DMU level Feeds the ARC In a nutshell, prefetch hits cause more prefetching Read a block, prefetch a block If we used the prefetched block, read 2 more blocks Up to 256 blocks Recognizes strided reads 2 sequential reads of same length and a fixed distance will be coalesced Fetches backwards Seems to work pretty well, as-is, for most workloads 138
  • 139. Unintelligent Prefetch? Some workloads don't do so well with intelligent prefetch CR6859997, zfs caching performance problem, fixed in NV b124 Look for time spent in zfetch_* functions using lockstat lockstat -I sleep 10 Easy to disable in mdb for testing on Solaris echo zfs_prefetch_disable/W0t1 | mdb -kw Re-enable with echo zfs_prefetch_disable/W0t0 | mdb -kw Set via /etc/system set zfs:zfs_prefetch_disable = 1 139
  • 140. I/O Queues By default, for devices which can support multiple I/Os, up to 35 I/Os are queued to each vdev Tunable with zfs_vdev_max_pending, set to 10 with: echo zfs_vdev_max_pending/W0t10 | mdb -kw Implies that more vdevs is better Consider avoiding RAID array with a single, large LUN ZFS I/O scheduler loses control once iops are queued CR6471212 proposes reserved slots for high-priority iops May need to match queues for the entire data path zfs_vdev_max_pending Fibre channel, SCSI, SAS, SATA driver RAID array controller Fast disks → small queues, slow disks → larger queues 140
  • 141. COW Penalty COW can negatively affect workloads which have updates and sequential reads Initial writes will be sequential Updates (writes) will cause seeks to read data Lots of people seem to worry a lot about this Only affects HDDs Very difficult to speculate about the impact on real-world apps Large sequential scans of random data hurt anyway Reads are cached in many places in the data path Databases can COW, too Sysbench benchmark used to test on MySQL w/InnoDB engine One hour read/write test select count(*) repeat, for a week 141
  • 142. COW Penalty Performance seems to level at about 25% penalty Results compliments of Allan Packer & Neelakanth Nadgir http://blogs.sun.com/realneel/resource/MySQL_Conference_2009_ZFS_MySQL.pdf 142
  • 143. About Disks... Disks still the most important performance bottleneck Modern processors are multi-core Default checksums and compression are computationally efficient Average Max Size Rotational Average Seek Disk Size RPM (GBytes) Latency (ms) (ms) HDD 2.5” 5,400 500 5.5 11 HDD 3.5” 5,900 2,000 5.1 16 HDD 3.5” 7,200 1,500 4.2 8 - 8.5 HDD 2.5” 10,000 300 3 4.2 - 4.6 HDD 2.5” 15,000 146 2 3.2 - 3.5 SSD 2.5” N/A 73 0 0.02 - 0.15 (w) (r) SSD 2.5” N/A 500 0 0.02 - 0.15 143
  • 144. DirectIO UFS forcedirectio option brought the early 1980s design of UFS up to the 1990s ZFS designed to run on modern multiprocessors Databases or applications which manage their data cache may benefit by disabling file system caching Expect L2ARC to improve random reads (secondarycache) Prefetch disabled by primarycache=none|metadata UFS DirectIO ZFS Unbuffered I/O primarycache=metadata primarycache=none Concurrency Available at inception Improved Async I/O code path Available at inception 144
  • 145. Hybrid Storage Pool SPA separate log L2ARC Main Pool device cache device Write optimized HDD HDD Read optimized device (SSD) HDD device (SSD) Size (GBytes) < 1 GByte large big Cost write iops/$ size/$ size/$ Performance low-latency - low-latency writes reads 145
  • 146. RAID-Z Bandwidth Traditional RAID-Z had a “mind the gap” feature Impacts possible bandwidth Mirrors could show higher bandwidth Now RAID-Z shows better bandwidth, when channel bandwidth is the constrained resource Implementation caused spurious errors for b118-b123 146
  • 148. Checking Status zpool status zpool status -v Solaris fmadm faulty fmdump fmdump -ev or fmdump -eV format or rmformat 148
  • 149. flash Copy on Write 1. Initial block tree 2. COW some data 3. COW metadata 4. Update Uberblocks & free What if the uberblock is updated prior to leaves? 149
  • 150. What if flush is ignored? Some devices ignore cache flush commands (!) Virtualization default=ignore flush: VirtualBox, others? Some USB/Firewire to IDE/SATA converters Problem: uberblock could be updated before leaves Symptom: can’t import pool, uberblock points to random data Affected systems Many OSes and file systems Laptops - rarely because of battery Enterprise-class systems - rarely because of power redundancy and solid design Desktops - more frequently Solution (pending further automation) Check integrity of recent transaction groups If damaged, rollback to older uberblock Today, can do this by hand, but process is tedious 150
  • 151. Can't Import Pool? Check device paths with zpool import Be aware of /etc/zfs/zpool.cache May need zpool -d directory option “phantom paths”? Check for 4 labels zdb -l /dev/dsk/c0t0d0s0 Beware of device short names: c0d0 != c0d0s0 151
  • 152. Slow Pool Import? Case: zvols with snapshots Symptom: reboot or zpool import is really slllooooowwwwwww... Cause: inefficient incrementing over all zvols creating entries in /dev/zvol/dsk Cure: CR6761786 integrated in b125 152
  • 153. File System Mounts B0rken? Prevention Avoid complex heirarchies (KISS) Be aware of legacy mounts Be aware of alternate boot environments (Solaris) Check mountpoint properties zfs list -o name,mountpoint Shared file systems Be aware of inherited shares Some clients do not mirror mount (Linux) NFS version differences? Check name services 153
  • 154. Can't Boot? Check if BIOS/OBP supports booting from device Make sure LUN has SMI label, not EFI Common mistake when mirroring root OK: zpool attach rpool c0t0d0s0 c0t1d0s7 Not OK: zpool attach rpool c0t0d0s0 c0t1d0 installboot? grub issues Boot environments usually handled by grub Check grub menu.lst Know how to do a failsafe boot Be aware of LiveCD import Be aware of zpool.cache interactions 154
  • 155. Future Plans Announced enhancements in the pipeline from Kernel Conference Australia, July 15-17 2009 Encryption Deduplication Block pointer rewrite Shadow migration More performance tweeks New block allocator Pipeline improvements Raw scrub Scrub prefetch Just in time decompression or decryption Native iSCSI (COMSTAR) Zero copy I/O Parallel device open 155
  • 156. More Future Plans Snapshot holds (b124) Access-based enumeration (b125) Multiple mount protection Separate log offlining (b125) (removal later) 156
  • 157. Now you know... ZFS structure: pools, datasets Data redundancy: mirrors, RAIDZ, copies Data verification: checksums Data replication: snapshots, clones, send, receive Hybrid storage: separate logs, cache devices, ARC Security: allow, deny, encryption Resource management: quotas, references, I/O scheduler Performance: latency, COW, zilstat, arcstat, logbias, recordsize Troubleshooting: FMA, zdb, importance of cache flushes 157
  • 158. Its a wrap! Thank You! Questions? Richard.Elling@RichardElling.com 158