SlideShare a Scribd company logo
1 of 218
ZFS: A File System for
   Modern Hardware
Richard.Elling@RichardElling.com
     Richard@Nexenta.com




      USENIX LISA’11 Conference
                                  USENIX LISA11 December, 2011
Agenda
  •     Overview
  •     Foundations
  •     Pooled Storage Layer
  •     Transactional Object Layer
  •     ZFS commands
  •     Sharing
  •     Properties
  •     Other goodies
  •     Performance
  •     Troubleshooting



ZFS Tutorial                    USENIX LISA’11   2
ZFS History
• Announced September 14, 2004
• Integration history
     ✦         SXCE b27 (November 2005)
     ✦         FreeBSD (April 2007)
     ✦         Mac OSX Leopard
               ✤   Preview shown, but removed from Snow Leopard
               ✤   Disappointed community reforming as the zfs-macos google group
                   (Oct 2009)
     ✦         OpenSolaris 2008.05
     ✦         Solaris 10 6/06 (June 2006)
     ✦         Linux FUSE (summer 2006)
     ✦         greenBytes ZFS+ (September 2008)
     ✦         Linux native port funded by the US DOE (2010)
• More than 45 patents, contributed to the CDDL Patents Common
ZFS Tutorial                              USENIX LISA’11                            3
ZFS Design Goals
  •     Figure out why storage has gotten so complicated
  •     Blow away 20+ years of obsolete assumptions
  •     Gotta replace UFS
  •     Design an integrated system from scratch




                  End the suffering


ZFS Tutorial                    USENIX LISA’11             4
Limits

•   248 — Number of entries in any individual directory
•   256 — Number of attributes of a file [*]
•   256 — Number of files in a directory [*]
•   16 EiB (264 bytes) — Maximum size of a file system
•   16 EiB — Maximum size of a single file
•   16 EiB — Maximum size of any attribute
•   264 — Number of devices in any pool
•   264 — Number of pools in a system
•   264 — Number of file systems in a pool
•   264 — Number of snapshots of any file system
•   256 ZiB (278 bytes) — Maximum size of any pool

    [*] actually constrained to 248 for the number of files in a ZFS file system


ZFS Tutorial                                      USENIX LISA’11                 5
Understanding Builds
  • Build is often referenced when speaking of feature/bug integration
  • Short-hand notation: b###
  • Distributions derived from Solaris NV (Nevada)
         ✦     NexentaStor
         ✦     Nexenta Core Platform
         ✦     SmartOS
         ✦     Solaris 11 (nee OpenSolaris)
         ✦     OpenIndiana
         ✦     StormOS
         ✦     BelleniX
         ✦     SchilliX
         ✦     MilaX
  • OpenSolaris builds
         ✦     Binary builds died at b134
         ✦     Source releases continued through b147
  • illumos stepping up to fill void left by OpenSolaris’ demise
ZFS Tutorial                              USENIX LISA’11                 6
Community Links
  • Community links
         ✦     nexenta.org
         ✦     nexentastor.org
         ✦     freebsd.org
         ✦     zfsonlinux.org
         ✦     zfs-fuse.net
         ✦     groups.google.com/group/zfs-macos
  • ZFS Community
         ✦     hub.opensolaris.org/bin/view/Community+Group+zfs/
  • IRC channels at irc.freenode.net
         ✦     #zfs




ZFS Tutorial                         USENIX LISA’11                7
ZFS Foundations




       8
Overhead View of a Pool


                                   Pool
                                                     File System
               Configuration
                Information


                                                           Volume
                                  File System


                                               Volume
                        Dataset




ZFS Tutorial                        USENIX LISA’11                  9
Hybrid Storage Pool

                                      Adaptive Replacement Cache (ARC)

                          separate              Main Pool
                                                 Main Pool            Level 2 ARC
                         intent log




                     Write optimized               HDD
                                                   HDD               Read optimized
                      device (SSD)                  HDD               device (SSD)

     Size (GBytes)      1 - 10 GByte                  large                big
     Cost               write iops/$                 size/$              size/$

     Use                sync writes           persistent storage       read cache

     Performance                                   secondary
                     low-latency writes                             low-latency reads
                                                  optimization
     Need more
     speed?                stripe            more, faster devices        stripe




ZFS Tutorial                                  USENIX LISA’11                            10
Layer View


       raw         swap dump iSCSI   ??      ZFS    NFS CIFS        ??


         ZFS Volume Emulator (Zvol)           ZFS POSIX Layer (ZPL)           pNFS Lustre   ??

                                      Transactional Object Layer

                                          Pooled Storage Layer

                                          Block Device Driver


                            HDD           SSD             iSCSI          ??




November 8, 2010                                   USENIX LISA’10                                11
Source Code Structure

                                   File system                              Mgmt
                                                         Device
                                   Consumer             Consumer
                                                                            libzfs

                   Interface
                                      ZPL                     ZVol         /dev/zfs
                   Layer


                   Transactional     ZIL          ZAP                      Traversal
                   Object
                   Layer                    DMU                      DSL


                                                       ARC
                   Pooled
                   Storage                              ZIO
                   Layer
                                                      VDEV                 Configuration


November 8, 2010                                 USENIX LISA’10                            12
Acronyms
  •     ARC – Adaptive Replacement Cache
  •     DMU – Data Management Unit
  •     DSL – Dataset and Snapshot Layer
  •     JNI – Java Native Interface
  •     ZPL – ZFS POSIX Layer (traditional file system interface)
  •     VDEV – Virtual Device
  •     ZAP – ZFS Attribute Processor
  •     ZIL – ZFS Intent Log
  •     ZIO – ZFS I/O layer
  •     Zvol – ZFS volume (raw/cooked block device interface)




ZFS Tutorial                     USENIX LISA’11                    13
NexentaStor Rosetta Stone


               NexentaStor              OpenSolaris/ZFS
                 Volume                       Storage pool
                  ZVol                          Volume
                 Folder                       File system




ZFS Tutorial                 USENIX LISA’11                  14
nvlists
  • name=value pairs
  • libnvpair(3LIB)
  • Allows ZFS capabilities to change without changing the
    physical on-disk format
  • Data stored is XDR encoded
  • A good thing, used often




ZFS Tutorial                 USENIX LISA’11                  15
Versioning
  • Features can be added and identified by nvlist entries
  • Change in pool or dataset versions do not change physical on-
        disk format (!)
         ✦     does change nvlist parameters
  • Older-versions can be used
         ✦     might see warning messages, but harmless
  • Available versions and features can be easily viewed
         ✦     zpool upgrade -v
         ✦     zfs upgrade -v
  • Online references (broken?)
         ✦     zpool: hub.opensolaris.org/bin/view/Community+Group+zfs/N
         ✦     zfs: hub.opensolaris.org/bin/view/Community+Group+zfs/N-1

                             Don't confuse zpool and zfs versions
ZFS Tutorial                            USENIX LISA’11                     16
zpool Versions

               VER   DESCRIPTION
               ---   ------------------------------------------------
                1    Initial ZFS version
                2    Ditto blocks (replicated metadata)
                3    Hot spares and double parity RAID-Z
                4    zpool history
                5    Compression using the gzip algorithm
                6    bootfs pool property
                7    Separate intent log devices
                8    Delegated administration
                9    refquota and refreservation properties
                10   Cache devices
                11   Improved scrub performance
                12   Snapshot properties
                13   snapused property
                14   passthrough-x aclinherit support



                             Continued...




ZFS Tutorial                                USENIX LISA’11              17
More zpool Versions

               VER   DESCRIPTION
               ---   ------------------------------------------------
                15   user/group space accounting
                16   stmf property support
                17   Triple-parity RAID-Z
                18   snapshot user holds
                19   Log device removal
                20   Compression using zle (zero-length encoding)
                21   Deduplication
                22   Received properties
                23   Slim ZIL
                24   System attributes
                25   Improved scrub stats
                26   Improved snapshot deletion performance
                27   Improved snapshot creation performance
                28   Multiple vdev replacements




                               For Solaris 10, version 21 is “reserved”
ZFS Tutorial                                USENIX LISA’11                18
zfs Versions

               VER DESCRIPTION
               ----------------------------------------------
                1   Initial ZFS filesystem version
                2   Enhanced directory entries
                3   Case insensitive and File system unique
                    identifier (FUID)
                4   userquota, groupquota properties
                5   System attributes




ZFS Tutorial                 USENIX LISA’11                     19
Copy on Write

               1. Initial block tree                    2. COW some data




               3. COW metadata                  4. Update Uberblocks & free




ZFS Tutorial                           USENIX LISA’11                         20
COW Notes

• COW works on blocks, not files
• ZFS reserves 32 MBytes or 1/64 of
  pool size
  ✦     COWs need some free space to
        remove files
  ✦     need space for ZIL
• For fixed-record size workloads
  “fragmentation” and “poor
  performance” can occur if the
  recordsize is not matched
• Spatial distribution is good fodder for
  performance speculation
  ✦     affects HDDs
  ✦     moot for SSDs
   ZFS Tutorial                  USENIX LISA’11   21
To fsck or not to fsck
  • fsck was created to fix known inconsistencies in file system
        metadata
         ✦     UFS is not transactional
         ✦     metadata inconsistencies must be reconciled
         ✦     does NOT repair data – how could it?
  • ZFS doesn't need fsck, as-is
         ✦     all on-disk changes are transactional
         ✦     COW means previously existing, consistent metadata is not
               overwritten
         ✦     ZFS can repair itself
               ✤   metadata is at least dual-redundant
               ✤   data can also be redundant
  • Reality check – this does not mean that ZFS is not susceptible to
        corruption
         ✦     nor is any other file system
ZFS Tutorial                               USENIX LISA’11                  22
Pooled Storage Layer


       raw         swap dump iSCSI   ??      ZFS    NFS CIFS        ??


         ZFS Volume Emulator (Zvol)           ZFS POSIX Layer (ZPL)           pNFS Lustre   ??

                                      Transactional Object Layer

                                          Pooled Storage Layer

                                          Block Device Driver


                            HDD           SSD             iSCSI          ??




November 8, 2010                                   USENIX LISA’10                                23
vdevs – Virtual Devices

                                         Logical vdevs

                                                root vdev



                        top-level vdev                             top-level vdev
                          children[0]                                children[1]
                            mirror                                     mirror




                   vdev               vdev                      vdev              vdev
               type = disk        type = disk               type = disk       type = disk
               children[0]        children[0]               children[0]       children[0]


                                  Physical or leaf vdevs


ZFS Tutorial                                            USENIX LISA’11                      24
vdev Labels
 • vdev labels != disk labels
 • Four 256 kByte labels written to every physical vdev
 • Two-stage update process
          ✦    write label0 & label2
          ✦    flush cache & check for errors
          ✦    write label1 & label3
          ✦    flush cache & check for errors                                        N = 256k * (size % 256k)
                                                                                    M = 128k / MIN(1k, sector size)
      0             256k      512k               4M                             N-512k N-256k                  N

           label0        label1     boot block                                        label2          label3


                                                                                                ...
                       Boot             Name=Value
           Blank
                      header               Pairs                       M-slot Uberblock Array

       0            8k            16k                 128k                                              256k
                                                                                                      25

ZFS Tutorial                                          USENIX LISA’11
Observing Labels
# zdb -l /dev/rdsk/c0t0d0s0
--------------------------------------------
LABEL 0
--------------------------------------------
    version=14
    name='rpool'
    state=0
    txg=13152
    pool_guid=17111649328928073943
    hostid=8781271
    hostname=''
    top_guid=11960061581853893368
    guid=11960061581853893368
    vdev_tree
        type='disk'
        id=0
        guid=11960061581853893368
        path='/dev/dsk/c0t0d0s0'
        devid='id1,sd@SATA_____ST3500320AS_________________9QM3FWFT/a'
        phys_path='/pci@0,0/pci1458,b002@11/disk@0,0:a'
        whole_disk=0
        metaslab_array=24
        metaslab_shift=30
        ashift=9
        asize=157945167872
        is_log=0
 ZFS Tutorial                     USENIX LISA’11                         26
Uberblocks
  • Sized based on minimum device block size
  • Stored in 128-entry circular queue
  • Only one uberblock is active at any time
         ✦     highest transaction group number
         ✦     correct SHA-256 checksum
  • Stored in machine's native format
         ✦     A magic number is used to determine endian format when
               imported
  • Contains pointer to Meta Object Set (MOS)
               Device Block Size    Uberblock Size       Queue Entries
                512 Bytes,1 KB          1 KB                 128
                     2 KB               2 KB                  64
                     4 KB               4 KB                  32
ZFS Tutorial                            USENIX LISA’11                   27
About Sizes
  •     Sizes are dynamic
  •     LSIZE = logical size
  •     PSIZE = physical size after compression
  •     ASIZE = allocated size including:
         ✦     physical size
         ✦     raidz parity
         ✦     gang blocks




                               Old notions of size reporting confuse people
ZFS Tutorial                              USENIX LISA’11                      28
VDEV



ZFS Tutorial    USENIX LISA’11   29
Dynamic Striping
  • RAID-0
         ✦     SNIA definition: fixed-length sequences of virtual disk data
               addresses are mapped to sequences of member disk addresses
               in a regular rotating pattern
  • Dynamic Stripe
         ✦     Data is dynamically mapped to member disks
         ✦     No fixed-length sequences
         ✦     Allocate up to ~1 MByte/vdev before changing vdev
         ✦     vdevs can be different size
         ✦     Good combination of the concatenation feature with RAID-0
               performance




ZFS Tutorial                         USENIX LISA’11                         30
Dynamic Striping


                    RAID-0 Column size = 128 kBytes, stripe width = 384 kBytes



384 kBytes


                    ZFS Dynamic Stripe recordsize = 128 kBytes




                                 Total write size = 2816 kBytes



     ZFS Tutorial                        USENIX LISA’11                          31
Mirroring
  • Straightforward: put N copies of the data on N vdevs
  • Unlike RAID-1
         ✦     No 1:1 mapping at the block level
         ✦     vdev labels are still at beginning and end
         ✦     vdevs can be of different size
               ✤   effective space is that of smallest vdev
  • Arbitration: ZFS does not blindly trust either side of mirror
         ✦     Most recent, correct view of data wins
         ✦     Checksums validate data




ZFS Tutorial                               USENIX LISA’11           32
Dynamic vdev Replacement

  • zpool replace poolname vdev [vdev]
  • Today, replacing vdev must be same size or larger
         ✦     NexentaStor 2 ‒ as measured by blocks
         ✦     NexentaStor 3 ‒ as measured by metaslabs
  • Replacing all vdevs in a top-level vdev with larger vdevs results
    in top-level vdev resizing
  • Expansion policy controlled by:
         ✦     NexentaStor 2 ‒ resize on import
         ✦     NexentaStor 3 ‒ zpool autoexpand property


     15G 10G
       10G           15G   10G   20G   15G 20G
                                         10G              15G   20G   20G   20G 20G
                                                                              10G

      10G Mirror       10G Mirror      15G Mirror           15G Mirror      20G Mirror


ZFS Tutorial                             USENIX LISA’11                                  33
RAIDZ
  • RAID-5
         ✦     Parity check data is distributed across the RAID array's disks
         ✦     Must read/modify/write when data is smaller than stripe width
  • RAIDZ
         ✦     Dynamic data placement
         ✦     Parity added as needed
         ✦     Writes are full-stripe writes
         ✦     No read/modify/write (write hole)
  • Arbitration: ZFS does not blindly trust any device
         ✦     Does not rely on disk reporting read error
         ✦     Checksums validate data
         ✦     If checksum fails, read parity

                               Space used is dependent on how used
ZFS Tutorial                            USENIX LISA’11                          34
RAID-5 vs RAIDZ

               DiskA   DiskB      DiskC         DiskD   DiskE
               D0:0    D0:1       D0:2          D0:3     P0
   RAID-5       P1     D1:0       D1:1          D1:2    D1:3
               D2:3     P2        D2:0          D2:1    D2:2
               D3:2    D3:3        P3           D3:0    D3:1


               DiskA   DiskB      DiskC         DiskD   DiskE
                P0     D0:0       D0:1          D0:2    D0:3
   RAIDZ        P1     D1:0       D1:1          P2:0    D2:0
               D2:1    D2:2       D2:3          P2:1    D2:4
               D2:5    Gap         P3           D3:0




ZFS Tutorial                   USENIX LISA’11                   35
RAIDZ and Block Size

    If block size >> N * sector size, space consumption is like RAID-5
       If block size = sector size, space consumption is like mirroring

 PSIZE=2KB
ASIZE=2.5KB           DiskA      DiskB       DiskC        DiskD   DiskE
                       P0        D0:0        D0:1         D0:2    D0:3
                       P1        D1:0        D1:1         P2:0    D2:0
 PSIZE=1KB            D2:1       D2:2        D2:3         P2:1    D2:4
ASIZE=1.5KB           D2:5       Gap          P3          D3:0


                  PSIZE=3KB                                PSIZE=512 bytes
               ASIZE=4KB + Gap                                ASIZE=1KB

                           Sector size = 512 bytes

                              Sector size can impact space savings
ZFS Tutorial                             USENIX LISA’11                      36
RAID-5 Write Hole
  • Occurs when data to be written is smaller than stripe size
  • Must read unallocated columns to recalculate the parity or the
    parity must be read/modify/write
  • Read/modify/write is risky for consistency
         ✦     Multiple disks
         ✦     Reading independently
         ✦     Writing independently
         ✦     System failure before all writes are complete to media could
               result in data loss
  • Effects can be hidden from host using RAID array with
        nonvolatile write cache, but extra I/O cannot be hidden from
        disks


ZFS Tutorial                           USENIX LISA’11                         37
RAIDZ2 and RAIDZ3
  • RAIDZ2 = double parity RAIDZ
  • RAIDZ3 = triple parity RAIDZ
  • Sorta like RAID-6
         ✦     Parity 1: XOR
         ✦     Parity 2: another Reed-Soloman syndrome
         ✦     Parity 3: yet another Reed-Soloman syndrome
  • Arbitration: ZFS does not blindly trust any device
         ✦     Does not rely on disk reporting read error
         ✦     Checksums validate data
         ✦     If data not valid, read parity
         ✦     If data still not valid, read other parity


                              Space used is dependent on how used
ZFS Tutorial                           USENIX LISA’11               38
Evaluating Data Retention
  • MTTDL = Mean Time To Data Loss
  • Note: MTBF is not constant in the real world, but keeps math
    simple
  • MTTDL[1] is a simple MTTDL model
  • No parity (single vdev, striping, RAID-0)
         ✦     MTTDL[1] = MTBF / N
  • Single Parity (mirror, RAIDZ, RAID-1, RAID-5)
         ✦     MTTDL[1] = MTBF2 / (N * (N-1) * MTTR)
  • Double Parity (3-way mirror, RAIDZ2, RAID-6)
         ✦     MTTDL[1] = MTBF3 / (N * (N-1) * (N-2) * MTTR2)
  • Triple Parity (4-way mirror, RAIDZ3)
         ✦     MTTDL[1] = MTBF4 / (N * (N-1) * (N-2) * (N-3) * MTTR3)


ZFS Tutorial                         USENIX LISA’11                     39
Another MTTDL Model
  • MTTDL[1] model doesn't take into account unrecoverable
    read
  • But unrecoverable reads (UER) are becoming the dominant
    failure mode
         ✦     UER specifed as errors per bits read
         ✦     More bits = higher probability of loss per vdev
  • MTTDL[2] model considers UER




ZFS Tutorial                           USENIX LISA’11            40
Why Worry about UER?
  • Richard's study
         ✦     3,684 hosts with 12,204 LUNs
         ✦     11.5% of all LUNs reported read errors
  • Bairavasundaram et.al. FAST08
        www.cs.wisc.edu/adsl/Publications/corruption-fast08.pdf
         ✦     1.53M LUNs over 41 months
         ✦     RAID reconstruction discovers 8% of checksum mismatches
         ✦     “For some drive models as many as 4% of drives develop
               checksum mismatches during the 17 months examined”




ZFS Tutorial                          USENIX LISA’11                     41
Why Worry about UER?
  • RAID array study




ZFS Tutorial           USENIX LISA’11   42
Why Worry about UER?
  • RAID array study




               Unrecoverable               Disk Disappeared
                  Reads                       “disk pull”



                           “Disk pull” tests aren’t very useful
ZFS Tutorial                        USENIX LISA’11                43
MTTDL[2] Model
  • Probability that a reconstruction will fail
         ✦     Precon_fail = (N-1) * size / UER
  • Model doesn't work for non-parity schemes
         ✦     single vdev, striping, RAID-0
  • Single Parity (mirror, RAIDZ, RAID-1, RAID-5)
         ✦     MTTDL[2] = MTBF / (N * Precon_fail)
  • Double Parity (3-way mirror, RAIDZ2, RAID-6)
         ✦     MTTDL[2] = MTBF2/ (N * (N-1) * MTTR * Precon_fail)
  • Triple Parity (4-way mirror, RAIDZ3)
         ✦     MTTDL[2] = MTBF3/ (N * (N-1) * (N-2) * MTTR2 *
               Precon_fail)




ZFS Tutorial                            USENIX LISA’11              44
Practical View of MTTDL[1]




ZFS Tutorial              USENIX LISA’11    45
MTTDL[1] Comparison




ZFS Tutorial          USENIX LISA’11   46
MTTDL Models: Mirror




                    Spares are not always better...
ZFS Tutorial              USENIX LISA’11              47
MTTDL Models: RAIDZ2




ZFS Tutorial           USENIX LISA’11   48
Space, Dependability, and Performance




ZFS Tutorial                USENIX LISA’11     49
Dependability Use Case
  •     Customer has 15+ TB of read-mostly data
  •     16-slot, 3.5” drive chassis
  •     2 TB HDDs
  •     Option 1: one raidz2 set
         ✦     24 TB available space
               ✤   12 data
               ✤   2 parity
               ✤   2 hot spares, 48 hour disk replacement time
         ✦     MTTDL[1] = 1,790,000 years
  • Option 2: two raidz2 sets
         ✦     24 TB available space (each set)
               ✤   6 data
               ✤   2 parity
               ✤   no hot spares
         ✦     MTTDL[1] = 7,450,000 years
ZFS Tutorial                             USENIX LISA’11          50
Ditto Blocks
  • Recall that each blkptr_t contains 3 DVAs
  • Dataset property used to indicate how many copies (aka ditto
        blocks) of data is desired
         ✦     Write all copies
         ✦     Read any copy
         ✦     Recover corrupted read from a copy
  • Not a replacement for mirroring
         ✦     For single disk, can handle data loss on approximately 1/8
               contiguous space
  • Easier to describe in pictures...
                 copies parameter     Data copies       Metadata copies
                 copies=1 (default)         1                 2
                     copies=2               2                 3
                     copies=3               3                 3
ZFS Tutorial                           USENIX LISA’11                       51
Copies in Pictures




November 8, 2010         USENIX LISA’10   52
Copies in Pictures




ZFS Tutorial         USENIX LISA’11   53
When Good Data Goes Bad

               File system     If it’s a metadata     Or we get
               does bad read   block FS panics        back bad
               Can not tell    does disk rebuild      data




ZFS Tutorial                         USENIX LISA’11               54
Checksum Verification

                     ZFS verifies checksums for every read
               Repairs data when possible (mirror, raidz, copies>1)




           Read bad data      Read good data         Repair bad data

ZFS Tutorial                        USENIX LISA’11                     55
ZIO - ZFS I/O Layer




         56
ZIO Framework
  • All physical disk I/O goes through ZIO Framework
  • Translates DVAs into Logical Block Address (LBA) on leaf
        vdevs
         ✦     Keeps free space maps (spacemap)
         ✦     If contiguous space is not available:
               ✤   Allocate smaller blocks (the gang)
               ✤   Allocate gang block, pointing to the gang
  • Implemented as multi-stage pipeline
         ✦     Allows extensions to be added fairly easily
  • Handles I/O errors




ZFS Tutorial                              USENIX LISA’11       57
ZIO Write Pipeline

          ZIO State   Compression     Checksum              DVA    vdev I/O

               open
                      compress if
                       savings >
                        12.5%
                                       generate
                                                        allocate

                                                                    start
                                                                     start
                                                                      start
                                                                    done
                                                                     done
                                                                      done
                                                                   assess
                                                                    assess
                                                                     assess
               done




                               Gang and deduplicaiton activity elided, for clarity
ZFS Tutorial                               USENIX LISA’11                            58
ZIO Read Pipeline


               ZIO State   Compression    Checksum            DVA      vdev I/O

                 open

                                                                       start
                                                                        start
                                                                         start
                                                                       done
                                                                        done
                                                                         done
                                                                      assess
                                                                       assess
                                                                        assess


                                             verify


                           decompress

                done




                                 Gang and deduplicaiton activity elided, for clarity
ZFS Tutorial                                 USENIX LISA’11                            59
VDEV – Virtual Device Subsytem
  • Where mirrors, RAIDZ, and                               Name       Priority
        RAIDZ2 are implemented
                                                            NOW           0
         ✦     Surprisingly few lines of code
                                                         SYNC_READ        0
               needed to implement RAID
                                                         SYNC_WRITE       0
  • Leaf vdev (physical device) I/O                         FREE          0
        management                                       CACHE_FILL       0
         ✦     Number of outstanding iops                LOG_WRITE        0
         ✦     Read-ahead cache                          ASYNC_READ       4
  • Priority scheduling                                  ASYNC_WRITE      4
                                                           RESILVER      10
                                                            SCRUB        20




ZFS Tutorial                            USENIX LISA’11                            60
ARC - Adaptive
Replacement Cache



        61
Object Cache
  • UFS uses page cache managed by the virtual memory system
  • ZFS does not use the page cache, except for mmap'ed files
  • ZFS uses a Adaptive Replacement Cache (ARC)
  • ARC used by DMU to cache DVA data objects
  • Only one ARC per system, but caching policy can be changed
    on a per-dataset basis
  • Seems to work much better than page cache ever did for UFS




ZFS Tutorial                USENIX LISA’11                   62
Traditional Cache
  • Works well when data being accessed was recently added
  • Doesn't work so well when frequently accessed data is evicted

               Misses cause insert



                      MRU

                                                    Dynamic caches can change
                     Cache           size           size by either not evicting
                                                    or aggressively evicting

                      LRU


                 Evict the oldest




ZFS Tutorial                                USENIX LISA’11                        63
ARC – Adaptive Replacement Cache


                 Evict the oldest single-use entry


                                LRU
                               Recent
                               Cache
                    Miss
                               MRU
                                                            Evictions and dynamic
                               MFU                   size   resizing needs to choose best
                     Hit
                                                            cache to evict (shrink)
                              Frequent
                               Cache

                                LFU


               Evict the oldest multiple accessed entry




ZFS Tutorial                                   USENIX LISA’11                               64
ARC with Locked Pages

                               Evict the oldest single-use entry


                Cannot evict                 LRU
               locked pages!
                                            Recent
                                            Cache
                                 Miss
                                            MRU
                                            MFU                size
                                  Hit
                                           Frequent
               If hit occurs                Cache
               within 62 ms
                                             LFU


                          Evict the oldest multiple accessed entry




                                ZFS ARC handles mixed-size pages

ZFS Tutorial                                USENIX LISA’11            65
L2ARC – Level 2 ARC
  • Data soon to be evicted from the ARC is added
        to a queue to be sent to cache vdev
         ✦     Another thread sends queue to cache vdev              ARC
         ✦     Data is copied to the cache vdev with a throttle
                                                                  data soon to
               to limit bandwidth consumption                      be evicted
         ✦     Under heavy memory pressure, not all evictions
               will arrive in the cache vdev
  • ARC directory remains in memory
  • Good idea - optimize cache vdev for fast reads
         ✦     lower latency than pool disks
         ✦     inexpensive way to “increase memory”
                                                                    cache
  • Content considered volatile, no raid needed
  • Monitor usage with zpool iostat and ARC kstats

ZFS Tutorial                           USENIX LISA’11                       66
ARC Directory
  • Each ARC directory entry contains arc_buf_hdr structs
         ✦     Info about the entry
         ✦     Pointer to the entry
  •     Directory entries have size, ~200 bytes
  •     ZFS block size is dynamic, sector size to 128 kBytes
  •     Disks are large
  •     Suppose we use a Seagate LP 2 TByte disk for the L2ARC
         ✦     Disk has 3,907,029,168 512 byte sectors, guaranteed
         ✦     Workload uses 8 kByte fixed record size
         ✦     RAM needed for arc_buf_hdr entries
               ✤   Need = (3,907,029,168 - 9,232) * 200 / 16 = ~48 GBytes
  • Don't underestimate the RAM needed for large L2ARCs

ZFS Tutorial                            USENIX LISA’11                      67
ARC Tips
  • In general, it seems to work well for most workloads
  • ARC size will vary, based on usage
         ✦     Default target max is 7/8 of physical memory or (memory - 1
               GByte)
         ✦     Target min is 64 MB
         ✦     Metadata capped at 1/4 of max ARC size
  • Dynamic size can be reduced when:
         ✦     page scanner is running
               ✤   freemem < lotsfree + needfree + desfree
         ✦     swapfs does not have enough space so that anonymous
               reservations can succeed
               ✤   availrmem < swapfs_minfree + swapfs_reserve + desfree
         ✦     [x86 only] kernel heap space more than 75% full
  • Can limit at boot time
ZFS Tutorial                             USENIX LISA’11                      68
Observing ARC
  • ARC statistics stored in kstats
  • kstat -n arcstats
  • Interesting statistics:
         ✦     size = current ARC size
         ✦     p = size of MFU cache
         ✦     c = target ARC size
         ✦     c_max = maximum target ARC size
         ✦     c_min = minimum target ARC size
         ✦     l2_hdr_size = space used in ARC by L2ARC
         ✦     l2_size = size of data in L2ARC




ZFS Tutorial                         USENIX LISA’11       69
General Status - ARC




ZFS Tutorial           USENIX LISA’11   70
More ARC Tips
  • Performance
         ✦     Prior to b107, L2ARC fill rate was limited to 8 MB/sec
         ✦     After b107, cold L2ARC fill rate increases to 16 MB/sec
  • Internals tracked by kstats in Solaris
         ✦     Use memory_throttle_count to observe pressure to
               evict
  • Dedup Table (DDT) also uses ARC
         ✦     lots of dedup objects need lots of RAM
         ✦     field reports that L2ARC can help with dedup




                           L2ARC keeps its directory in kernel memory
ZFS Tutorial                            USENIX LISA’11                  71
Transactional
Object Layer



      72
flash
                               Source Code Structure

                                     File system                              Mgmt
                                                           Device
                                     Consumer             Consumer
                                                                              libzfs

                     Interface
                                        ZPL                     ZVol         /dev/zfs
                     Layer


                     Transactional     ZIL          ZAP                      Traversal
                     Object
                     Layer                    DMU                      DSL


                                                         ARC
                     Pooled
                     Storage                              ZIO
                     Layer
                                                        VDEV                 Configuration


  November 8, 2010                                 USENIX LISA’10                            73
Transaction Engine
  • Manages physical I/O
  • Transactions grouped into transaction group (txg)
         ✦     txg updates
         ✦     All-or-nothing
         ✦     Commit interval
               ✤   Older versions: 5 seconds
               ✤   Less old versions: 30 seconds
               ✤   b143 and later: 5 seconds
  • Delay committing data to physical storage
         ✦     Improves performance
         ✦     A bad thing for sync workload performance – hence the ZFS
               Intent Log (ZIL)


                          30 second delay can impact failure detection time
ZFS Tutorial                                USENIX LISA’11                    74
ZIL – ZFS Intent Log
  • DMU is transactional, and likes to group I/O into transactions
        for later commits, but still needs to handle “write it now”
        desire of sync writers
         ✦     NFS
         ✦     Databases
  • ZIL recordsize inflation can occur for some workloads
         ✦     May cause larger than expected actual I/O for sync workloads
         ✦     Oracle redo logs
         ✦     No slog: can tune zfs_immediate_write_sz,
               zvol_immediate_write_sz
         ✦     With slog: use logbias property instead
  • Never read, except at import (eg reboot), when transactions
        may need to be rolled forward

ZFS Tutorial                          USENIX LISA’11                          75
Separate Logs (slogs)
  • ZIL competes with pool for IOPS
         ✦     Applications wait for sync writes to be on nonvolatile media
         ✦     Very noticeable on HDD JBODs
  • Put ZIL on separate vdev, outside of pool
         ✦     ZIL writes tend to be sequential
         ✦     No competition with pool for IOPS
         ✦     Downside: slog device required to be operational at import
         ✦     NexentaStor 3 allows slog device removal
         ✦     Size of separate log < than size of RAM (duh)
  • 10x or more performance improvements possible
         ✦     Nonvolatile RAM card
         ✦     Write-optimized SSD
         ✦     Nonvolatile write cache on RAID array

ZFS Tutorial                           USENIX LISA’11                         76
zilstat
  • http://www.richardelling.com/Home/scripts-and-programs-1/
    zilstat
  • Integrated into NexentaStor 3.0.3
         ✦     nmc: show performance zil




ZFS Tutorial                    USENIX LISA’11                  77
Synchronous Write Destination

                      Without separate log
                    Sync I/O size >
                                                        ZIL Destination
               zfs_immediate_write_sz ?
                         no                                    ZIL log
                         yes                                bypass to pool


                        With separate log
                           logbias?                 ZIL Destination
                       latency (default)                    log device
                          throughput                 bypass to pool


                          Default zfs_immediate_write_sz = 32 kBytes
ZFS Tutorial                               USENIX LISA’11                    78
ZIL Synchronicity Project
  • All-or-nothing policies don’t work well, in general
  • ZIL Synchronicity project proposed by Robert Milkowski
         ✦     http://milek.blogspot.com
  • Adds new sync property to datasets
  • Arrived in b140
                sync Parameter                           Behaviour
                                    Policy follows previous design: write
               standard (default)
                                    immediate size and separate logs
                    always          All writes become synchronous (slow)
                   disabled         Synchronous write requests are ignored




ZFS Tutorial                            USENIX LISA’11                       79
Disabling the ZIL

  •     Preferred method: change dataset sync property
  •     Rule 0: Don’t disable the ZIL
  •     If you love your data, do not disable the ZIL
  •     You can find references to this as a way to speed up ZFS
         ✦     NFS workloads
         ✦     “tar -x” benchmarks
  •     Golden Rule: Don’t disable the ZIL
  •     Can set via mdb, but need to remount the file system
  •     Friends don’t let friends disable the ZIL
  •     Older Solaris - can set in /etc/system
  •     NexentaStor has checkbox for disabling ZIL
  •     Nostradamus wrote, “disabling the ZIL will lead to the
        apocalypse”
ZFS Tutorial                         USENIX LISA’11               80
DSL - Dataset and
 Snapshot Layer



        81
Dataset & Snapshot Layer
  • Object
         ✦     Allocated storage
         ✦     dnode describes collection of blocks
  • Object Set
                                                        Dataset Directory
         ✦     Group of related objects                 Dataset
  • Dataset                                             Object Set   Childmap
         ✦     Snapmap: snapshot relationships
                                                        Object
                                                        Object
         ✦     Space usage                               Object      Properties

  • Dataset directory                                    Snapmap
         ✦     Childmap: dataset relationships
         ✦     Properties




ZFS Tutorial                           USENIX LISA’11                             82
flash
                                        Copy on Write

                      1. Initial block tree                    2. COW some data




                      3. COW metadata                  4. Update Uberblocks & free




                                                                                  z
       ZFS Tutorial                           USENIX LISA’11                          83
zfs snapshot
  • Create a read-only, point-in-time window into the dataset (file
    system or Zvol)
  • Computationally free, because of COW architecture
  • Very handy feature
         ✦     Patching/upgrades
  • Basis for time-related snapshot interfaces
         ✦     Solaris Time Slider
         ✦     NexentaStor Delorean Plugin
         ✦     NexentaStor Virtual Machine Data Center




ZFS Tutorial                         USENIX LISA’11              84
Snapshot
  •     Create a snapshot by not free'ing COWed blocks
  •     Snapshot creation is fast and easy
  •     Number of snapshots determined by use – no hardwired limit
  •     Recursive snapshots also possible


               Snapshot tree                      Current tree
                   root                              root




ZFS Tutorial                     USENIX LISA’11                  85
auto-snap service




ZFS Tutorial         USENIX LISA’11   86
Clones
  • Snapshots are read-only
  • Clones are read-write based upon a snapshot
  • Child depends on parent
         ✦     Cannot destroy parent without destroying all children
         ✦     Can promote children to be parents
  • Good ideas
         ✦     OS upgrades
         ✦     Change control
         ✦     Replication
               ✤   zones
               ✤   virtual disks




ZFS Tutorial                           USENIX LISA’11                  87
zfs clone
  • Create a read-write file system from a read-only snapshot
  • Solaris boot environment administation
       Install   Checkpoint   Clone                             Checkpoint

       OS rev1     OS rev1     OS rev1                OS rev1      OS rev1

                    rootfs-    rootfs-                rootfs-      rootfs-
                     nmu-       nmu-                   nmu-         nmu-
                      001        001                    001          001


                                          patch/
                              OS rev1                OS rev1      OS rev1
                                         upgrade
                               clone                  clone        clone
                                                                   rootfs-
                                                                    nmu-
                                                                     002
                                                                             grubboot
                                                                             manager


                   Origin snapshot cannot be destroyed, if clone exists


ZFS Tutorial                                   USENIX LISA’11                           88
Deduplication




      89
What is Deduplication?
  • A $2.1 Billion feature
  • 2009 buzzword of the year
  • Technique for improving storage space efficiency
         ✦     Trades big I/Os for small I/Os
         ✦     Does not eliminate I/O
  • Implementation styles
         ✦     offline or post processing
               ✤   data written to nonvolatile storage
               ✤   process comes along later and dedupes data
               ✤   example: tape archive dedup
         ✦     inline
               ✤   data is deduped as it is being allocated to nonvolatile storage
               ✤   example: ZFS


ZFS Tutorial                              USENIX LISA’11                             90
Dedup how-to
  • Given a bunch of data
  • Find data that is duplicated
  • Build a lookup table of references to data
  • Replace duplicate data with a pointer to the entry in the
    lookup table
  • Grainularity
         ✦     file
         ✦     block
         ✦     byte




ZFS Tutorial                  USENIX LISA’11                    91
Dedup in ZFS
  • Leverage block-level checksums
         ✦     Identify blocks which might be duplicates
         ✦     Variable block size is ok
  • Synchronous implementation
         ✦     Data is deduped as it is being written
  • Scalable design
         ✦     No reference count limits
  • Works with existing features
         ✦     compression
         ✦     copies
         ✦     scrub
         ✦     resilver
  • Implemented in ZIO pipeline
ZFS Tutorial                           USENIX LISA’11      92
Deduplication Table (DDT)
  • Internal implementation
         ✦     Adelson-Velskii, Landis (AVL) tree
         ✦     Typical table entry ~270 bytes
               ✤   checksum
               ✤   logical size
               ✤   physical size
               ✤   references
         ✦     Table entry size increases as the number of references
               increases




ZFS Tutorial                           USENIX LISA’11                   93
Reference Counts




                Eggs courtesy of Richard’s chickens
ZFS Tutorial                  USENIX LISA’11          94
Reference Counts
  • Problem: loss of the referenced data affects all referrers
  • Solution: make additional copies of referred data based upon a
        threshold count of referrers
         ✦     leverage copies (ditto blocks)
         ✦     pool-level threshold for automatically adding ditto copies
               ✤   set via dedupditto pool property
                # zpool set dedupditto=50 zwimming

               ✤   add 2nd copy when dedupditto references (50) reached
               ✤   add 3rd copy when dedupditto2 references (2500) reached




ZFS Tutorial                            USENIX LISA’11                       95
Verification

                    write()


                  compress

                  checksum

               DDT entry lookup


                              yes                   no
                    DDT
                                          verify?
                    match?

                  no                 yes

                                      read data



                                           data     yes
                                          match?             add reference

                                     no

                  new entry



ZFS Tutorial                                USENIX LISA’11                   96
Enabling Dedup
  • Set dedup property for each dataset to be deduped
  • Remember: properties are inherited
  • Remember: only applies to newly written data

                  dedup             checksum              verify?
                    on
                                     SHA256                 no
                  sha256
                 on,verify
                                     SHA256                yes
               sha256,verify




                      Fletcher is considered too weak, without verify
ZFS Tutorial                             USENIX LISA’11                 97
Dedup Accounting
  • ...and you thought compression accounting was hard...
  • Remember: dedup works at pool level
         ✦     dataset-level accounting doesn’t see other datasets
         ✦     pool-level accounting is always correct

zfs list
           NAME           USED       AVAIL      REFER             MOUNTPOINT
           bar           7.56G        449G        22K             /bar
           bar/ws        7.56G        449G      7.56G             /bar/ws
           dozer         7.60G        455G        22K             /dozer
           dozer/ws      7.56G        455G      7.56G             /dozer/ws
           tank          4.31G        456G        22K             /tank
           tank/ws       4.27G        456G      4.27G             /tank/ws

zpool list
           NAME       SIZE   ALLOC       FREE    CAP         DEDUP        HEALTH   ALTROOT
           bar        464G   7.56G       456G     1%         1.00x        ONLINE   -
           dozer      464G   1.43G       463G     0%         5.92x        ONLINE   -
           tank       464G    957M       463G     0%         5.39x        ONLINE   -



ZFS Tutorial                                    DataUSENIX LISA’11team
                                                    courtesy of the ZFS                      98
DDT Histogram

      # zdb -DD tank
  DDT-sha256-zap-duplicate: 110173 entries, size 295 on disk, 153 in core
  DDT-sha256-zap-unique: 302 entries, size 42194 on disk, 52827 in core

  DDT histogram (aggregated over all DDTs):

  bucket!             allocated!                      referenced
  ______ ___________________________        ___________________________
  refcnt blocks LSIZE PSIZE DSIZE           blocks LSIZE PSIZE DSIZE
  ------ ------ ----- ----- -----           ------ ----- ----- -----
       1    302 7.26M 4.24M 4.24M              302 7.26M 4.24M 4.24M
       2   103K 1.12G    712M    712M         216K 2.64G 1.62G 1.62G
       4  3.11K 30.0M 17.1M 17.1M            14.5K   168M 95.2M 95.2M
       8    503 11.6M 6.16M 6.16M            4.83K   129M 68.9M 68.9M
      16    100 4.22M 1.92M 1.92M            2.14K   101M 45.8M 45.8M




ZFS Tutorial                                      USENIX LISA’11
                                   Data courtesy of the ZFS team            99
DDT Histogram

$ zdb -DD zwimming
DDT-sha256-zap-duplicate: 19725 entries, size 270 on disk, 153 in core
DDT-sha256-zap-unique: 52369639 entries, size 284 on disk, 159 in core

DDT histogram (aggregated over all DDTs):

bucket                      allocated                           referenced
______           ______________________________       ______________________________
refcnt           blocks   LSIZE   PSIZE   DSIZE       blocks   LSIZE   PSIZE   DSIZE
------           ------   -----   -----   -----       ------   -----   -----   -----
     1            49.9M   25.0G   25.0G   25.0G        49.9M   25.0G   25.0G   25.0G
     2            16.7K   8.33M   8.33M   8.33M        33.5K   16.7M   16.7M   16.7M
     4              610    305K    305K    305K        3.33K   1.66M   1.66M   1.66M
     8              661    330K    330K    330K        6.67K   3.34M   3.34M   3.34M
    16              242    121K    121K    121K        5.34K   2.67M   2.67M   2.67M
    32              131   65.5K   65.5K   65.5K        5.54K   2.77M   2.77M   2.77M
    64              897    448K    448K    448K          84K     42M     42M     42M
   128              125   62.5K   62.5K   62.5K        18.0K   8.99M   8.99M   8.99M
    8K                1     512     512     512        12.5K   6.27M   6.27M   6.27M
 Total            50.0M   25.0G   25.0G   25.0G        50.1M   25.1G   25.1G   25.1G

dedup = 1.00, compress = 1.00, copies = 1.00, dedup * compress / copies = 1.00



  ZFS Tutorial                             USENIX LISA’11                              100
Over-the-wire Dedup
  • Dedup is also possible over the send/receive pipe
         ✦     Blocks with no checksum are considered duplicates (no verify
               option)
         ✦     First copy sent as usual
         ✦     Subsequent copies sent by reference
  • Independent of dedup status of originating pool
         ✦     Receiving pool knows about blocks which have already arrived
  • Can be a win for dedupable data, especially over slow wires
  • Remember: send/receive version rules still apply

               # zfs send -DR zwimming/stuff




ZFS Tutorial                          USENIX LISA’11                          101
Dedup Performance
  • Dedup can save space and bandwidth
  • Dedup increases latency
         ✦     Caching data improves latency
         ✦     More memory → more data cached
         ✦     Cache performance heirarchy
               ✤   RAM: fastest
               ✤   L2ARC on SSD: slower
               ✤   Pool HDD: dreadfully slow
  • ARC is currently not deduped
  • Difficult to predict
         ✦     Dependent variable: number of blocks
         ✦     Estimate 270 bytes per unique block
         ✦     Example:
               ✤   50M blocks * 270 bytes/block = 13.5 GBytes
ZFS Tutorial                            USENIX LISA’11          102
Deduplication Use Cases

                    Data type                    Dedupe     Compression
                Home directories                    ✔✔         ✔✔

                 Internet content                       ✔       ✔

                 Media and video                    ✔✔          ✔

                   Life sciences                        ✘      ✔✔

               Oil and Gas (seismic)                    ✘      ✔✔

                 Virtual machines                   ✔✔          ✘

                     Archive                    ✔✔✔✔            ✔




ZFS Tutorial                           USENIX LISA’11                     103
zpool Command




      104
Pooled Storage Layer


       raw         swap dump iSCSI   ??      ZFS    NFS CIFS        ??


         ZFS Volume Emulator (Zvol)           ZFS POSIX Layer (ZPL)           pNFS Lustre   ??

                                      Transactional Object Layer

                                          Pooled Storage Layer

                                          Block Device Driver


                            HDD           SSD             iSCSI          ??




November 8, 2010                                   USENIX LISA’10                                105
zpool create
  • zpool create poolname vdev-configuration
  • nmc: setup volume create
    ✦ vdev-configuration examples

               ✤   mirror c0t0d0 c3t6d0
               ✤   mirror c0t0d0 c3t6d0 mirror c4t0d0 c0t1d6
               ✤   mirror disk1s0 disk2s0 cache disk4s0 log disk5
               ✤   raidz c0d0s1 c0d1s1 c1d2s0 spare c1d3s0
  • Solaris
         ✦     Additional checks for disk/slice overlaps or in use
         ✦     Whole disks are given EFI labels
  • Can set initial pool or dataset properties
  • By default, creates a file system with the same name
         ✦     poolname pool → /poolname file system

                       People get confused by a file system with same name as the pool
ZFS Tutorial                                USENIX LISA’11                              106
zpool destroy
  • Destroy the pool and all datasets therein
  • zpool destroy poolname
         ✦     Can (try to) force with “-f”
         ✦     There is no “are you sure?” prompt – if you weren't sure, you
               would not have typed “destroy”
  • nmc: destroy volume volumename
         ✦     nmc prompts for confirmation, by default




                          zpool destroy is destructive... really! Use with caution!
ZFS Tutorial                               USENIX LISA’11                             107
zpool add
  • Adds a device to the pool as a top-level vdev
  • Does NOT not add columns to a raidz set
  • Does NOT attach a mirror – use zpool attach instead
  • zpool add poolname vdev-configuration
    ✦ vdev-configuration can be any combination also used for zpool

      create
    ✦ Complains if the added vdev-configuration would cause a

      different data protection scheme than is already in use
               ✤   use “-f” to override
         ✦     Good idea: try with “-n” flag first
               ✤   will show final configuration without actually performing the add
  • nmc: setup            volume volumename grow



                        Do not add a device which is in use as a cluster quorum device
ZFS Tutorial                               USENIX LISA’11                                108
zpool remove
  • Remove a top-level vdev from the pool
  • zpool remove poolname vdev
  • nmc: setup volume volumename remove-lun
  • Today, you can only remove the following vdevs:
    ✦ cache

    ✦ hot spare

    ✦ separate log (b124, NexentaStor 3.0)




                 Don't confuse “remove” with “detach”
ZFS Tutorial                    USENIX LISA’11          109
zpool attach
  • Attach a vdev as a mirror to an existing vdev
  • zpool attach poolname existing-vdev vdev
  • nmc: setup volume volumename attach-lun
  • Attaching vdev must be the same size or larger than the
    existing vdev

                       vdev Configurations
                ok        simple vdev → mirror
                ok                mirror
                ok         log → mirrored log
                no                RAIDZ
                no               RAIDZ2
                no               RAIDZ3

ZFS Tutorial                  USENIX LISA’11                  110
zpool detach
  •     Detach a vdev from a mirror
  •     zpool detach poolname vdev
  •     nmc: setup volume volumename detach-lun
  •     A resilvering vdev will wait until resilvering is complete




ZFS Tutorial                       USENIX LISA’11                    111
zpool replace
  • Replaces an existing vdev with a new vdev
  • zpool replace poolname existing-vdev vdev
  • nmc: setup volume volumename replace-lun
  • Effectively, a shorthand for “zpool attach” followed by “zpool
    detach”
  • Attaching vdev must be the same size or larger than the
    existing vdev
  • Works for any top-level vdev-configuration, including RAIDZ




               “Same size” literally means the same number of blocks until b117.
                Many “same size” disks have different number of available blocks.
ZFS Tutorial                         USENIX LISA’11                                 112
zpool import
  • Import a pool and mount all mountable datasets
  • Import a specific pool
         ✦     zpool import poolname
         ✦     zpool import GUID
         ✦     nmc: setup volume import
  • Scan LUNs for pools which may be imported
         ✦     zpool import
  • Can set options, such as alternate root directory or other
        properties
         ✦     alternate root directory important for rpool or syspool


                       Beware of zpool.cache interactions

                       Beware of artifacts, especially partial artifacts
ZFS Tutorial                                 USENIX LISA’11                113
zpool export
  • Unmount datasets and export the pool
  • zpool export poolname
  • nmc: setup volume volumename export
  • Removes pool entry from zpool.cache
         ✦     useful when unimported pools remain in zpool.cache




ZFS Tutorial                          USENIX LISA’11                114
zpool upgrade

• Display current versions
     ✦     zpool upgrade
• View available upgrade versions, with features, but don't
    actually upgrade
     ✦     zpool upgrade -v
• Upgrade pool to latest version
     ✦     zpool upgrade poolname
     ✦     nmc: setup volume volumename version-
           upgrade
• Upgrade pool to specific version

                   Once you upgrade, there is no downgrade

                   Beware of grub and rollback issues
ZFS Tutorial                       USENIX LISA’11             115
zpool history
  • Show history of changes made to the pool
  • nmc and Solaris use same command
# zpool history rpool
History for 'rpool':
2009-03-04.07:29:46 zpool create -f -o failmode=continue -R /a -m legacy -o
cachefile=/tmp/root/etc/zfs/zpool.cache rpool c0t0d0s0
2009-03-04.07:29:47 zfs set canmount=noauto rpool
2009-03-04.07:29:47 zfs set mountpoint=/rpool rpool
2009-03-04.07:29:47 zfs create -o mountpoint=legacy rpool/ROOT
2009-03-04.07:29:48 zfs create -b 4096 -V 2048m rpool/swap
2009-03-04.07:29:48 zfs create -b 131072 -V 1024m rpool/dump
2009-03-04.07:29:49 zfs create -o canmount=noauto rpool/ROOT/snv_106
2009-03-04.07:29:50 zpool set bootfs=rpool/ROOT/snv_106 rpool
2009-03-04.07:29:50 zfs set mountpoint=/ rpool/ROOT/snv_106
2009-03-04.07:29:51 zfs set canmount=on rpool
2009-03-04.07:29:51 zfs create -o mountpoint=/export rpool/export
2009-03-04.07:29:51 zfs create rpool/export/home
2009-03-04.00:21:42 zpool import -f -R /a 17111649328928073943
2009-03-04.00:21:42 zpool export rpool
2009-03-04.08:47:08 zpool set bootfs=rpool rpool
2009-03-04.08:47:08 zpool set bootfs=rpool/ROOT/snv_106 rpool
2009-03-04.08:47:12 zfs snapshot rpool/ROOT/snv_106@snv_b108
2009-03-04.08:47:12 zfs clone rpool/ROOT/snv_106@snv_b108 rpool/ROOT/snv_b108
...


ZFS Tutorial                         USENIX LISA’11                             116
zpool status
  • Shows the status of the current pools, including their
    configuration
  • Important troubleshooting step
  • nmc and Solaris use same command
       # zpool status
       …
         pool: zwimming
         state: ONLINE
       status: The pool is formatted using an older on-disk format. The pool can
                still be used, but some features are unavailable.
       action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
                pool will no longer be accessible on older software versions.
         scrub: none requested
       config:
               NAME            STATE     READ WRITE CKSUM
               zwimming        ONLINE       0     0     0
                 mirror        ONLINE       0     0     0
                    c0t2d0s0   ONLINE       0     0     0
                    c0t0d0s7   ONLINE       0     0     0
       errors: No known data errors




                        Understanding status output error messages can be tricky
ZFS Tutorial                                  USENIX LISA’11                       117
zpool clear
  •     Clears device errors
  •     Clears device error counters
  •     Starts any resilvering, as needed
  •     Improves sysadmin sanity and reduces sweating
  • zpool clear poolname
  • nmc: setup volume volumename clear-errors




ZFS Tutorial                    USENIX LISA’11          118
zpool iostat
  • Show pool physical I/O activity, in an iostat-like manner
  • Solaris: fsstat will show I/O activity looking into a ZFS file
    system
  • Especially useful for showing slog activity
  • nmc and Solaris use same command
          # zpool iostat -v
                              capacity       operations       bandwidth
          pool              used avail      read write       read write
          ------------     ----- -----     ----- -----      ----- -----
          rpool            16.5G   131G        0      0     1.16K 2.80K
            c0t0d0s0       16.5G   131G        0      0     1.16K 2.80K
          ------------     ----- -----     ----- -----      ----- -----
          zwimming          135G 14.4G         0      5     2.09K 27.3K
            mirror          135G 14.4G         0      5     2.09K 27.3K
               c0t2d0s0        -       -       0      3     1.25K 27.5K
               c0t0d0s7        -       -       0      2     1.27K 27.5K
          ------------     ----- -----     ----- -----      ----- -----



                          Unlike iostat, does not show latency
ZFS Tutorial                               USENIX LISA’11                 119
zpool scrub
  • Manually starts scrub
         ✦     zpool scrub poolname
  • Scrubbing performed in background
  • Use zpool status to track scrub progress
  • Stop scrub
         ✦     zpool scrub -s poolname
  • How often to scrub?
         ✦     Depends on level of paranoia
         ✦     Once per month seems reasonable
         ✦     After a repair or recovery procedure
  • NexentaStor auto-scrub features easily manages scrubs and
        schedules

                         Estimated scrub completion time improves over time
ZFS Tutorial                             USENIX LISA’11                       120
auto-scrub service




ZFS Tutorial         USENIX LISA’11   121
zfs Command




     122
Dataset Management


       raw         swap dump iSCSI   ??      ZFS    NFS CIFS        ??


         ZFS Volume Emulator (Zvol)           ZFS POSIX Layer (ZPL)           pNFS Lustre   ??

                                      Transactional Object Layer

                                          Pooled Storage Layer

                                          Block Device Driver


                            HDD           SSD             iSCSI          ??




November 8, 2010                                   USENIX LISA’10                                123
zfs create, destroy
  • By default, a file system with the same name as the pool is
    created by zpool create
  • Dataset name format is: pool/name[/name ...]
  • File system / folder
         ✦     zfs create dataset-name
         ✦     nmc: create folder
         ✦     zfs destroy dataset-name
         ✦     nmc: destroy folder
  • Zvol
         ✦     zfs create -V size dataset-name
         ✦     nmc: create zvol
         ✦     zfs destroy dataset-name
         ✦     nmc: destroy zvol

ZFS Tutorial                   USENIX LISA’11                    124
zfs mount, unmount
  • Note: mount point is a file system parameter
         ✦     zfs get mountpoint fs-name
  • Rarely used subcommand (!)
  • Display mounted file systems
         ✦     zfs mount
  • Mount a file system
         ✦     zfs mount fs-name
         ✦     zfs mount -a
  • Unmount (not umount)
         ✦     zfs unmount fs-name
         ✦     zfs unmount -a




ZFS Tutorial                   USENIX LISA’11     125
zfs list
  • List mounted datasets
  • NexentaStor 2: listed everything
  • NexentaStor 3: do not list snapshots
         ✦     See zpool listsnapshots property
  • Examples
         ✦     zfs list
         ✦     zfs list -t snapshot
         ✦     zfs list -H -o name




ZFS Tutorial                          USENIX LISA’11   126
Replication Services


                     Days
                                      Traditional Backup NDMP


                    Hours                          Auto-Tier
Recovery                                             rsync

  Point                                         Text
                                                   Auto-Sync
                                                     ZFS send/receive
Objective
                   Seconds
                                Auto-CDP                     Application Level
                               AVS (SNDR)   Mirror             Replication


                             Slower                                       Faster


                              System I/O Performance
    ZFS Tutorial                            USENIX LISA’11                         127
zfs send, receive
  • Send
         ✦     send a snapshot to stdout
         ✦     data is decompressed
  • Receive
         ✦     receive a snapshot from stdin
         ✦     receiving file system parameters apply (compression, et.al)
  • Can incrementally send snapshots in time order
  • Handy way to replicate dataset snapshots
  • NexentaStor
         ✦     simplifies management
         ✦     manages snapshots and send/receive to remote systems
  • Only method for replicating dataset properties, except quotas
  • NOT a replacement for traditional backup solutions
ZFS Tutorial                           USENIX LISA’11                       128
auto-sync Service




ZFS Tutorial         USENIX LISA’11   129
zfs upgrade

• Display current versions
     ✦         zfs upgrade
• View available upgrade versions, with features, but don't
    actually upgrade
     ✦         zfs upgrade -v
• Upgrade pool to latest version
     ✦         zfs upgrade dataset
• Upgrade pool to specific version
     ✦         zfs upgrade -V version dataset
• NexentaStor: not needed until 3.0
                         You can upgrade, there is no downgrade

                         Beware of grub and rollback issues
ZFS Tutorial                          USENIX LISA’11              130
Sharing




   131
Sharing
  • zfs share dataset
  • Type of sharing set by parameters
    ✦ shareiscsi = [on | off]

    ✦ sharenfs = [on | off | options]

    ✦ sharesmb = [on | off | options]

  • Shortcut to manage sharing
    ✦ Uses external services (nfsd, iscsi target, smbshare, etc)

    ✦ Importing pool will also share

    ✦ Implementation is OS-specific

               ✤   sharesmb uses in-kernel SMB server for Solaris-derived OSes
               ✤   sharesmb uses Samba for FreeBSD




ZFS Tutorial                             USENIX LISA’11                          132
Properties




    133
Properties

  • Properties are stored in an nvlist
  • By default, are inherited
  • Some properties are common to all datasets, but a specific
    dataset type may have additional properties
  • Easily set or retrieved via scripts
  • In general, properties affect future file system activity




                 zpool get doesn't script as nicely as zfs get
ZFS Tutorial                     USENIX LISA’11                  134
Getting Properties

• zpool get all poolname
• nmc: show volume volumename property
  propertyname
• zpool get propertyname poolname

• zfs get all dataset-name
• nmc: show folder foldername property
• nmc: show zvol zvolname property




ZFS Tutorial         USENIX LISA’11      135
Setting Properties

• zpool set propertyname=value poolname
• nmc: setup volume volumename property
      propertyname

• zfs set propertyname=value dataset-name
• nmc: setup folder foldername property
      propertyname




ZFS Tutorial          USENIX LISA’11        136
User-defined Properties
  • Names
         ✦     Must include colon ':'
         ✦     Can contain lower case alphanumerics or “+” “.” “_”
         ✦     Max length = 256 characters
         ✦     By convention, module:property
               ✤   com.sun:auto-snapshot
  • Values
         ✦     Max length = 1024 characters
  • Examples
         ✦     com.sun:auto-snapshot=true
         ✦     com.richardelling:important_files=true




ZFS Tutorial                               USENIX LISA’11            137
Clearing Properties
  • Reset to inherited value
         ✦     zfs inherit compression export/home/relling
  • Clear user-defined parameter
         ✦     zfs inherit com.sun:auto-snapshot export/
               home/relling
  • NexentaStor doesn’t offer method in nmc




ZFS Tutorial                   USENIX LISA’11              138
Pool Properties

      Property     Change?                   Brief Description
        altroot              Alternate root directory (ala chroot)
   autoexpand                Policy for expanding when vdev size changes
   autoreplace               vdev replacement policy
       available   readonly Available storage space
         bootfs              Default bootable dataset for root pool
                             Cache file to use other than /etc/zfs/
       cachefile
                             zpool.cache
       capacity    readonly Percent of pool space used
     delegation              Master pool delegation switch
       failmode              Catastrophic pool failure policy



ZFS Tutorial                           USENIX LISA’11                      139
More Pool Properties


               Property   Change?                    Brief Description
                 guid     readonly Unique identifier
                health    readonly Current health of the pool
        listsnapshots               zfs list policy
                 size     readonly Total size of pool
                used      readonly Amount of space used
               version    readonly Current on-disk version




ZFS Tutorial                              USENIX LISA’11                 140
Common Dataset Properties

               Property    Change?                       Brief Description
               available   readonly   Space available to dataset & children
               checksum               Checksum algorithm
          compression                 Compression algorithm
                                      Compression ratio – logical
        compressratio      readonly
                                      size:referenced physical
                copies                Number of copies of user data
               creation    readonly   Dataset creation time
                dedup                 Deduplication policy
                logbias               Separate log write policy
               mlslabel               Multilayer security label
                origin     readonly   For clones, origin snapshot


ZFS Tutorial                            USENIX LISA’11                        141
More Common Dataset Properties


                Property     Change?                      Brief Description
           primarycache                 ARC caching policy
                readonly                Is dataset in readonly mode?
               referenced    readonly   Size of data accessible by this dataset
                                        Minimum space guaranteed to a
          refreservation                dataset, excluding descendants
                                        (snapshots & clones)
                                        Minimum space guaranteed to dataset,
               reservation
                                        including descendants
        secondarycache                  L2ARC caching policy
                  sync                  Synchronous write policy
                                        Type of dataset (filesystem, snapshot,
                  type       readonly
                                        volume)

ZFS Tutorial                             USENIX LISA’11                           142
More Common Dataset Properties


                 Property       Change?                     Brief Description
                   used         readonly Sum of usedby* (see below)
               usedbychildren   readonly Space used by descendants
               usedbydataset    readonly Space used by dataset
                                           Space used by a refreservation for
      usedbyrefreservation readonly
                                           this dataset
                                           Space used by all snapshots of this
          usedbysnapshots       readonly
                                           dataset
                                           Is dataset added to non-global zone
                   zoned        readonly
                                           (Solaris)




ZFS Tutorial                               USENIX LISA’11                        143
Volume Dataset Properties


                Property      Change?                Brief Description
                shareiscsi                     iSCSI service (not COMSTAR)
               volblocksize   creation               fixed block size
                 volsize                              Implicit quota
                                          Set if dataset delegated to non-global
                 zoned        readonly
                                                       zone (Solaris)




ZFS Tutorial                        USENIX LISA’11                                 144
File System Properties

                 Property        Change?                     Brief Description
                                              ACL inheritance policy, when files or
                 aclinherit
                                                    directories are created
                                            ACL modification policy, when chmod is
                  aclmode
                                                           used
                   atime                     Disable access time metadata updates
                 canmount                                      Mount policy
                                            Filename matching algorithm (CIFS client
               casesensitivity   creation
                                                           feature)
                  devices                        Device opening policy for dataset
                    exec                          File execution policy for dataset
                 mounted         readonly        Is file system currently mounted?



ZFS Tutorial                                USENIX LISA’11                             145
More File System Properties

     Property     Change?                        Brief Description
                   export/   File system should be mounted with non-blocking
      nbmand
                   import           mandatory locks (CIFS client feature)
normalization creation       Unicode normalization of file names for matching
        quota                Max space dataset and descendants can consume
   recordsize                     Suggested maximum block size for files
                              Max space dataset can consume, not including
     refquota
                                             descendants
        setuid                                 setuid mode policy
      sharenfs                                NFS sharing options
    sharesmb                          Files system shared with CIFS



ZFS Tutorial                          USENIX LISA’11                           146
File System Properties


               Property   Change?                    Brief Description
               snapdir               Controls whether .zfs directory is hidden
               utf8only   creation       UTF-8 character file name policy
                vscan                               Virus scan enabled
                xattr                        Extended attributes policy




ZFS Tutorial                             USENIX LISA’11                          147
Forking Properties

                                Pool Properties
       Release              Property                          Brief Description
        illumos             comment            Human-readable comment field

                             Dataset Properties
                Release          Property                       Brief Description
               Solaris 11       encryption                     Dataset encryption
       Delphix/illumos            clones                       Clone descendants
       Delphix/illumos            refratio       Compression ratio for references
               Solaris 11          share           Combines sharenfs & sharesmb
               Solaris 11         shadow                          Shadow copy
   NexentaOS/illumos              worm                           WORM feature
                                                 Amount of data written since last
       Delphix/illumos            written
                                                            snapshot
ZFS Tutorial                                 USENIX LISA’11                          148
More Goodies




     149
Dataset Space Accounting
    • used = usedbydataset + usedbychildren + usedbysnapshots +
      usedbyrefreservation
    • Lazy updates, may not be correct until txg commits
    • ls and du will show size of allocated files which includes all
      copies of a file
    • Shorthand report available

$ zfs list -o space
NAME                  AVAIL    USED   USEDSNAP         USEDDS   USEDREFRESERV   USEDCHILD
rpool                  126G   18.3G          0          35.5K               0       18.3G
rpool/ROOT             126G   15.3G          0            18K               0       15.3G
rpool/ROOT/snv_106     126G   86.1M          0          86.1M               0           0
rpool/ROOT/snv_b108    126G   15.2G      5.89G          9.28G               0           0
rpool/dump             126G   1.00G          0          1.00G               0           0
rpool/export           126G     37K          0            19K               0         18K
rpool/export/home      126G     18K          0            18K               0           0
rpool/swap             128G      2G          0           193M           1.81G           0



  ZFS Tutorial                        USENIX LISA’11                                   150
Pool Space Accounting
  • Pool space accounting changed in b128, along with
    deduplication
  • Compression, deduplication, and raidz complicate pool
    accounting (the numbers are correct, the interpretation is
    suspect)
  • Capacity planning for remaining free space can be challenging

      $ zpool list zwimming
      NAME     SIZE ALLOC   FREE     CAP      DEDUP   HEALTH   ALTROOT
      zwimming 100G 43.9G 56.1G      43%      1.00x   ONLINE   -




ZFS Tutorial                       USENIX LISA’11                        151
zfs vs zpool Space Accounting
  • zfs list != zpool list
  • zfs list shows space used by the dataset plus space for
    internal accounting
  • zpool list shows physical space available to the pool
  • For simple pools and mirrors, they are nearly the same
  • For RAIDZ, RAIDZ2, or RAIDZ3, zpool list will show space
    available for parity




                  Users will be confused about reported space available

ZFS Tutorial                         USENIX LISA’11                       152
NexentaStor Snapshot Services




ZFS Tutorial               USENIX LISA’11      153
Accessing Snapshots
  • By default, snapshots are accessible in .zfs directory
  • Visibility of .zfs directory is tunable via snapdir property
         ✦     Don't really want find to find the .zfs directory
  • Windows CIFS clients can see snapshots as Shadow Copies
        for Shared Folders (VSS)

                # zfs snapshot rpool/export/home/relling@20090415
                # ls -a /export/home/relling
                …
                .Xsession
                .xsession-errors
                # ls /export/home/relling/.zfs
                shares    snapshot
                # ls /export/home/relling/.zfs/snapshot
                20090415
                # ls /export/home/relling/.zfs/snapshot/20090415
                Desktop Documents Downloads Public




ZFS Tutorial                           USENIX LISA’11               154
Time Slider - Automatic Snapshots
  • Solaris feature similar to OSX's Time Machine
  • SMF service for managing snapshots
  • SMF properties used to specify policies: frequency (interval)
    and number to keep
  • Creates cron jobs
  • GUI tool makes it easy to select individual file systems
  • Tip: take additional snapshots for important milestones to
    avoid automatic snapshot deletion
                     Service Name          Interval (default)   Keep (default)
                 auto-snapshot:frequent         15 minutes            4
                 auto-snapshot:hourly              1 hour            24
                  auto-snapshot:daily                1 day           31
                 auto-snapshot:weekly               7 days            4
                 auto-snapshot:monthly            1 month            12
ZFS Tutorial                              USENIX LISA’11                         155
Nautilus
  • File system views which can go back in time




ZFS Tutorial                 USENIX LISA’11       156
Resilver & Scrub
  • Can be read IOPS bound
  • Resilver can also be bandwidth bound to the resilvering device
  • Both work at lower I/O scheduling priority than normal
    work, but that may not matter for read IOPS bound devices
  • Dueling RFEs:
         ✦     Resilver should go faster
         ✦     Resilver should go slower
               ✤   Integrated in b140




ZFS Tutorial                            USENIX LISA’11          157
Time-based Resilvering
  • Block pointers contain
    birth txg number
                                                           73
  • Resilvering begins with                                 73

    oldest blocks first
                                                          73 55
  • Interrupted resilver will still                        73 27

    result in a valid file system                                   27   27
                                                 68 73
    view                                          73 68




                                                     Birth txg = 27
                                                     Birth txg = 68
                                                     Birth txg = 73




ZFS Tutorial                    USENIX LISA’11                               158
ACL – Access Control List
  •     Based on NFSv4 ACLs
  •     Similar to Windows NT ACLs
  •     Works well with CIFS services
  •     Supports ACL inheritance
  •     Change using chmod
  •     View using ls
  •     Some changes in b146 to make behaviour more consistent




ZFS Tutorial                   USENIX LISA’11                    159
Checksums for Data
  • DVA contains 256 bits for checksum
  • Checksum is in the parent, not in the block itself
  • Types
         ✦     none
         ✦     fletcher2: truncated 2nd order Fletcher-like algorithm
         ✦     fletcher4: 4th order Fletcher-like algorithm
         ✦     SHA-256
  • There are open proposals for better algorithms




ZFS Tutorial                           USENIX LISA’11                  160
Checksum Use

                  Pool          Algorithm                        Notes
               Uberblock        SHA-256                     self-checksummed
                Metadata        fletcher4
                 Labels         SHA-256
               Gang block       SHA-256                     self-checksummed

                 Dataset        Algorithm                        Notes
                Metadata        fletcher4
                                fletcher2
                  Data                                 zfs checksum parameter
                             fletcher4 (b114)
                                fletcher2
                 ZIL log                                    self-checksummed
                             fletcher4 (b135)
               Send stream      fletcher4
                 Note: ZIL log has additional checking beyond the checksum
ZFS Tutorial                               USENIX LISA’11                       161
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a

More Related Content

What's hot

Page reclaim
Page reclaimPage reclaim
Page reclaimsiburu
 
Apache kafka 관리와 모니터링
Apache kafka 관리와 모니터링Apache kafka 관리와 모니터링
Apache kafka 관리와 모니터링JANGWONSEO4
 
ZFS Workshop
ZFS WorkshopZFS Workshop
ZFS WorkshopAPNIC
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in RustAndrew Lamb
 
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...Altinity Ltd
 
Performance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li XiaoyanPerformance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li XiaoyanCeph Community
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Databricks
 
Evolution of MongoDB Replicaset and Its Best Practices
Evolution of MongoDB Replicaset and Its Best PracticesEvolution of MongoDB Replicaset and Its Best Practices
Evolution of MongoDB Replicaset and Its Best PracticesMydbops
 
A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides Altinity Ltd
 
Kafka High Availability in multi data center setup with floating Observers wi...
Kafka High Availability in multi data center setup with floating Observers wi...Kafka High Availability in multi data center setup with floating Observers wi...
Kafka High Availability in multi data center setup with floating Observers wi...HostedbyConfluent
 
Startup Snapshot in Node.js
Startup Snapshot in Node.jsStartup Snapshot in Node.js
Startup Snapshot in Node.jsIgalia
 
Flash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with CoscoFlash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with CoscoDatabricks
 
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBayReal-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBayAltinity Ltd
 
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Databricks
 
ClickHouse Mark Cache, by Mik Kocikowski, Cloudflare
ClickHouse Mark Cache, by Mik Kocikowski, CloudflareClickHouse Mark Cache, by Mik Kocikowski, Cloudflare
ClickHouse Mark Cache, by Mik Kocikowski, CloudflareAltinity Ltd
 
Aerospike Architecture
Aerospike ArchitectureAerospike Architecture
Aerospike ArchitecturePeter Milne
 

What's hot (20)

Page reclaim
Page reclaimPage reclaim
Page reclaim
 
Apache kafka 관리와 모니터링
Apache kafka 관리와 모니터링Apache kafka 관리와 모니터링
Apache kafka 관리와 모니터링
 
ZFS Workshop
ZFS WorkshopZFS Workshop
ZFS Workshop
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in Rust
 
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
 
Performance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li XiaoyanPerformance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li Xiaoyan
 
HDFS Overview
HDFS OverviewHDFS Overview
HDFS Overview
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
 
Evolution of MongoDB Replicaset and Its Best Practices
Evolution of MongoDB Replicaset and Its Best PracticesEvolution of MongoDB Replicaset and Its Best Practices
Evolution of MongoDB Replicaset and Its Best Practices
 
A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides
 
Kafka High Availability in multi data center setup with floating Observers wi...
Kafka High Availability in multi data center setup with floating Observers wi...Kafka High Availability in multi data center setup with floating Observers wi...
Kafka High Availability in multi data center setup with floating Observers wi...
 
Designing data intensive applications
Designing data intensive applicationsDesigning data intensive applications
Designing data intensive applications
 
Startup Snapshot in Node.js
Startup Snapshot in Node.jsStartup Snapshot in Node.js
Startup Snapshot in Node.js
 
Flash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with CoscoFlash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with Cosco
 
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBayReal-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
 
Storage
StorageStorage
Storage
 
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
 
ClickHouse Mark Cache, by Mik Kocikowski, Cloudflare
ClickHouse Mark Cache, by Mik Kocikowski, CloudflareClickHouse Mark Cache, by Mik Kocikowski, Cloudflare
ClickHouse Mark Cache, by Mik Kocikowski, Cloudflare
 
Aerospike Architecture
Aerospike ArchitectureAerospike Architecture
Aerospike Architecture
 

Similar to USENIX LISA11 Tutorial: ZFS a

ZFS Tutorial USENIX June 2009
ZFS  Tutorial  USENIX June 2009ZFS  Tutorial  USENIX June 2009
ZFS Tutorial USENIX June 2009Richard Elling
 
VDI storage and storage virtualization
VDI storage and storage virtualizationVDI storage and storage virtualization
VDI storage and storage virtualizationSisimon Soman
 
Storage virtualization citrix blr wide tech talk
Storage virtualization citrix blr wide tech talkStorage virtualization citrix blr wide tech talk
Storage virtualization citrix blr wide tech talkSisimon Soman
 
Vm13 vnx mixed workloads
Vm13 vnx mixed workloadsVm13 vnx mixed workloads
Vm13 vnx mixed workloadspittmantony
 
ZFS for Databases
ZFS for DatabasesZFS for Databases
ZFS for Databasesahl0003
 
제3회난공불락 오픈소스 인프라세미나 - lustre
제3회난공불락 오픈소스 인프라세미나 - lustre제3회난공불락 오픈소스 인프라세미나 - lustre
제3회난공불락 오픈소스 인프라세미나 - lustreTommy Lee
 
Vancouver bug enterprise storage and zfs
Vancouver bug   enterprise storage and zfsVancouver bug   enterprise storage and zfs
Vancouver bug enterprise storage and zfsRami Jebara
 
Benefits of NexentaStor 3.0 in a Virtualized Enviroment
Benefits of NexentaStor 3.0 in a Virtualized EnviromentBenefits of NexentaStor 3.0 in a Virtualized Enviroment
Benefits of NexentaStor 3.0 in a Virtualized Enviromentcloudcampghent
 
Lustre+ZFS:Reliable/Scalable Storage
Lustre+ZFS:Reliable/Scalable StorageLustre+ZFS:Reliable/Scalable Storage
Lustre+ZFS:Reliable/Scalable StorageElizabeth Ciabattari
 
Sharing experience implementing Direct NFS
Sharing experience implementing Direct NFSSharing experience implementing Direct NFS
Sharing experience implementing Direct NFSYury Velikanov
 
Gluster fs buero20_presentation
Gluster fs buero20_presentationGluster fs buero20_presentation
Gluster fs buero20_presentationMartin Alfke
 
Cloud storage slides
Cloud storage slidesCloud storage slides
Cloud storage slidesEvan Powell
 
pnfs status
pnfs statuspnfs status
pnfs statusbergwolf
 

Similar to USENIX LISA11 Tutorial: ZFS a (20)

ZFS Tutorial USENIX June 2009
ZFS  Tutorial  USENIX June 2009ZFS  Tutorial  USENIX June 2009
ZFS Tutorial USENIX June 2009
 
VDI storage and storage virtualization
VDI storage and storage virtualizationVDI storage and storage virtualization
VDI storage and storage virtualization
 
Storage virtualization citrix blr wide tech talk
Storage virtualization citrix blr wide tech talkStorage virtualization citrix blr wide tech talk
Storage virtualization citrix blr wide tech talk
 
Vm13 vnx mixed workloads
Vm13 vnx mixed workloadsVm13 vnx mixed workloads
Vm13 vnx mixed workloads
 
ZFS for Databases
ZFS for DatabasesZFS for Databases
ZFS for Databases
 
제3회난공불락 오픈소스 인프라세미나 - lustre
제3회난공불락 오픈소스 인프라세미나 - lustre제3회난공불락 오픈소스 인프라세미나 - lustre
제3회난공불락 오픈소스 인프라세미나 - lustre
 
OpenZFS at LinuxCon
OpenZFS at LinuxConOpenZFS at LinuxCon
OpenZFS at LinuxCon
 
Vancouver bug enterprise storage and zfs
Vancouver bug   enterprise storage and zfsVancouver bug   enterprise storage and zfs
Vancouver bug enterprise storage and zfs
 
Benefits of NexentaStor 3.0 in a Virtualized Enviroment
Benefits of NexentaStor 3.0 in a Virtualized EnviromentBenefits of NexentaStor 3.0 in a Virtualized Enviroment
Benefits of NexentaStor 3.0 in a Virtualized Enviroment
 
Lustre+ZFS:Reliable/Scalable Storage
Lustre+ZFS:Reliable/Scalable StorageLustre+ZFS:Reliable/Scalable Storage
Lustre+ZFS:Reliable/Scalable Storage
 
Inexpensive storage
Inexpensive storageInexpensive storage
Inexpensive storage
 
Sharing experience implementing Direct NFS
Sharing experience implementing Direct NFSSharing experience implementing Direct NFS
Sharing experience implementing Direct NFS
 
Gluster fs buero20_presentation
Gluster fs buero20_presentationGluster fs buero20_presentation
Gluster fs buero20_presentation
 
Pnfs
PnfsPnfs
Pnfs
 
SoNAS
SoNASSoNAS
SoNAS
 
Posscon2013
Posscon2013Posscon2013
Posscon2013
 
AFS introduction
AFS introductionAFS introduction
AFS introduction
 
Cloud storage slides
Cloud storage slidesCloud storage slides
Cloud storage slides
 
Zfs intro v2
Zfs intro v2Zfs intro v2
Zfs intro v2
 
pnfs status
pnfs statuspnfs status
pnfs status
 

Recently uploaded

COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdfJamie (Taka) Wang
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataSafe Software
 
RAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIRAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIUdaiappa Ramachandran
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 

Recently uploaded (20)

COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
 
RAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIRAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AI
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 

USENIX LISA11 Tutorial: ZFS a

  • 1. ZFS: A File System for Modern Hardware Richard.Elling@RichardElling.com Richard@Nexenta.com USENIX LISA’11 Conference USENIX LISA11 December, 2011
  • 2. Agenda • Overview • Foundations • Pooled Storage Layer • Transactional Object Layer • ZFS commands • Sharing • Properties • Other goodies • Performance • Troubleshooting ZFS Tutorial USENIX LISA’11 2
  • 3. ZFS History • Announced September 14, 2004 • Integration history ✦ SXCE b27 (November 2005) ✦ FreeBSD (April 2007) ✦ Mac OSX Leopard ✤ Preview shown, but removed from Snow Leopard ✤ Disappointed community reforming as the zfs-macos google group (Oct 2009) ✦ OpenSolaris 2008.05 ✦ Solaris 10 6/06 (June 2006) ✦ Linux FUSE (summer 2006) ✦ greenBytes ZFS+ (September 2008) ✦ Linux native port funded by the US DOE (2010) • More than 45 patents, contributed to the CDDL Patents Common ZFS Tutorial USENIX LISA’11 3
  • 4. ZFS Design Goals • Figure out why storage has gotten so complicated • Blow away 20+ years of obsolete assumptions • Gotta replace UFS • Design an integrated system from scratch End the suffering ZFS Tutorial USENIX LISA’11 4
  • 5. Limits • 248 — Number of entries in any individual directory • 256 — Number of attributes of a file [*] • 256 — Number of files in a directory [*] • 16 EiB (264 bytes) — Maximum size of a file system • 16 EiB — Maximum size of a single file • 16 EiB — Maximum size of any attribute • 264 — Number of devices in any pool • 264 — Number of pools in a system • 264 — Number of file systems in a pool • 264 — Number of snapshots of any file system • 256 ZiB (278 bytes) — Maximum size of any pool [*] actually constrained to 248 for the number of files in a ZFS file system ZFS Tutorial USENIX LISA’11 5
  • 6. Understanding Builds • Build is often referenced when speaking of feature/bug integration • Short-hand notation: b### • Distributions derived from Solaris NV (Nevada) ✦ NexentaStor ✦ Nexenta Core Platform ✦ SmartOS ✦ Solaris 11 (nee OpenSolaris) ✦ OpenIndiana ✦ StormOS ✦ BelleniX ✦ SchilliX ✦ MilaX • OpenSolaris builds ✦ Binary builds died at b134 ✦ Source releases continued through b147 • illumos stepping up to fill void left by OpenSolaris’ demise ZFS Tutorial USENIX LISA’11 6
  • 7. Community Links • Community links ✦ nexenta.org ✦ nexentastor.org ✦ freebsd.org ✦ zfsonlinux.org ✦ zfs-fuse.net ✦ groups.google.com/group/zfs-macos • ZFS Community ✦ hub.opensolaris.org/bin/view/Community+Group+zfs/ • IRC channels at irc.freenode.net ✦ #zfs ZFS Tutorial USENIX LISA’11 7
  • 9. Overhead View of a Pool Pool File System Configuration Information Volume File System Volume Dataset ZFS Tutorial USENIX LISA’11 9
  • 10. Hybrid Storage Pool Adaptive Replacement Cache (ARC) separate Main Pool Main Pool Level 2 ARC intent log Write optimized HDD HDD Read optimized device (SSD) HDD device (SSD) Size (GBytes) 1 - 10 GByte large big Cost write iops/$ size/$ size/$ Use sync writes persistent storage read cache Performance secondary low-latency writes low-latency reads optimization Need more speed? stripe more, faster devices stripe ZFS Tutorial USENIX LISA’11 10
  • 11. Layer View raw swap dump iSCSI ?? ZFS NFS CIFS ?? ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ?? Transactional Object Layer Pooled Storage Layer Block Device Driver HDD SSD iSCSI ?? November 8, 2010 USENIX LISA’10 11
  • 12. Source Code Structure File system Mgmt Device Consumer Consumer libzfs Interface ZPL ZVol /dev/zfs Layer Transactional ZIL ZAP Traversal Object Layer DMU DSL ARC Pooled Storage ZIO Layer VDEV Configuration November 8, 2010 USENIX LISA’10 12
  • 13. Acronyms • ARC – Adaptive Replacement Cache • DMU – Data Management Unit • DSL – Dataset and Snapshot Layer • JNI – Java Native Interface • ZPL – ZFS POSIX Layer (traditional file system interface) • VDEV – Virtual Device • ZAP – ZFS Attribute Processor • ZIL – ZFS Intent Log • ZIO – ZFS I/O layer • Zvol – ZFS volume (raw/cooked block device interface) ZFS Tutorial USENIX LISA’11 13
  • 14. NexentaStor Rosetta Stone NexentaStor OpenSolaris/ZFS Volume Storage pool ZVol Volume Folder File system ZFS Tutorial USENIX LISA’11 14
  • 15. nvlists • name=value pairs • libnvpair(3LIB) • Allows ZFS capabilities to change without changing the physical on-disk format • Data stored is XDR encoded • A good thing, used often ZFS Tutorial USENIX LISA’11 15
  • 16. Versioning • Features can be added and identified by nvlist entries • Change in pool or dataset versions do not change physical on- disk format (!) ✦ does change nvlist parameters • Older-versions can be used ✦ might see warning messages, but harmless • Available versions and features can be easily viewed ✦ zpool upgrade -v ✦ zfs upgrade -v • Online references (broken?) ✦ zpool: hub.opensolaris.org/bin/view/Community+Group+zfs/N ✦ zfs: hub.opensolaris.org/bin/view/Community+Group+zfs/N-1 Don't confuse zpool and zfs versions ZFS Tutorial USENIX LISA’11 16
  • 17. zpool Versions VER DESCRIPTION --- ------------------------------------------------ 1 Initial ZFS version 2 Ditto blocks (replicated metadata) 3 Hot spares and double parity RAID-Z 4 zpool history 5 Compression using the gzip algorithm 6 bootfs pool property 7 Separate intent log devices 8 Delegated administration 9 refquota and refreservation properties 10 Cache devices 11 Improved scrub performance 12 Snapshot properties 13 snapused property 14 passthrough-x aclinherit support Continued... ZFS Tutorial USENIX LISA’11 17
  • 18. More zpool Versions VER DESCRIPTION --- ------------------------------------------------ 15 user/group space accounting 16 stmf property support 17 Triple-parity RAID-Z 18 snapshot user holds 19 Log device removal 20 Compression using zle (zero-length encoding) 21 Deduplication 22 Received properties 23 Slim ZIL 24 System attributes 25 Improved scrub stats 26 Improved snapshot deletion performance 27 Improved snapshot creation performance 28 Multiple vdev replacements For Solaris 10, version 21 is “reserved” ZFS Tutorial USENIX LISA’11 18
  • 19. zfs Versions VER DESCRIPTION ---------------------------------------------- 1 Initial ZFS filesystem version 2 Enhanced directory entries 3 Case insensitive and File system unique identifier (FUID) 4 userquota, groupquota properties 5 System attributes ZFS Tutorial USENIX LISA’11 19
  • 20. Copy on Write 1. Initial block tree 2. COW some data 3. COW metadata 4. Update Uberblocks & free ZFS Tutorial USENIX LISA’11 20
  • 21. COW Notes • COW works on blocks, not files • ZFS reserves 32 MBytes or 1/64 of pool size ✦ COWs need some free space to remove files ✦ need space for ZIL • For fixed-record size workloads “fragmentation” and “poor performance” can occur if the recordsize is not matched • Spatial distribution is good fodder for performance speculation ✦ affects HDDs ✦ moot for SSDs ZFS Tutorial USENIX LISA’11 21
  • 22. To fsck or not to fsck • fsck was created to fix known inconsistencies in file system metadata ✦ UFS is not transactional ✦ metadata inconsistencies must be reconciled ✦ does NOT repair data – how could it? • ZFS doesn't need fsck, as-is ✦ all on-disk changes are transactional ✦ COW means previously existing, consistent metadata is not overwritten ✦ ZFS can repair itself ✤ metadata is at least dual-redundant ✤ data can also be redundant • Reality check – this does not mean that ZFS is not susceptible to corruption ✦ nor is any other file system ZFS Tutorial USENIX LISA’11 22
  • 23. Pooled Storage Layer raw swap dump iSCSI ?? ZFS NFS CIFS ?? ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ?? Transactional Object Layer Pooled Storage Layer Block Device Driver HDD SSD iSCSI ?? November 8, 2010 USENIX LISA’10 23
  • 24. vdevs – Virtual Devices Logical vdevs root vdev top-level vdev top-level vdev children[0] children[1] mirror mirror vdev vdev vdev vdev type = disk type = disk type = disk type = disk children[0] children[0] children[0] children[0] Physical or leaf vdevs ZFS Tutorial USENIX LISA’11 24
  • 25. vdev Labels • vdev labels != disk labels • Four 256 kByte labels written to every physical vdev • Two-stage update process ✦ write label0 & label2 ✦ flush cache & check for errors ✦ write label1 & label3 ✦ flush cache & check for errors N = 256k * (size % 256k) M = 128k / MIN(1k, sector size) 0 256k 512k 4M N-512k N-256k N label0 label1 boot block label2 label3 ... Boot Name=Value Blank header Pairs M-slot Uberblock Array 0 8k 16k 128k 256k 25 ZFS Tutorial USENIX LISA’11
  • 26. Observing Labels # zdb -l /dev/rdsk/c0t0d0s0 -------------------------------------------- LABEL 0 -------------------------------------------- version=14 name='rpool' state=0 txg=13152 pool_guid=17111649328928073943 hostid=8781271 hostname='' top_guid=11960061581853893368 guid=11960061581853893368 vdev_tree type='disk' id=0 guid=11960061581853893368 path='/dev/dsk/c0t0d0s0' devid='id1,sd@SATA_____ST3500320AS_________________9QM3FWFT/a' phys_path='/pci@0,0/pci1458,b002@11/disk@0,0:a' whole_disk=0 metaslab_array=24 metaslab_shift=30 ashift=9 asize=157945167872 is_log=0 ZFS Tutorial USENIX LISA’11 26
  • 27. Uberblocks • Sized based on minimum device block size • Stored in 128-entry circular queue • Only one uberblock is active at any time ✦ highest transaction group number ✦ correct SHA-256 checksum • Stored in machine's native format ✦ A magic number is used to determine endian format when imported • Contains pointer to Meta Object Set (MOS) Device Block Size Uberblock Size Queue Entries 512 Bytes,1 KB 1 KB 128 2 KB 2 KB 64 4 KB 4 KB 32 ZFS Tutorial USENIX LISA’11 27
  • 28. About Sizes • Sizes are dynamic • LSIZE = logical size • PSIZE = physical size after compression • ASIZE = allocated size including: ✦ physical size ✦ raidz parity ✦ gang blocks Old notions of size reporting confuse people ZFS Tutorial USENIX LISA’11 28
  • 29. VDEV ZFS Tutorial USENIX LISA’11 29
  • 30. Dynamic Striping • RAID-0 ✦ SNIA definition: fixed-length sequences of virtual disk data addresses are mapped to sequences of member disk addresses in a regular rotating pattern • Dynamic Stripe ✦ Data is dynamically mapped to member disks ✦ No fixed-length sequences ✦ Allocate up to ~1 MByte/vdev before changing vdev ✦ vdevs can be different size ✦ Good combination of the concatenation feature with RAID-0 performance ZFS Tutorial USENIX LISA’11 30
  • 31. Dynamic Striping RAID-0 Column size = 128 kBytes, stripe width = 384 kBytes 384 kBytes ZFS Dynamic Stripe recordsize = 128 kBytes Total write size = 2816 kBytes ZFS Tutorial USENIX LISA’11 31
  • 32. Mirroring • Straightforward: put N copies of the data on N vdevs • Unlike RAID-1 ✦ No 1:1 mapping at the block level ✦ vdev labels are still at beginning and end ✦ vdevs can be of different size ✤ effective space is that of smallest vdev • Arbitration: ZFS does not blindly trust either side of mirror ✦ Most recent, correct view of data wins ✦ Checksums validate data ZFS Tutorial USENIX LISA’11 32
  • 33. Dynamic vdev Replacement • zpool replace poolname vdev [vdev] • Today, replacing vdev must be same size or larger ✦ NexentaStor 2 ‒ as measured by blocks ✦ NexentaStor 3 ‒ as measured by metaslabs • Replacing all vdevs in a top-level vdev with larger vdevs results in top-level vdev resizing • Expansion policy controlled by: ✦ NexentaStor 2 ‒ resize on import ✦ NexentaStor 3 ‒ zpool autoexpand property 15G 10G 10G 15G 10G 20G 15G 20G 10G 15G 20G 20G 20G 20G 10G 10G Mirror 10G Mirror 15G Mirror 15G Mirror 20G Mirror ZFS Tutorial USENIX LISA’11 33
  • 34. RAIDZ • RAID-5 ✦ Parity check data is distributed across the RAID array's disks ✦ Must read/modify/write when data is smaller than stripe width • RAIDZ ✦ Dynamic data placement ✦ Parity added as needed ✦ Writes are full-stripe writes ✦ No read/modify/write (write hole) • Arbitration: ZFS does not blindly trust any device ✦ Does not rely on disk reporting read error ✦ Checksums validate data ✦ If checksum fails, read parity Space used is dependent on how used ZFS Tutorial USENIX LISA’11 34
  • 35. RAID-5 vs RAIDZ DiskA DiskB DiskC DiskD DiskE D0:0 D0:1 D0:2 D0:3 P0 RAID-5 P1 D1:0 D1:1 D1:2 D1:3 D2:3 P2 D2:0 D2:1 D2:2 D3:2 D3:3 P3 D3:0 D3:1 DiskA DiskB DiskC DiskD DiskE P0 D0:0 D0:1 D0:2 D0:3 RAIDZ P1 D1:0 D1:1 P2:0 D2:0 D2:1 D2:2 D2:3 P2:1 D2:4 D2:5 Gap P3 D3:0 ZFS Tutorial USENIX LISA’11 35
  • 36. RAIDZ and Block Size If block size >> N * sector size, space consumption is like RAID-5 If block size = sector size, space consumption is like mirroring PSIZE=2KB ASIZE=2.5KB DiskA DiskB DiskC DiskD DiskE P0 D0:0 D0:1 D0:2 D0:3 P1 D1:0 D1:1 P2:0 D2:0 PSIZE=1KB D2:1 D2:2 D2:3 P2:1 D2:4 ASIZE=1.5KB D2:5 Gap P3 D3:0 PSIZE=3KB PSIZE=512 bytes ASIZE=4KB + Gap ASIZE=1KB Sector size = 512 bytes Sector size can impact space savings ZFS Tutorial USENIX LISA’11 36
  • 37. RAID-5 Write Hole • Occurs when data to be written is smaller than stripe size • Must read unallocated columns to recalculate the parity or the parity must be read/modify/write • Read/modify/write is risky for consistency ✦ Multiple disks ✦ Reading independently ✦ Writing independently ✦ System failure before all writes are complete to media could result in data loss • Effects can be hidden from host using RAID array with nonvolatile write cache, but extra I/O cannot be hidden from disks ZFS Tutorial USENIX LISA’11 37
  • 38. RAIDZ2 and RAIDZ3 • RAIDZ2 = double parity RAIDZ • RAIDZ3 = triple parity RAIDZ • Sorta like RAID-6 ✦ Parity 1: XOR ✦ Parity 2: another Reed-Soloman syndrome ✦ Parity 3: yet another Reed-Soloman syndrome • Arbitration: ZFS does not blindly trust any device ✦ Does not rely on disk reporting read error ✦ Checksums validate data ✦ If data not valid, read parity ✦ If data still not valid, read other parity Space used is dependent on how used ZFS Tutorial USENIX LISA’11 38
  • 39. Evaluating Data Retention • MTTDL = Mean Time To Data Loss • Note: MTBF is not constant in the real world, but keeps math simple • MTTDL[1] is a simple MTTDL model • No parity (single vdev, striping, RAID-0) ✦ MTTDL[1] = MTBF / N • Single Parity (mirror, RAIDZ, RAID-1, RAID-5) ✦ MTTDL[1] = MTBF2 / (N * (N-1) * MTTR) • Double Parity (3-way mirror, RAIDZ2, RAID-6) ✦ MTTDL[1] = MTBF3 / (N * (N-1) * (N-2) * MTTR2) • Triple Parity (4-way mirror, RAIDZ3) ✦ MTTDL[1] = MTBF4 / (N * (N-1) * (N-2) * (N-3) * MTTR3) ZFS Tutorial USENIX LISA’11 39
  • 40. Another MTTDL Model • MTTDL[1] model doesn't take into account unrecoverable read • But unrecoverable reads (UER) are becoming the dominant failure mode ✦ UER specifed as errors per bits read ✦ More bits = higher probability of loss per vdev • MTTDL[2] model considers UER ZFS Tutorial USENIX LISA’11 40
  • 41. Why Worry about UER? • Richard's study ✦ 3,684 hosts with 12,204 LUNs ✦ 11.5% of all LUNs reported read errors • Bairavasundaram et.al. FAST08 www.cs.wisc.edu/adsl/Publications/corruption-fast08.pdf ✦ 1.53M LUNs over 41 months ✦ RAID reconstruction discovers 8% of checksum mismatches ✦ “For some drive models as many as 4% of drives develop checksum mismatches during the 17 months examined” ZFS Tutorial USENIX LISA’11 41
  • 42. Why Worry about UER? • RAID array study ZFS Tutorial USENIX LISA’11 42
  • 43. Why Worry about UER? • RAID array study Unrecoverable Disk Disappeared Reads “disk pull” “Disk pull” tests aren’t very useful ZFS Tutorial USENIX LISA’11 43
  • 44. MTTDL[2] Model • Probability that a reconstruction will fail ✦ Precon_fail = (N-1) * size / UER • Model doesn't work for non-parity schemes ✦ single vdev, striping, RAID-0 • Single Parity (mirror, RAIDZ, RAID-1, RAID-5) ✦ MTTDL[2] = MTBF / (N * Precon_fail) • Double Parity (3-way mirror, RAIDZ2, RAID-6) ✦ MTTDL[2] = MTBF2/ (N * (N-1) * MTTR * Precon_fail) • Triple Parity (4-way mirror, RAIDZ3) ✦ MTTDL[2] = MTBF3/ (N * (N-1) * (N-2) * MTTR2 * Precon_fail) ZFS Tutorial USENIX LISA’11 44
  • 45. Practical View of MTTDL[1] ZFS Tutorial USENIX LISA’11 45
  • 46. MTTDL[1] Comparison ZFS Tutorial USENIX LISA’11 46
  • 47. MTTDL Models: Mirror Spares are not always better... ZFS Tutorial USENIX LISA’11 47
  • 48. MTTDL Models: RAIDZ2 ZFS Tutorial USENIX LISA’11 48
  • 49. Space, Dependability, and Performance ZFS Tutorial USENIX LISA’11 49
  • 50. Dependability Use Case • Customer has 15+ TB of read-mostly data • 16-slot, 3.5” drive chassis • 2 TB HDDs • Option 1: one raidz2 set ✦ 24 TB available space ✤ 12 data ✤ 2 parity ✤ 2 hot spares, 48 hour disk replacement time ✦ MTTDL[1] = 1,790,000 years • Option 2: two raidz2 sets ✦ 24 TB available space (each set) ✤ 6 data ✤ 2 parity ✤ no hot spares ✦ MTTDL[1] = 7,450,000 years ZFS Tutorial USENIX LISA’11 50
  • 51. Ditto Blocks • Recall that each blkptr_t contains 3 DVAs • Dataset property used to indicate how many copies (aka ditto blocks) of data is desired ✦ Write all copies ✦ Read any copy ✦ Recover corrupted read from a copy • Not a replacement for mirroring ✦ For single disk, can handle data loss on approximately 1/8 contiguous space • Easier to describe in pictures... copies parameter Data copies Metadata copies copies=1 (default) 1 2 copies=2 2 3 copies=3 3 3 ZFS Tutorial USENIX LISA’11 51
  • 52. Copies in Pictures November 8, 2010 USENIX LISA’10 52
  • 53. Copies in Pictures ZFS Tutorial USENIX LISA’11 53
  • 54. When Good Data Goes Bad File system If it’s a metadata Or we get does bad read block FS panics back bad Can not tell does disk rebuild data ZFS Tutorial USENIX LISA’11 54
  • 55. Checksum Verification ZFS verifies checksums for every read Repairs data when possible (mirror, raidz, copies>1) Read bad data Read good data Repair bad data ZFS Tutorial USENIX LISA’11 55
  • 56. ZIO - ZFS I/O Layer 56
  • 57. ZIO Framework • All physical disk I/O goes through ZIO Framework • Translates DVAs into Logical Block Address (LBA) on leaf vdevs ✦ Keeps free space maps (spacemap) ✦ If contiguous space is not available: ✤ Allocate smaller blocks (the gang) ✤ Allocate gang block, pointing to the gang • Implemented as multi-stage pipeline ✦ Allows extensions to be added fairly easily • Handles I/O errors ZFS Tutorial USENIX LISA’11 57
  • 58. ZIO Write Pipeline ZIO State Compression Checksum DVA vdev I/O open compress if savings > 12.5% generate allocate start start start done done done assess assess assess done Gang and deduplicaiton activity elided, for clarity ZFS Tutorial USENIX LISA’11 58
  • 59. ZIO Read Pipeline ZIO State Compression Checksum DVA vdev I/O open start start start done done done assess assess assess verify decompress done Gang and deduplicaiton activity elided, for clarity ZFS Tutorial USENIX LISA’11 59
  • 60. VDEV – Virtual Device Subsytem • Where mirrors, RAIDZ, and Name Priority RAIDZ2 are implemented NOW 0 ✦ Surprisingly few lines of code SYNC_READ 0 needed to implement RAID SYNC_WRITE 0 • Leaf vdev (physical device) I/O FREE 0 management CACHE_FILL 0 ✦ Number of outstanding iops LOG_WRITE 0 ✦ Read-ahead cache ASYNC_READ 4 • Priority scheduling ASYNC_WRITE 4 RESILVER 10 SCRUB 20 ZFS Tutorial USENIX LISA’11 60
  • 62. Object Cache • UFS uses page cache managed by the virtual memory system • ZFS does not use the page cache, except for mmap'ed files • ZFS uses a Adaptive Replacement Cache (ARC) • ARC used by DMU to cache DVA data objects • Only one ARC per system, but caching policy can be changed on a per-dataset basis • Seems to work much better than page cache ever did for UFS ZFS Tutorial USENIX LISA’11 62
  • 63. Traditional Cache • Works well when data being accessed was recently added • Doesn't work so well when frequently accessed data is evicted Misses cause insert MRU Dynamic caches can change Cache size size by either not evicting or aggressively evicting LRU Evict the oldest ZFS Tutorial USENIX LISA’11 63
  • 64. ARC – Adaptive Replacement Cache Evict the oldest single-use entry LRU Recent Cache Miss MRU Evictions and dynamic MFU size resizing needs to choose best Hit cache to evict (shrink) Frequent Cache LFU Evict the oldest multiple accessed entry ZFS Tutorial USENIX LISA’11 64
  • 65. ARC with Locked Pages Evict the oldest single-use entry Cannot evict LRU locked pages! Recent Cache Miss MRU MFU size Hit Frequent If hit occurs Cache within 62 ms LFU Evict the oldest multiple accessed entry ZFS ARC handles mixed-size pages ZFS Tutorial USENIX LISA’11 65
  • 66. L2ARC – Level 2 ARC • Data soon to be evicted from the ARC is added to a queue to be sent to cache vdev ✦ Another thread sends queue to cache vdev ARC ✦ Data is copied to the cache vdev with a throttle data soon to to limit bandwidth consumption be evicted ✦ Under heavy memory pressure, not all evictions will arrive in the cache vdev • ARC directory remains in memory • Good idea - optimize cache vdev for fast reads ✦ lower latency than pool disks ✦ inexpensive way to “increase memory” cache • Content considered volatile, no raid needed • Monitor usage with zpool iostat and ARC kstats ZFS Tutorial USENIX LISA’11 66
  • 67. ARC Directory • Each ARC directory entry contains arc_buf_hdr structs ✦ Info about the entry ✦ Pointer to the entry • Directory entries have size, ~200 bytes • ZFS block size is dynamic, sector size to 128 kBytes • Disks are large • Suppose we use a Seagate LP 2 TByte disk for the L2ARC ✦ Disk has 3,907,029,168 512 byte sectors, guaranteed ✦ Workload uses 8 kByte fixed record size ✦ RAM needed for arc_buf_hdr entries ✤ Need = (3,907,029,168 - 9,232) * 200 / 16 = ~48 GBytes • Don't underestimate the RAM needed for large L2ARCs ZFS Tutorial USENIX LISA’11 67
  • 68. ARC Tips • In general, it seems to work well for most workloads • ARC size will vary, based on usage ✦ Default target max is 7/8 of physical memory or (memory - 1 GByte) ✦ Target min is 64 MB ✦ Metadata capped at 1/4 of max ARC size • Dynamic size can be reduced when: ✦ page scanner is running ✤ freemem < lotsfree + needfree + desfree ✦ swapfs does not have enough space so that anonymous reservations can succeed ✤ availrmem < swapfs_minfree + swapfs_reserve + desfree ✦ [x86 only] kernel heap space more than 75% full • Can limit at boot time ZFS Tutorial USENIX LISA’11 68
  • 69. Observing ARC • ARC statistics stored in kstats • kstat -n arcstats • Interesting statistics: ✦ size = current ARC size ✦ p = size of MFU cache ✦ c = target ARC size ✦ c_max = maximum target ARC size ✦ c_min = minimum target ARC size ✦ l2_hdr_size = space used in ARC by L2ARC ✦ l2_size = size of data in L2ARC ZFS Tutorial USENIX LISA’11 69
  • 70. General Status - ARC ZFS Tutorial USENIX LISA’11 70
  • 71. More ARC Tips • Performance ✦ Prior to b107, L2ARC fill rate was limited to 8 MB/sec ✦ After b107, cold L2ARC fill rate increases to 16 MB/sec • Internals tracked by kstats in Solaris ✦ Use memory_throttle_count to observe pressure to evict • Dedup Table (DDT) also uses ARC ✦ lots of dedup objects need lots of RAM ✦ field reports that L2ARC can help with dedup L2ARC keeps its directory in kernel memory ZFS Tutorial USENIX LISA’11 71
  • 73. flash Source Code Structure File system Mgmt Device Consumer Consumer libzfs Interface ZPL ZVol /dev/zfs Layer Transactional ZIL ZAP Traversal Object Layer DMU DSL ARC Pooled Storage ZIO Layer VDEV Configuration November 8, 2010 USENIX LISA’10 73
  • 74. Transaction Engine • Manages physical I/O • Transactions grouped into transaction group (txg) ✦ txg updates ✦ All-or-nothing ✦ Commit interval ✤ Older versions: 5 seconds ✤ Less old versions: 30 seconds ✤ b143 and later: 5 seconds • Delay committing data to physical storage ✦ Improves performance ✦ A bad thing for sync workload performance – hence the ZFS Intent Log (ZIL) 30 second delay can impact failure detection time ZFS Tutorial USENIX LISA’11 74
  • 75. ZIL – ZFS Intent Log • DMU is transactional, and likes to group I/O into transactions for later commits, but still needs to handle “write it now” desire of sync writers ✦ NFS ✦ Databases • ZIL recordsize inflation can occur for some workloads ✦ May cause larger than expected actual I/O for sync workloads ✦ Oracle redo logs ✦ No slog: can tune zfs_immediate_write_sz, zvol_immediate_write_sz ✦ With slog: use logbias property instead • Never read, except at import (eg reboot), when transactions may need to be rolled forward ZFS Tutorial USENIX LISA’11 75
  • 76. Separate Logs (slogs) • ZIL competes with pool for IOPS ✦ Applications wait for sync writes to be on nonvolatile media ✦ Very noticeable on HDD JBODs • Put ZIL on separate vdev, outside of pool ✦ ZIL writes tend to be sequential ✦ No competition with pool for IOPS ✦ Downside: slog device required to be operational at import ✦ NexentaStor 3 allows slog device removal ✦ Size of separate log < than size of RAM (duh) • 10x or more performance improvements possible ✦ Nonvolatile RAM card ✦ Write-optimized SSD ✦ Nonvolatile write cache on RAID array ZFS Tutorial USENIX LISA’11 76
  • 77. zilstat • http://www.richardelling.com/Home/scripts-and-programs-1/ zilstat • Integrated into NexentaStor 3.0.3 ✦ nmc: show performance zil ZFS Tutorial USENIX LISA’11 77
  • 78. Synchronous Write Destination Without separate log Sync I/O size > ZIL Destination zfs_immediate_write_sz ? no ZIL log yes bypass to pool With separate log logbias? ZIL Destination latency (default) log device throughput bypass to pool Default zfs_immediate_write_sz = 32 kBytes ZFS Tutorial USENIX LISA’11 78
  • 79. ZIL Synchronicity Project • All-or-nothing policies don’t work well, in general • ZIL Synchronicity project proposed by Robert Milkowski ✦ http://milek.blogspot.com • Adds new sync property to datasets • Arrived in b140 sync Parameter Behaviour Policy follows previous design: write standard (default) immediate size and separate logs always All writes become synchronous (slow) disabled Synchronous write requests are ignored ZFS Tutorial USENIX LISA’11 79
  • 80. Disabling the ZIL • Preferred method: change dataset sync property • Rule 0: Don’t disable the ZIL • If you love your data, do not disable the ZIL • You can find references to this as a way to speed up ZFS ✦ NFS workloads ✦ “tar -x” benchmarks • Golden Rule: Don’t disable the ZIL • Can set via mdb, but need to remount the file system • Friends don’t let friends disable the ZIL • Older Solaris - can set in /etc/system • NexentaStor has checkbox for disabling ZIL • Nostradamus wrote, “disabling the ZIL will lead to the apocalypse” ZFS Tutorial USENIX LISA’11 80
  • 81. DSL - Dataset and Snapshot Layer 81
  • 82. Dataset & Snapshot Layer • Object ✦ Allocated storage ✦ dnode describes collection of blocks • Object Set Dataset Directory ✦ Group of related objects Dataset • Dataset Object Set Childmap ✦ Snapmap: snapshot relationships Object Object ✦ Space usage Object Properties • Dataset directory Snapmap ✦ Childmap: dataset relationships ✦ Properties ZFS Tutorial USENIX LISA’11 82
  • 83. flash Copy on Write 1. Initial block tree 2. COW some data 3. COW metadata 4. Update Uberblocks & free z ZFS Tutorial USENIX LISA’11 83
  • 84. zfs snapshot • Create a read-only, point-in-time window into the dataset (file system or Zvol) • Computationally free, because of COW architecture • Very handy feature ✦ Patching/upgrades • Basis for time-related snapshot interfaces ✦ Solaris Time Slider ✦ NexentaStor Delorean Plugin ✦ NexentaStor Virtual Machine Data Center ZFS Tutorial USENIX LISA’11 84
  • 85. Snapshot • Create a snapshot by not free'ing COWed blocks • Snapshot creation is fast and easy • Number of snapshots determined by use – no hardwired limit • Recursive snapshots also possible Snapshot tree Current tree root root ZFS Tutorial USENIX LISA’11 85
  • 86. auto-snap service ZFS Tutorial USENIX LISA’11 86
  • 87. Clones • Snapshots are read-only • Clones are read-write based upon a snapshot • Child depends on parent ✦ Cannot destroy parent without destroying all children ✦ Can promote children to be parents • Good ideas ✦ OS upgrades ✦ Change control ✦ Replication ✤ zones ✤ virtual disks ZFS Tutorial USENIX LISA’11 87
  • 88. zfs clone • Create a read-write file system from a read-only snapshot • Solaris boot environment administation Install Checkpoint Clone Checkpoint OS rev1 OS rev1 OS rev1 OS rev1 OS rev1 rootfs- rootfs- rootfs- rootfs- nmu- nmu- nmu- nmu- 001 001 001 001 patch/ OS rev1 OS rev1 OS rev1 upgrade clone clone clone rootfs- nmu- 002 grubboot manager Origin snapshot cannot be destroyed, if clone exists ZFS Tutorial USENIX LISA’11 88
  • 90. What is Deduplication? • A $2.1 Billion feature • 2009 buzzword of the year • Technique for improving storage space efficiency ✦ Trades big I/Os for small I/Os ✦ Does not eliminate I/O • Implementation styles ✦ offline or post processing ✤ data written to nonvolatile storage ✤ process comes along later and dedupes data ✤ example: tape archive dedup ✦ inline ✤ data is deduped as it is being allocated to nonvolatile storage ✤ example: ZFS ZFS Tutorial USENIX LISA’11 90
  • 91. Dedup how-to • Given a bunch of data • Find data that is duplicated • Build a lookup table of references to data • Replace duplicate data with a pointer to the entry in the lookup table • Grainularity ✦ file ✦ block ✦ byte ZFS Tutorial USENIX LISA’11 91
  • 92. Dedup in ZFS • Leverage block-level checksums ✦ Identify blocks which might be duplicates ✦ Variable block size is ok • Synchronous implementation ✦ Data is deduped as it is being written • Scalable design ✦ No reference count limits • Works with existing features ✦ compression ✦ copies ✦ scrub ✦ resilver • Implemented in ZIO pipeline ZFS Tutorial USENIX LISA’11 92
  • 93. Deduplication Table (DDT) • Internal implementation ✦ Adelson-Velskii, Landis (AVL) tree ✦ Typical table entry ~270 bytes ✤ checksum ✤ logical size ✤ physical size ✤ references ✦ Table entry size increases as the number of references increases ZFS Tutorial USENIX LISA’11 93
  • 94. Reference Counts Eggs courtesy of Richard’s chickens ZFS Tutorial USENIX LISA’11 94
  • 95. Reference Counts • Problem: loss of the referenced data affects all referrers • Solution: make additional copies of referred data based upon a threshold count of referrers ✦ leverage copies (ditto blocks) ✦ pool-level threshold for automatically adding ditto copies ✤ set via dedupditto pool property # zpool set dedupditto=50 zwimming ✤ add 2nd copy when dedupditto references (50) reached ✤ add 3rd copy when dedupditto2 references (2500) reached ZFS Tutorial USENIX LISA’11 95
  • 96. Verification write() compress checksum DDT entry lookup yes no DDT verify? match? no yes read data data yes match? add reference no new entry ZFS Tutorial USENIX LISA’11 96
  • 97. Enabling Dedup • Set dedup property for each dataset to be deduped • Remember: properties are inherited • Remember: only applies to newly written data dedup checksum verify? on SHA256 no sha256 on,verify SHA256 yes sha256,verify Fletcher is considered too weak, without verify ZFS Tutorial USENIX LISA’11 97
  • 98. Dedup Accounting • ...and you thought compression accounting was hard... • Remember: dedup works at pool level ✦ dataset-level accounting doesn’t see other datasets ✦ pool-level accounting is always correct zfs list NAME USED AVAIL REFER MOUNTPOINT bar 7.56G 449G 22K /bar bar/ws 7.56G 449G 7.56G /bar/ws dozer 7.60G 455G 22K /dozer dozer/ws 7.56G 455G 7.56G /dozer/ws tank 4.31G 456G 22K /tank tank/ws 4.27G 456G 4.27G /tank/ws zpool list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT bar 464G 7.56G 456G 1% 1.00x ONLINE - dozer 464G 1.43G 463G 0% 5.92x ONLINE - tank 464G 957M 463G 0% 5.39x ONLINE - ZFS Tutorial DataUSENIX LISA’11team courtesy of the ZFS 98
  • 99. DDT Histogram # zdb -DD tank DDT-sha256-zap-duplicate: 110173 entries, size 295 on disk, 153 in core DDT-sha256-zap-unique: 302 entries, size 42194 on disk, 52827 in core DDT histogram (aggregated over all DDTs): bucket! allocated! referenced ______ ___________________________ ___________________________ refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE ------ ------ ----- ----- ----- ------ ----- ----- ----- 1 302 7.26M 4.24M 4.24M 302 7.26M 4.24M 4.24M 2 103K 1.12G 712M 712M 216K 2.64G 1.62G 1.62G 4 3.11K 30.0M 17.1M 17.1M 14.5K 168M 95.2M 95.2M 8 503 11.6M 6.16M 6.16M 4.83K 129M 68.9M 68.9M 16 100 4.22M 1.92M 1.92M 2.14K 101M 45.8M 45.8M ZFS Tutorial USENIX LISA’11 Data courtesy of the ZFS team 99
  • 100. DDT Histogram $ zdb -DD zwimming DDT-sha256-zap-duplicate: 19725 entries, size 270 on disk, 153 in core DDT-sha256-zap-unique: 52369639 entries, size 284 on disk, 159 in core DDT histogram (aggregated over all DDTs): bucket allocated referenced ______ ______________________________ ______________________________ refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE ------ ------ ----- ----- ----- ------ ----- ----- ----- 1 49.9M 25.0G 25.0G 25.0G 49.9M 25.0G 25.0G 25.0G 2 16.7K 8.33M 8.33M 8.33M 33.5K 16.7M 16.7M 16.7M 4 610 305K 305K 305K 3.33K 1.66M 1.66M 1.66M 8 661 330K 330K 330K 6.67K 3.34M 3.34M 3.34M 16 242 121K 121K 121K 5.34K 2.67M 2.67M 2.67M 32 131 65.5K 65.5K 65.5K 5.54K 2.77M 2.77M 2.77M 64 897 448K 448K 448K 84K 42M 42M 42M 128 125 62.5K 62.5K 62.5K 18.0K 8.99M 8.99M 8.99M 8K 1 512 512 512 12.5K 6.27M 6.27M 6.27M Total 50.0M 25.0G 25.0G 25.0G 50.1M 25.1G 25.1G 25.1G dedup = 1.00, compress = 1.00, copies = 1.00, dedup * compress / copies = 1.00 ZFS Tutorial USENIX LISA’11 100
  • 101. Over-the-wire Dedup • Dedup is also possible over the send/receive pipe ✦ Blocks with no checksum are considered duplicates (no verify option) ✦ First copy sent as usual ✦ Subsequent copies sent by reference • Independent of dedup status of originating pool ✦ Receiving pool knows about blocks which have already arrived • Can be a win for dedupable data, especially over slow wires • Remember: send/receive version rules still apply # zfs send -DR zwimming/stuff ZFS Tutorial USENIX LISA’11 101
  • 102. Dedup Performance • Dedup can save space and bandwidth • Dedup increases latency ✦ Caching data improves latency ✦ More memory → more data cached ✦ Cache performance heirarchy ✤ RAM: fastest ✤ L2ARC on SSD: slower ✤ Pool HDD: dreadfully slow • ARC is currently not deduped • Difficult to predict ✦ Dependent variable: number of blocks ✦ Estimate 270 bytes per unique block ✦ Example: ✤ 50M blocks * 270 bytes/block = 13.5 GBytes ZFS Tutorial USENIX LISA’11 102
  • 103. Deduplication Use Cases Data type Dedupe Compression Home directories ✔✔ ✔✔ Internet content ✔ ✔ Media and video ✔✔ ✔ Life sciences ✘ ✔✔ Oil and Gas (seismic) ✘ ✔✔ Virtual machines ✔✔ ✘ Archive ✔✔✔✔ ✔ ZFS Tutorial USENIX LISA’11 103
  • 105. Pooled Storage Layer raw swap dump iSCSI ?? ZFS NFS CIFS ?? ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ?? Transactional Object Layer Pooled Storage Layer Block Device Driver HDD SSD iSCSI ?? November 8, 2010 USENIX LISA’10 105
  • 106. zpool create • zpool create poolname vdev-configuration • nmc: setup volume create ✦ vdev-configuration examples ✤ mirror c0t0d0 c3t6d0 ✤ mirror c0t0d0 c3t6d0 mirror c4t0d0 c0t1d6 ✤ mirror disk1s0 disk2s0 cache disk4s0 log disk5 ✤ raidz c0d0s1 c0d1s1 c1d2s0 spare c1d3s0 • Solaris ✦ Additional checks for disk/slice overlaps or in use ✦ Whole disks are given EFI labels • Can set initial pool or dataset properties • By default, creates a file system with the same name ✦ poolname pool → /poolname file system People get confused by a file system with same name as the pool ZFS Tutorial USENIX LISA’11 106
  • 107. zpool destroy • Destroy the pool and all datasets therein • zpool destroy poolname ✦ Can (try to) force with “-f” ✦ There is no “are you sure?” prompt – if you weren't sure, you would not have typed “destroy” • nmc: destroy volume volumename ✦ nmc prompts for confirmation, by default zpool destroy is destructive... really! Use with caution! ZFS Tutorial USENIX LISA’11 107
  • 108. zpool add • Adds a device to the pool as a top-level vdev • Does NOT not add columns to a raidz set • Does NOT attach a mirror – use zpool attach instead • zpool add poolname vdev-configuration ✦ vdev-configuration can be any combination also used for zpool create ✦ Complains if the added vdev-configuration would cause a different data protection scheme than is already in use ✤ use “-f” to override ✦ Good idea: try with “-n” flag first ✤ will show final configuration without actually performing the add • nmc: setup volume volumename grow Do not add a device which is in use as a cluster quorum device ZFS Tutorial USENIX LISA’11 108
  • 109. zpool remove • Remove a top-level vdev from the pool • zpool remove poolname vdev • nmc: setup volume volumename remove-lun • Today, you can only remove the following vdevs: ✦ cache ✦ hot spare ✦ separate log (b124, NexentaStor 3.0) Don't confuse “remove” with “detach” ZFS Tutorial USENIX LISA’11 109
  • 110. zpool attach • Attach a vdev as a mirror to an existing vdev • zpool attach poolname existing-vdev vdev • nmc: setup volume volumename attach-lun • Attaching vdev must be the same size or larger than the existing vdev vdev Configurations ok simple vdev → mirror ok mirror ok log → mirrored log no RAIDZ no RAIDZ2 no RAIDZ3 ZFS Tutorial USENIX LISA’11 110
  • 111. zpool detach • Detach a vdev from a mirror • zpool detach poolname vdev • nmc: setup volume volumename detach-lun • A resilvering vdev will wait until resilvering is complete ZFS Tutorial USENIX LISA’11 111
  • 112. zpool replace • Replaces an existing vdev with a new vdev • zpool replace poolname existing-vdev vdev • nmc: setup volume volumename replace-lun • Effectively, a shorthand for “zpool attach” followed by “zpool detach” • Attaching vdev must be the same size or larger than the existing vdev • Works for any top-level vdev-configuration, including RAIDZ “Same size” literally means the same number of blocks until b117. Many “same size” disks have different number of available blocks. ZFS Tutorial USENIX LISA’11 112
  • 113. zpool import • Import a pool and mount all mountable datasets • Import a specific pool ✦ zpool import poolname ✦ zpool import GUID ✦ nmc: setup volume import • Scan LUNs for pools which may be imported ✦ zpool import • Can set options, such as alternate root directory or other properties ✦ alternate root directory important for rpool or syspool Beware of zpool.cache interactions Beware of artifacts, especially partial artifacts ZFS Tutorial USENIX LISA’11 113
  • 114. zpool export • Unmount datasets and export the pool • zpool export poolname • nmc: setup volume volumename export • Removes pool entry from zpool.cache ✦ useful when unimported pools remain in zpool.cache ZFS Tutorial USENIX LISA’11 114
  • 115. zpool upgrade • Display current versions ✦ zpool upgrade • View available upgrade versions, with features, but don't actually upgrade ✦ zpool upgrade -v • Upgrade pool to latest version ✦ zpool upgrade poolname ✦ nmc: setup volume volumename version- upgrade • Upgrade pool to specific version Once you upgrade, there is no downgrade Beware of grub and rollback issues ZFS Tutorial USENIX LISA’11 115
  • 116. zpool history • Show history of changes made to the pool • nmc and Solaris use same command # zpool history rpool History for 'rpool': 2009-03-04.07:29:46 zpool create -f -o failmode=continue -R /a -m legacy -o cachefile=/tmp/root/etc/zfs/zpool.cache rpool c0t0d0s0 2009-03-04.07:29:47 zfs set canmount=noauto rpool 2009-03-04.07:29:47 zfs set mountpoint=/rpool rpool 2009-03-04.07:29:47 zfs create -o mountpoint=legacy rpool/ROOT 2009-03-04.07:29:48 zfs create -b 4096 -V 2048m rpool/swap 2009-03-04.07:29:48 zfs create -b 131072 -V 1024m rpool/dump 2009-03-04.07:29:49 zfs create -o canmount=noauto rpool/ROOT/snv_106 2009-03-04.07:29:50 zpool set bootfs=rpool/ROOT/snv_106 rpool 2009-03-04.07:29:50 zfs set mountpoint=/ rpool/ROOT/snv_106 2009-03-04.07:29:51 zfs set canmount=on rpool 2009-03-04.07:29:51 zfs create -o mountpoint=/export rpool/export 2009-03-04.07:29:51 zfs create rpool/export/home 2009-03-04.00:21:42 zpool import -f -R /a 17111649328928073943 2009-03-04.00:21:42 zpool export rpool 2009-03-04.08:47:08 zpool set bootfs=rpool rpool 2009-03-04.08:47:08 zpool set bootfs=rpool/ROOT/snv_106 rpool 2009-03-04.08:47:12 zfs snapshot rpool/ROOT/snv_106@snv_b108 2009-03-04.08:47:12 zfs clone rpool/ROOT/snv_106@snv_b108 rpool/ROOT/snv_b108 ... ZFS Tutorial USENIX LISA’11 116
  • 117. zpool status • Shows the status of the current pools, including their configuration • Important troubleshooting step • nmc and Solaris use same command # zpool status … pool: zwimming state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scrub: none requested config: NAME STATE READ WRITE CKSUM zwimming ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t2d0s0 ONLINE 0 0 0 c0t0d0s7 ONLINE 0 0 0 errors: No known data errors Understanding status output error messages can be tricky ZFS Tutorial USENIX LISA’11 117
  • 118. zpool clear • Clears device errors • Clears device error counters • Starts any resilvering, as needed • Improves sysadmin sanity and reduces sweating • zpool clear poolname • nmc: setup volume volumename clear-errors ZFS Tutorial USENIX LISA’11 118
  • 119. zpool iostat • Show pool physical I/O activity, in an iostat-like manner • Solaris: fsstat will show I/O activity looking into a ZFS file system • Especially useful for showing slog activity • nmc and Solaris use same command # zpool iostat -v capacity operations bandwidth pool used avail read write read write ------------ ----- ----- ----- ----- ----- ----- rpool 16.5G 131G 0 0 1.16K 2.80K c0t0d0s0 16.5G 131G 0 0 1.16K 2.80K ------------ ----- ----- ----- ----- ----- ----- zwimming 135G 14.4G 0 5 2.09K 27.3K mirror 135G 14.4G 0 5 2.09K 27.3K c0t2d0s0 - - 0 3 1.25K 27.5K c0t0d0s7 - - 0 2 1.27K 27.5K ------------ ----- ----- ----- ----- ----- ----- Unlike iostat, does not show latency ZFS Tutorial USENIX LISA’11 119
  • 120. zpool scrub • Manually starts scrub ✦ zpool scrub poolname • Scrubbing performed in background • Use zpool status to track scrub progress • Stop scrub ✦ zpool scrub -s poolname • How often to scrub? ✦ Depends on level of paranoia ✦ Once per month seems reasonable ✦ After a repair or recovery procedure • NexentaStor auto-scrub features easily manages scrubs and schedules Estimated scrub completion time improves over time ZFS Tutorial USENIX LISA’11 120
  • 121. auto-scrub service ZFS Tutorial USENIX LISA’11 121
  • 122. zfs Command 122
  • 123. Dataset Management raw swap dump iSCSI ?? ZFS NFS CIFS ?? ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ?? Transactional Object Layer Pooled Storage Layer Block Device Driver HDD SSD iSCSI ?? November 8, 2010 USENIX LISA’10 123
  • 124. zfs create, destroy • By default, a file system with the same name as the pool is created by zpool create • Dataset name format is: pool/name[/name ...] • File system / folder ✦ zfs create dataset-name ✦ nmc: create folder ✦ zfs destroy dataset-name ✦ nmc: destroy folder • Zvol ✦ zfs create -V size dataset-name ✦ nmc: create zvol ✦ zfs destroy dataset-name ✦ nmc: destroy zvol ZFS Tutorial USENIX LISA’11 124
  • 125. zfs mount, unmount • Note: mount point is a file system parameter ✦ zfs get mountpoint fs-name • Rarely used subcommand (!) • Display mounted file systems ✦ zfs mount • Mount a file system ✦ zfs mount fs-name ✦ zfs mount -a • Unmount (not umount) ✦ zfs unmount fs-name ✦ zfs unmount -a ZFS Tutorial USENIX LISA’11 125
  • 126. zfs list • List mounted datasets • NexentaStor 2: listed everything • NexentaStor 3: do not list snapshots ✦ See zpool listsnapshots property • Examples ✦ zfs list ✦ zfs list -t snapshot ✦ zfs list -H -o name ZFS Tutorial USENIX LISA’11 126
  • 127. Replication Services Days Traditional Backup NDMP Hours Auto-Tier Recovery rsync Point Text Auto-Sync ZFS send/receive Objective Seconds Auto-CDP Application Level AVS (SNDR) Mirror Replication Slower Faster System I/O Performance ZFS Tutorial USENIX LISA’11 127
  • 128. zfs send, receive • Send ✦ send a snapshot to stdout ✦ data is decompressed • Receive ✦ receive a snapshot from stdin ✦ receiving file system parameters apply (compression, et.al) • Can incrementally send snapshots in time order • Handy way to replicate dataset snapshots • NexentaStor ✦ simplifies management ✦ manages snapshots and send/receive to remote systems • Only method for replicating dataset properties, except quotas • NOT a replacement for traditional backup solutions ZFS Tutorial USENIX LISA’11 128
  • 129. auto-sync Service ZFS Tutorial USENIX LISA’11 129
  • 130. zfs upgrade • Display current versions ✦ zfs upgrade • View available upgrade versions, with features, but don't actually upgrade ✦ zfs upgrade -v • Upgrade pool to latest version ✦ zfs upgrade dataset • Upgrade pool to specific version ✦ zfs upgrade -V version dataset • NexentaStor: not needed until 3.0 You can upgrade, there is no downgrade Beware of grub and rollback issues ZFS Tutorial USENIX LISA’11 130
  • 131. Sharing 131
  • 132. Sharing • zfs share dataset • Type of sharing set by parameters ✦ shareiscsi = [on | off] ✦ sharenfs = [on | off | options] ✦ sharesmb = [on | off | options] • Shortcut to manage sharing ✦ Uses external services (nfsd, iscsi target, smbshare, etc) ✦ Importing pool will also share ✦ Implementation is OS-specific ✤ sharesmb uses in-kernel SMB server for Solaris-derived OSes ✤ sharesmb uses Samba for FreeBSD ZFS Tutorial USENIX LISA’11 132
  • 133. Properties 133
  • 134. Properties • Properties are stored in an nvlist • By default, are inherited • Some properties are common to all datasets, but a specific dataset type may have additional properties • Easily set or retrieved via scripts • In general, properties affect future file system activity zpool get doesn't script as nicely as zfs get ZFS Tutorial USENIX LISA’11 134
  • 135. Getting Properties • zpool get all poolname • nmc: show volume volumename property propertyname • zpool get propertyname poolname • zfs get all dataset-name • nmc: show folder foldername property • nmc: show zvol zvolname property ZFS Tutorial USENIX LISA’11 135
  • 136. Setting Properties • zpool set propertyname=value poolname • nmc: setup volume volumename property propertyname • zfs set propertyname=value dataset-name • nmc: setup folder foldername property propertyname ZFS Tutorial USENIX LISA’11 136
  • 137. User-defined Properties • Names ✦ Must include colon ':' ✦ Can contain lower case alphanumerics or “+” “.” “_” ✦ Max length = 256 characters ✦ By convention, module:property ✤ com.sun:auto-snapshot • Values ✦ Max length = 1024 characters • Examples ✦ com.sun:auto-snapshot=true ✦ com.richardelling:important_files=true ZFS Tutorial USENIX LISA’11 137
  • 138. Clearing Properties • Reset to inherited value ✦ zfs inherit compression export/home/relling • Clear user-defined parameter ✦ zfs inherit com.sun:auto-snapshot export/ home/relling • NexentaStor doesn’t offer method in nmc ZFS Tutorial USENIX LISA’11 138
  • 139. Pool Properties Property Change? Brief Description altroot Alternate root directory (ala chroot) autoexpand Policy for expanding when vdev size changes autoreplace vdev replacement policy available readonly Available storage space bootfs Default bootable dataset for root pool Cache file to use other than /etc/zfs/ cachefile zpool.cache capacity readonly Percent of pool space used delegation Master pool delegation switch failmode Catastrophic pool failure policy ZFS Tutorial USENIX LISA’11 139
  • 140. More Pool Properties Property Change? Brief Description guid readonly Unique identifier health readonly Current health of the pool listsnapshots zfs list policy size readonly Total size of pool used readonly Amount of space used version readonly Current on-disk version ZFS Tutorial USENIX LISA’11 140
  • 141. Common Dataset Properties Property Change? Brief Description available readonly Space available to dataset & children checksum Checksum algorithm compression Compression algorithm Compression ratio – logical compressratio readonly size:referenced physical copies Number of copies of user data creation readonly Dataset creation time dedup Deduplication policy logbias Separate log write policy mlslabel Multilayer security label origin readonly For clones, origin snapshot ZFS Tutorial USENIX LISA’11 141
  • 142. More Common Dataset Properties Property Change? Brief Description primarycache ARC caching policy readonly Is dataset in readonly mode? referenced readonly Size of data accessible by this dataset Minimum space guaranteed to a refreservation dataset, excluding descendants (snapshots & clones) Minimum space guaranteed to dataset, reservation including descendants secondarycache L2ARC caching policy sync Synchronous write policy Type of dataset (filesystem, snapshot, type readonly volume) ZFS Tutorial USENIX LISA’11 142
  • 143. More Common Dataset Properties Property Change? Brief Description used readonly Sum of usedby* (see below) usedbychildren readonly Space used by descendants usedbydataset readonly Space used by dataset Space used by a refreservation for usedbyrefreservation readonly this dataset Space used by all snapshots of this usedbysnapshots readonly dataset Is dataset added to non-global zone zoned readonly (Solaris) ZFS Tutorial USENIX LISA’11 143
  • 144. Volume Dataset Properties Property Change? Brief Description shareiscsi iSCSI service (not COMSTAR) volblocksize creation fixed block size volsize Implicit quota Set if dataset delegated to non-global zoned readonly zone (Solaris) ZFS Tutorial USENIX LISA’11 144
  • 145. File System Properties Property Change? Brief Description ACL inheritance policy, when files or aclinherit directories are created ACL modification policy, when chmod is aclmode used atime Disable access time metadata updates canmount Mount policy Filename matching algorithm (CIFS client casesensitivity creation feature) devices Device opening policy for dataset exec File execution policy for dataset mounted readonly Is file system currently mounted? ZFS Tutorial USENIX LISA’11 145
  • 146. More File System Properties Property Change? Brief Description export/ File system should be mounted with non-blocking nbmand import mandatory locks (CIFS client feature) normalization creation Unicode normalization of file names for matching quota Max space dataset and descendants can consume recordsize Suggested maximum block size for files Max space dataset can consume, not including refquota descendants setuid setuid mode policy sharenfs NFS sharing options sharesmb Files system shared with CIFS ZFS Tutorial USENIX LISA’11 146
  • 147. File System Properties Property Change? Brief Description snapdir Controls whether .zfs directory is hidden utf8only creation UTF-8 character file name policy vscan Virus scan enabled xattr Extended attributes policy ZFS Tutorial USENIX LISA’11 147
  • 148. Forking Properties Pool Properties Release Property Brief Description illumos comment Human-readable comment field Dataset Properties Release Property Brief Description Solaris 11 encryption Dataset encryption Delphix/illumos clones Clone descendants Delphix/illumos refratio Compression ratio for references Solaris 11 share Combines sharenfs & sharesmb Solaris 11 shadow Shadow copy NexentaOS/illumos worm WORM feature Amount of data written since last Delphix/illumos written snapshot ZFS Tutorial USENIX LISA’11 148
  • 149. More Goodies 149
  • 150. Dataset Space Accounting • used = usedbydataset + usedbychildren + usedbysnapshots + usedbyrefreservation • Lazy updates, may not be correct until txg commits • ls and du will show size of allocated files which includes all copies of a file • Shorthand report available $ zfs list -o space NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD rpool 126G 18.3G 0 35.5K 0 18.3G rpool/ROOT 126G 15.3G 0 18K 0 15.3G rpool/ROOT/snv_106 126G 86.1M 0 86.1M 0 0 rpool/ROOT/snv_b108 126G 15.2G 5.89G 9.28G 0 0 rpool/dump 126G 1.00G 0 1.00G 0 0 rpool/export 126G 37K 0 19K 0 18K rpool/export/home 126G 18K 0 18K 0 0 rpool/swap 128G 2G 0 193M 1.81G 0 ZFS Tutorial USENIX LISA’11 150
  • 151. Pool Space Accounting • Pool space accounting changed in b128, along with deduplication • Compression, deduplication, and raidz complicate pool accounting (the numbers are correct, the interpretation is suspect) • Capacity planning for remaining free space can be challenging $ zpool list zwimming NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT zwimming 100G 43.9G 56.1G 43% 1.00x ONLINE - ZFS Tutorial USENIX LISA’11 151
  • 152. zfs vs zpool Space Accounting • zfs list != zpool list • zfs list shows space used by the dataset plus space for internal accounting • zpool list shows physical space available to the pool • For simple pools and mirrors, they are nearly the same • For RAIDZ, RAIDZ2, or RAIDZ3, zpool list will show space available for parity Users will be confused about reported space available ZFS Tutorial USENIX LISA’11 152
  • 153. NexentaStor Snapshot Services ZFS Tutorial USENIX LISA’11 153
  • 154. Accessing Snapshots • By default, snapshots are accessible in .zfs directory • Visibility of .zfs directory is tunable via snapdir property ✦ Don't really want find to find the .zfs directory • Windows CIFS clients can see snapshots as Shadow Copies for Shared Folders (VSS) # zfs snapshot rpool/export/home/relling@20090415 # ls -a /export/home/relling … .Xsession .xsession-errors # ls /export/home/relling/.zfs shares snapshot # ls /export/home/relling/.zfs/snapshot 20090415 # ls /export/home/relling/.zfs/snapshot/20090415 Desktop Documents Downloads Public ZFS Tutorial USENIX LISA’11 154
  • 155. Time Slider - Automatic Snapshots • Solaris feature similar to OSX's Time Machine • SMF service for managing snapshots • SMF properties used to specify policies: frequency (interval) and number to keep • Creates cron jobs • GUI tool makes it easy to select individual file systems • Tip: take additional snapshots for important milestones to avoid automatic snapshot deletion Service Name Interval (default) Keep (default) auto-snapshot:frequent 15 minutes 4 auto-snapshot:hourly 1 hour 24 auto-snapshot:daily 1 day 31 auto-snapshot:weekly 7 days 4 auto-snapshot:monthly 1 month 12 ZFS Tutorial USENIX LISA’11 155
  • 156. Nautilus • File system views which can go back in time ZFS Tutorial USENIX LISA’11 156
  • 157. Resilver & Scrub • Can be read IOPS bound • Resilver can also be bandwidth bound to the resilvering device • Both work at lower I/O scheduling priority than normal work, but that may not matter for read IOPS bound devices • Dueling RFEs: ✦ Resilver should go faster ✦ Resilver should go slower ✤ Integrated in b140 ZFS Tutorial USENIX LISA’11 157
  • 158. Time-based Resilvering • Block pointers contain birth txg number 73 • Resilvering begins with 73 oldest blocks first 73 55 • Interrupted resilver will still 73 27 result in a valid file system 27 27 68 73 view 73 68 Birth txg = 27 Birth txg = 68 Birth txg = 73 ZFS Tutorial USENIX LISA’11 158
  • 159. ACL – Access Control List • Based on NFSv4 ACLs • Similar to Windows NT ACLs • Works well with CIFS services • Supports ACL inheritance • Change using chmod • View using ls • Some changes in b146 to make behaviour more consistent ZFS Tutorial USENIX LISA’11 159
  • 160. Checksums for Data • DVA contains 256 bits for checksum • Checksum is in the parent, not in the block itself • Types ✦ none ✦ fletcher2: truncated 2nd order Fletcher-like algorithm ✦ fletcher4: 4th order Fletcher-like algorithm ✦ SHA-256 • There are open proposals for better algorithms ZFS Tutorial USENIX LISA’11 160
  • 161. Checksum Use Pool Algorithm Notes Uberblock SHA-256 self-checksummed Metadata fletcher4 Labels SHA-256 Gang block SHA-256 self-checksummed Dataset Algorithm Notes Metadata fletcher4 fletcher2 Data zfs checksum parameter fletcher4 (b114) fletcher2 ZIL log self-checksummed fletcher4 (b135) Send stream fletcher4 Note: ZIL log has additional checking beyond the checksum ZFS Tutorial USENIX LISA’11 161