This presentation is from the ZFS Tutorial presented at the USENIX LISA09 Conference at Baltimore, Maryland in November 2009.
Later versions are available on slideshare.net, too.
3. Ground Rules
No religilous discussion
No licensing discussion
No “future of <company>” discussion
No zones/containers/jails discussion
No “when is it going to be in Solaris 10” discussion... ok maybe a few...
3
4. History
Announced September 14, 2004
Integration history
SXCE b27 (November 2005)
FreeBSD (April 2007)
Mac OSX Leopard
Preview shown, but removed from Snow Leopard
Disappointed community reforming as the zfs-macos google group
(Oct 2009)
OpenSolaris 2008.05
Solaris 10 6/06 (June 2006)
Linux FUSE (summer 2006)
greenBytes ZFS+ (September 2008)
More than 45 patents, contributed to the CDDL Patents Common
4
5. Brief List of Features
Future-proof “No silent data corruption ever”
Cutting-edge data integrity “Mind-boggling scalability”
High performance “Breathtaking speed”
Simplified administration “Near zero administration”
Eliminates need for volume “Radical new architecture”
managers “Greatly simplifies support
Reduced costs issues”
Compatibility with POSIX file “RAIDZ saves money”
system & block devices
Self-healing
Marketing: 2 drink minimum
5
6. ZFS Design Goals
Figure out why storage has gotten so complicated
Blow away 20+ years of obsolete assumptions
Gotta replace UFS
Design an integrated system from scratch
End the suffering
6
7. Limits
248 — Number of entries in any individual directory
256 — Number of attributes of a file [1]
256 — Number of files in a directory [1]
16 EiB (264 bytes) — Maximum size of a file system
16 EiB — Maximum size of a single file
16 EiB — Maximum size of any attribute
264 — Number of devices in any pool
264 — Number of pools in a system
264 — Number of file systems in a pool
264 — Number of snapshots of any file system
256 ZiB (278 bytes) — Maximum size of any pool
[1] actually constrained to 248 for the number of files in a ZFS file system
7
8. Sidetrack: Understanding Builds
Build is often referenced when speaking of feature/bug integration
Short-hand notation: b#
OpenSolaris and SXCE are based on NV
SXCE will soon end
OpenSolaris carries forward
ZFS development done for NV
Bi-weekly build cycle
Schedule at http://opensolaris.org/os/community/on/schedule/
ZFS is ported to Solaris 10 and other OSes
8
15. Versioning
Features can be added and identified by nvlist entries
Change in pool or dataset versions do not change physical on-disk
format (!)
does change nvlist parameters
Older-versions can be used
might see warning messages, but harmless
Available versions and features can be easily viewed
zpool upgrade -v
zfs upgrade -v
Online references
zpool: www.opensolaris.org/os/community/zfs/version/N
zfs: www.opensolaris.org/os/community/zfs/version/zpl/N
Don't confuse zpool and zfs versions
15
16. zpool versions
VER DESCRIPTION
--- --------------------------------------------------------
1 Initial ZFS version
2 Ditto blocks (replicated metadata)
3 Hot spares and double parity RAID-Z
4 zpool history
5 Compression using the gzip algorithm
6 bootfs pool property
7 Separate intent log devices
8 Delegated administration
9 refquota and refreservation properties
10 Cache devices
11 Improved scrub performance
12 Snapshot properties
13 snapused property
14 passthrough-x aclinherit support
15 user/group space accounting
16 stmf property support
17 Triple-parity RAID-Z
18 snapshot user holds
19 Log device removal
16
17. zfs versions
VER DESCRIPTION
--- --------------------------------------------------------
1 Initial ZFS filesystem version
2 Enhanced directory entries
3 Case insensitive and File system unique identifier (FUID)
4 userquota, groupquota properties
17
18. Copy on Write
1. Initial block tree 2. COW some data
3. COW metadata 4. Update Uberblocks & free
18
19. COW Notes
COW works on blocks, not files
ZFS reserves 32 MBytes or 1/64 of
pool size
COWs need some free space to
remove files
need space for ZIL
For fixed-record size workloads
“fragmentation” and “poor
performance” can occur if the
recordsize is not matched
Spatial distribution is good fodder for
performance speculation
affects HDDs
moot for SSDs
19
24. To fsck or not to fsck
fsck was created to fix known inconsistencies in file system metadata
UFS is not transactional
metadata inconsistencies must be reconciled
does NOT repair data – how could it?
ZFS doesn't need fsck, as-is
all on-disk changes are transactional
COW means previously existing, consistent metadata is not
overwritten
ZFS can repair itself
metadata is at least dual-redundant
data can also be redundant
Reality check – this does not mean that ZFS is not susceptible to
corruption
nor is any other file system
24
26. Dynamic Striping
RAID-0
− SNIA definition: fixed-length sequences of virtual disk data
addresses are mapped to sequences of member disk
addresses in a regular rotating pattern
Dynamic Stripe
− Data is dynamically mapped to member disks
− No fixed-length sequences
− Allocate up to ~1 MByte/vdev before changing vdev
− vdevs can be different size
− Good combination of the concatenation feature with RAID-0
performance
26
28. Mirroring
Straightforward: put N copies of the data on N vdevs
Unlike RAID-1
− No 1:1 mapping at the block level
− vdev labels are still at beginning and end
− vdevs can be of different size
effective space is that of smallest vdev
Arbitration: ZFS does not blindly trust either side of mirror
− Most recent, correct view of data wins
− Checksums validate data
28
30. Dynamic vdev Replacement
zpool replace poolname vdev [vdev]
Today, replacing vdev must be same size or larger
− Before b117: as measured by blocks
− After b117: as measured by metaslabs
Replacing all vdevs in a top-level vdev with larger vdevs results in
top-level vdev resizing
Policy controlled by zpool autoexpand property
15G 10G
10G 15G 10G 20G 15G 20G
10G 15G 20G 20G 20G 20G
10G
10G Mirror 10G Mirror 15G Mirror 15G Mirror 20G Mirror
30
31. RAIDZ
RAID-5
− Parity check data is distributed across the RAID array's disks
− Must read/modify/write when data is smaller than stripe width
RAIDZ
− Dynamic data placement
− Parity added as needed
− Writes are full-stripe writes
− No read/modify/write (write hole)
Arbitration: ZFS does not blindly trust any device
− Does not rely on disk reporting read error
− Checksums validate data
− If checksum fails, read parity
Space used is dependent on how used
31
33. RAID-5 Write Hole
Occurs when data to be written is smaller than stripe size
Must read unallocated columns to recalculate the parity or the parity
must be read/modify/write
Read/modify/write is risky for consistency
− Multiple disks
− Reading independently
− Writing independently
− System failure before all writes are complete to media could
result in data loss
Effects can be hidden from host using RAID array with nonvolatile
write cache, but extra I/O cannot be hidden from disks
33
34. RAIDZ2 and RAIDZ3
RAIDZ2 = double parity RAIDZ
RAIDZ3 = triple parity RAIDZ
Sorta like RAID-6
− Parity 1: XOR
− Parity 2: another Reed-Soloman syndrome
− Parity 3: yet another Reed-Soloman syndrome
Arbitration: ZFS does not blindly trust any device
− Does not rely on disk reporting read error
− Checksums validate data
− If data not valid, read parity
− If data still not valid, read other parity
Space used is dependent on how used
34
35. Evaluating Data Retention
MTTDL = Mean Time To Data Loss
Note: MTBF is not constant in the real world, but keeps math simple
MTTDL[1] is a simple MTTDL model
No parity (single vdev, striping, RAID-0)
− MTTDL[1] = MTBF / N
Single Parity (mirror, RAIDZ, RAID-1, RAID-5)
− MTTDL[1] = MTBF2 / (N * (N-1) * MTTR)
Double Parity (3-way mirror, RAIDZ2, RAID-6)
− MTTDL[1] = MTBF3 / (N * (N-1) * (N-2) * MTTR2)
35
36. Another MTTDL Model
MTTDL[1] model doesn't take into account unrecoverable read
But unrecoverable reads (UER) are becoming the dominant failure
mode
− UER specifed as errors per bits read
− More bits = higher probability of loss per vdev
MTTDL[2] model considers UER
36
37. Why Worry about UER?
Richard's study
− 3,684 hosts with 12,204 LUNs
− 11.5% of all LUNs reported read errors
Bairavasundaram et.al. FAST08
www.cs.wisc.edu/adsl/Publications/corruption-fast08.pdf
− 1.53M LUNs over 41 months
− RAID reconstruction discovers 8% of checksum mismatches
− 4% of disks studies developed checksum errors over 17 months
37
38. MTTDL[2] Model
Probability that a reconstruction will fail
− Precon_fail = (N-1) * size / UER
Model doesn't work for non-parity schemes (single vdev, striping,
RAID-0)
Single Parity (mirror, RAIDZ, RAID-1, RAID-5)
− MTTDL[2] = MTBF / (N * Precon_fail)
Double Parity (3-way mirror, RAIDZ2, RAID-6)
− MTTDL[2] = MTBF2/ (N * (N-1) * MTTR * Precon_fail)
38
42. Ditto Blocks
Recall that each blkptr_t contains 3 DVAs
Dataset property used to indicate how many copies (aka ditto blocks)
of data is desired
Write all copies
Read any copy
Recover corrupted read from a copy
Not a replacement for mirroring
Easier to describe in pictures...
copies parameter Data copies Metadata copies
copies=1 (default) 1 2
copies=2 2 3
copies=3 3 3
42
46. ZIO Framework
All physical disk I/O goes through ZIO Framework
Translates DVAs into Logical Block Address (LBA) on leaf vdevs
Keeps free space maps (spacemap)
If contiguous space is not available:
Allocate smaller blocks (the gang)
Allocate gang block, pointing to the gang
Implemented as multi-stage pipeline
Allows extensions to be added fairly easily
Handles I/O errors
46
52. Object Cache
UFS uses page cache managed by the virtual memory system
ZFS does not use the page cache, except for mmap'ed files
ZFS uses a Adaptive Replacement Cache (ARC)
ARC used by DMU to cache DVA data objects
Only one ARC per system, but caching policy can be changed on a
per-dataset basis
Seems to work much better than page cache ever did for UFS
52
53. Traditional Cache
Works well when data being accessed was recently added
Doesn't work so well when frequently accessed data is evicted
Misses cause insert
MRU
Dynamic caches can change
Cache size size by either not evicting
or aggressively evicting
LRU
Evict the oldest
53
54. ARC – Adaptive Replacement
Cache
Evict the oldest single-use entry
LRU
Recent
Cache
Miss
MRU Evictions and dynamic
MRU size resizing needs to choose best
Hit cache to evict (shrink)
Frequent
Cache
LRU
Evict the oldest multiple accessed entry
54
55. ZFS ARC – Adaptive Replacement
Cache with Locked Pages
Evict the oldest single-use entry
Cannot evict LRU
locked pages!
Recent
Cache
Miss
MRU
MRU
Hit size
Frequent
If hit occurs Cache
within 62 ms
LRU
Evict the oldest multiple accessed entry
ZFS ARC handles mixed-size pages
55
56. ARC Directory
Each ARC directory entry contains arc_buf_hdr structs
Info about the entry
Pointer to the entry
Directory entries have size, ~200 bytes
ZFS block size is dynamic, 512 bytes – 128 kBytes
Disks are large
Suppose we use a Seagate LP 2 TByte disk for the L2ARC
Disk has 3,907,029,168 512 byte sectors, guaranteed
Workload uses 8 kByte fixed record size
RAM needed for arc_buf_hdr entries
Need = (3,907,029,168 - 9,232) * 200 / 16 = ~48 GBytes
Don't underestimate the RAM needed for large L2ARCs
56
57. L2ARC – Level 2 ARC
ARC evictions are sent to cache vdev
ARC directory remains in memory
Works well when cache vdev is optimized for
fast reads
ARC
lower latency than pool disks
inexpensive way to “increase memory”
Content considered volatile, no ZFS data
evicted
protection allowed data
Monitor usage with zpool iostat
“cache”
“cache”
“cache”
vdev
vdev
vdev
57
58. ARC Tips
In general, it seems to work well for most workloads
ARC size will vary, based on usage
Default max is 3/4 of memory or memory - 1 GByte
Min is 64 MB
Metadata capped at 1/4 of max ARC size
Internals tracked by kstats in Solaris
Use memory_throttle_count to observe pressure to evict
Can limit at boot time
Solaris – set zfs:zfs_arc_max in /etc/system
Performance
Prior to b107, L2ARC fill rate was limited to 8 MBytes/s
L2ARC keeps its directory in kernel memory
58
61. DMU – Data Management Layer
Datasets issue transactions to the DMU
Transactional based object model
Transactions are
Atomic
Grouped (txg = transaction group)
Responsible for on-disk data
ZFS Attribute Processor (ZAP)
Dataset and Snapshot Layer (DSL)
ZFS Intent Log (ZIL)
61
62. Transaction Engine
Manages physical I/O
Transactions grouped into transaction group (txg)
txg updates
All-or-nothing
Commit interval
Older versions: 5 seconds
Now: 30 seconds max, dynamically scale based on time required to
commit txg
Delay committing data to physical storage
Improves performance
A bad thing for sync workloads – hence the ZFS Intent Log (ZIL)
30 second delay can impact failure detection time
62
63. ZIL – ZFS Intent Log
DMU is transactional, and likes to group I/O into transactions for later
commits, but still needs to handle “write it now” desire of sync
writers
NFS
Databases
ZIL recordsize inflation can occur for some workloads
May cause larger than expected actual I/O for sync workloads
Oracle redo logs
Can tune zfs_immediate_write_sz, but after b122 use logbias
property instead
Never read, except at import (eg reboot), when transactions may need
to be rolled forward
63
64. Separate Logs (slogs)
ZIL competes with pool for iops
Applications will wait for sync writes to be on nonvolatile media
Very noticeable on HDD JBODs
Put ZIL on separate vdev, outside of pool
ZIL writes tend to be sequential
No competition with pool for IOPS
Downside: slog device required to be operational at import
b125 adds slog device removal support
Size of separate log < than size of RAM (duh)
10x or more performance improvements possible
Use write-optimized SSD or non-volatile write cache on RAID array
Use zilstat to observe ZIL activity
64
65. Synchronous Write Destination
Without separate log
Sync I/O size >
zfs_immediate_write_sz ? ZIL Destination
no ZIL log
yes bypass to pool
With separate log
Sync I/O size >
zfs_immediate_write_sz ? logbias? ZIL Destination
no log device
yes prior to logbias (b122) log device
latency (default) log device
throughput bypass to pool +
Default zfs_immediate_write_sz = 32 kBytes
65
66. Disabling the ZIL
Rule 0: Don’t disable the ZIL
If you love your data, do not disable the ZIL
You can find references to this as a way to speed up ZFS
NFS workloads
“tar -x” benchmarks
Golden Rule: Don’t disable the ZIL
Can set via mdb, but need to remount the file system under test
Friends don’t let friends disable the ZIL
Solaris - can set in /etc/system
*** TEMPORARY disable ZIL for non-production use
*** disabled by <your name> on <date>
set zfs:zil_disable=1
Nostradamus wrote, “disabling the ZIL will lead to the apocalypse”
66
68. flash
Copy on Write
1. Initial block tree 2. COW some data
3. COW metadata 4. Update Uberblocks & free
68
69. zfs snapshot
Create a read-only, point-in-time window into the dataset (file system
or Zvol)
Computationally free, because of COW architecture
Very handy feature
Patching/upgrades
Basis for Time Slider
69
70. Snapshot
Current tree root
Snapshot tree root
Create a snapshot by not free'ing COWed blocks
Snapshot creation is fast and easy
Number of snapshots determined by use – no hardwired limit
Recursive snapshots also possible
70
71. Clones
Snapshots are read-only
Clones are read-write based upon a snapshot
Child depends on parent
Cannot destroy parent without destroying all children
Can promote children to be parents
Good ideas
OS upgrades
Change control
Replication
zones
virtual disks
71
72. zfs clone
Create a read-write file system from a read-only snapshot
Used extensively for OpenSolaris upgrades
OS rev1 OS rev1 OS rev1 OS rev1
OS rev1 OS rev1 OS rev1
snapshot snapshot snapshot
OS rev1
upgrade OS rev2
clone
boot
manager
Origin snapshot cannot be destroyed, if clone exists
72
73. zfs rollback
OS b104 OS b104
rpool/ROOT/b104 rpool/ROOT/b104
OS b104 OS b104
snapshot rollback snapshot
rpool/ROOT/b104@today rpool/ROOT/b104@today
73
76. Dataset & Snapshot Layer
Object
Allocated storage
dnode describes collection of Dataset Directory
blocks
Dataset
Object Set
Object Set Childmap
Group of related objects
Dataset Object
Object
Object Properties
Snapmap: snapshot
relationships Snapmap
Space usage
Dataset directory
Childmap: dataset relationships
Properties
76
77. zpool create
zpool create poolname vdev-configuration
vdev-configuration examples
mirror c0t0d0 c3t6d0
mirror c0t0d0 c3t6d0 mirror c4t0d0 c0t1d6
mirror disk1s0 disk2s0 cache disk4s0 log disk5
raidz c0d0s1 c0d1s1 c1d2s0 spare c1d3s0
Solaris
Additional checks to see if disk/slice overlaps or is currently in use
Whole disks are given EFI labels
Can set initial pool or dataset properties
By default, creates a file system with the same name
poolname pool → /poolname file system
People get confused by a file system with same
name as the pool
77
78. zpool add
Adds a device to the pool as a top-level vdev
zpool add poolname vdev-configuration
vdev-configuration can be any combination also used for zpool create
Complains if the added vdev-configuration would cause a different data
protection scheme than is already in use – use “-f” to override
Good idea: try with “-n” flag first – will show final configuration without
actually performing the add
Do not add a device which is in use as a quorum device
78
79. zpool remove
Remove a top-level vdev from the pool
zpool remove poolname vdev
Today, you can only remove the following vdevs:
cache
hot spare
separate log (b124)
An RFE is open to allow removal of other top-level vdevs
Don't confuse “remove” with “detach”
79
80. zpool attach
Attach a vdev as a mirror to an existing vdev
zpool attach poolname existing-vdev vdev
Attaching vdev must be the same size or larger than the existing vdev
Note: today, not available for RAIDZ, RAIDZ2, or RAIDZ3 vdevs
vdev Configurations
ok simple vdev → mirror
ok mirror
ok log → mirrored log
no RAIDZ
no RAIDZ2
no RAIDZ3
“Same size” literally means the same number of blocks until b117.
Beware that many “same size” disks have different number of
available blocks.
80
81. zpool import
Import a pool and mount all mountable datasets
Import a specific pool
zpool import poolname
zpool import GUID
Scan LUNs for pools which may be imported
zpool import
Can set options, such as alternate root directory or other properties
Beware of zpool.cache interactions
Beware of artifacts, especially partial artifacts
81
82. zpool history
Show history of changes made to the pool
# zpool history rpool
History for 'rpool':
2009-03-04.07:29:46 zpool create -f -o failmode=continue -R /a -m legacy -o
cachefile=/tmp/root/etc/zfs/zpool.cache rpool c0t0d0s0
2009-03-04.07:29:47 zfs set canmount=noauto rpool
2009-03-04.07:29:47 zfs set mountpoint=/rpool rpool
2009-03-04.07:29:47 zfs create -o mountpoint=legacy rpool/ROOT
2009-03-04.07:29:48 zfs create -b 4096 -V 2048m rpool/swap
2009-03-04.07:29:48 zfs create -b 131072 -V 1024m rpool/dump
2009-03-04.07:29:49 zfs create -o canmount=noauto rpool/ROOT/snv_106
2009-03-04.07:29:50 zpool set bootfs=rpool/ROOT/snv_106 rpool
2009-03-04.07:29:50 zfs set mountpoint=/ rpool/ROOT/snv_106
2009-03-04.07:29:51 zfs set canmount=on rpool
2009-03-04.07:29:51 zfs create -o mountpoint=/export rpool/export
2009-03-04.07:29:51 zfs create rpool/export/home
2009-03-04.00:21:42 zpool import -f -R /a 17111649328928073943
2009-03-04.00:21:42 zpool export rpool
2009-03-04.08:47:08 zpool set bootfs=rpool rpool
2009-03-04.08:47:08 zpool set bootfs=rpool/ROOT/snv_106 rpool
2009-03-04.08:47:12 zfs snapshot rpool/ROOT/snv_106@snv_b108
2009-03-04.08:47:12 zfs clone rpool/ROOT/snv_106@snv_b108 rpool/ROOT/
snv_b108
...
82
83. zpool status
Shows the status of the current pools, including their configuration
Important troubleshooting step
# zpool status
…
pool: zwimming
state: ONLINE
status: The pool is formatted using an older on-disk format. The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
pool will no longer be accessible on older software versions.
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
zwimming ONLINE 0 0 0
mirror ONLINE 0 0 0
c0t2d0s0 ONLINE 0 0 0
c0t0d0s7 ONLINE 0 0 0
errors: No known data errors
Understanding status output error messages can be tricky
83
84. zpool iostat
Show pool physical I/O activity, in an iostat-like manner
Solaris: fsstat will show I/O activity looking into a ZFS file system
Especially useful for showing slog activity
# zpool iostat -v
capacity operations bandwidth
pool used avail read write read write
------------ ----- ----- ----- ----- ----- -----
rpool 16.5G 131G 0 0 1.16K 2.80K
c0t0d0s0 16.5G 131G 0 0 1.16K 2.80K
------------ ----- ----- ----- ----- ----- -----
zwimming 135G 14.4G 0 5 2.09K 27.3K
mirror 135G 14.4G 0 5 2.09K 27.3K
c0t2d0s0 - - 0 3 1.25K 27.5K
c0t0d0s7 - - 0 2 1.27K 27.5K
------------ ----- ----- ----- ----- ----- -----
Unlike iostat, does not show latency
84
86. zfs create, destroy
By default, a file system with the same name as the pool is created by
zpool create
Name format is: pool/name[/name ...]
File system
zfs create fs-name
zfs destroy fs-name
Zvol
zfs create -V size vol-name
zfs destroy vol-name
Parameters can be set at create time
86
87. zfs list
List mounted datasets
Old versions: listed everything
After b108: do not list snapshots
See zpool listsnapshots property
Examples
zfs list
zfs list -t snapshot
zfs list -H -o name
87
88. zfs send, receive
Send
send a snapshot to stdout
data is decompressed
Receive
receive a snapshot from stdin
receiving file system parameters apply (compression, et.al)
Can incrementally send snapshots in time order
Handy way to replicate dataset snapshots
Only method for replicating dataset properties, except quotas
NOT a replacement for traditional backup solutions
All-or-nothing design per snapshot
In general, does not send files (!)
Send streams from b35 (or older) no longer supported after b89
88
90. Sharing
zfs share dataset
Type of sharing set by parameters
shareiscsi = [on | off]
sharenfs = [on | off | options]
sharesmb = [on | off | options]
Shortcut to manage sharing
Uses external services (nfsd, iscsi target, smbshare, etc)
Importing pool will also share
May vary by OS
90
91. NFS
ZFS file systems work as expected
use ACLs based on NFSv4 ACLs
Parallel NFS, aks pNFS, aka NFSv4.1
Still a work-in-progress
http://opensolaris.org/os/project/nfsv41/
zfs create -t pnfsdata mypnfsdata
pNFS
Client
pNFS Data Server pNFS Data Server
pnfsdata pnfsdata
pNFS dataset dataset
Metadata
Server pool pool
91
92. CIFS
UID mapping
casesensitivity parameter
Good idea, set when file system is created
zfs create -o casesensitivity=insensitive mypool/Shared
Shadow Copies for Shared Folders (VSS) supported
CIFS clients cannot create shadow remotely (yet)
CIFS features vary by OS, Samba, etc.
92
93. iSCSI
SCSI over IP
Block-level protocol
Uses Zvols as storage
Solaris has 2 iSCSI target implementations
shareiscsi enables old, user-land iSCSI target
To use COMSTAR, enable using itadm(1m)
b116 more closely integrates COMSTAR (zpool version 16)
iSCSI performance hiccup
Prior to b107, iSCSI over Zvols didn’t properly handle sync writes
b107-b113, iSCSI over Zvols made all writes sync (read: slow)
Workaround: enable write cache enable in the iSCSI target, see
CR6770534
OpenSolaris 2009.06 is b111
b114, write cache enable works automatically iSCSI over Zvol
93
95. Properties
Properties are stored in an nvlist
By default, are inherited
Some properties are common to all datasets, but a specific dataset
type may have additional properties
Easily set or retrieved via scripts
In general, properties affect future file system activity
zpool get doesn't script as nicely as zfs get
95
96. User-defined Properties
Names
Must include colon ':'
Can contain lower case alphanumerics or “+” “.” “_”
Max length = 256 characters
By convention, module:property
com.sun:auto-snapshot
Values
Max length = 1024 characters
Examples
com.sun:auto-snapshot=true
com.richardelling:important_files=true
96
97. set & get properties
Set
zfs set compression=on export/home/relling
Get
zfs get compression export/home/relling
Reset to inherited value
zfs inherit compression export/home/relling
Clear user-defined parameter
zfs inherit com.sun:auto-snapshot export/home/
relling
97
98. Pool Properties
Property Change? Brief Description
altroot Alternate root directory (ala chroot)
autoexpand Policy for expanding when vdev size
changes
autoreplace vdev replacement policy
available readonly Available storage space
bootfs Default bootable dataset for root pool
cachefile Cache file to use other than /etc/zfs/
zpool.cache
capacity readonly Percent of pool space used
delegation Master pool delegation switch
failmode Catastrophic pool failure policy
98
99. More Pool Properties
Property Change? Brief Description
guid readonly Unique identifier
health readonly Current health of the pool
listsnapshots zfs list policy
size readonly Total size of pool
used readonly Amount of space used
version readonly Current on-disk version
99
100. Property Change? Brief Description
available readonly Space available to dataset & children
checksum Checksum algorithm
compression Compression algorithm
compressratio readonly Compression ratio – logical
size:referenced physical
copies Number of copies of user data
creation readonly Dataset creation time
logbias Separate log write policy
origin readonly For clones, origin snapshot
primarycache ARC caching policy
readonly Is dataset in readonly mode?
referenced readonly Size of data accessible by this dataset
100
101. More Common Dataset Properties
Property Change? Brief Description
refreservation Max space guaranteed to a dataset,
excluding descendants (snapshots &
clones)
reservation Minimum space guaranteed to
dataset, including descendants
secondarycache L2ARC caching policy
type readonly Type of dataset (filesystem,
snapshot, volume)
101
102. More Common Dataset Properties
Property Change? Brief Description
used readonly Sum of usedby* (see below)
usedbychildren readonly Space used by descendants
usedbydataset readonly Space used by dataset
usedbyrefreservation readonly Space used by a refreservation for
this dataset
usedbysnapshots readonly Space used by all snapshots of this
dataset
zoned readonly Is dataset added to non-global zone
(Solaris)
102
103. Volume Dataset Properties
Property Change? Brief Description
shareiscsi iSCSI service (not COMSTAR)
volblocksize creation fixed block size
volsize Implicit quota
zoned readonly Set if dataset delegated to non-global
zone (Solaris)
103
104. File System Properties
Property Change? Brief Description
aclinherit ACL inheritance policy, when files or
directories are created
aclmode ACL modification policy, when chmod is
used
atime Disable access time metadata updates
canmount Mount policy
casesensitivity creation Filename matching algorithm (CIFS client
feature)
devices Device opening policy for dataset
exec File execution policy for dataset
mounted readonly Is file system currently mounted?
104
105. More File System Properties
Property Change? Brief Description
nbmand export/ File system should be mounted with non-
import blocking mandatory locks (CIFS client
feature)
normalization creation Unicode normalization of file names for
matching
quota Max space dataset and descendants can
consume
recordsize Suggested maximum block size for files
refquota Max space dataset can consume, not
including descendants
setuid setuid mode policy
sharenfs NFS sharing options
sharesmb Files system shared with CIFS
105
106. File System Properties
Property Change? Brief Description
snapdir Controls whether .zfs directory is hidden
utf8only creation UTF-8 character file name policy
vscan Virus scan enabled
xattr Extended attributes policy
106
108. Dataset Space Accounting
used = usedbydataset + usedbychildren + usedbysnapshots +
usedbyrefreservation
Lazy updates, may not be correct until txg commits
ls and du will show size of allocated files which includes all copies of a
file
Shorthand report available
$ zfs list -o space
NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD
rpool 126G 18.3G 0 35.5K 0 18.3G
rpool/ROOT 126G 15.3G 0 18K 0 15.3G
rpool/ROOT/snv_106 126G 86.1M 0 86.1M 0 0
rpool/ROOT/snv_b108 126G 15.2G 5.89G 9.28G 0 0
rpool/dump 126G 1.00G 0 1.00G 0 0
rpool/export 126G 37K 0 19K 0 18K
rpool/export/home 126G 18K 0 18K 0 0
rpool/swap 128G 2G 0 193M 1.81G 0
108
109. zfs vs zpool Space Accounting
zfs list != zpool list
zfs list shows space used by the dataset plus space for internal
accounting
zpool list shows physical space available to the pool
For simple pools and mirrors, they are nearly the same
For RAIDZ, RAIDZ2, or RAIDZ3, zpool list will show space
available for parity
Users will be confused about reported space available
109
110. Accessing Snapshots
By default, snapshots are accessible in .zfs directory
Visibility of .zfs directory is tunable via snapdir property
Don't really want find to find the .zfs directory
Windows CIFS clients can see snapshots as Shadow Copies for
Shared Folders (VSS)
# zfs snapshot rpool/export/home/relling@20090415
# ls -a /export/home/relling
…
.Xsession
.xsession-errors
# ls /export/home/relling/.zfs
shares snapshot
# ls /export/home/relling/.zfs/snapshot
20090415
# ls /export/home/relling/.zfs/snapshot/20090415
Desktop Documents Downloads Public
110
111. Time-based Resilvering
Block pointers contain birth txg
number
Resilvering begins with oldest
blocks first 73
73
Interrupted resilver will still result
in a valid file system view 73 55
73 27
68 73 27 27
73 68
Birth txg = 27
Birth txg = 68
Birth txg = 73
111
112. Time Slider - Automatic Snapshots
Underpinnings for Solaris feature similar to OSX's Time Machine
SMF service for managing snapshots
SMF properties used to specify policies: frequency (interval) and number to keep
Creates cron jobs
GUI tool makes it easy to select individual file systems
Tip: take additional snapshots for important milestones to avoid automatic
snapshot deletion
Service Name Interval (default) Keep (default)
auto-snapshot:frequent 15 minutes 4
auto-snapshot:hourly 1 hour 24
auto-snapshot:daily 1 day 31
auto-snapshot:weekly 7 days 4
auto-snapshot:monthly 1 month 12
112
114. ACL – Access Control List
Based on NFSv4 ACLs
Similar to Windows NT ACLs
Works well with CIFS services
Supports ACL inheritance
Change using chmod
View using ls
114
115. Checksums for Data
DVA contains 256 bits for checksum
Checksum is in the parent, not in the block itself
Types
none
fletcher2: truncated 2nd order Fletcher-like algorithm (default prior
to b114)
fletcher4: 4th order Fletcher-like algorithm (default, starting b114)
SHA-256
There are open proposals for better algorithms
115
116. Checksum Use
Pool Algorithm Notes
Uberblock SHA-256 self-checksummed
Metadata fletcher4
Labels SHA-256
Gang block SHA-256 self-checksummed
Dataset Algorithm Notes
Metadata fletcher4
Data fletcher4 (default) zfs checksum parameter
ZIL log fletcher2 self-checksummed
Send stream fletcher4
Note: fletcher2 was the default for data prior to b114
Note: ZIL log has additional checking beyond the checksum
116
117. Compression
Builtin
lzjb, Lempel-Ziv by Jeff Bonwick
gzip, levels 1-9
Extensible
new compressors can be added
backwards compatibility issues
Uses taskqs to take advantage of multi-processor systems
Do you have a better compressor in mind?
http://richardelling.blogspot.com/2009/08/justifying-new-
compression-algorithms.html
Cannot boot from gzip compressed root (RFE is open)
117
119. Quotas
File system quotas
quota includes descendants (snapshots, clones)
refquota does not include descendants
User and group quotas
b114, Solaris 10 10/09 (patch 141444-03 or 141445-03)
Works like refquota, descendants don't count
Not inherited
zfs userspace and groupspace subcommands show quotas
Users can only see their own and group quota, but can delegate
Managed like properties
[user|group]quota@[UID|username|SID name|SID number]
not visible via zfs get all
119
120. zpool.cache
Old way
mount /
read /etc/[v]fstab
mount file systems
ZFS
import pool(s)
find mountable datasets and mount them
/etc/zpool.cache is a cache of pools to be imported at boot time
No scanning of all available LUNs for pools to import
Binary: dump contents with zdb -C
cachefile property permits selecting an alternate zpool.cache
Useful for OS installers
Useful for clusters, where you don't want a booting node to
automatically import a pool
Not persistent (!) 120
121. Mounting ZFS File Systems
By default, mountable file systems are mounted when the pool is
imported
Controlled by canmount policy (not inherited)
on – (default) file system is mountable
off – file system is not mountable
if you want children to be mountable, but not the parent
noauto – file system must be explicitly mounted (boot environment)
Can zfs set mountpoint=legacy to use /etc/vfstab
By default, cannot mount on top of non-empty directory
Can override explicitly using zfs mount -O or legacy mountpoint
Mount properties are persistent, use zfs mount -o for temporary
changes
Imports are done in parallel, beware of mountpoint races
prior to b104
121
122. recordsize
Dynamic
Max 128 kBytes
Min 512 Bytes
Power of 2
For most workloads, don't worry about it
For fixed size workloads, can set to match workloads
Databases
iSCSI Zvols serving NTFS or ext3 (use 4 KB)
File systems or Zvols
zfs set recordsize=8k dataset
122
123. Delegated Administration
Fine grain control
users or groups of users
subcommands, parameters, or sets
Similar to Solaris' Role Based Access Control (RBAC)
Enable/disable at the pool level
zpool set delegation=on mypool (default)
Allow/unallow at the dataset level
zfs allow relling snapshot mypool/relling
zfs allow @backupusers snapshot,send mypool/sw
zfs allow mypool/relling
123
131. Solaris Swap and Dump
Swap
Solaris does not have automatic swap resizing
Swap as a separate dataset
Swap device is raw, with a refreservation
Blocksize matched to pagesize: 8 kB SPARC, 4 kB x86
Don't really need or want snapshots or clones
Can resize while online, manually
Dump
Only used during crash dump
Preallocated
No refreservation
Checksum off
Compression off (dumps are already compressed)
131
133. General Comments
In general, performs well out of the box
Standard performance improvement techniques apply
Lots of DTrace knowledge available
Typical areas of concern:
ZIL
check with zilstat, improve with slogs
COW “fragmentation”
check iostat, improve with L2ARC
Memory consumption
check with arcstat
set primarycache property
can be capped
can compete with large page aware apps
Compression, or lack thereof
133
134. ZIL Performance : NFS
Big performance increases demonstrated
especially with SSDs
for RAID arrays with nonvolatile RAM cache, not so much
NFS servers
32kByte threshold (zfs_immediate_write_sz) also corresponds to
NFSv3 write size
May cause more work than needed
See CR6686887
134
135. ZIL Performance : Databases
The logbias property can be set on a dataset to control threshold for
writing to pool when a slog is used
logbias=latency (default) all writes go to slog
logbias=throughput, writes > zfs_immediate_write_sz go to pool
Settable on-the-fly
Consider changing policy during database loads
Can have different sync policies for logs and data
Oracle, separate latency-sensitive redo log traffic from
Redo logs: logbias=latency
Indexes: logbias=latency
Data files: logbias=throughput
MySQL with InnoDB
logbias=latency
135
136. More ZIL Performance : Databases
I/O size inflation
Once a file grows to use a block size, it will keep that block size
Block size is capped by recordsize
recordsize is a power of 2: 512 bytes, 1 KB, 2 KB, 4 KB, ... 128 KB
Can be inefficient if the workload is sync and writes variable sized
data
Oracle performance work: Roch reports 40% improvement for JBOD
(HDD) + separate log (SSD) with:
File system or Zvol Role recordsize logbias
data files 8 KB throughput
redo logs 128 KB (default) latency (default)
indices 8-32 KB? latency (default)
136
137. vdev Cache
vdev cache occurs at the SPA level
readahead
10 MBytes per vdev
only caches metadata (b70 or later)
Stats collected as Solaris kstats
# kstat -n vdev_cache_stats
module: zfs instance: 0
name: vdev_cache_stats class: misc
crtime 38.83342625
delegations 14030
hits 105169
misses 59452
snaptime 4564628.18130739
Hit rate = 59%, not bad...
137
138. Intelligent Prefetching
Intelligent file-level prefetching occurs at the DMU level
Feeds the ARC
In a nutshell, prefetch hits cause more prefetching
Read a block, prefetch a block
If we used the prefetched block, read 2 more blocks
Up to 256 blocks
Recognizes strided reads
2 sequential reads of same length and a fixed distance will be
coalesced
Fetches backwards
Seems to work pretty well, as-is, for most workloads
138
139. Unintelligent Prefetch?
Some workloads don't do so well with intelligent prefetch
CR6859997, zfs caching performance problem, fixed in NV b124
Look for time spent in zfetch_* functions using lockstat
lockstat -I sleep 10
Easy to disable in mdb for testing on Solaris
echo zfs_prefetch_disable/W0t1 | mdb -kw
Re-enable with
echo zfs_prefetch_disable/W0t0 | mdb -kw
Set via /etc/system
set zfs:zfs_prefetch_disable = 1
139
140. I/O Queues
By default, for devices which can support multiple I/Os, up to 35 I/Os
are queued to each vdev
Tunable with zfs_vdev_max_pending, set to 10 with:
echo zfs_vdev_max_pending/W0t10 | mdb -kw
Implies that more vdevs is better
Consider avoiding RAID array with a single, large LUN
ZFS I/O scheduler loses control once iops are queued
CR6471212 proposes reserved slots for high-priority iops
May need to match queues for the entire data path
zfs_vdev_max_pending
Fibre channel, SCSI, SAS, SATA driver
RAID array controller
Fast disks → small queues, slow disks → larger queues
140
141. COW Penalty
COW can negatively affect workloads which have updates and
sequential reads
Initial writes will be sequential
Updates (writes) will cause seeks to read data
Lots of people seem to worry a lot about this
Only affects HDDs
Very difficult to speculate about the impact on real-world apps
Large sequential scans of random data hurt anyway
Reads are cached in many places in the data path
Databases can COW, too
Sysbench benchmark used to test on MySQL w/InnoDB engine
One hour read/write test
select count(*)
repeat, for a week
141
142. COW Penalty
Performance seems to level at about 25% penalty
Results compliments of Allan Packer & Neelakanth Nadgir
http://blogs.sun.com/realneel/resource/MySQL_Conference_2009_ZFS_MySQL.pdf
142
143. About Disks...
Disks still the most important performance bottleneck
Modern processors are multi-core
Default checksums and compression are computationally efficient
Average
Max Size Rotational Average Seek
Disk Size RPM (GBytes) Latency (ms) (ms)
HDD 2.5” 5,400 500 5.5 11
HDD 3.5” 5,900 2,000 5.1 16
HDD 3.5” 7,200 1,500 4.2 8 - 8.5
HDD 2.5” 10,000 300 3 4.2 - 4.6
HDD 2.5” 15,000 146 2 3.2 - 3.5
SSD 2.5” N/A 73 0 0.02 - 0.15
(w) (r)
SSD 2.5” N/A 500 0 0.02 - 0.15
143
144. DirectIO
UFS forcedirectio option brought the early 1980s design of UFS up to
the 1990s
ZFS designed to run on modern multiprocessors
Databases or applications which manage their data cache may benefit
by disabling file system caching
Expect L2ARC to improve random reads (secondarycache)
Prefetch disabled by primarycache=none|metadata
UFS DirectIO ZFS
Unbuffered I/O primarycache=metadata
primarycache=none
Concurrency Available at inception
Improved Async I/O code path Available at inception
144
145. Hybrid Storage Pool
SPA
separate log L2ARC
Main Pool
device cache device
Write optimized HDD
HDD Read optimized
device (SSD) HDD device (SSD)
Size (GBytes) < 1 GByte large big
Cost write iops/$ size/$ size/$
Performance low-latency - low-latency
writes reads
145
146. RAID-Z Bandwidth
Traditional RAID-Z had a “mind the gap” feature
Impacts possible bandwidth
Mirrors could show higher bandwidth
Now RAID-Z shows better bandwidth, when channel bandwidth is the
constrained resource
Implementation caused spurious errors for b118-b123
146
149. flash
Copy on Write
1. Initial block tree 2. COW some data
3. COW metadata 4. Update Uberblocks & free
What if the uberblock is updated prior to leaves?
149
150. What if flush is ignored?
Some devices ignore cache flush commands (!)
Virtualization default=ignore flush: VirtualBox, others?
Some USB/Firewire to IDE/SATA converters
Problem: uberblock could be updated before leaves
Symptom: can’t import pool, uberblock points to random data
Affected systems
Many OSes and file systems
Laptops - rarely because of battery
Enterprise-class systems - rarely because of power redundancy and
solid design
Desktops - more frequently
Solution (pending further automation)
Check integrity of recent transaction groups
If damaged, rollback to older uberblock
Today, can do this by hand, but process is tedious
150
151. Can't Import Pool?
Check device paths with zpool import
Be aware of /etc/zfs/zpool.cache
May need zpool -d directory option
“phantom paths”?
Check for 4 labels
zdb -l /dev/dsk/c0t0d0s0
Beware of device short names: c0d0 != c0d0s0
151
152. Slow Pool Import?
Case: zvols with snapshots
Symptom: reboot or zpool import is really slllooooowwwwwww...
Cause: inefficient incrementing over all zvols creating entries in
/dev/zvol/dsk
Cure: CR6761786 integrated in b125
152
153. File System Mounts B0rken?
Prevention
Avoid complex heirarchies (KISS)
Be aware of legacy mounts
Be aware of alternate boot environments (Solaris)
Check mountpoint properties
zfs list -o name,mountpoint
Shared file systems
Be aware of inherited shares
Some clients do not mirror mount (Linux)
NFS version differences?
Check name services
153
154. Can't Boot?
Check if BIOS/OBP supports booting from device
Make sure LUN has SMI label, not EFI
Common mistake when mirroring root
OK: zpool attach rpool c0t0d0s0 c0t1d0s7
Not OK: zpool attach rpool c0t0d0s0 c0t1d0
installboot?
grub issues
Boot environments usually handled by grub
Check grub menu.lst
Know how to do a failsafe boot
Be aware of LiveCD import
Be aware of zpool.cache interactions
154
155. Future Plans
Announced enhancements in the pipeline from
Kernel Conference Australia, July 15-17 2009
Encryption
Deduplication
Block pointer rewrite
Shadow migration
More performance tweeks
New block allocator
Pipeline improvements
Raw scrub
Scrub prefetch
Just in time decompression or decryption
Native iSCSI (COMSTAR)
Zero copy I/O
Parallel device open
155
156. More Future Plans
Snapshot holds (b124)
Access-based enumeration (b125)
Multiple mount protection
Separate log offlining (b125) (removal later)
156
157. Now you know...
ZFS structure: pools, datasets
Data redundancy: mirrors, RAIDZ, copies
Data verification: checksums
Data replication: snapshots, clones, send, receive
Hybrid storage: separate logs, cache devices, ARC
Security: allow, deny, encryption
Resource management: quotas, references, I/O scheduler
Performance: latency, COW, zilstat, arcstat, logbias, recordsize
Troubleshooting: FMA, zdb, importance of cache flushes
157
158. Its a wrap!
Thank You!
Questions?
Richard.Elling@RichardElling.com
158